Category Archives: XQuery

Lazy XML Enrichment

One of my big gripes with most content-oriented software is that it requires a big bang approach (see The First Step’s a Doozy). The basic premise behind most content software is roughly:

1. If you do all this hard work to perfectly standardize the schema of your content, perfectly tag it, and possibly perfectly shred it, then

2. You can do cool stuff like content repurposing, content integration, multi-channel content delivery, and custom publishing.

The problem is, of course, that the first step is lethal. Many content software projects blow up on the launchpad because they can’t get beyond step 1. Our first customer had been stuck on step 1 for 18 months with Oracle before they found Mark Logic. (We loaded their content in a week.) At a recent Federal tradeshow, we had dinner with some folks from Booz Allen who’d been trying to load to some semi-structured message traffic data into a relational database for months. We told them to swing by our booth the next day. Our sales engineer then loaded their content over a cup of coffee while eating a muffin and built a basic application in an hour. They couldn’t believe it.

In most companies — even publishers — content is a mess. It’s in 100 different places in 15 different formats, and each defined format is usually more of an aspiration than a standard. Once, at a multi-billion dollar publisher one of our technical guys actually found this sentence in some internal documentation: “it is believed that this tag is used to …” Only folklore describes the schema.

So when it comes to the general problem of making XML more rich — i.e., having more tags that indicate more meaning — many people take the same big-bang approach. “Well, step 1 would be to put all the content into a single schema (which alone could kill you) and run it through a dozen different entity, fact, sentiment, concept, summarization “extractors” that can markup the content and fragments of it with lots of new and powerful tags (which alone could cost millions).

Again, step 1 becomes lethal.

At Mark Logic we advocate that people consider the opposite approach. Instead of:

  • Step 1: make the content perfect so you can enable any application you want to build
  • Step 2: build an application

We say:

  • Step 1: figure out the application you want to build
  • Step 2: figure out which portions of your markup need to be improved to build that application
  • Step 3: improve only that markup, sometimes manually, sometimes with extraction software, and sometimes with heuristics (i.e., rules of thumb) coded in XQuery
  • Step 4: build your application and get some business value from it
  • Step 5: repeat the process, driven by subsequent application requirements

I call this lazy XML enrichment. You could call it application-driven, as opposed to infrastructure-driven, content cleanup. I think it’s an infinitely better approach because it delivers business results faster and eliminates the risk of either never finishing the first step because it’s impossible, or having funding yanked by the business because it runs out of patience with an IT project that’s showing no ostensible progress.

At this point, I’d like to direct those of technical heart to Matt Turner’s Discovering XQuery blog where he provides a detailed post (code included) that shows an example of lazy, heuristic-based XML enrichment, here.

  • Matt’s example show lazy enrichment because the only markup he needs for his desired application is related to weapons, so that’s all he adds.
  • Matt’s example is heuristic-based because he devises a way to find weapons in XQuery, and then use XQuery to tag them as such.

How The Web Disrupts the RDBMS World

I found an interesting post on The Future of Software minisite run by the GigaOM network, best known for Om Malik and his GigaOM blog. The post is entitled “Data 2.0: How the Web disrupts our relational database world” and is written by Nitin Borwankar.

The post begins with:

The great online shift is creating massive amounts of data – whether it is videos on YouTube or social networking profiles on MySpace. And that data is stored in databases, making them the key component of the new web infrastructure. But managing that information isn’t easy

I think he nails the problem statement. The Web world is changing fast. And relational databases are having trouble keeping up.

The good news is that database management will be vastly different in the future. In fact, change has already begun; it just isn’t (cliché alert!) “evenly distributed” yet.

He then goes on to describe some leading examples of companies or problems that are pushing the relational database envelope.

  1. Yahoo’s creation of its own user management software based on BerkeleyDB
  2. Google’s MapReduce
  3. Amazon’s S3 (simple storage service) and SQS (simple queue service) which externalize operations normally done by a database.
  4. The general use of Lucene, Nutch, and Solr to do indexing of unstructured content, “something an old relational database cannot do well.”
  5. The graph-structured data problem (also known as the parts explosion problem) inherent in social networking and which remains an Achilles’ heel for relational databases

So while I generally agree with his thesis, the examples cited are basically all technology companies who are able to write their own system-level software to bypass and/or accommodate the limitations of relational databases.

My question is: what about everybody else? What are they supposed to do?

My short answer is — perhaps not shockingly — MarkLogic. At MarkLogic, we call Data 2.0 “content.”

  • We manage XML natively
  • We manage graph-structured data easily
  • We manage, search, storage and index text and XML natively

Some companies will always be able to write their own stuff to get around problems. But the reason MarkLogic exists is provide a commercial DBMS that “the rest of us” can use when managing content and building web applications with it.

See this post on top-to-bottom XML for more.

EMC Goes Dutch

In what appears to be a bungled press launch, EMC (owner of Documentum) has announced the acquisition of Rotterdam based XML vendor, X-Hive.

Why do I say bungled? Because one of my Google alerts caught this eWeek story on Saturday before I could find an official press release on either companies’ website or the wire services. And I found stories on Gilbane and CMS Watch announcing the deal as well. (And it’s now end of day Monday and I’ve still not seen any official indication of this on either site.)

First, let’s look at the numbers. The terms of the deal were not disclosed, so we don’t have much to work with, but they did say that X-Hive currently employs 25 people. With that, plus some standard ratios and basic math, we can work up a valuation estimate.

Assuming sales/employee/year in the broad range of $200K to $300K yields annual revenues of $5.0M to $7.5M. Since X-Hive is 11 years old and employs 25 people, it’s safe to assume the average growth rate has been quite low over the company’s history. But, let’s be charitable and assume that they were getting some traction with their recent s1000d initiative, so let’s guess they were growing from at 25% to 50% over the past few years.

That, plus a look at EMC’s historical deals, suggests a valuation of 2 to 5 times annual revenues, implying a valuation range of $10M to $37.5M. Eyeball correcting that, and knowing the company is venture-backed, suggests to me a range of $25M to $50M. If I had to guess one number in the range, I’d say $35M.

Next, let’s analyze strategy. X-Hive has three primary product lines.

X-Hive/DB, an XML content server with built-in search capabilities, in the same category as MarkLogic

X-Hive/Docato, an XML content management system, in the same category as Vasont, Astoria, and XyEnterprise.

X-Hive/AMDS, an aviation document management system that I believe was built for Northwest Airlines, in the same rough category as offerings from Jouve/InfoTrust.

My strategic concern with X-Hive has always been focus. While the offerings are layered on each other, the reality is you have a 25-person company in the Netherlands conducting war on three fronts. All three categories in which the company competes are highly competitive, and X-Hive has approximately 8 people working per category. That strikes me as way below critical mass.

Perhaps controversially, I believe that X-Hive’s strategy in moving towards aviation was divergent from EMC’s interest. It’s well known that Documentum has poor XML handling. While X-Hive was heading off to aviation, I think that EMC was looking to improve its XML capabilities. But I think EMC has taken a more tactical than strategic approach.

Why? If you think about it, EMC has an interesting problem. While they have strong positions in the storage and ECM layers of the stack, they have no presence at the database layer, which is controlled by the MOI (Microsoft, Oracle, and IBM) oligopoly. What’s worse, the MOI are rising up from the database level and attacking Documentum on its home turf — e.g., Microsoft’s increasing investment in SharePoint, Oracle’s purchase of Stellent, and IBM’s FileNet acquisition.

A creative strategy for EMC would be to play defense at the ECM layer by playing offense at the database layer (which in this context includes relational database, enterprise search, and content server technology), by integrating best-of-breed technologies at that level and then attacking with a strong unified data/content database story.

But I think EMC views XML the same way that Oracle viewed BI – as a tactical, tick-box item, and not as a strategic opportunity.

Let’s talk about that some more. As you may know, I was part of the executive team that took Business Objects from $30M in 1994 to nearly $1B in 2004 when I left to join Mark Logic. Over an approximately 15-year period, a $1B company was built directly underneath the nose of Oracle, one of the most viciously competitive companies in high-technology.

What’s more, Oracle had competing products (Reports, Discoverer) from day one. Business Objects was founded by a marketing director and a sales manager from Oracle France, after they unsuccessfully ran the idea up the corporate flagpole at Oracle, so the company was on Oracle radar from day one. Thus, I can derive the non-existence of Business Objects from first-principles quite easily. But – and here’s the catch – it does exist, and today it’s about a $1.5B company.

So how in the world did that happen? My take:

• Oracle never saw BI as strategic. For them and many other companies, “tool” was a four-letter word, and BI a tick-box category to be avoided. Consequently, Oracle’s best people never worked on BI.

• Oracle was distracted. Its repeated failings in the much larger applications (e.g., ERP, CRM) market were a constant source of distraction. There were always bigger fish to fry.

• The market structure lent itself to independents. Most customers had multiple DBMSs and ERP/CRM systems and wanted BI as a unifying layer across that underlying chaos – this lent credence to players who could credibly claim agnosticism across the lower layers.

The result? Oracle made disposable BI products that were good enough to throw-in free on a purchase order as a discounting alternative, but not good enough to be seriously considered by someone who viewed BI as strategic. In effect, Oracle skimmed the sludge from the bottom of the market, leaving the cream for vendors like Business Objects and Cognos.

That’s my belief for how things will work out with EMC, Documentum, and X-Hive. By taking this approach to both EMC’s database-layer problem and Documentum’s XML problem, they are (in my humble opinion) screaming “tick box” and not strategic.

Finally, in the event that I’ve gotten it wrong and EMC really does believe that they are going to attack Oracle, Microsoft, and IBM at the database layer with (1/3rd of) 25 folks in Holland, then I’d say that I think they’re tilting at windmills, if you’ll pardon the pun.

Celebrating XML Independence

Today, I’d like to highlight a (4th of July holiday) post on Matt Turner’s Discovering XQuery blog. Matt’s post refers to this article, entitled XQuery: The Server Language, on XML.com, written by Kurt Cagle.

I’d read Kurt’s article when it was posted on June 6 and had meant to blog on it, but didn’t get around to it (or frankly, much blogging at all) during the busy month of June. Nevertheless, here are few chunky morsels from Kurt’s article:

As an XML developer, one of the problems that I come across almost invariably within these [server-side scripting] languages is the fact that they are shaped by people who view XML as something of an afterthought, a small subset of the overall language that’s intended to satisfy those strange people who think in angle brackets.

He then shows an example (that warmed Matt Turner’s heart) of how often people have to create HMTL by composing strings in-line. More morsels:

The original intent of the developers of XQuery was to use it, not surprisingly, as an XML-oriented query language. XQuery is not itself XML based (nor for that matter is XPath), but all of its operations are designed to work with XML documents or XML databases to provide a way of filtering or manipulating that XML to produce some form of output, most typically as XML or HTML.

Intriguingly, as a filter on XML, XQuery has seen only limited success. Part of this has to do with the fact that a significant number of the databases currently in use are SQL based, not XML based, so the benefits to gained by using an XML query filter are offset by the need to convert relational data into XML in the first place.

While I’d agree with Kurt thus far on the market adoption of XQuery and the hassle introduced by having to map XML to an RDBMS (see this post on Top-to-Bottom XML Apps), we at Mark Logic like to think of ourselves as the exception to the slow XQuery adoption rule. While XQuery is not a huge wind at our back, we have been able to grow the company eight-fold since I joined in 3Q04 and that growth is most definitely helped by the de-risking that comes with XQuery by virtue of it being both an industry standard and an eventual, inexorable replacement for SQL.

(If green is the new black, then XQuery is the new SQL, and SQL the new COBOL.)

Kurt concludes his article with:

This article serves as a very basic introduction to XQuery as a server language. I will be addressing this topic in more detail in subsequent articles in this series, examining some of the more sophisticated capabilities and the gotchas inherent in working with XQuery and eXist, and showing what explosive power you can release when you combine eXist or other rest based XQuery engines with XForms and Ajax.

My prediction is that REST based XML databases like eXist will seriously challenge the existing raft of server languages, from ASP to Ruby, within the next couple of years. Right now, it’s something of a closed secret among a few developers, but the power, sophistication and ease of use inherent in working with the XML as if it were a natural part of the server landscape can only be understood by trying it.

I couldn’t agree more with the bolded statement and we all look forward to seeing the subsequent articles in the series.

Web Applications: The Virtues of Top-to-Bottom XML

I think that most people now correctly perceive our product, MarkLogic Server, as an XML content server, a special-purpose DBMS designed specifically for handling XML marked-up content. That’s the good news.

The better news is that many of these same people are figuring out what that means when it comes to developing web applications – specifically, that you can use an XML content server to build web applications using XML top-to-bottom. No Java required. No relational tables required. No application server required. (And no expense for all those supporting products.)

Don’t get me wrong. Many customers choose to use MarkLogic as the XML repository and query system in their architecture, building their applications in Java, using an application server, and making calls out to MarkLogic to process XML queries. Lots of people use the product in that way. That’s fine.

But, people soon realize, when you have a DBMS and query language (XQuery) that directly outputs XML (e.g., xHTML) which can be directly rendered by a browser, and when that “query” language is really a misnamed and underpositioned programming language easily capable of developing entire applications, you can say:

“Wait a minute. My content’s in XML. My browser speaks XML. Why not build my whole app top-to-bottom in XML and XQuery?”

Good question. And the answer is you can. And in many cases, you probably should. What’s the advantage of so doing?

  • Use of a high-level, standard, powerful programming language, XQuery. High-level and powerful translate to greater development and maintenance productivity. Standard translates to risk reduction and freedom of choice. (Aside: While XQuery is not a big-hype, overnight-success type of technology like Ajax, XQuery continues to march along with certain inevitability. In my mind, there is no question that XQuery will be the database programming language of the future – it is superior to SQL, it is more general than SQL and ergo applicable to a broader class of problems, and all major DBMS vendors are already committed to it. The question is not will XQuery become mainstream, but when?)
  • Elimination of three impedance mismatches: Java/XML, XML/relational, and Java/relational. Java is object-oriented, XML is hierarchical, and relational databases are tabular. The mapping between these three different data models generates a lot of zero-value-added work in developing an application. When you’re XML top-to-bottom, poof, that work’s all gone.
  • Elimination of tiers. I had lunch a while back with a top engineer at Oracle who told me that he believed the limiting factor on database application performance was becoming scheduling. That is, hardware and databases are becoming so fast that scheduling work across tiers was becoming the limiting factor in performance. His suggested solution? Eliminate tiers. Well top-to-bottom XML does exactly that.

CQ Leads Government Technology Story

I wanted highlight this story in Government Technology that features Mark Logic customer Congressional Quarterly (CQ).

One of my new memes is that just as the relational database enabled two large secondary markets (i.e., business applications, business analytics) so will XML content servers (such as MarkLogic) drive the creation of two huge secondary markets (i.e., content applications, content analytics).

As it turns out, most of our publishing customers build content applications (e.g., Elsevier’s PathConsult) and most of our government customers do content analytics.

CQ lives at the intersection of our two largest markets — they are a publisher that covers government — and they have built a very interesting content analytic application called CQ Legislative Impact.

This story, entitled X Factor, is primarily about XQuery, the query language that MarkLogic Server natively speaks.

It leads off with an interview of CQ’s senior software architect, Hank Hoffman:

It’s one thing to compile hearing dates, vote counts and committee actions, but it’s quite something else to make those data points relate meaningfully to one another. A year ago, Hoffman found what he was looking for in the form of XQuery […]

“You can do some very powerful things with just a very few lines of code,” Hoffman said, explaining that XQuery makes interpreting and managing masses of XML data a much simpler proposition […]

Other snippets include:

“If all you have is relational data, and you want to create tables, SQL is a great language. The problem is that the game has changed,” said Jonathan Robie, XQuery technology lead and chief scientist at Massachusetts-based DataDirect Technologies […]

He’s referring to the recent rise of XML as the predominant language driving the Internet and data storage in general — an evolution that has pushed demand for tools to query and manage XML data. That’s where XQuery comes in. […] Several companies, including Microsoft, IBM, MarkLogic and Saxonica, have moved to commercialize XQuery with diverse tools aimed at easing its implementation.

[…]

Users say it’s relatively easy to acquire a fluency in XQuery basics. Harvey turned to the language to develop an interactive dictionary, and recalls boarding a train not knowing a thing about it. “By the time I took the train to New York, had a meeting and took the train back,” she said, “I had a working product that I could give to my client. If you are familiar with XML and XML technologies, it is not that hard to work with.”

Search Engine as Implemented in MarkLogic

One of our field consultants, Matt Turner, has started a blog called Discovering XQuery. Since Matt is often in demand and spends lots of time working on customer engagements, he hasn’t found the time to do many posts, but the ones he has done are quite good. So please check out his blog, linked above, and egg him on to do more posting.

At my request, Matt took on a subject that I thought needed more explaining. As frequent readers will know, MarkLogic Server is an XML content server. An XML content server is a special-purpose DBMS designed to handle XML content. In MarkLogic’s case, large amounts of XML content and with very high performance.

Because DBMSs are generally not designed for handling content, at Mark Logic, we typically compete with search engines (sometimes tied to DBMSs) in typical customer engagements. One question that invariably comes up is how does MarkLogic differ from a search engine?

There are many answers:

  • MarkLogic is a DBMS; a search engine is an indexing system. Think VSAM vs. Oracle.
  • MarkLogic has transactions. So the second a document is inserted into a database, it’s visible to all subsequent queries. (There is no indexing latency.)
  • MarkLogic has updates. Like any DBMS, we allow updates and do proper concurrency control when performing them.
  • MarkLogic has read-consistent snapshots. Like Oracle, MarkLogic shows you the results of a query that consistently reflect the state of the database at the start of your query. (This is also sometimes called read consistency, or non-blocking consistent reads.)
  • MarkLogic has a query language as its interface, instead of an API.
  • MarkLogic’s query language (XQuery) is a W3C standard, and not a proprietary vendor API.
  • Because XQuery is a powerful language, much processing can be pushed to the database tier, resulting in applications with little or no middle tier. With search engines, you typically write a thick middle tier of Java code to process documents returned by the search engine. For example, if you want to extract all footnotes from a document, MarkLogic can return this directly from an XQuery; a search engine will return links to all documents with footnotes and you then have to create a DOM tree for each document and traverse it to find and extract all footnotes.

MarkLogic uses search-engine indexing and query processing techniques, but it is a DBMS. MarkLogic also uses search-engine scaling techniques.

But the big conceptual difference is that MarkLogic is a platform for building content applications. And one basic, almost trivial, content application is “enterprise search” — i.e., returning links to documents that contain a given word or phrase.

For example, say that you had a collection of XML content and wanted to have enterprise search functionality against it. You could, with about a page of code, implement an XML enterprise search engine using XQuery. And here’s what it would look like. Thanks to Matt for banging out the example (as well as the car metaphor).