Category Archives: XQuery

MarkLogic Server Purer than Ivory

MarkLogic Server recently underwent the XQuery Test Suite and I’m pleased to report that we are a purer form of XQuery than Ivory is of soap.

MarkLogic Server weighed in at 99.9% on the test suite, whereas Ivory is only 99 and 44/100ths percent pure, if you believe the classic marketing slogan, which ended up as the basis for a country song by Ronnie Milsap, Pure Love. (Chorus: pure love, you’re the picture of pure love, ninety-nine and forty-four one-hundredths percent pure love.)

Here is the XQuery Test Suite Result Summary. The detailed report that lists every single test on a pass/fail basis is here.

Dare I say:

MarkLogic Server,
You’re the picture of XQuery,
(More than) 99 and 44 one-hundredths percent pure XQuery

Interviewed in Dr. Dobbs Journal

Check out this interview with me, published this past Thursday, in Dr. Dobbs Journal, and entitled XML as a Content Platform.

Excerpt on history:

In essence, because RDBMSs date back to the 1960s and pure XML databases only date back to around 2000, the XML database vendors get the coveted chance to “start over” in designing a database system. So we can quickly incorporate a lot of the features put in RDBMSs over the past few decades while at the same time optimizing for XML.

Excerpt on XQuery:

Because XQuery was the DBMS community’s chance to start over and they took it. XQuery is superior to SQL for a number of reasons. It’s a full programming language, not just a data manipulation language. It handles XML natively, and XML is indeed becoming more and more pervasive.

Excerpt on “SQL is COBOL” to our kids:

Our kids will think of SQL the way that we think of COBOL. (“Daddy, do you mean you used a database language that assumed all data was stored in tables and didn’t natively understand XML?” “Yes, Muffin, and I used to have to sew my own clothes, too!”)

XML: Good, Bad, Bloated?

GCN ran an article last month, entitled XML: The Good, The Bad, and the Bloated, about which I wanted to share a few thoughts.

The article begins (bolding mine):

Depending on whom you talk to, Extensible Markup Language is either the centralized solution for managing cross-platform and cross-agency data sharing, or it’s a bloated monster that’s slowly taking over data storage and forcing too much data through networks during queries.

Which view is accurate?

In general, I believe XML’s flexibility and cross-platform capabilities far outshine any negatives. But if XML files are not properly planned and managed, there is a good possibility that you could experience XML bloat.

First, I’ll note that the author balances the pro/con of XML and comes out pro: XML’s benefits outweigh its stated and perceived disadvantages.

Now, let’s move on to the cons:

But XML bloat occurs when files are poorly constructed or not equipped for the jobs they must perform. There is a strong temptation to cram too much information into files, which makes them larger than they need to be. When an agency needs only part of the data, [it] often has to accept the whole file, including long blocks of text.

First, I’d say that “long blocks of text” are often the data in which analysts are interested, so we must be careful not to quickly classify them as baggage (i.e., let’s not be too data-centric in today’s world).

Second, I’d agree that the blind marking of everything in XML can be wasteful. That’s why I’ve long advocated a “lazy” approach where:

  • You first decide application requirements and then create XML tags in order to support them, iterating over time on both the application requirements and the sophistication of the XML to support them.

As opposed to a far-too-common “big-bang” approach whereby:

  • You design “the ultimate schema,” which can answer virtually any possible application requirement, and then spend enormous time and money first designing it, and then trying to migrate your data/content to it.

The problems with the big-bang approach are many:

  • Designing the ultimate schema is a Sisyphean task.
  • You spend money investing in XML richness which has no short-term return; i.e., you over-design for the short-term
  • You lose your budget mid-term because while you’re designing perfection, the business has seen no value and loses faith in the project.

As I like to say, “big-bang approaches often result in a big bang,” or, similarly, with too many content-oriented systems “the first step’s a doozy” beyond which you never pass.

At Mark Logic, we’re trying to change all that in three ways:

  • By delivering a forgiving XML system that accepts content in a rather ragged form, enabling you to ingest XML immediately and begin delivering value against it.
  • By evangelizing a lazy XML enrichment and migration approach that delivers business value faster than big-bang approaches.

With Mark Logic, the question is not how much slower do I have to go than an RDBMS and get the benefits of XML,” it’s typically “how much faster does it go than an RDBMS and still deliver the benefits of XML?

In customer benchmarks, we’ve see out-performance of 10:1 as common and outperformance of an RDBMS by 100:1 is certainly not unheard-of. Ask our customers and partners: MarkLogic is fast.

The article continues (bolding mine):

Luckily, technologies are evolving that can help with XML bloat.

First is the evolution of platform-based XML solutions that offer a single system to author, tag, store and manage XML files. They also allow developers to set the policies for dynamic XML integration into other documents or applications. Mark Logic is one of the best-known purveyors of such solutions, …

A lot of XML bloat perception comes from the idea that you’re inserting tags into ASCII files and those files increase by the size of the tags which, at times, appear material relative to the size of the content.

As a trivial example, if you have an XML element is named publication-author, with value (i.e., the author’s name) “Joe,” then you have added 41 characters of “overhead” (begin and end tags) to the underlying data of 3 characters. And, if Joe has authored 1,000 documents in the collection, you’d argue that you’ve added 41,000 characters of overhead for 3,000 characters of data. And you’d see precisely that if you looked at an ASCII serialization of the XML.

But good XML systems don’t store XML that way. XML is naturally tree-structured and XML documents are stored as trees. What’s more, the element names (i.e., the tags) are typically hashed. So the 20-character “publication-author” element name get hashed to 64 bits once and every time the tag appears in the corpus only the hash-value is stored. So it’s not 41K of overhead to 3K of content in the preceding example, it’s more 2K to 3K.

In fact, by Mark Logic rules of thumb, the picture often looks like:

  • 1MB of text source content, which becomes
  • 3MB of XML, which becomes
  • 300K of compressed XML in MarkLogic, which becomes
  • 1MB of compressed XML + indexes in MarkLogic

Simply put, it’s often the case that the content blows up a bit in XML only to be compressed to 1/10th its size, only to be re-inflated through indexing back to its original size.

Now this certainly isn’t true every time. Sometimes content + indexes ends up 2-5x the original size. But critics should remember: (1) you then have rich XML tags that enable you to do something with the content and (2) you then have indexes so you do it, fast. (Often the counter-arguments make it sound like nothing is gained for the size increase.)

Finally, I’d add two points:

  • With magnetic disk storage well less than $1/gigabyte (e.g., this drive) for consumer applications and maybe $10/gigabyte in a mid-range SAN …. to put it bluntly … should you care? Despite our (potentially advancing) age and attitudes about storage costs,
    we should not conserve storage for conservation’s sake, but instead optimize our computing investment so as to maximize overall return paying heed to the relative costs of subsystems and to value of functionality enabled by them.
  • Your XML can be as big or rich as you want it to be. And with MarkLogic, you can change that richness over time. Our presumption is that you are adding elements because you want to use them to deliver business value so technically speaking, there should be no “wasted elements” — i.e., elements that merely inflate size and deliver no value. That is, if you’re paying attention and following a lazy XML approach, then your XML should be no richer than the functionality required by your appliactions, and ergo — by definition — there is no waste or bloat.

Basically, if your content gets bigger, it’s simply because you wanted to do more things with it.

Startup Zeitgeist

Seedcamp, a London-based, week-long camp for European entrepreneurs recently did an interesting exercise. They took the several hundred applications they received for their event and made tagclouds. Here’s what they found.

What are you creating?

How will you make money?

What tools will you use?

(I’d love to see XQuery in the toolset, but happy to see that database, server, and XML are already there.)

And who says you can’t do interesting analytics on content? I thought this was fascinating. Check out Seedcamp’s blog post about the exercise, here.

Norm Learns Rule 1

One of the fun things about Mark Logic is that we unite people from different computing backgrounds: database people, search engine people, content management people, the odd computational linguistics person, and — of course — document/XML people.

Aside: one of my big theses of computing life is that individuals tend to stovepipe into a single computing camp early on, fail to cross-breed / cross-read, and thus the camps end up quite in-bred and incommunicado over time. That’s one reason why I deliberately “jumped camps” in leaving Business Objects four years ago, hopping from BI into unstructured data / content / documents / XML.

But I digress.

We recently hired Norm Walsh, a pretty big guy in the document camp, which elicited comments such as the following from his fellow camp members:

I’m wondering how in the hell some obscure “XQuery Content” company stole Norm Walsh away from Sun. […] Anyone care to provide some insight? Is Mark Logic really *that* good?

That was fun.

But what’s been even more fun is helping someone who is clearly a distinguished individual in one camp and introducing him to another. Towards that end, I’m happy to report that Norm is now officially certified in what I call rule 1 of database performance: push constraints to data, don’t move data to constraints.

Believe it or not, rule 1 appears quite counter-intuitive to document people who seem to innately want to materialize DOM trees and then process them in a middle tier.

Because I’m so wed to the database viewpoint, I have trouble expressing it in a document-person way. That’s why I’m happy that Norm has recounted his journey here, in a post entitled Thinking Differently about XML.

The Publishing [R]evolution

I’m posting the slides that Darin McBeath from Elsevier presented at the XML Holland conference a few months back. I’m sorry about the delay, but I wanted to be sure it was OK with Darin and the process got stuck on my back burner.

In addition to an all-around great speech, Darin introduced two concepts that I liked a lot.

  • Fewer moving parts
  • Find the ringtones

“Fewer moving parts” was Darin’s metaphor on simplicity in building pure XML-based systems (with XML content and XQuery as the query / programming language). It’s always hard to argue the business benefits of simplicity without doing detailed costing analysis. I thought it was creative of Darin to use this metaphor to drive the point home. We know jet engines are safer than piston engines because they are simpler and have fewer parts. The same could be said of Nokia vs. Motorola phones. Fewer parts works. When you build content applications on XML content with XQuery and an XML content server, you have fewer moving parts. No Java layer. No relational mappings. See this post, The Virtues of Top-to-Bottom XML, for more.

“Find the ringtones” was another cool Darin idea. As you probably know, ringtones are a multi-billion dollar business. The amazing thing about ringtones is that you can charge $3.00 for fifteen seconds of a song which in its three-minute entirety would sell for $0.99. Less really can be more. Darin’s challenge to publishers was to “find the ringtones” in their content. Where, in different sections of the publishing business, can you deliver higher value and increased revenue — by offering less? That’s a cool question. And in an increasingly information-overloaded world, a smart one.

In the better late than never department, here are Darin’s slides.

SQL/XQuery Franglais Frankenqueries

One of our consultants is doing some testing of MarkLogic vs. XML-extended relational databases, and he sent me an example of the kinds of queries you need to write when you’re mixing SQL and XQuery/XPath. Here is an example:

SELECT XMLQUERY( ‘$p/Citation/Index/ConceptCodeList/ConceptCode’ PASSING P.XMLDATA AS “p”)FROM AllCitations AS p WHERE contains (XMLDATA,‘(SECTION( “/Citation/Index/ChemicalData/ChemicalList/ChemicalName”) “leucovorin”)&(SECTION( “/Citation/Index/ConceptCodeList/ConceptCode”) “Pharmacology”)’) = 1;

A few things spring to mind when I see queries like this:

  • This is why people made XQuery — so you wouldn’t have to write stuff like this.
  • Why in the world do you need to mix XPath and SQL in this way? In a theoretically bi-lingual SQL/XQuery database, can I just write document-oriented queries purely in XQuery and not mess around with selecting columns that are themselves XMLQUERYs? Answer: in DB2’s ironically named pureXML, you need to use SQL as the outer framework if you want to use full-text indexing; so yes, you must do this.
  • Are there more than 10 people in the world who will understand what the answer to this query is supposed to be? SQL and XQuery each have their own semantics, and few people deeply understand them. How many people understand not only both SQL and XQuery semantics, but also how they interact? (It reminds me of trying to find a tax guy in France who could do both the US and French systems at the same time.) I watched two world-class experts debate what the correct answer was to such a query for 20 minutes. Does Joe Programmer even have a chance?