Search Engine as Implemented in MarkLogic

One of our field consultants, Matt Turner, has started a blog called Discovering XQuery. Since Matt is often in demand and spends lots of time working on customer engagements, he hasn’t found the time to do many posts, but the ones he has done are quite good. So please check out his blog, linked above, and egg him on to do more posting.

At my request, Matt took on a subject that I thought needed more explaining. As frequent readers will know, MarkLogic Server is an XML content server. An XML content server is a special-purpose DBMS designed to handle XML content. In MarkLogic’s case, large amounts of XML content and with very high performance.

Because DBMSs are generally not designed for handling content, at Mark Logic, we typically compete with search engines (sometimes tied to DBMSs) in typical customer engagements. One question that invariably comes up is how does MarkLogic differ from a search engine?

There are many answers:

  • MarkLogic is a DBMS; a search engine is an indexing system. Think VSAM vs. Oracle.
  • MarkLogic has transactions. So the second a document is inserted into a database, it’s visible to all subsequent queries. (There is no indexing latency.)
  • MarkLogic has updates. Like any DBMS, we allow updates and do proper concurrency control when performing them.
  • MarkLogic has read-consistent snapshots. Like Oracle, MarkLogic shows you the results of a query that consistently reflect the state of the database at the start of your query. (This is also sometimes called read consistency, or non-blocking consistent reads.)
  • MarkLogic has a query language as its interface, instead of an API.
  • MarkLogic’s query language (XQuery) is a W3C standard, and not a proprietary vendor API.
  • Because XQuery is a powerful language, much processing can be pushed to the database tier, resulting in applications with little or no middle tier. With search engines, you typically write a thick middle tier of Java code to process documents returned by the search engine. For example, if you want to extract all footnotes from a document, MarkLogic can return this directly from an XQuery; a search engine will return links to all documents with footnotes and you then have to create a DOM tree for each document and traverse it to find and extract all footnotes.

MarkLogic uses search-engine indexing and query processing techniques, but it is a DBMS. MarkLogic also uses search-engine scaling techniques.

But the big conceptual difference is that MarkLogic is a platform for building content applications. And one basic, almost trivial, content application is “enterprise search” — i.e., returning links to documents that contain a given word or phrase.

For example, say that you had a collection of XML content and wanted to have enterprise search functionality against it. You could, with about a page of code, implement an XML enterprise search engine using XQuery. And here’s what it would look like. Thanks to Matt for banging out the example (as well as the car metaphor).

Leave a Reply

This site uses Akismet to reduce spam. Learn how your comment data is processed.