The Oracle of Unstructured Information: A Three-Horse Race

I’m often asked about my business vision for Mark Logic. My answer is simple: I want to be the Oracle of unstructured information.

When I say this, I mean more the Oracle of 1995 than the one of 2009 — i.e., the one focused on DBMS software and applications, not the CA-like financial consolidator of today. I want to be the clear leader in providing DBMS software for the development of applications and business intelligence systems.

  • In Oracle’s case that was relational DBMS software for the development of data applications.
  • In Mark Logic’s case that means XML DBMS software for the development of information applications.

By information, I mean the full information continuum, from structured data to semi-structured data to unstructured data. (It’s a concept that Gartner’s Rita Knox writes about for those interested in further information.)

In many ways, it short sells MarkLogic to call it an XML DBMS. That is why we position it instead as an XML Server. While the one-word difference may seemed nuanced, there is an important underlying distinction because MarkLogic Server is really three things rolled into one:

  • A database management system
  • A search engine
  • An application server

So, in one sense, MarkLogic Server is more than a DBMS. But, then again, that’s exactly what you’d expect from a next-generation DBMS — to be both different-from and more-than the prior generation. MarkLogic doesn’t just change the database software market, it changes the way you build database-driven, information-centric web applications.

As such, it’s actually a new piece of information infrastructure software on top of which our customers build a wide range of systems:

Now, I don’t think Mark Logic is the only company that aspires to become the Oracle of unstructured information. I think it’s a three-horse between Mark Logic, Autonomy, and Endeca. Since I talk about a lot about Mark Logic in this blog, let me instead take a moment to discuss the other participants.

The Financial Consolidator: Autonomy
Autonomy started out as a enterprise search company with a semantic and probabilistic flair, a firm that emphasized complex algorithms and theories to help improve search results. After some early success, that strategy resulted in the company getting stuck in the $50M to $60M revenue range for 5 years (2000-2004). To break out of those doldrums, Autonomy made the bold and risky acquisition of Verity, a struggling search rival 2.2x its size, in November, 2005. They then acquired email archiving provider Zantaz for $375M in July, 2007 and web content management system vendor Interwoven for $775M in January, 2009.

My perception is that Autonomy is a well run M&A shop. The reported financials are excellent and the operating margins are strong at 35%. Personally I’d feel significantly more re-assured if Autonomy reported under US GAAP, as most leading software companies do, as opposed to the seemingly looser IFRS rules under which fellow European search provider, Fast Search and Transfer, seemed to get itself into quite some trouble. I was at the France-based Business Objects for 9 years and we reported under both US GAAP and French rules the entire time, so I know first-hand that it can be done.

Autonomy’s challenge is that organic growth typically wanes in such M&A-driven ventures and the company — often despite intentions to the contrary — becomes addicted to inorganic growth. Note that when company A buys company B, for the following four quarters company A benefits from an inorganic growth tailwind by using (A+B)/A in calculating its growth. By the way, this analysis suggests that Autonomy is due for another acquisition since it’s been over a bit over a year since the Interwoven announcement.

Technologically, acquisition-based strategies create hodgepodges not platforms. While Autonomy is doing well size-wise in the race, my prediction is that the acquisition strategy will continue and result in a de facto strategy away from technology and information platforms and towards continued acquisitions, the financial engineering that accompanies them, and the major task of integrating them.

A firm can talk technology all day long, but if its revenues and growth are coming from acquisitions then it is a financial play, not a technology one. We are, after all, what we do, not what we say, nor what we aspire to do.

The Drifter: Endeca
Endeca started out as search and discovery company with original problem statement: how can we make it easier to find things on e-Bay? This lead to an early focus on combining both textual and data constraints (e.g., Frank Sinatra memorabilia costing less than $25) as well as the notion of multi-dimensional (or, faceted) navigation for the iterative refinement of queries.

This naturally led to a strong position in e-commerce search, which was a key company focus in the early and mid 2000s. Many online retailers use Endeca to power their e-commerce sites, including one of my personal favor
ites, K&L Wines, a local wine shop that I have been known to frequent.

Wine, in fact, provides a great demo for this technology because you have plenty of facets (e.g., price, country, region, grape, points) and plenty of the kind of ambiguity that search engines handle well through taxonomy and thesaurus, but that usually stumps databases. For example, searching for “claret” on the K&L site properly returns a collection of Bordeaux wines in which the word “claret” never appears in the name or description, despite the fact that claret is not part of the official French Bordeaux classification. Someone who knows wine has configured the system to understand that some people generally refer to Bordeaux as claret, and thus the system has processed the query appropriately. Wine also provides other merchandising opportunities such as up-sell, cross-sell, and recommendations.

In a prior revision of this post, I referred to Endeca as “the bolter,” but I decided to change that characterization because I’m not sufficiently sure how Endeca’s MDEX engine looks on the inside. I’ve always believed it’s a layer on top of an RDBMS and a search engine that bolts them together, but I’m not sure enough to bet my whole characterization on it.

What’s more, I think “drifter” better characterizes the company because I find them — like many before them — to fall into the grass-is-greener trap of getting bored with their core market (i.e., e-commerce) and instead venturing off in many new and different directions.

My prime example for this particular strategic malady is Informatica. When we partnered with them in the mid 1990s at Business Objects, Informatica was focused on data integration. Somewhere along the way, they got bored with data integration and decided to move into “analytic applications” market, certainly egged-on by the industry analysts of the day. One day ETL was all over their website; the next day it was gone.

In terms of marketing execution, the overnight transition was stunning. In terms of strategy, it was a disaster. It turns out that Informatica’s customers liked ETL. They didn’t know what analytic applications were. And if they did know and did want one, they could call either their BI or ERP vendor, both of which were viewed as a more logical supplier.

The other problem with Informatica’s analytic applications charge of the light brigade was that it alienated their partners. After years of successful partnership, Business Objects was forced to work more closely with Ascential and we later acquired Acta in our own ETL counter-strike. All of this because Informatica got bored with its market. (Sometimes I wonder if this phenomenon isn’t better labeled board-dom than boredom.)

If you’ve not heard, Informatica is doing quite well these days; the stock’s almost doubled in the past year with, guess what, a data integration focus. The strategy that was “too limiting” as a $150M company is working just fine as a $500M one.

I tell the story of Informatica because I feel like Endeca’s repeating it, though perhaps not executing as crisply on their way. Endeca’s message remains a bit of a hodgepodge. Here they’re a database. Here, they’re an information access platform. Here, they’re an enterprise search engine. And I believe the company’s go-to-market strategy is no more focused. After a strong historical focus in e-commerce, Endeca launched a big manufacturing push, and they continue to dabble in government and in media/publishing, Mark Logic’s two core markets.

The result: while I believe Endeca is approximately three times Mark Logic in overall size, I believe that Mark Logic is three times Endeca’s size in both publishing and government. That is the power of focus.

The moral is that in your hurry to get beyond e-commerce, you can end up everywhere and nowhere at the same time.

That said, I believe a lot of smart people still work at Endeca and I believe there is a strong and persistent tradition around the concept of uniting structured and unstructured information in an application platform. Bolted search/database or not, I expect Endeca to rediscover focus going forward and to re-emerge as a strong participant in the three-horse race.

Whither Oracle?
One might say it’s presumptuous to say that Oracle won’t be the Oracle of unstructured information, but I think it’s not. Since this post is long enough already, I won’t attempt to make a complete argument here. Instead, I’ll tee up a few points to show my line of reasoning and hopefully, return to this topic with a future post.

  • Oracle is a financial consolidator
  • The NoSQL movement is as much about the RDBMS oligopoly and associated vendor pricing and practices as it is about technology
  • In my opinion, the Innovator’s Dilemma paradox fully applies to Oracle in this instance

So, who’s in the three-horse race? Mark Logic, Autonomy, and Endeca. May the market-focused, technology disruptor win!

6 responses to “The Oracle of Unstructured Information: A Three-Horse Race

  1. Dave,its quite a long and interesting post. Kudos to your patience. Helped me a lot to understand the in and outs of MarkLogic's competitor's focus and strategies.WIll be eagerly waiting for your thoughts on "Why Oracle is not in the 3-horse race and cannot become Oracle of unstructured data".

  2. Dave – I'm going to quote this article to you when we're both old and grey!

  3. Out of curiosity Dave, does MarkLogic support the sorts of taxonomy-based searches shown on the K&L Wine site? It seems like publishers — especially Science and Technology — would benefit from being able to leverage taxonomies and ontologies defined for markets and fields of study that have unusually large and complicated domain vocabularies.For example, on the K&L site, if you narrow the wines down by region to California, you will then be able to further filter by sub-region specific to California (Napa Valley, Sonoma County, etc). This level of structure probably isn't in the document itself, but is in the definition of the facet space.I'm asking because it seems like this type of structure — the knowledge that Sonoma County and Napa Valley are both in California — is what a structured DB can really help with, but I'm struck that there is very little mention of taxonomies, ontologies, etc on the MarkLogic site.

  4. Felciano — here is an answer provided by Jason Hunter of the Mark Logic team:Taxonomies and ontologies are (in our view) an application-level feature. MarkLogic supports taxonomies and ontologies by providing robust building-blocks that make it really easy to build them, in the various shapes and sizes that our customers require. At K&L, for example, there are a few ways to handle the hierarchical geographic representation.* Have each document know its country, sub-region, and appellation.<geography> <country>United States</country> <subregion>California</subregion> <appellation>Napa Valley</appellation></geography>The top listing would be to show country values (and counts) matching the other constraints. The sub-region listing would be to show subregion values and counts matching the other constraints plus the country = "X" constraint. We can very quickly do that. And the appellation listing would be to show appellation values and counts matching the other constraints plus the established country and subregion constraints. This works fine for a taxonomy that isn't under a lot of flux. This also lets you do what K&L does and show both Country and Sub-Region as top-level listings.* Use geospatial indexes. Normally this isn't an option, but it would be here for a geographic constraint. Assign a geo point to every winery. Define countries, subregions, and appellations by different polygon bounding boxes.* Use an external value-centric definition. This is somewhat like the geospatial indexes approach in that you give each item its final value (but textual instead of geographic) and use an external definition to define the higher-up groupings (California consists of Napa Valley, Sonoma County, and …). Use OR queries to group the different text values. This works well when the groupings change more often.* Use scalar index bucketing. For anything with a number value (i.e. prices, sizes, ratings) there's built-in support for bucketing. Each item gets a price and you can bucket to arbitrary groupings. You can shrink the bucket sizes on each refinement. Bucket sizes can be declared externally or calculated programmatically based on peeking at the search results.This is just a sampling of ways you can use our built-in query and analytic tools to build a hierarchical or otherwise nested taxonomy or ontology. Looking at the K&L site it'd be pretty easy to do all the features I see. And with MarkLogic it's possible to do it against millions of items before even having to cluster, and billions of items in a cluster.

  5. Dave

    Interestingly seems that Endeca is now organized by three markets(Public Sector, Enterprise & Ebusiness) as evidenced by their new management structure.

  6. Sameer,

    Thanks for sharing. Very interesting.

Leave a Reply

This site uses Akismet to reduce spam. Learn how your comment data is processed.