Category Archives: XML content server

Norm Learns Rule 1

One of the fun things about Mark Logic is that we unite people from different computing backgrounds: database people, search engine people, content management people, the odd computational linguistics person, and — of course — document/XML people.

Aside: one of my big theses of computing life is that individuals tend to stovepipe into a single computing camp early on, fail to cross-breed / cross-read, and thus the camps end up quite in-bred and incommunicado over time. That’s one reason why I deliberately “jumped camps” in leaving Business Objects four years ago, hopping from BI into unstructured data / content / documents / XML.

But I digress.

We recently hired Norm Walsh, a pretty big guy in the document camp, which elicited comments such as the following from his fellow camp members:

I’m wondering how in the hell some obscure “XQuery Content” company stole Norm Walsh away from Sun. [...] Anyone care to provide some insight? Is Mark Logic really *that* good?

That was fun.

But what’s been even more fun is helping someone who is clearly a distinguished individual in one camp and introducing him to another. Towards that end, I’m happy to report that Norm is now officially certified in what I call rule 1 of database performance: push constraints to data, don’t move data to constraints.

Believe it or not, rule 1 appears quite counter-intuitive to document people who seem to innately want to materialize DOM trees and then process them in a middle tier.

Because I’m so wed to the database viewpoint, I have trouble expressing it in a document-person way. That’s why I’m happy that Norm has recounted his journey here, in a post entitled Thinking Differently about XML.

Thoughts on Category Creation and Information Access Platforms [Revised]

[Revised 8/2/08; still working on cleaning up this consciousness stream.]

Back in the old days, it seemed easy to create a category in software. Look at the database market, for example:

  • IBM invents the relational DBMS (RDBMS) category
  • Oracle, Ingres, and Informix enter in a largely undifferentiated way, though Informix eventually drifts towards the low-end/cheap segment
  • Sybase creates the derivative category of high-performance OLTP RDBMS.
  • Arbor re-christens the failed multi-dimensional DBMS as the OLAP Server
  • Tandem creates the non-stop RDBMS with its superb fault tolerance
  • Illustra launches the universal DBMS and is quickly acquired by Informix
  • Sybase launches the bitmap-indexed DBMS with SybaseIQ
  • Teradata launches the data-warehouse DBMS category

And you can find just as many examples outside database-land.

  • ASK defines the manufacturing resource planning (MRP) category
  • SAP hijacks MRP, redefines it as ERP, and goes on to become the world’s largest applications software company
  • PeopleSoft invents the HRMS category
  • Gartner Group’s Howard Dresner invents the business intelligence (BI) category, re-christening and re-framing what was formally known as DSS or EIS.
  • Siebel pioneers the sales force automation (SFA) category
  • Scopus pioneers call center automation (CCA)
  • Companies like Rubric pioneer enterprise marketing automation (EMA)
  • Siebel, through acquisition, coalesces SFA, CCA, and EMA into a single category called customer relationship management (CRM)
  • Oracle and SAP work to coalesce CRM back into ERP. Such is the ebb and flow of categories.

(And I could go on and on — BPM, KM, CMS, WCM, ECM, LMS, DRM, SCM, PLM, ETL, DI, EII — but I think I’ll stop here with the initials list.)

People are still creating categories today, and sometimes it looks easy. Uber-categories have been quite popular in the past decade as people have focused on different ways of developing and delivering software:

  • SaaS as an uber-category has worked well, with a variety offerings in various SaaS sub-categories (e.g., Salesforce, NetSuite)
  • Appliances have done pretty much the same thing — i.e., offering an appliance alternative for a wide variety of existing categories (e.g., a data warehouse appliance a la Netezza)
  • Open source has also done the same thing — again serving as a different flavor/dimension for a wide variety of largely existing software categories.

Only a few genuinely new categories have emerged, virtualization being the most obvious example. (Though you could argue that virtualization is itself an uber-category covering storage virtualization, server virtualization, et cetera.)

Companies are still working to carve new categories, particularly in the database market:

Sometimes vendors and/or the analysts who cover them try to impose either a straight name change (e.g., from MD-DBMS to OLAP) or a strategic shift (e.g., from BI to analytic applications) in category. Sometimes they’re just bored. Sometimes a vendor’s trying to redefine the market in line with its strengths. Sometimes an analyst is trying to make his/her mark on the industry and earn the coveted “father/mother of [category name],” much as Howard Dresner successfully did with BI.

BI got bored with its name several times during my tenure at Business Objects. At one point both the analysts and Informatica were trying to re-dub the category “analytic applications” in an attempt to get a fresh name and raise the abstraction level from tools to applications. Informatica nearly died on that hill.

Later, analysts tried to redefine the category, dubbing it corporate performance management (CPM), and arguing that business intelligence needed to link with financial planning systems. While knowing actuals is good, knowing actuals compared to the plan is better, and using actuals to drive the future plan better still. Cognos nearly tripped over itself repositioning around the CPM, ultimately acquiring Adaytum, which in turn lead to SRC’s eventual acquisition by Business Objects.

In an art-imitates-life sort of way, one wonders if the analysts predicted a move in the market or provoked it? My chips are on the latter.

This stream-of-consciousness is a long way of winding up to a single question: are enterprise search vendors successfully repositioning themselves as “information access platforms” or not?

Background: the enterprise-search-related vendors (e.g., Fast/Microsoft, Endeca) and search/content analysts who cover them are in the midst of an attempted category repositioning:

  • The word “enterprise search” is now seemingly dead, having been contaminated by the Google Appliance. When a shark gets in the water, all the fish jump out.
  • The word “information” is increasingly being used as a unifying term to describe both data and content (aka, unstructured data)
  • Enterprise search vendors are increasingly calling themselves “information access platforms” (though not generally abbreviated as IAP, I will do so here for brevity).

For example, consider Endeca’s corporate boilerplate:

Endeca’s innovative information access software that helps people explore, analyze, and understand complex information, guiding them to unexpected insights and better dec
isions. The Endeca Information Access Platform, built around a new class of access-optimized database, powers applications that combine the ease of searching and browsing with the analytical power of business intelligence.

I have a number of concerns on and related to this attempted shift:

  • The important thing about categories is that they exist in the mind of the customer. Analysts and vendors can try to put them there — but they have to stick. In my mind, IAP is not sticking. I have never heard a customer say: “I need to go out and get an IAP.”
  • I do, however, believe that “information” might well stick as an overall term, meaning both data and content (aka, structured and unstructured data).
  • It is not clear to me why someone who desires a unified platform for “information” would turn to a search vendor. Search engines were designed as read-only indexes to help people find documents containing tokens; hardly ideal as an application development platform.
  • In my estimation, someone managing “special” data should turn to a database vendor. While databases have classically not handled “special” data well, databases were designed as application platforms, and there is a whole new class of specialized databases emerging for handling various “special” types of data.
  • While I think a unified platform is a dandy vision, I think no one is close to delivering a unified platform that handles all types of data equally well. Bolting Lucene and MySQL together isn’t a platform. Relational databases still do a poor job with both content and many types of data (e.g., sparse, hierarchical, or semi-structured). XML servers (like MarkLogic) handle XML brilliantly, but need work before they can match RDBMSs at classical relational data.
  • I believe that someone who needs a crawl-and-index the intranet value proposition should use the Google Appliance; so I think the search vendors are correct in their desire to flee, I don’t think that “information access platform” is a good refuge.

Overall, my chips remain on the don’t come line for the attempted category repositioning from “enterprise search” to “information access platform.” You can find my stack on the come line for the emerging “special-purpose database” category and “XML servers” as an instance of them.

Stonebraker's "One Size Fits All" Papers

As frequent readers know, one of my memes is the rise of special-purpose databases, whether they be data warehouse appliances like Netezza, stream databases like Streambase, or OLAP (aka multi-dimensional) databases like Essbase, recently purchased by Oracle through the Hyperion Acquisition.

I believe that MarkLogic is one of a class of special-purpose DBMSs that will be necessary to handle new requirements that were never envisioned when the RDBMS was born. The relational database is now pushing 40 years old since its invention (and pushing 30 since the first implementations in commercial products).

An easy way of seeing the problem is to think about the computers you used even 20 years ago, their disk and memory configuration, their network connection speed, the types of data they managed, and the applications they ran. For me, that would be a 1 MIPS MicroVAX II with 8MB of memory, 256 MB of disk space, 40 users (among other things I was the sysadmin), and we used it to run a technical support call tracking system at Ingres, then known as Relational Technology, Inc.

While RDBMSs have proven remarkably extensible, for certain classes of applications (e.g., ultra-low latency trading) and databases (e.g., managing tens to hundreds of terabytes of XML documents), they are simply not appropriate.

As it turns out, I’m not the only person who sees this problem. Michael Stonebraker, noted computer science professor (formerly of UC Berkeley and now of MIT), serial entrepreneur (a founder of Ingres, Illustra, Cohera, Streambase, and Vertica), and general database visionary, thinks the same thing.

Towards that end, he co-authored of two papers:

  • One Size Fits All: An Idea Whose Time Has Come and Gone. This paper makes the argument that the relational database cannot be extended ad infinitum, demonstrates how RDBMSs are inappropriate for several new applications, and argues that the DBMS market will fragment into a series of special-purpose engines, perhaps unified by a common front-end parser.
  • One Size Fits All: Part 2, Benchmarking Results. This paper buttresses the first with benchmark results for relational vs. special-purpose databases in several applications. Interestingly and pragmatically, Stonebraker argues that most people won’t even consider a special-purpose database (largely due to inertia) unless it is at least 10x faster than relational for a given application. He then demonstrates several applications where you can see 10 – 100x gains in performance. (Large text and XML contentbases are one the cases he discusses, citing Google’s creation of their own file system and software stack to deal with Internet-scale documentbases.)

I have always found Stonebraker’s work very clear; he’s one of the few authors of academic computer science literature whose work I can always read and understand. Take a look at the articles.

If you’re not up for the papers, then here’s an interview in Red Hat Magazine that hits many of the key points. (But bear in mind he’s doing PR for Vertica here, so the examples are a bit biased towards column-orientation, and I’m sure the webinar mentioned at the bottom is a Vertica one.)

Change is Good: You Go First

This post’s title is one of my favorite sayings because it perfectly captures our conflicting attitudes toward change. Intellectually, people know that change is necessary for advancement, but emotionally, most of us still don’t like it.

Happily, for companies like Mark Logic, there are always some brave souls willing to try changing the way they do things. Sometimes these people are driven to change by external forces (e.g., publishers who know that objects like Google in the rear-view mirror are indeed closer than they appear). Sometimes, they’re just adventurous spirits working in groups dedicated to technology exploration. Sometimes, they’re open to change simply because the mission is too important not to be (e.g., preventing terrorism).

The idea for this post came to me during a recent sales call. We were visiting a publisher which was looking to replace its search engine because it was expensive, hard to configure, and under-performing expectations. Moreover, the supplier was discontinuing support of the product, forcing a potential upgrade.

The good news was that these folks had found Mark Logic and were willing to hear what we had to say. But I was worried they were “wedged” in a search paradigm. As I said on the call:

If you’re just looking to replace your search engine the way you might change the oil filter in a car, then you should just go do that; there are plenty of them out there. If, however, you’re looking to change the way you build information products, to add enormous agility to that process, and to save the expense of buying and integrating a search engine and a DBMS to boot, then you should consider Mark Logic.

Look. The paradigm defines the outcome. If you spec a vehicle as requiring wooden wheels, a spring-loaded bench, leather reins, a hand brake, and a low hay/mile consumption rate, then you are never, ever going to come up with a car.

The fact is that disruptive technologies almost never have every feature of those they replace, especially at first. (Recall that it took about a decade for the relational DBMS to become production OLTP worthy.)

So if you want to stare at Mark Logic through an enterprise search engine lens, happily you will find that it has a lot of things that search engines don’t (e.g., read/write, transactions, database-style query language). But you’ll also find it’s missing a few things that search engines do have (e.g., a recent, now-neutralized example of this is proximity search – see aside below).

But that’s not the point. If you remove the search engine lens and frame the question not as “do you have reins and a hand brake” but instead as “what’s the best vehicle to get from A to B” — i.e., “what’s the best platform on which to build new information products” — then you’ll find the answer is most certainly Mark Logic and I can find about 30 happy publishers who’ll confirm that.

Time will tell where this customer ends up. They were great people, and we had a great meeting, so I hope they’ll choose to work with us. Either way, I feel for them since it’s never easy facing these sorts of challenges.

Like the headline says: change is good, but you go first.

Aside on Proximity Search

Proximity search is the name of a feature that lets you find all documents where word-A is within N words of word-B. It’s a popular search engine feature and until version 3, something that MarkLogic lacked. I like to talk about proximity because it provides a fascinating example related to disruptive change.

From a purist XML content server perspective, proximity search is a hack, a workaround to a problem that enterprise search engines face.

For example, if you want to find all contracts governed by Texas law, you could use your enterprise search engine to do a simple keyword search on “Texas” and “governing.” But say your company’s in Texas, so every contract has Texas in numerous address blocks. And every contract presumably also has a governing law section. So your query will return literally every contract in your database. Not so useful.

Proximity search addresses this problem by letting you say: find all documents where “governing” is within 10 words of “Texas”. It’s not a bad fix, if you’re enterprise search vendor.

But an XML person sees this problem differently: XML has structure, so use it. The search becomes: find all documents with a section-heading element that contains “governing” and that contain “Texas” in the first paragraph of the subsequent section. You don’t need proximity to answer this question in an XML content server.

So think about this: we get asked to add a feature in our product that was added to one of the technologies we’re replacing in order to fix a limitation in what they had. Wow. It’s a bit like asking for blinders for your car’s headlights.

But we did it. Why? Because proximity’s still useful in an XML content server, because XML-aware proximity is even cooler (find these elements near those elements), and because it’s about 10x easier to tell this story when our product contains proximity then when it doesn’t. Interesting, n’est-ce pas?

Web Applications: The Virtues of Top-to-Bottom XML

I think that most people now correctly perceive our product, MarkLogic Server, as an XML content server, a special-purpose DBMS designed specifically for handling XML marked-up content. That’s the good news.

The better news is that many of these same people are figuring out what that means when it comes to developing web applications – specifically, that you can use an XML content server to build web applications using XML top-to-bottom. No Java required. No relational tables required. No application server required. (And no expense for all those supporting products.)

Don’t get me wrong. Many customers choose to use MarkLogic as the XML repository and query system in their architecture, building their applications in Java, using an application server, and making calls out to MarkLogic to process XML queries. Lots of people use the product in that way. That’s fine.

But, people soon realize, when you have a DBMS and query language (XQuery) that directly outputs XML (e.g., xHTML) which can be directly rendered by a browser, and when that “query” language is really a misnamed and underpositioned programming language easily capable of developing entire applications, you can say:

“Wait a minute. My content’s in XML. My browser speaks XML. Why not build my whole app top-to-bottom in XML and XQuery?”

Good question. And the answer is you can. And in many cases, you probably should. What’s the advantage of so doing?

  • Use of a high-level, standard, powerful programming language, XQuery. High-level and powerful translate to greater development and maintenance productivity. Standard translates to risk reduction and freedom of choice. (Aside: While XQuery is not a big-hype, overnight-success type of technology like Ajax, XQuery continues to march along with certain inevitability. In my mind, there is no question that XQuery will be the database programming language of the future – it is superior to SQL, it is more general than SQL and ergo applicable to a broader class of problems, and all major DBMS vendors are already committed to it. The question is not will XQuery become mainstream, but when?)
  • Elimination of three impedance mismatches: Java/XML, XML/relational, and Java/relational. Java is object-oriented, XML is hierarchical, and relational databases are tabular. The mapping between these three different data models generates a lot of zero-value-added work in developing an application. When you’re XML top-to-bottom, poof, that work’s all gone.
  • Elimination of tiers. I had lunch a while back with a top engineer at Oracle who told me that he believed the limiting factor on database application performance was becoming scheduling. That is, hardware and databases are becoming so fast that scheduling work across tiers was becoming the limiting factor in performance. His suggested solution? Eliminate tiers. Well top-to-bottom XML does exactly that.

Honey, I Shrunk the Company: Convera Sells Retrievalware to Fast

Two days ago, Norwegian enterprise search vendor Fast Search and Transfer announced an agreement to purchase the Retrievalware business from Convera for $23M. You can find the press release here.

Let’s try to understand what this means.

First, some background on Convera. Technically, Convera is a seven-year old company created through the combination of Excalibur Technologies and Intel’s Interactive Media Services division. I’d always thought of Convera as the re-branding and reincarnation of Excalibur, a search company that has been around for over 20 years. Convera always struck me as a company that historically did well in Federal government (e.g., defense, intelligence), but that never appreciated its own strengths.

Financially, Convera has not done well. For example, in its most recent quarter, 4Q07 (FY ends on 1/31), Convera reported total revenues of $2.8M, down 24% from 4Q06, and a net loss of $9.7M. Retrievalware revenues in 4Q07 totaled $2.6M, down 27% from 4Q06. Looking over the longer term, the FY06 10-K, shows on page 23 that annual revenues have monotonically decreased since 2004, descending from $29.3M to $25.M to $21.0M and, per this press release, to $16.7M in 2007, reducing the company nearly by half over the past 4 years.

I’d occasionally joked that it was perhaps appropriate that the company’s headquarters were on Gallows Road.

Convera has some quirkiness it its history, detailed in this Washington Post story. I’d guess that one reason Convera has not been content simply to be a Federal play is that Herb Allen is a medial mogul, running an exclusive conference in Sun Valley, and arguably the premier investment house in media and entertainment. Hey, when you’re on the Forbes Billionaire List already, why mess around with a Federal play when, with luck, you might convert it to the next Google, and without luck you lose what amounts to a rounding error? When billionaires play, it’s rarely to make pocket change and it’s usually for keeps.

This is speculation on my part, but my guess is that Allen’s involvement is what accounts for Convera’s schizophrenic past, as evidenced by this graphic that I took off their homepage today.

To me, Convera is one small ($10M run-rate), shrinking company with two strategies: vertical search platform and enterprise search engine. Or, I should say, was.

After this deal, it seems that Convera becomes a tiny ($800K run-rate) company with one strategy. While it’s hard to believe — and I’ve had to check the figures a few times to do so — Convera seems to have sold the business that accounts for 93% of their revenues. While I might question their wisdom or sanity, I certainly can’t fault them on commitment.

Let’s flip over to the Fast side of the equation.

Since no MBA who passed quant class would pay $23M for a $10M business shrinking at 24%, there needs to be more going on here. In this IWR blog post, CEO John Lervik says that the deal helps Fast in “aiming at the lucrative government market,” which this InformationWeek story says accounts for about 70% of the acquired business.

That’s consistent with Fast’s recent comments about tactical acquisitions, and I suppose the business argument is that they can try to sell their search technology to the Retrievalware installed base. The success of that strategy will depend on a number of variables:

  • Have Retrievalware customers already and long-ago found alternative paths forward?
  • Are those that remain customers merely interested in keeping existing systems running?
  • Is enterprise search technology the appropriate replacement technology?
  • Will government customers, particularly in the sensitive defense and intelligence sectors where Convera did much of its work, be comfortable buying from foreign suppliers? [See note below.]

In our experience, particularly in Federal government, XML content servers are often a better replacement technology than contemporary search engines. That’s because (1) government likes XML as a storage format since it’s open and standard, (2) the ability in XQuery to express arbitrarily complex queries, (3) the ability to easily hook a series of best-of-breed extraction / enrichment tools together in an open architecture, and (4) government contentbases are often massive in scale and require the ability to run very complex queries against very large contentbases with high performance.

The last point requires obeying “rule 1” of database performance, which troubles search engines because, compared to XML content servers, they have a limited ability to push constraints to data.

As for Convera’s vertical search platform strategy, I’ll say one thing: they have most definitely burned the ships on landing in the New World.

Time will tell whether they go on to greatness or get eaten by the natives. Either way, there’s no going back now.

# # #

Note: I do not claim definitive expertise on whether the US government or sectors of it can or should buy software from US or foreign suppliers. While I do know that the Buy American Act exists, it seems to exclude software in section 25.103 (e). Despite that, I often hear that there are “issues” with foreign suppliers in the more sensitive sectors of government and I would welcome email pointing me to relevant regulations. Meantime, I have disabled comments on this post to avoid repeating a problem I had in the past with what I suspect were competitors testifying anonymously and anecdotally to the contrary. Since it’s my blog, I will share my opinion based on the people I’ve asked this question. Please feel free to send me information (e.g., links to regulations) so I can learn more.

See the FAQ for information on my comment policy.

Buxton IEEE Article: Beyond Search, Content Applications

Mark Logic’s own Stephen Buxton, co-author of the definitive tome, Querying XML, has recently published an article in IT Pro (a publication of the IEEE Computer Society) entitled “Beyond Search: Content Applications.”

Here is a link to the article (subscription required). If you press the link you can either view the abstract or buy the article for $19. Here’s a link to the editor’s introduction of the issue (free), where he says:

“Stephen Buxton’s article on XML content servers describes the unique capabilities of this form of repository system and the extreme precision and information extraction that it can achieve. The server’s content of unstructured text is richly tagged, usually by inflow entity extractors or taxonomies. This provides a high degree of semantic quality and makes high relevancy search and disambiguation possible. Search, as well as other applications, can be developed to sit atop the server and take full advantage of the metadata. In this way, the enterprise can benefit from true information extraction in search as well as in other applications requiring high precision and a degree of semantic awareness.”

In the article Buxton differentiates enterprise search engines from XML content servers as candidate platforms for content applications.

He also discusses several example content applications, including:

  • The Oxford University Press African American Studies Center, an online product for social sciences libraries and researchers that does extensive content integration and repurposing
  • O’Reilly Media’s SafariU, a custom publishing system that enables professors to build custom books, online through a web interface with printed versions shipped to the campus bookstore in about 2 weeks
  • Elsevier’s PathConsult, a highly contextual application designed for pathologists in order to assist them in the tricky task of differential diagnosis.

It’s worth the $19 — go ahead and get the article. Heck, it’s cheaper and faster to read than his book!