Category Archives: Enterprise Search

Change is Good: You Go First

This post’s title is one of my favorite sayings because it perfectly captures our conflicting attitudes toward change. Intellectually, people know that change is necessary for advancement, but emotionally, most of us still don’t like it.

Happily, for companies like Mark Logic, there are always some brave souls willing to try changing the way they do things. Sometimes these people are driven to change by external forces (e.g., publishers who know that objects like Google in the rear-view mirror are indeed closer than they appear). Sometimes, they’re just adventurous spirits working in groups dedicated to technology exploration. Sometimes, they’re open to change simply because the mission is too important not to be (e.g., preventing terrorism).

The idea for this post came to me during a recent sales call. We were visiting a publisher which was looking to replace its search engine because it was expensive, hard to configure, and under-performing expectations. Moreover, the supplier was discontinuing support of the product, forcing a potential upgrade.

The good news was that these folks had found Mark Logic and were willing to hear what we had to say. But I was worried they were “wedged” in a search paradigm. As I said on the call:

If you’re just looking to replace your search engine the way you might change the oil filter in a car, then you should just go do that; there are plenty of them out there. If, however, you’re looking to change the way you build information products, to add enormous agility to that process, and to save the expense of buying and integrating a search engine and a DBMS to boot, then you should consider Mark Logic.

Look. The paradigm defines the outcome. If you spec a vehicle as requiring wooden wheels, a spring-loaded bench, leather reins, a hand brake, and a low hay/mile consumption rate, then you are never, ever going to come up with a car.

The fact is that disruptive technologies almost never have every feature of those they replace, especially at first. (Recall that it took about a decade for the relational DBMS to become production OLTP worthy.)

So if you want to stare at Mark Logic through an enterprise search engine lens, happily you will find that it has a lot of things that search engines don’t (e.g., read/write, transactions, database-style query language). But you’ll also find it’s missing a few things that search engines do have (e.g., a recent, now-neutralized example of this is proximity search – see aside below).

But that’s not the point. If you remove the search engine lens and frame the question not as “do you have reins and a hand brake” but instead as “what’s the best vehicle to get from A to B” — i.e., “what’s the best platform on which to build new information products” — then you’ll find the answer is most certainly Mark Logic and I can find about 30 happy publishers who’ll confirm that.

Time will tell where this customer ends up. They were great people, and we had a great meeting, so I hope they’ll choose to work with us. Either way, I feel for them since it’s never easy facing these sorts of challenges.

Like the headline says: change is good, but you go first.

Aside on Proximity Search

Proximity search is the name of a feature that lets you find all documents where word-A is within N words of word-B. It’s a popular search engine feature and until version 3, something that MarkLogic lacked. I like to talk about proximity because it provides a fascinating example related to disruptive change.

From a purist XML content server perspective, proximity search is a hack, a workaround to a problem that enterprise search engines face.

For example, if you want to find all contracts governed by Texas law, you could use your enterprise search engine to do a simple keyword search on “Texas” and “governing.” But say your company’s in Texas, so every contract has Texas in numerous address blocks. And every contract presumably also has a governing law section. So your query will return literally every contract in your database. Not so useful.

Proximity search addresses this problem by letting you say: find all documents where “governing” is within 10 words of “Texas”. It’s not a bad fix, if you’re enterprise search vendor.

But an XML person sees this problem differently: XML has structure, so use it. The search becomes: find all documents with a section-heading element that contains “governing” and that contain “Texas” in the first paragraph of the subsequent section. You don’t need proximity to answer this question in an XML content server.

So think about this: we get asked to add a feature in our product that was added to one of the technologies we’re replacing in order to fix a limitation in what they had. Wow. It’s a bit like asking for blinders for your car’s headlights.

But we did it. Why? Because proximity’s still useful in an XML content server, because XML-aware proximity is even cooler (find these elements near those elements), and because it’s about 10x easier to tell this story when our product contains proximity then when it doesn’t. Interesting, n’est-ce pas?

Honey, I Shrunk the Company: Convera Sells Retrievalware to Fast

Two days ago, Norwegian enterprise search vendor Fast Search and Transfer announced an agreement to purchase the Retrievalware business from Convera for $23M. You can find the press release here.

Let’s try to understand what this means.

First, some background on Convera. Technically, Convera is a seven-year old company created through the combination of Excalibur Technologies and Intel’s Interactive Media Services division. I’d always thought of Convera as the re-branding and reincarnation of Excalibur, a search company that has been around for over 20 years. Convera always struck me as a company that historically did well in Federal government (e.g., defense, intelligence), but that never appreciated its own strengths.

Financially, Convera has not done well. For example, in its most recent quarter, 4Q07 (FY ends on 1/31), Convera reported total revenues of $2.8M, down 24% from 4Q06, and a net loss of $9.7M. Retrievalware revenues in 4Q07 totaled $2.6M, down 27% from 4Q06. Looking over the longer term, the FY06 10-K, shows on page 23 that annual revenues have monotonically decreased since 2004, descending from $29.3M to $25.M to $21.0M and, per this press release, to $16.7M in 2007, reducing the company nearly by half over the past 4 years.

I’d occasionally joked that it was perhaps appropriate that the company’s headquarters were on Gallows Road.

Convera has some quirkiness it its history, detailed in this Washington Post story. I’d guess that one reason Convera has not been content simply to be a Federal play is that Herb Allen is a medial mogul, running an exclusive conference in Sun Valley, and arguably the premier investment house in media and entertainment. Hey, when you’re on the Forbes Billionaire List already, why mess around with a Federal play when, with luck, you might convert it to the next Google, and without luck you lose what amounts to a rounding error? When billionaires play, it’s rarely to make pocket change and it’s usually for keeps.

This is speculation on my part, but my guess is that Allen’s involvement is what accounts for Convera’s schizophrenic past, as evidenced by this graphic that I took off their homepage today.

To me, Convera is one small ($10M run-rate), shrinking company with two strategies: vertical search platform and enterprise search engine. Or, I should say, was.

After this deal, it seems that Convera becomes a tiny ($800K run-rate) company with one strategy. While it’s hard to believe — and I’ve had to check the figures a few times to do so — Convera seems to have sold the business that accounts for 93% of their revenues. While I might question their wisdom or sanity, I certainly can’t fault them on commitment.

Let’s flip over to the Fast side of the equation.

Since no MBA who passed quant class would pay $23M for a $10M business shrinking at 24%, there needs to be more going on here. In this IWR blog post, CEO John Lervik says that the deal helps Fast in “aiming at the lucrative government market,” which this InformationWeek story says accounts for about 70% of the acquired business.

That’s consistent with Fast’s recent comments about tactical acquisitions, and I suppose the business argument is that they can try to sell their search technology to the Retrievalware installed base. The success of that strategy will depend on a number of variables:

  • Have Retrievalware customers already and long-ago found alternative paths forward?
  • Are those that remain customers merely interested in keeping existing systems running?
  • Is enterprise search technology the appropriate replacement technology?
  • Will government customers, particularly in the sensitive defense and intelligence sectors where Convera did much of its work, be comfortable buying from foreign suppliers? [See note below.]

In our experience, particularly in Federal government, XML content servers are often a better replacement technology than contemporary search engines. That’s because (1) government likes XML as a storage format since it’s open and standard, (2) the ability in XQuery to express arbitrarily complex queries, (3) the ability to easily hook a series of best-of-breed extraction / enrichment tools together in an open architecture, and (4) government contentbases are often massive in scale and require the ability to run very complex queries against very large contentbases with high performance.

The last point requires obeying “rule 1” of database performance, which troubles search engines because, compared to XML content servers, they have a limited ability to push constraints to data.

As for Convera’s vertical search platform strategy, I’ll say one thing: they have most definitely burned the ships on landing in the New World.

Time will tell whether they go on to greatness or get eaten by the natives. Either way, there’s no going back now.

# # #

Note: I do not claim definitive expertise on whether the US government or sectors of it can or should buy software from US or foreign suppliers. While I do know that the Buy American Act exists, it seems to exclude software in section 25.103 (e). Despite that, I often hear that there are “issues” with foreign suppliers in the more sensitive sectors of government and I would welcome email pointing me to relevant regulations. Meantime, I have disabled comments on this post to avoid repeating a problem I had in the past with what I suspect were competitors testifying anonymously and anecdotally to the contrary. Since it’s my blog, I will share my opinion based on the people I’ve asked this question. Please feel free to send me information (e.g., links to regulations) so I can learn more.

See the FAQ for information on my comment policy.

The High Cost of Ineffective Search

Just a quick post to a recent article on the costs associated with ineffective enterprise search.

Tidbits include:

  • According to IDC, a company with 1,000 information workers can expect more than $5M in annual wasted salary costs because of poor search.
  • A recent survey of 1,000 middle managers found that more than half the information they find during searching is useless.
  • According to Butler Group, as much as 10% of a company’s salary costs are wasted through ineffective search.
  • According to Sue Feldman of IDC, people spend 9-10 hours per week searching for information and aren’t successful 1/3 to 1/2 the time.

As I always say, there’s a reason why “enterprise search sucks” returns over 1M hits on Google, including posts from luminaries such as John Udell and Tony Byrne.

While Mark Logic is not out to solve the generic enterprise search problem, I have long believed that enterprise search, as a catgory, will become stuck between a rock and a hard place.

  • The rock is the commoditization of the low-end enterprise search market through offerings like the Google Appliance and IBM OmniFind Yahoo Edition. This will suck the money out of the low end, the generic crawl-and-index market.
  • The hard place is DBMSs — specifically, DBMS-based content applications built to help people in specific roles perform specific tasks. Some people build these applications today by trying to bolt together an enterprise search engine and a DBMS (e.g., Oracle + Verity or Lucene + MySQL), but increasing I believe people will use XML content servers (special-purpose DBMSs designed to handle content) for this purpose.

When you think about it, an inverted keyword index can only help you so much when trying to solve a problem — even if you gussy it up with taxonomies and sexy extraction technology. In the end, an application designed to solve a specific problem will trump a souped-up tool every time.

Buxton IEEE Article: Beyond Search, Content Applications

Mark Logic’s own Stephen Buxton, co-author of the definitive tome, Querying XML, has recently published an article in IT Pro (a publication of the IEEE Computer Society) entitled “Beyond Search: Content Applications.”

Here is a link to the article (subscription required). If you press the link you can either view the abstract or buy the article for $19. Here’s a link to the editor’s introduction of the issue (free), where he says:

“Stephen Buxton’s article on XML content servers describes the unique capabilities of this form of repository system and the extreme precision and information extraction that it can achieve. The server’s content of unstructured text is richly tagged, usually by inflow entity extractors or taxonomies. This provides a high degree of semantic quality and makes high relevancy search and disambiguation possible. Search, as well as other applications, can be developed to sit atop the server and take full advantage of the metadata. In this way, the enterprise can benefit from true information extraction in search as well as in other applications requiring high precision and a degree of semantic awareness.”

In the article Buxton differentiates enterprise search engines from XML content servers as candidate platforms for content applications.

He also discusses several example content applications, including:

  • The Oxford University Press African American Studies Center, an online product for social sciences libraries and researchers that does extensive content integration and repurposing
  • O’Reilly Media’s SafariU, a custom publishing system that enables professors to build custom books, online through a web interface with printed versions shipped to the campus bookstore in about 2 weeks
  • Elsevier’s PathConsult, a highly contextual application designed for pathologists in order to assist them in the tricky task of differential diagnosis.

It’s worth the $19 — go ahead and get the article. Heck, it’s cheaper and faster to read than his book!

Search Engine as Implemented in MarkLogic

One of our field consultants, Matt Turner, has started a blog called Discovering XQuery. Since Matt is often in demand and spends lots of time working on customer engagements, he hasn’t found the time to do many posts, but the ones he has done are quite good. So please check out his blog, linked above, and egg him on to do more posting.

At my request, Matt took on a subject that I thought needed more explaining. As frequent readers will know, MarkLogic Server is an XML content server. An XML content server is a special-purpose DBMS designed to handle XML content. In MarkLogic’s case, large amounts of XML content and with very high performance.

Because DBMSs are generally not designed for handling content, at Mark Logic, we typically compete with search engines (sometimes tied to DBMSs) in typical customer engagements. One question that invariably comes up is how does MarkLogic differ from a search engine?

There are many answers:

  • MarkLogic is a DBMS; a search engine is an indexing system. Think VSAM vs. Oracle.
  • MarkLogic has transactions. So the second a document is inserted into a database, it’s visible to all subsequent queries. (There is no indexing latency.)
  • MarkLogic has updates. Like any DBMS, we allow updates and do proper concurrency control when performing them.
  • MarkLogic has read-consistent snapshots. Like Oracle, MarkLogic shows you the results of a query that consistently reflect the state of the database at the start of your query. (This is also sometimes called read consistency, or non-blocking consistent reads.)
  • MarkLogic has a query language as its interface, instead of an API.
  • MarkLogic’s query language (XQuery) is a W3C standard, and not a proprietary vendor API.
  • Because XQuery is a powerful language, much processing can be pushed to the database tier, resulting in applications with little or no middle tier. With search engines, you typically write a thick middle tier of Java code to process documents returned by the search engine. For example, if you want to extract all footnotes from a document, MarkLogic can return this directly from an XQuery; a search engine will return links to all documents with footnotes and you then have to create a DOM tree for each document and traverse it to find and extract all footnotes.

MarkLogic uses search-engine indexing and query processing techniques, but it is a DBMS. MarkLogic also uses search-engine scaling techniques.

But the big conceptual difference is that MarkLogic is a platform for building content applications. And one basic, almost trivial, content application is “enterprise search” — i.e., returning links to documents that contain a given word or phrase.

For example, say that you had a collection of XML content and wanted to have enterprise search functionality against it. You could, with about a page of code, implement an XML enterprise search engine using XQuery. And here’s what it would look like. Thanks to Matt for banging out the example (as well as the car metaphor).