Category Archives: text analytics

Open Text Snags Nstein

Open Text Corp. today announced that it was acquiring Montreal-based text mining and publishing solutions vendor Nstein Technologies for CDN $0.65 per share, or CDN $35M, equivalent to US $33.5M, a 100% premium over the trailing 30-day average closing price of Nstein’s common shares which are traded publicly on the Toronto Stock Venture Exchange (TSXV).

In its most recently reported financial period, 3Q09, Nstein reported (all figures CDN) $4.6M in revenues, and -$0.8M in EBITDA.  Revenue was down 24% on a sequential basis and 17% on a year-over-year basis.  Given the $18.4M run-rate and the $24.2M in TTM revenues, Open Text paid 1.9x run-rate and 1.4x TTM revenues for the small, largely text-mining focused concern.  While the 100% premium is surely good news for shareholders, it’s off a valuation that is less than 1x TTM revenues (0.72x to be precise).  Then again, the company was both losing money and shrinking.

I’ve charted 11 quarters of Nstein history above, which makes the picture pretty clear.  Even the 2/08 acquisition of Picdar couldn’t get growth going, organic or otherwise.

In terms of focus, Nstein’s roots were in text mining.  The Eurocortex acquisition brought them a poor man’s CMS, with Nstein paying less for a  company than large Documentum customers pay for a license.  Picdar brought them digital asset management.  So you had a company doing $4.6M a quarter split across three areas:  text mining, CMS, and DAM.  Given the abnormally low 52% gross margins, that means a whole lot of that revenue was services, so they were maybe doing $2M a quarter in license.  That’s $0.7M in license for each of the three areas, which basically rounds down to nothing.  Remember the expression:  if you try to be all things to all people you can end up nothing to everyone.  This appears to be yet another example.

To my knowledge, this focus splitting was done in the name of “solutions” though what the company was known for — to the extent  it was known at all — was text mining.  I’ve previously blogged on such solutions strategies, and Nstein’s in particular:  NStein 2Q08, Growth Slows:  The Moldy Sandwich.

The tension highlighted in the “moldy sandwich” argument is that between creating a truly best-of-breed component (e.g., a sentiment analysis engine) and offering customers complete solutions to problems.  Companies are invariably pulled by their salesforces to the latter, while most companies can only credibly offer the former.  Simply put:  do you want to offer your customers either great ham, great cheese, or great mayo — and ask them to build the sandwich — or do you want to offer them a complete sandwich, but made from bad ingredients?  For most technology companies, I’d say you’re kidding yourself if you can think you can do both.

While I’ve never been a fan of the moldy sandwich strategy, I both know and like several of the folks at Nstein, and want to offer my congratulations to them on this deal.  While I’m guessing the CMS will go away and the DAM customers will be moved to Artesia, I’m reasonably sure that they have found a nice home for the text mining engine and gotten a reasonable valuation for the firm (given its trajectory) and a nice pop for shareholders in the process.

Other coverage of the deal:

Semantic Technology at the New York Times

I recently had the pleasure of meeting Evan Sandhaus, semantic technologist at The New York Times R&D, and wanted to highlight and share a few things that we discussed.

Evan gave an information-packed, 79-slide keynote address at the recent Semantic Technology Conference in San Jose. During our meeting, we went through some of the slides and they were fantastic. While the slides aren’t publicly posted, I hope they soon will be and will update this post with a link once and if they are.

He also told me about the New York Times’ recent release of a 1.8M article corpus to the computer science research community, known as The New York Times Annotated Corpus. The corpus includes nearly every article published in the New York Times for twenty years (between 1/1/87 and 6/19/07) in XML format (NITF to be precise) along with various metadata about the articles.

They believe the corpus can can be a valuable resource for a number of natural language processing research areas, including document summarization, document categorization and automatic content extraction. I think that’s true not only because it’s real content in real volume, but because that content comes with real, high-quality metadata that you can use to either build upon and/or validate various text processing algorithms.

Finally, in prepping for the meeting I found this video interview with Evan at the New York Semantic Meetup. Great stuff, embedded below.

SAS Acquires Teragram

In a not terribly surprising move SAS announced that it is acquiring Teragram on Monday. Seth Grimes of Intelligent Enterprise covers the announcement here. CMS Watch discusses it here.

This is basically a replay of the $76M Business Objects / Inxight deal last May.

At a trends level, this is about the BI vendors (can you say that anymore? perhaps I need to say BI vendor since SAS is the only independent left) wanting to perform analytics against both structured and unstructured data. For most of the evolution of the BI category, unstructured data was not part of the picture. Only in the past few years, has it really hit radar in the BI community.

The approach these folks typically take is use text mining engines to structure the unstructured data. For example, identify which documents talk about which products. Identify which documents have which tone. Then load that data into the data warehouse so you can run a report that shows sales by week with two more columns added: number of emails to customer support and % negative in tone.

Put differently, when you have a multi-billion dollar data infrastructure and you’re faced with content what do you want to do with it? Turn it into data. This is not a bad start and it does enable more and better dashboards and reports. However, to build really powerful content analytic applications, I believe that you need a specialized server (i.e., an XML server like MarkLogic) to handle the documents.

In many ways, it’s like OLAP. If you wanted basic slice and dice you could build that into a reporting tool. But if you wanted real OLAP — big databases, instant performance, complex analysis — then you needed specialized DBMS — an OLAP server — to do it.

Lazy XML Enrichment

One of my big gripes with most content-oriented software is that it requires a big bang approach (see The First Step’s a Doozy). The basic premise behind most content software is roughly:

1. If you do all this hard work to perfectly standardize the schema of your content, perfectly tag it, and possibly perfectly shred it, then

2. You can do cool stuff like content repurposing, content integration, multi-channel content delivery, and custom publishing.

The problem is, of course, that the first step is lethal. Many content software projects blow up on the launchpad because they can’t get beyond step 1. Our first customer had been stuck on step 1 for 18 months with Oracle before they found Mark Logic. (We loaded their content in a week.) At a recent Federal tradeshow, we had dinner with some folks from Booz Allen who’d been trying to load to some semi-structured message traffic data into a relational database for months. We told them to swing by our booth the next day. Our sales engineer then loaded their content over a cup of coffee while eating a muffin and built a basic application in an hour. They couldn’t believe it.

In most companies — even publishers — content is a mess. It’s in 100 different places in 15 different formats, and each defined format is usually more of an aspiration than a standard. Once, at a multi-billion dollar publisher one of our technical guys actually found this sentence in some internal documentation: “it is believed that this tag is used to …” Only folklore describes the schema.

So when it comes to the general problem of making XML more rich — i.e., having more tags that indicate more meaning — many people take the same big-bang approach. “Well, step 1 would be to put all the content into a single schema (which alone could kill you) and run it through a dozen different entity, fact, sentiment, concept, summarization “extractors” that can markup the content and fragments of it with lots of new and powerful tags (which alone could cost millions).

Again, step 1 becomes lethal.

At Mark Logic we advocate that people consider the opposite approach. Instead of:

  • Step 1: make the content perfect so you can enable any application you want to build
  • Step 2: build an application

We say:

  • Step 1: figure out the application you want to build
  • Step 2: figure out which portions of your markup need to be improved to build that application
  • Step 3: improve only that markup, sometimes manually, sometimes with extraction software, and sometimes with heuristics (i.e., rules of thumb) coded in XQuery
  • Step 4: build your application and get some business value from it
  • Step 5: repeat the process, driven by subsequent application requirements

I call this lazy XML enrichment. You could call it application-driven, as opposed to infrastructure-driven, content cleanup. I think it’s an infinitely better approach because it delivers business results faster and eliminates the risk of either never finishing the first step because it’s impossible, or having funding yanked by the business because it runs out of patience with an IT project that’s showing no ostensible progress.

At this point, I’d like to direct those of technical heart to Matt Turner’s Discovering XQuery blog where he provides a detailed post (code included) that shows an example of lazy, heuristic-based XML enrichment, here.

  • Matt’s example show lazy enrichment because the only markup he needs for his desired application is related to weapons, so that’s all he adds.
  • Matt’s example is heuristic-based because he devises a way to find weapons in XQuery, and then use XQuery to tag them as such.