Category Archives: MarkLogic

NetBase Tragicomedy: The Perils of "Magic" and Language Processing

It’s no secret that I’m not a big fan of “magic” in software. You could argue I’m still bearing the scars from BusinessMiner, one of our few failed products, at Business Objects. You could argue that for some tasks, magic is a necessary evil, and I wouldn’t argue back too hard. Many Mark Logic customers rely on “magic” to automatically enrich content, adding XML tags that identify entities (e.g., people, places, geopolitical organizations), sentiment (e.g., positive, negative or neutral), or even geo-code content with latitude and longitude that we then index, thus enabling geo-queries against content.

While I confess to some ignorance about how the magical tools work, it’s my perception that on a bad day they’re 50% accurate and on a good one they’re 80%. Now one could argue that content that’s enriched at 80% accuracy is way more valuable than unenriched content, and you’d be right. All I’m saying is I’m glad I’m not in the business of making the software that does that, because — customers being customers — nobody wants to hear that 80% is great and 100% is unattainable. Perhaps it’s my lack of deep expertise in the field. Or perhaps it’s my belief that humans are uncomfortable around black boxes.

The other reason I don’t like magic is that it can fail in truly spectacular ways. What’s the expression? To err is human. To really foul things up requires natural language processing.

This happened today with NetBase, a company whose high-level messaging is fairly similar to Mark Logic’s though happily with very different technology and business strategy.

NetBase recently launched healthBase, “a new health research showcase to find treatments, causes, and complications of any condition [and the] pros and cons of any drug, food, or treatment.”

Sounds nice. But, today they were slaughtered on TechCrunch with a story headlined: NetBase Thinks You Can Get Rid of Jews with Alcohol and Salt. Excerpt:

Several of our readers tested out the site and found that healthBase’s semantic search engine has some major glitches (see the comments). One of the most unfortunate examples is when you type in a search for “AIDS,” one of the listed causes of the disease is “Jew.” Really.

The ridiculousness continues. When you click on Jew, you can see proper “Treatments” for Jews, “Drugs And Medications” for Jews and “Complications” for Jews. Apparently, “alcohol” and “coarse salt” are treatments to get rid of Jews, as is Dr. Pepper! Who knew?

Here’s a great demo of why I don’t want to sell semantic processing technology. Here’s the reply Netbase gave TechCrunch:

This is an unfortunate example of homonymy, i.e., words that have different meanings.

The showcase was not configured to distinguish between the disease “AIDS” and the verb “aids” (as in aiding someone). If you click on the result “Jew” you see a sentence from a Wikipedia page about 7th Century history: “Hispano-Visigothic king Egica accuses the Jews of aiding the Muslims, and sentences all Jews to slavery. ” Although Wikipedia contains a lot of great health information it also contains non-health related information (like this one) that is hard to filter out.

I hate to be pedestrian, but isn’t that just a fancy way of saying it doesn’t work? It reminds me of the quip about Autonomy, where, when the Bayesian and Shanon’s Information Theory magic isn’t working, they simply tell the customer that they’re not smart enough to understand why. Nice.

Now, for the hapless NetBase, the AIDS query was just the beginning. They get destroyed in the blog comments, which quickly turned into a contest to find the silliest results.

  • The treatment for venture capital is funding. The cons is fool.
  • Masturbation causes insanity and is cured by cocaine.
  • The treatment for Twitter is Facebook. (This one might be right.)
  • The treatment for Microsoft is Viagra
  • Babies are caused by smoking and brain damage

It goes on and on. Now yes, many of the silly queries are out of the health domain, but there has to be better way to answer them.

One active commenter, Dave, who coined the “tragicomedy” description and who isn’t me, had this to offer:

The tragi-comic failure of Netbase can teach a lot to every company in the Semantic space.

Lesson 1

Don’t even try to boil the ocean of the WWW with these technologies. [The] Internet is full of valuable information but crap (or opinions) is 90% [of it] , the cost of getting rid of this crap and save only the good stuff is very high, [and] that’s [what] makes [it] so hard to succed even for Google and Microsoft with billions [of dollars].

Lesson 2

Linguistic approaches are likely going to fail because search engines (and machines) can’t distinguish joke/seriousness, sarcasm/shame and sentiments in general. The semantic meaning is right there not in the words of a text.

Lesson 3

If you choose to apply such approaches to one specific topic like Medicine (good choice) then stick to that topic , that means accept as INPUT only medical terms and provide as OUTPUTS only medical terms.

This last point requires human intervention and predefined taxonomies/ontologies but Netbase claims that they don’t need them both, ]i.e., that] their engine is fully automatic —> the failure too.

The Perils of Text-Only Search

You won’t be surprised to know that I use a series of Google Alerts to help me track events relevant to Mark Logic. I often have two problems with them:

  • Old content is mis-identified as new. You can ask any Mark Logician about the number of times I’ve forwarded a story that I thought was a hot news item only to discover it was four years old and that Google had nevertheless “alerted” me to its existence. I highlight this here because it bugs me, but I will not drill into it.
  • Content is mis-parsed, resulting in erroneous matches and alerts.

For example, today I received a Google Alert on “Mark Logic” for this Reading Eagle story, entitled Douglass Township Man Waives Hearing on Charges He Fired Gun in Neighborhood. Wondering if we had a wayward employee, I read the story which is about a man named Jeffrey W. Logic, who is charged with firing several shots near a group of people assembled in a neighbor’s front yard.

Here’s the text that generated the hit. (Bolding mine.)

Logic pulled out a gun and fired several shots into one of the car’s tires. He also fired a shot into the pavement, and a stone or bullet fragment ricocheted and struck the driver in the neck, causing a red mark.

Logic started to walk away when two men who had been at the party approached him. Logic pointed the gun at one of them but the man swatted it away. Logic then fired into the ground once more.

What happened? The words mark and logic are sequentially related in the text. But they’re not in the same paragraph, let alone the same sentence. Clearly, if you’ll pardon the pun, this result is a misfire, but it highlights an important problem with full-text search engines: they understand neither the structure nor the semantics of the content they are indexing.

For example, in an XML representation, you might indicate structure by using <para> tags to indicate paragraphs and <sentence> tags to indicate sentences. When searching, you could then say “find all the <sentences> that contain the phrase ‘Mark Logic” and you wouldn’t get the false match that Google returned.

Awareness of structural markup is important not only because it eliminates false matches, but because it enables you to express more powerful queries (from both a search and retrieval perspective), such as:

  • Find all the <figures> that have <captions> that contain the phrase “survival rate”
  • Return the <authors> and <abstracts> of articles that contain the word “lymphoma” and have <captions> that contain the phrase “survival rate”

Or, more powerfully, perform a citation analysis:

  • Return the <authors> and <abstracts> of <articles> with <citations> to <articles> written by <author> “Sandra Horning”

But, even better is the ability for the system to understand semantic markup, for example, coming from a taxonomy or an automatic entity extraction tool.

  • And find them only in the <articles> that contain references to the <drug> “Rituxan” which is a <monoclonal antibody> which the system knows is also called “Rituximab” and “MabThera.”
  • And which contain the <disease> diffuse large b-cell lymphoma, which the system knows is a <b-cell lymphoma> which is a <lymphoma> which is a <blood cancer> which is a <cancer>

Then think of the simple permutations of this query you can run:

  • Against all monoclonal antibodies, not just Rituxan
  • Against all lymphomas, not just diffuse large b-cell. Or against all blood cancers.
  • Against not just those citing Horning, but against those citing any other author.

And then think of all the queries you can run against this same corpus when you apply any number of any combination of full-text, structural, and semantic constraints.

And pause.

If you wonder why I say that Mark Logic lets you run database-style queries against content* you hopefully now understand why.

It’s not just about catching that Mark is the last word of sentence 20 and Logic is the first word in sentence 21. It’s about combining structural, semantic, and full-text constraints and in virtually any combination. And that unleashes a mind-boggling amount of query power. A power, by the way, that we’re accustomed to against data, but are now only beginning to understand against content.


* If you want to search only within items the author bolded for emphasis, you can do that, too!

** Or if you want to search only within footnotes, as they sometimes do in finance, you can do that, too!

Google Settlement: Implications for Publishers White Paper

I’m happy to announce the availability of a white paper on which we worked with information industry veteran Bill Rosenblatt of Giant Steps Media that analyzes the effects of the Google settlement with publishers, and identifies new opportunities that result from it.

From the introduction:

The first part of this white paper describes the Settlement Agreement in the litigation, including the Book Rights Registry, the initial set of business models that Google and publishers will implement, and the set of business models that the Settlement Agreement contemplates in the future.

The second part discusses the future opportunities for publishers, particularly those that depend on publishers’ ability to build XML-based content architectures and make content available in structured formats with standardized metadata. It then discusses the capabilities that will be necessary for publishers to adopt in order to take advantage of these opportunities, including systems, tools, processes, and standards adoption where appropriate. Of course, a growing number of publishers are already starting to adopt these capabilities.

From the start of the second section:

The future business models contemplated in Section 4.7 of the Settlement Agreement differ qualitatively from the way that Google currently works with publishers – mainly in that they include several opportunities that require the availability of content in structural rather than page-oriented formats.

I believe the agreement enables Google to challenge Amazon in the sale of online books (and importantly, derivatives thereof) and therefore that publishers need to think of Google not as only a discoverability channel, but also a distribution channel — and ergo be ready to distribute their content in the way(s) that Google asks.

To me, this unsurprisingly suggests the need to store content in a centralized XML repository whereby it can quickly be repurposed, reformatted, and/or otherwise sliced-and-diced to enable experimentation about new and different ways to sell it.

John Kreisa from Mark Logic presented on the settlement with Bill Rosenblatt at last week’s O’Reilly Tools of Change for Publishing conference and here is an article in Publisher’s Weekly about the panel. The slides that they presented are below:

Bill Rosenblatt has blogged about the white paper and about the settlement itself on his Copyright and Technology blog.

You can download the white paper via the Mark Logic site (and be asked to provide some information) here. Or you can use the back door and download the paper directly via the Giant Steps site, here.

Related articles by Zemanta

Top 5 Predictions for Publishers in 2009 Webinar

Come to a webinar next week that Mark Logic is sponsoring entitled Gilbane’s Top 5 Predictions for Publishers in 2009 featuring speaker Steve Paxhia, lead analyst with The Gilbane Group.

Steve will discuss trends from his upcoming report, entitled “Digital Platforms and Technologies for Book Publishers: Implementations Beyond eBook,” where he identifies five important trends that are changing the landscape for information providers:

  • The Domain Strikes Back. Traditional publishers leverage their domain expertise to create premium, authoritative digital products that trump free and informed internet content.
  • Discoverability Overcomes Paranoia. Publishers realize the value in being discovered online, as research shows that readers do buy whole books and subscriptions based on excerpts and previews.
  • Custom, Custom, Custom. XML technology enables publishers to cost-effectively create custom products, a trend that has rapidly accelerated in the last six to nine months, especially in the educational textbook segment.
  • Communities Count. Communities will exert greater influence on digital publishing strategies, as providers engage readers to help build not only their brands but also their products.
  • Print on Demand. Print on demand increases in production quality and cost-effectiveness, leading to larger runs, more short-run custom products, and deeper backlists.

Learn more about these trends and find out if your company has the tools, processes, and attitudes required to exploit them in an uncertain market. All attendees will receive a copy of the completed research report from Gilbane.

For more information and/or to register, go here. Steve’s a great speaker. I’m sure you find the webinar a great use of an hour.

Thanks and Happy New Year

As I sit in snowy Tahoe with a glass of pinot noir in hand (and some Veuve Clicquot chilling for midnight) I thought I’d take a moment to say “thank you” to all.

  • Thanks to Mark Logic customers for having faith in us and for trusting us with your business.
  • Thanks to Mark Logic employees for delivering a record year, driving strong growth in 2008, and — despite a chilling economic environment — record fourth-quarter sales
  • Thanks to the Mark Logic board and investors for your continuing support.
  • Thanks to readers of the Mark Logic CEO Blog. Subscription and site visits are at record numbers, and I’ve had some major recognition for the blog this year — including the recent kudos from the ReadWriteWeb — which makes the time spent feel worthwhile and the atypical approach seem validated.

Happy New Year!

Best,
Dave