NetBase Tragicomedy: The Perils of "Magic" and Language Processing

It’s no secret that I’m not a big fan of “magic” in software. You could argue I’m still bearing the scars from BusinessMiner, one of our few failed products, at Business Objects. You could argue that for some tasks, magic is a necessary evil, and I wouldn’t argue back too hard. Many Mark Logic customers rely on “magic” to automatically enrich content, adding XML tags that identify entities (e.g., people, places, geopolitical organizations), sentiment (e.g., positive, negative or neutral), or even geo-code content with latitude and longitude that we then index, thus enabling geo-queries against content.

While I confess to some ignorance about how the magical tools work, it’s my perception that on a bad day they’re 50% accurate and on a good one they’re 80%. Now one could argue that content that’s enriched at 80% accuracy is way more valuable than unenriched content, and you’d be right. All I’m saying is I’m glad I’m not in the business of making the software that does that, because — customers being customers — nobody wants to hear that 80% is great and 100% is unattainable. Perhaps it’s my lack of deep expertise in the field. Or perhaps it’s my belief that humans are uncomfortable around black boxes.

The other reason I don’t like magic is that it can fail in truly spectacular ways. What’s the expression? To err is human. To really foul things up requires natural language processing.

This happened today with NetBase, a company whose high-level messaging is fairly similar to Mark Logic’s though happily with very different technology and business strategy.

NetBase recently launched healthBase, “a new health research showcase to find treatments, causes, and complications of any condition [and the] pros and cons of any drug, food, or treatment.”

Sounds nice. But, today they were slaughtered on TechCrunch with a story headlined: NetBase Thinks You Can Get Rid of Jews with Alcohol and Salt. Excerpt:

Several of our readers tested out the site and found that healthBase’s semantic search engine has some major glitches (see the comments). One of the most unfortunate examples is when you type in a search for “AIDS,” one of the listed causes of the disease is “Jew.” Really.

The ridiculousness continues. When you click on Jew, you can see proper “Treatments” for Jews, “Drugs And Medications” for Jews and “Complications” for Jews. Apparently, “alcohol” and “coarse salt” are treatments to get rid of Jews, as is Dr. Pepper! Who knew?

Here’s a great demo of why I don’t want to sell semantic processing technology. Here’s the reply Netbase gave TechCrunch:

This is an unfortunate example of homonymy, i.e., words that have different meanings.

The showcase was not configured to distinguish between the disease “AIDS” and the verb “aids” (as in aiding someone). If you click on the result “Jew” you see a sentence from a Wikipedia page about 7th Century history: “Hispano-Visigothic king Egica accuses the Jews of aiding the Muslims, and sentences all Jews to slavery. ” Although Wikipedia contains a lot of great health information it also contains non-health related information (like this one) that is hard to filter out.

I hate to be pedestrian, but isn’t that just a fancy way of saying it doesn’t work? It reminds me of the quip about Autonomy, where, when the Bayesian and Shanon’s Information Theory magic isn’t working, they simply tell the customer that they’re not smart enough to understand why. Nice.

Now, for the hapless NetBase, the AIDS query was just the beginning. They get destroyed in the blog comments, which quickly turned into a contest to find the silliest results.

  • The treatment for venture capital is funding. The cons is fool.
  • Masturbation causes insanity and is cured by cocaine.
  • The treatment for Twitter is Facebook. (This one might be right.)
  • The treatment for Microsoft is Viagra
  • Babies are caused by smoking and brain damage

It goes on and on. Now yes, many of the silly queries are out of the health domain, but there has to be better way to answer them.

One active commenter, Dave, who coined the “tragicomedy” description and who isn’t me, had this to offer:

The tragi-comic failure of Netbase can teach a lot to every company in the Semantic space.

Lesson 1

Don’t even try to boil the ocean of the WWW with these technologies. [The] Internet is full of valuable information but crap (or opinions) is 90% [of it] , the cost of getting rid of this crap and save only the good stuff is very high, [and] that’s [what] makes [it] so hard to succed even for Google and Microsoft with billions [of dollars].

Lesson 2

Linguistic approaches are likely going to fail because search engines (and machines) can’t distinguish joke/seriousness, sarcasm/shame and sentiments in general. The semantic meaning is right there not in the words of a text.

Lesson 3

If you choose to apply such approaches to one specific topic like Medicine (good choice) then stick to that topic , that means accept as INPUT only medical terms and provide as OUTPUTS only medical terms.

This last point requires human intervention and predefined taxonomies/ontologies but Netbase claims that they don’t need them both, ]i.e., that] their engine is fully automatic —> the failure too.

3 responses to “NetBase Tragicomedy: The Perils of "Magic" and Language Processing

  1. The poison comes from the semantic that has flooded the Web with the version 3.0 just as the hack tips have flooded in the Web with the number 2.0. Everyone makes semantic as Monsieur Jourdain in Le Bourgeois Gentilhomme makes prose. Some even claim to define the syntax from semantic. It is well known: semantic is the mother of the syntax, itself is the lexical parent node. All modern philosophers know that and they are becoming more numerous. Magic invaded the planet. The alchemists are back, new religions are multiplying, bankers are increasingly conjuring, elections produce results more surprising. However the weather is still unclear, vaccines are still too difficult to develop, programs are squatted by more and more cockroaches, web designers are recruited from colleges. I just visited a Web service that I did not know before. I subscribe to in specifying data transmitted in http (without S, because the site does not support S – S for syntax Security). However semantic is safe because they speak learnedly about security. A few minutes later I try to open my personal space and my browser gets stuck, my keyboard was unresponsive. I ask my inspector to investigate, he does not know any of the semantic point, but he found that the code inside was very badly written. I am often in trouble because my browser was set for myself to identify spelling errors. Just for fun I wrote the character "below" < and behind some other characters, nothing too serious because I am not a pirate. This site is very serious, however, many French players in the web are present, they speak very carefully the incoming web. ( The species Homo deserves perhaps not really attribute SAPIENS, probably ranking ontological abusive or premature. It is not enough to store everything we have at hand on a magnetic disk in either 2.0 or 3.0 format to organize things for themselves. The magic is necessary because the human brain is lazy. The human brain is brimming with ideas but hate the logic and syntax. I think I understand why Alan Turing committed suicide. But some people continue to say that holy Logic was below the cradle of the Big Bang, later a witch gave an apple to a brave man who thought becoming a scientist. The magic goose must be fought every day. Semantic is not OWL or RDF. The OWL on its tree is laughing at the clever fox.

  2. Unfortunately, this case stands as an indictment of NLP and "magic" in large part because of fixable problems that were neglected long after they were identified. I hope that the company puts some needed fixes into place soon and represents the state of Natural Language technology better:

  3. Pingback: NetBase Thinks You Can Get Rid Of Jews With Alcohol And Salt | JK Technologies |

Leave a Reply

This site uses Akismet to reduce spam. Learn how your comment data is processed.