It’s no secret that I’m not a big fan of “magic” in software. You could argue I’m still bearing the scars from BusinessMiner, one of our few failed products, at Business Objects. You could argue that for some tasks, magic is a necessary evil, and I wouldn’t argue back too hard. Many Mark Logic customers rely on “magic” to automatically enrich content, adding XML tags that identify entities (e.g., people, places, geopolitical organizations), sentiment (e.g., positive, negative or neutral), or even geo-code content with latitude and longitude that we then index, thus enabling geo-queries against content.
While I confess to some ignorance about how the magical tools work, it’s my perception that on a bad day they’re 50% accurate and on a good one they’re 80%. Now one could argue that content that’s enriched at 80% accuracy is way more valuable than unenriched content, and you’d be right. All I’m saying is I’m glad I’m not in the business of making the software that does that, because — customers being customers — nobody wants to hear that 80% is great and 100% is unattainable. Perhaps it’s my lack of deep expertise in the field. Or perhaps it’s my belief that humans are uncomfortable around black boxes.
The other reason I don’t like magic is that it can fail in truly spectacular ways. What’s the expression? To err is human. To really foul things up requires natural language processing.
NetBase recently launched healthBase, “a new health research showcase to find treatments, causes, and complications of any condition [and the] pros and cons of any drug, food, or treatment.”
Sounds nice. But, today they were slaughtered on TechCrunch with a story headlined: NetBase Thinks You Can Get Rid of Jews with Alcohol and Salt. Excerpt:
Several of our readers tested out the site and found that healthBase’s semantic search engine has some major glitches (see the comments). One of the most unfortunate examples is when you type in a search for “AIDS,” one of the listed causes of the disease is “Jew.” Really.
The ridiculousness continues. When you click on Jew, you can see proper “Treatments” for Jews, “Drugs And Medications” for Jews and “Complications” for Jews. Apparently, “alcohol” and “coarse salt” are treatments to get rid of Jews, as is Dr. Pepper! Who knew?
Here’s a great demo of why I don’t want to sell semantic processing technology. Here’s the reply Netbase gave TechCrunch:
This is an unfortunate example of homonymy, i.e., words that have different meanings.
The showcase was not configured to distinguish between the disease “AIDS” and the verb “aids” (as in aiding someone). If you click on the result “Jew” you see a sentence from a Wikipedia page about 7th Century history: “Hispano-Visigothic king Egica accuses the Jews of aiding the Muslims, and sentences all Jews to slavery. ” Although Wikipedia contains a lot of great health information it also contains non-health related information (like this one) that is hard to filter out.
I hate to be pedestrian, but isn’t that just a fancy way of saying it doesn’t work? It reminds me of the quip about Autonomy, where, when the Bayesian and Shanon’s Information Theory magic isn’t working, they simply tell the customer that they’re not smart enough to understand why. Nice.
Now, for the hapless NetBase, the AIDS query was just the beginning. They get destroyed in the blog comments, which quickly turned into a contest to find the silliest results.
- The treatment for venture capital is funding. The cons is fool.
- Masturbation causes insanity and is cured by cocaine.
- The treatment for Twitter is Facebook. (This one might be right.)
- The treatment for Microsoft is Viagra
- Babies are caused by smoking and brain damage
It goes on and on. Now yes, many of the silly queries are out of the health domain, but there has to be better way to answer them.
One active commenter, Dave, who coined the “tragicomedy” description and who isn’t me, had this to offer:
The tragi-comic failure of Netbase can teach a lot to every company in the Semantic space.
Don’t even try to boil the ocean of the WWW with these technologies. [The] Internet is full of valuable information but crap (or opinions) is 90% [of it] , the cost of getting rid of this crap and save only the good stuff is very high, [and] that’s [what] makes [it] so hard to succed even for Google and Microsoft with billions [of dollars].
Linguistic approaches are likely going to fail because search engines (and machines) can’t distinguish joke/seriousness, sarcasm/shame and sentiments in general. The semantic meaning is right there not in the words of a text.
If you choose to apply such approaches to one specific topic like Medicine (good choice) then stick to that topic , that means accept as INPUT only medical terms and provide as OUTPUTS only medical terms.
This last point requires human intervention and predefined taxonomies/ontologies but Netbase claims that they don’t need them both, ]i.e., that] their engine is fully automatic —> the failure too.