The Perils of Text-Only Search

You won't be surprised to know that I use a series of Google Alerts to help me track events relevant to Mark Logic. I often have two problems with them:

Old content is mis-identified as new. You can ask any Mark Logician about the number of times I've forwarded a story that I thought was a hot news item only to discover it was four years old and that Google had nevertheless "alerted" me to its existence. I highlight this here because it bugs me, but I will not drill into it.

Content is mis-parsed, resulting in erroneous matches and alerts.

For example, today I received a Google Alert on "Mark Logic" for this Reading Eagle story, entitled Douglass Township Man Waives Hearing on Charges He Fired Gun in Neighborhood. Wondering if we had a wayward employee, I read the story which is about a man named Jeffrey W. Logic, who is charged with firing several shots near a group of people assembled in a neighbor's front yard.

Here's the text that generated the hit. (Bolding mine.)

Logic pulled out a gun and fired several shots into one of the car's tires. He also fired a shot into the pavement, and a stone or bullet fragment ricocheted and struck the driver in the neck, causing a red

mark

Logic

started to walk away when two men who had been at the party approached him. Logic pointed the gun at one of them but the man swatted it away. Logic then fired into the ground once more.What happened? The words mark and logic are sequentially related in the text. But they're not in the same paragraph, let alone the same sentence. Clearly, if you'll pardon the pun, this result is a misfire, but it highlights an important problem with full-text search engines: they understand neither the structure nor the semantics of the content they are indexing.

For example, in an XML representation, you might indicate structure by using <para> tags to indicate paragraphs and <sentence> tags to indicate sentences. When searching, you could then say "find all the <sentences> that contain the phrase 'Mark Logic" and you wouldn't get the false match that Google returned.

Awareness of structural markup is important not only because it eliminates false matches, but because it enables you to express more powerful queries (from both a search and retrieval perspective), such as:

Find all the <figures> that have <captions> that contain the phrase "survival rate"

Return the <authors> and <abstracts> of articles that contain the word "lymphoma" and have <captions> that contain the phrase "survival rate"

Or, more powerfully, perform a citation analysis:

Return the <authors> and <abstracts> of <articles> with <citations> to <articles> written by <author> "Sandra Horning"

But, even better is the ability for the system to understand semantic markup, for example, coming from a taxonomy or an automatic entity extraction tool.

And find them only in the <articles> that contain references to the <drug> "Rituxan" which is a <monoclonal antibody> which the system knows is also called "Rituximab" and "MabThera."

And which contain the <disease> diffuse large b-cell lymphoma, which the system knows is a <b-cell lymphoma> which is a <lymphoma> which is a <blood cancer> which is a <cancer>

Then think of the simple permutations of this query you can run:

Against all monoclonal antibodies, not just Rituxan
Against all lymphomas, not just diffuse large b-cell. Or against all blood cancers.
Against not just those citing Horning, but against those citing any other author.

And then think of all the queries you can run against this same corpus when you apply any number of any combination of full-text, structural, and semantic constraints.

And pause.

If you wonder why I say that

Mark Logic lets you run database-style queries against content*

you hopefully now understand why.

It's not just about catching that Mark is the last word of sentence 20 and Logic is the first word in sentence 21. It's about combining structural, semantic, and full-text constraints and in virtually any combination. And that unleashes a mind-boggling amount of query power. A power, by the way, that we're accustomed to against data, but are now only beginning to understand against content.

---
* If you want to search only within items the author bolded for emphasis, you can do that, too!

** Or if you want to search only within footnotes, as they sometimes do in finance, you can do that, too!

The Perils of Text-Only Search

Read more

Book Review: The Curious Case of Mike Lynch by Katie Prescott

Why I'm Joining the Board of Dreamdata

The Metrics Brothers Hiatus

A Diamond in the Rough: Startup Founder Survival Guide by David Politis