The Perils of Text-Only Search

You won’t be surprised to know that I use a series of Google Alerts to help me track events relevant to Mark Logic. I often have two problems with them:

  • Old content is mis-identified as new. You can ask any Mark Logician about the number of times I’ve forwarded a story that I thought was a hot news item only to discover it was four years old and that Google had nevertheless “alerted” me to its existence. I highlight this here because it bugs me, but I will not drill into it.
  • Content is mis-parsed, resulting in erroneous matches and alerts.

For example, today I received a Google Alert on “Mark Logic” for this Reading Eagle story, entitled Douglass Township Man Waives Hearing on Charges He Fired Gun in Neighborhood. Wondering if we had a wayward employee, I read the story which is about a man named Jeffrey W. Logic, who is charged with firing several shots near a group of people assembled in a neighbor’s front yard.

Here’s the text that generated the hit. (Bolding mine.)

Logic pulled out a gun and fired several shots into one of the car’s tires. He also fired a shot into the pavement, and a stone or bullet fragment ricocheted and struck the driver in the neck, causing a red mark.

Logic started to walk away when two men who had been at the party approached him. Logic pointed the gun at one of them but the man swatted it away. Logic then fired into the ground once more.

What happened? The words mark and logic are sequentially related in the text. But they’re not in the same paragraph, let alone the same sentence. Clearly, if you’ll pardon the pun, this result is a misfire, but it highlights an important problem with full-text search engines: they understand neither the structure nor the semantics of the content they are indexing.

For example, in an XML representation, you might indicate structure by using <para> tags to indicate paragraphs and <sentence> tags to indicate sentences. When searching, you could then say “find all the <sentences> that contain the phrase ‘Mark Logic” and you wouldn’t get the false match that Google returned.

Awareness of structural markup is important not only because it eliminates false matches, but because it enables you to express more powerful queries (from both a search and retrieval perspective), such as:

  • Find all the <figures> that have <captions> that contain the phrase “survival rate”
  • Return the <authors> and <abstracts> of articles that contain the word “lymphoma” and have <captions> that contain the phrase “survival rate”

Or, more powerfully, perform a citation analysis:

  • Return the <authors> and <abstracts> of <articles> with <citations> to <articles> written by <author> “Sandra Horning”

But, even better is the ability for the system to understand semantic markup, for example, coming from a taxonomy or an automatic entity extraction tool.

  • And find them only in the <articles> that contain references to the <drug> “Rituxan” which is a <monoclonal antibody> which the system knows is also called “Rituximab” and “MabThera.”
  • And which contain the <disease> diffuse large b-cell lymphoma, which the system knows is a <b-cell lymphoma> which is a <lymphoma> which is a <blood cancer> which is a <cancer>

Then think of the simple permutations of this query you can run:

  • Against all monoclonal antibodies, not just Rituxan
  • Against all lymphomas, not just diffuse large b-cell. Or against all blood cancers.
  • Against not just those citing Horning, but against those citing any other author.

And then think of all the queries you can run against this same corpus when you apply any number of any combination of full-text, structural, and semantic constraints.

And pause.

If you wonder why I say that Mark Logic lets you run database-style queries against content* you hopefully now understand why.

It’s not just about catching that Mark is the last word of sentence 20 and Logic is the first word in sentence 21. It’s about combining structural, semantic, and full-text constraints and in virtually any combination. And that unleashes a mind-boggling amount of query power. A power, by the way, that we’re accustomed to against data, but are now only beginning to understand against content.

* If you want to search only within items the author bolded for emphasis, you can do that, too!

** Or if you want to search only within footnotes, as they sometimes do in finance, you can do that, too!

4 responses to “The Perils of Text-Only Search

  1. Interesting! I think you covered most cases, except one, where you want the system to ignore the structural tag. Can you set Mark Logic to ignore presentation-level tags inside of words like em? Will Mark Logic find a hit for a search on "superbad" if the source text breaks up the word as follows:[b]super[/b]bad (superbad)? (I'm using square instead of angle brackets.)Can you set Mark Logic to ignore such "superfluous" tags in fulltext search?

  2. Yes and No.The server can be configured to ignore “superfluous tags” for phrases. In fact, by default we ignore many common HTML ones like the bold tag.So for a document with:<b>hello</b>worldYou can search for “hello world” and MarkLogic would find the document. We call this feature phrase-through.However, the behavior of the server is such that these tags act as word boundaries. Hence, tags through a word actually break the word into two:<b>super</b>badSearch for “superbad” would miss. Search for “super bad” would hit.I’d call this capability word-through, though we don't currently support it.

  3. Dave, of course –1) I'm guessing that the old pages for which you receive alerts *are* new: they're new to Google's index. I'd say that returning alerts for newly indexed content is valid although allowing you to specify you want only newly published content would be nice.2) I guess Google is doing greedy pattern matching, casting a wide net. I bet they have some level of regex support however. Have you tried being more precise? Seth

  4. Hi Seth,Thanks for reading. Indeed on your first point, I agree, the pages are in one sense new, because they are probably updated, but then again they are in another sense, old. For example, RSS seems to do a great job (and I'm not sure how) at only showing one post — the most recent — in my feed. And knowing that it's the same post, just modified. Google seems to lose track of when a page is the same page, just updated, vs. a new page. And I've embarrassed myself about 10 times at Mark Logic sending mails to the whole company saying "holy cow, breaking news, did you know that" [insert something that happened 2 years ago.]As for search syntax in Google, I'm not sure there's an easy answer for my trival problem with two adjacent words spanning paragraph.And, as I'm sure you know, my real point is that that is a *trivial* example of a much broader point on XML-awareness in content.I checked here for a solution on Google and, on a quick glance, didn't see one.

Leave a Reply

Fill in your details below or click an icon to log in: Logo

You are commenting using your account. Log Out / Change )

Twitter picture

You are commenting using your Twitter account. Log Out / Change )

Facebook photo

You are commenting using your Facebook account. Log Out / Change )

Google+ photo

You are commenting using your Google+ account. Log Out / Change )

Connecting to %s