The Perils of Text-Only Search

You won’t be surprised to know that I use a series of Google Alerts to help me track events relevant to Mark Logic. I often have two problems with them:

  • Old content is mis-identified as new. You can ask any Mark Logician about the number of times I’ve forwarded a story that I thought was a hot news item only to discover it was four years old and that Google had nevertheless “alerted” me to its existence. I highlight this here because it bugs me, but I will not drill into it.
  • Content is mis-parsed, resulting in erroneous matches and alerts.

For example, today I received a Google Alert on “Mark Logic” for this Reading Eagle story, entitled Douglass Township Man Waives Hearing on Charges He Fired Gun in Neighborhood. Wondering if we had a wayward employee, I read the story which is about a man named Jeffrey W. Logic, who is charged with firing several shots near a group of people assembled in a neighbor’s front yard.

Here’s the text that generated the hit. (Bolding mine.)

Logic pulled out a gun and fired several shots into one of the car’s tires. He also fired a shot into the pavement, and a stone or bullet fragment ricocheted and struck the driver in the neck, causing a red mark.

Logic started to walk away when two men who had been at the party approached him. Logic pointed the gun at one of them but the man swatted it away. Logic then fired into the ground once more.

What happened? The words mark and logic are sequentially related in the text. But they’re not in the same paragraph, let alone the same sentence. Clearly, if you’ll pardon the pun, this result is a misfire, but it highlights an important problem with full-text search engines: they understand neither the structure nor the semantics of the content they are indexing.

For example, in an XML representation, you might indicate structure by using <para> tags to indicate paragraphs and <sentence> tags to indicate sentences. When searching, you could then say “find all the <sentences> that contain the phrase ‘Mark Logic” and you wouldn’t get the false match that Google returned.

Awareness of structural markup is important not only because it eliminates false matches, but because it enables you to express more powerful queries (from both a search and retrieval perspective), such as:

  • Find all the <figures> that have <captions> that contain the phrase “survival rate”
  • Return the <authors> and <abstracts> of articles that contain the word “lymphoma” and have <captions> that contain the phrase “survival rate”

Or, more powerfully, perform a citation analysis:

  • Return the <authors> and <abstracts> of <articles> with <citations> to <articles> written by <author> “Sandra Horning”

But, even better is the ability for the system to understand semantic markup, for example, coming from a taxonomy or an automatic entity extraction tool.

  • And find them only in the <articles> that contain references to the <drug> “Rituxan” which is a <monoclonal antibody> which the system knows is also called “Rituximab” and “MabThera.”
  • And which contain the <disease> diffuse large b-cell lymphoma, which the system knows is a <b-cell lymphoma> which is a <lymphoma> which is a <blood cancer> which is a <cancer>

Then think of the simple permutations of this query you can run:

  • Against all monoclonal antibodies, not just Rituxan
  • Against all lymphomas, not just diffuse large b-cell. Or against all blood cancers.
  • Against not just those citing Horning, but against those citing any other author.

And then think of all the queries you can run against this same corpus when you apply any number of any combination of full-text, structural, and semantic constraints.

And pause.

If you wonder why I say that Mark Logic lets you run database-style queries against content* you hopefully now understand why.

It’s not just about catching that Mark is the last word of sentence 20 and Logic is the first word in sentence 21. It’s about combining structural, semantic, and full-text constraints and in virtually any combination. And that unleashes a mind-boggling amount of query power. A power, by the way, that we’re accustomed to against data, but are now only beginning to understand against content.


* If you want to search only within items the author bolded for emphasis, you can do that, too!

** Or if you want to search only within footnotes, as they sometimes do in finance, you can do that, too!

NY Times Calls Out Tom Siebel on Death of IT Claims

About six months ago, I blogged my notes from a speech given to the Alliance of Chief Executives by Tom Siebel, entitled From IT to ET (from information technology to energy technology). While I enjoyed the speech, I had a few problems with it:

  • It struck me as unduly and anecdotally negative on the future of information technology. It reeked of a “the party’s over because I’m leaving” mentality.
  • Some of it felt like pure spin. When talking about his new venture, C3 (which I think stands for carbon-concious consumer), Siebel seemed to go miles out of his way to position it as an energy technology company, not an information technology company. But, as far as I could tell, it was going to make a software application to help enterprises manage their carbon footprint, kind of an ERP for carbon. Quoting Siebel: “an application to monitor, monetize, and mitigate the carbon footprint.” To me, that’s clearly information technology.

Basically, it felt like his thesis was “IT is dead, long live ET” and any, uh, “facts” that contradicted it were overlooked or spun away. So I was happy to see this piece in the New York Times this morning, Are The Glory Days Long Gone for IT?, which challenges some of Siebel’s claims.

The chart above shows the peril’s of defining the world by one’s own experience. While Siebel characterized 1980-2000 — Siebel’s prime years — as the go-go years of IT, in reality, the fastest period of IT growth was between 1961 and 1971.

Timothy Bresnahan, a Stanford economist, similarly does not accept Mr. Siebel’s contention that the decline in growth rates this decade, which encompasses two recessions, signals a permanent end to I.T.’s record of growing faster than the larger economy. “It is early days to say the game is over,” he said.

When the economy recovers, there is no dearth of unfinished projects for I.T., he said, like “automating white-collar work and automating buying and selling in markets.”

To me, the latter point is critical. Rather than defining the future of the industry by its recent growth rate or by one person’s successes, we should define it in terms of work left to do. And from that perspective, there is plenty of work left and plenty of growth associated with that work.

In my estimation, it’s not that 1980-2000 were the two big decades of IT, they were the two decades of data. The growth that Siebel refers to was all driven by the relational database, the applications layered on it, and the analytics on top of those applications.

One key reason I joined Mark Logic was that, in some way, I agree with Siebel — we are hitting diminishing marginal returns — not on IT overall — but instead on what we can do with data. While we have seen great strides in what we can do with data over the past 30 years, content, by contrast, still lives in the stone ages.

Documents are stored redundantly. They’re not centralized in databases. They’re not re-used. They’re not controlled and managed by applications. They’re not analyzed. Compared to data, content is the Wild West. It’s my belief that this will change, and change radically, over the years to come.

The full story, written by San Jose State business school professor Randall Stross, is here.

Netflix’s 128 Slides on Culture. Awesome.

The blogosphere is lit up with discussion of this 128-slide internal presentation on culture at Netflix, dug up by the Hacking Netflix blog and discussed in this TechCrunch story.

I’m a big believer in the importance of corporate culture as a competitive advantage and often use colorful analogies (e.g., swarming antibodies, Lord of the Flies) to describe one aspect of the culture (mediocrity intolerance) that I want at Mark Logic. Slides 41-51 do a good job of describing what happens as a company grows — and as I witnessed at Business Objects — in this regard.

So while I’m not ready to sign up for everything in the Netflix deck, I think it’s about as visionary and practical a piece on corporate culture as anything I’ve seen. Truly excellent stuff.

How Do You Query a Key-Value Store Anyway?

Just picked this joke up via a (warning: expletive) cartoon a friend mailed me. To spare those who want to avoid the expletive, here’s the text.

Bob: So, how do I query the database?
IT guy: It’s not a database. It’s a key-value store.
Bob: OK, it’s not a database. How do I query it?
IT guy: You write a distributed map-reduce function in Erlang.
Bob: Did you just tell me to go screw myself?
IT guy: I believe I did, Bob.

Two Great Posts on Media Industry Disruption

I’ve been off filling my brain at the Stanford Graduate School of Business for the past two weeks, so I haven’t been able to post much. I have nevertheless managed to keep my Tweetstream going so, if you’re not already following me on Twitter, you may wish to consider doing so because I am changing my sharing pattern to include more Tweets based upon the realization that bit.ly makes it very easy to do so and that I only blog on somewhere between 5% and 25% of the topics that I throw on my to-blog list.

On digging through the deluge of RSS articles I found on my return, I located two particularly interesting posts on disruption of the media industry.

The first is a post by Michael Nielsen, a quantum information theorist and seemingly very smart fellow, entitled Is Scientific Publishing About To Be Disrupted, which includes links to some great posts about the challenges facing newspapers, and provides not only a great general discussion of how industry disruption happens, but also specific look at media overall and scientific publishing in particular. I’d never heard of Nielsen before, but I’ve already subscribed to his blog because he strikes me as a real Renaissance individual working on fascinating projects like a book on The Future of Science, a series of posts on Google’s Technology Stack, along with the odd post on things like Why The World Needs Quantum Mechanics.

The second is a post on the ReadWriteWeb entitled Bits of Destruction Hit the Book Publishing Business Part 1 and Part 2. These posts focus on three waves rocking the publishing industry (Google Book Search, e-Books, and print on demand) and their consequences on various participants in the book publishing value chain. In the end they predict that future book revenues end up getting split 33/33/33 among the author, the (web) publisher, and the e-book or print-on-demand deliverer.

Excerpt:

Here is a bookstore owner’s nightmare. Customer walks in; browses around; has grand old time in this temple of knowledge; peruses a book that costs $27; takes out Kindle and orders it for $17, right there in front of your nose, using your wi-fi connection. Aaagh!

You wake up sweating at 3:00 in the morning

Both posts are well worth reading, but save some time to do so and be sure to hit lots of the links embedded in the Nielsen post.