Category Archives: Mark Logic

The 20th Century Called. It Wants Its Relational Database Back.

I saw this piece of creative the other day for a tradeshow ad and loved it.  Remember, Ted Codd invented the relational database in 1970 with his paper “A Relational Model for Shared Data Banks.”  This PDF of the classic looks about as old as the ad.  (Do PDFs age?)  Enjoy!

My Slides from the MarkLogic 2010 Digital Publishing Summit

Just a quick post to share my slides from this year’s standing-room-only 2010 Digital Publishing Summit at the Plaza Hotel.

Thank you to everyone for attending!

Six Thoughts on The NoSQL Movement

We are in the middle of one of our periodic analyst tours at MarkLogic, where we meet about 50 top software industry analysts focused in areas like enterprise search, enterprise content management, and database management systems.  The NoSQL movement was one of four key topics we are covering, and while I’d expected some lively discussions about it, most of the time we have found ourselves educating people about NoSQL.

In this post, I’ll share the six key points we’re making about NoSQL on the tour.

Our first point is that NoSQL systems come in many flavors and it’s not just about key/value stores.  These flavors include:

  • Key/value stores (e.g., Hadoop)
  • Document databases (e.g., MarkLogic, CouchDB)
  • Graph databases (e.g., AllegroGraph)
  • Distributed caching systems (e.g., Memcached)

Our second point is that NoSQL is part of a broader trend in database systems:  specialization.  The jack-of-all-trades relational database (e.g., Oracle, DB2) works reasonably well for a broad range of applications — but it is a master of none.  For any specific application, you can design a specialized DBMS that will outperform Oracle by 10 to 1000 times.  Specialization represents, in aggregate, the biggest threat to the big-three DBMS oligopolists.  Examples of specialized DBMSs include:

  • Streambase, Skyler:  real-time stream processing
  • MarkLogic:  semi-structured data
  • Vertica, Greenplum:  mid-range data warehousing
  • Aster:  large-scale (aka “big data”) analytic data warehousing
  • VoltDB:  high volume transaction processing
  • MATLAB:  scientific data management

Our third point is that NoSQL is largely orthogonal to specialization.  There are specialized NoSQL databases (e.g., MarkLogic) and there are specialized SQL databases (e.g., Aster, Volt).  The only case where I think there are zero examples is general-purpose NoSQL systems.  While I’m sure many of the NoSQL crowd would argue that their systems can do everything, is anyone *really* going to run general ledger or opportunity management on Hadoop?   I don’t think so.

Our fourth point is that NoSQL isn’t about open source.  The software-wants-to-be-free crowd wants to build open source into the definition of NoSQL and I believe that is both incorrect and a mistake.  It’s incorrect because systems like MarkLogic (which uses an XML data model and XQuery) are indisputably NoSQL.  And it’s a mistake because technology movements should be about technology, not business models.  (The open source NoSQL gang can solve its problem simply by affiliating with both the NoSQL technology movement and the open source business model movements.)

As CEO of a company that’s invested a lot of energy in supporting standards, our fifth point was that, rather ironically, most open source NoSQL systems have proprietary interfaces.  People shouldn’t confuse “can access the source code” with “can write applications that call standard interfaces” and ergo can swap components easily.   If you take offense at the word proprietary, that’s fine.  You can call them unique instead.  But the point is an application written on Cassandra is not practically moved to Couch, regardless of whether you can access the source code both Couch and Cassandra.

Our sixth point is that we think MarkLogic provides a best-of-both-worlds option between open source NoSQL systems and traditional DBMSs.  Like open source NoSQL systems, MarkLogic provides shared-nothing clustering on inexpensive hardware, superior support for unstructured data, document-orientation, and high-performance.  But like traditional databases, MarkLogic speaks a high-level query language, implements industry standards, and is commercial-grade, supported software.  This means that customers can scale applications on inexpensive computers and storage, avoid the pains of normalization and joins, have systems that run fast, can be implemented by normal database programmers, and feel safe that their applications are built via a standard query language (XQuery) that is supported by scores of vendors.

Slides from Mark Logic Digital Publishing Summit

I’m at the Mark Logic Digital Publishing Summit at The Plaza Hotel in New York. While I’m not sure what the “official” means will be for sharing presentation slides, based on a few requests at lunch I’ve uploaded my slides and David Worlock’s slides to SlideShare and embedded them here.

Great event, over 550 registered, almost ran out of chairs at lunch. Thanks to everyone for coming!

My slides:

David’s slides:

Mark Logic Highlighted in San Jose Mercury News Story on Venture Capital

I’m pleased to report that Mark Logic was highlighted in a San Jose Mercury News story published yesterday about the resurgence of VC-backed startups one year after the famous Sequoia Rest in Peace Good Times meeting.

The story was published as part of the Mercury New’s quarterly venture capital survey.

The story begins:

When Dave Kellogg arrived at Sequoia Capital on that day in early October 2008, “the last chair in the room was in the front row,” he recalled. “My penance for being a little bit late.”

Kellogg is the CEO of Mark Logic, a startup that helps business clients make sense of the chaos of unstructured data. He wound up with an excellent seat for an auspicious moment in Silicon Valley lore — the “R.I.P. Good Times” briefing that drove home the severity of the financial industry crisis for the startup economy. Initially intended exclusively for leaders of companies backed by Sequoia’s investments, it would be inadvertently leaked by one CEO and sail around the Web like an early Halloween ghoul.

Not only has the story received great visibility in Silicon Valley, a quote of mine from it was picked up by the Wall Street Journal’s Venture Capital Dispatch blog, here.

The full story is available here. The Mercury News quarterly venture capital survey is here. Another recent piece of Mark Logic business press coverage, from the San Jose Buisness Journal, is here.

Mark Logic in One Adjective-Filled Sentence by Jason Hunter

Embedded below please find the short slide deck that Jason Hunter presented at the recent Oakland NoSQL Meetup. I thought I’d share the deck here because Jason’s mission was unique: we were sponsoring a meeting of people largely opposed to commercial software and definitely opposed to SQL databases, so he had to tread lightly, say what he had to say, and get out.

Here’s what he said.

NetBase Tragicomedy: The Perils of "Magic" and Language Processing

It’s no secret that I’m not a big fan of “magic” in software. You could argue I’m still bearing the scars from BusinessMiner, one of our few failed products, at Business Objects. You could argue that for some tasks, magic is a necessary evil, and I wouldn’t argue back too hard. Many Mark Logic customers rely on “magic” to automatically enrich content, adding XML tags that identify entities (e.g., people, places, geopolitical organizations), sentiment (e.g., positive, negative or neutral), or even geo-code content with latitude and longitude that we then index, thus enabling geo-queries against content.

While I confess to some ignorance about how the magical tools work, it’s my perception that on a bad day they’re 50% accurate and on a good one they’re 80%. Now one could argue that content that’s enriched at 80% accuracy is way more valuable than unenriched content, and you’d be right. All I’m saying is I’m glad I’m not in the business of making the software that does that, because — customers being customers — nobody wants to hear that 80% is great and 100% is unattainable. Perhaps it’s my lack of deep expertise in the field. Or perhaps it’s my belief that humans are uncomfortable around black boxes.

The other reason I don’t like magic is that it can fail in truly spectacular ways. What’s the expression? To err is human. To really foul things up requires natural language processing.

This happened today with NetBase, a company whose high-level messaging is fairly similar to Mark Logic’s though happily with very different technology and business strategy.

NetBase recently launched healthBase, “a new health research showcase to find treatments, causes, and complications of any condition [and the] pros and cons of any drug, food, or treatment.”

Sounds nice. But, today they were slaughtered on TechCrunch with a story headlined: NetBase Thinks You Can Get Rid of Jews with Alcohol and Salt. Excerpt:

Several of our readers tested out the site and found that healthBase’s semantic search engine has some major glitches (see the comments). One of the most unfortunate examples is when you type in a search for “AIDS,” one of the listed causes of the disease is “Jew.” Really.

The ridiculousness continues. When you click on Jew, you can see proper “Treatments” for Jews, “Drugs And Medications” for Jews and “Complications” for Jews. Apparently, “alcohol” and “coarse salt” are treatments to get rid of Jews, as is Dr. Pepper! Who knew?

Here’s a great demo of why I don’t want to sell semantic processing technology. Here’s the reply Netbase gave TechCrunch:

This is an unfortunate example of homonymy, i.e., words that have different meanings.

The showcase was not configured to distinguish between the disease “AIDS” and the verb “aids” (as in aiding someone). If you click on the result “Jew” you see a sentence from a Wikipedia page about 7th Century history: “Hispano-Visigothic king Egica accuses the Jews of aiding the Muslims, and sentences all Jews to slavery. ” Although Wikipedia contains a lot of great health information it also contains non-health related information (like this one) that is hard to filter out.

I hate to be pedestrian, but isn’t that just a fancy way of saying it doesn’t work? It reminds me of the quip about Autonomy, where, when the Bayesian and Shanon’s Information Theory magic isn’t working, they simply tell the customer that they’re not smart enough to understand why. Nice.

Now, for the hapless NetBase, the AIDS query was just the beginning. They get destroyed in the blog comments, which quickly turned into a contest to find the silliest results.

  • The treatment for venture capital is funding. The cons is fool.
  • Masturbation causes insanity and is cured by cocaine.
  • The treatment for Twitter is Facebook. (This one might be right.)
  • The treatment for Microsoft is Viagra
  • Babies are caused by smoking and brain damage

It goes on and on. Now yes, many of the silly queries are out of the health domain, but there has to be better way to answer them.

One active commenter, Dave, who coined the “tragicomedy” description and who isn’t me, had this to offer:

The tragi-comic failure of Netbase can teach a lot to every company in the Semantic space.

Lesson 1

Don’t even try to boil the ocean of the WWW with these technologies. [The] Internet is full of valuable information but crap (or opinions) is 90% [of it] , the cost of getting rid of this crap and save only the good stuff is very high, [and] that’s [what] makes [it] so hard to succed even for Google and Microsoft with billions [of dollars].

Lesson 2

Linguistic approaches are likely going to fail because search engines (and machines) can’t distinguish joke/seriousness, sarcasm/shame and sentiments in general. The semantic meaning is right there not in the words of a text.

Lesson 3

If you choose to apply such approaches to one specific topic like Medicine (good choice) then stick to that topic , that means accept as INPUT only medical terms and provide as OUTPUTS only medical terms.

This last point requires human intervention and predefined taxonomies/ontologies but Netbase claims that they don’t need them both, ]i.e., that] their engine is fully automatic —> the failure too.

Why Mark Logic Isn't Bigger Than Oracle. Or Is It?

I had a great meeting with a Silicon Valley technology executive this afternoon. He was a very clever chap with all the right credentials: an engineering master’s, a top MBA, experience at Google. The works.

He asked us the standard set of questions that tech executives do.

  • Is this a database or search engine? (Yes. As in both.)
  • Does it have transaction consistency? (Yes.)
  • Does it provide real-time search? (Yes.)
  • Does it have a query language? (Yes. XQuery.)
  • Does it scale? (Yes, to 100+ TB today)
  • Does it cluster on cheap hardware? (Yes.)
  • Do you require schema adherence? (No.)
  • Can you handle semi-structured content? (Yes.)
  • Is it native XML? (Yes.)
  • What database does it run on? (Itself. i.e., it is a DBMS.)
  • Do you use Lucene? (No. It has our own built-in search engine.)
  • Is it a database bolted to a search engine? (No. It’s both but as an integral hybrid.)

We went to cover common customer use-cases, talking about the generic reasons why customers buy Mark Logic:

  • Messy XML. (Various structures, unknown structures, changing structure.)
  • Big content. (Tens to hundreds of terabytes and nothing else will perform.)
  • DBMS/search-engine integration fatigue. (Tired of paying the costs of development, maintenance, and synchronization.)
  • Semi-structured data. (Such as the wide, sparse table problem.)
  • “Special” search requirements, where “special” could mean any combination of: structured, parametric, real-time, language-specific, or geo-coded search.

Then, suddenly, he caught us off-guard with his next question:

If you actually do all this, then why aren’t you bigger than Oracle?

In the meeting, we stumbled: “uh, gosh, well, that’s a question we don’t hear often.” Now, had I brought my A-game to the meeting, here’s what I would have said:

On an age-adjusted basis, we are.

For my source, check out this New York Times post, How Long Does It Take to Build a Technology Empire? The post measures various high-tech companies on their historical revenue ramps, specifically addressing the question how long did it take to get to $50M in revenues?

The answer for Oracle is 10 years. In Oracle’s year six, according to the graphic — which is interactive, you should play around with it — they were a mere $5M in inflation-adjusted revenues. Mark Logic, in year six (from its A-round funding) is many times that and should break $50M, in my estimation and with continued good fortune, in year 7.

Related articles by Zemanta

The Perils of Text-Only Search

You won’t be surprised to know that I use a series of Google Alerts to help me track events relevant to Mark Logic. I often have two problems with them:

  • Old content is mis-identified as new. You can ask any Mark Logician about the number of times I’ve forwarded a story that I thought was a hot news item only to discover it was four years old and that Google had nevertheless “alerted” me to its existence. I highlight this here because it bugs me, but I will not drill into it.
  • Content is mis-parsed, resulting in erroneous matches and alerts.

For example, today I received a Google Alert on “Mark Logic” for this Reading Eagle story, entitled Douglass Township Man Waives Hearing on Charges He Fired Gun in Neighborhood. Wondering if we had a wayward employee, I read the story which is about a man named Jeffrey W. Logic, who is charged with firing several shots near a group of people assembled in a neighbor’s front yard.

Here’s the text that generated the hit. (Bolding mine.)

Logic pulled out a gun and fired several shots into one of the car’s tires. He also fired a shot into the pavement, and a stone or bullet fragment ricocheted and struck the driver in the neck, causing a red mark.

Logic started to walk away when two men who had been at the party approached him. Logic pointed the gun at one of them but the man swatted it away. Logic then fired into the ground once more.

What happened? The words mark and logic are sequentially related in the text. But they’re not in the same paragraph, let alone the same sentence. Clearly, if you’ll pardon the pun, this result is a misfire, but it highlights an important problem with full-text search engines: they understand neither the structure nor the semantics of the content they are indexing.

For example, in an XML representation, you might indicate structure by using <para> tags to indicate paragraphs and <sentence> tags to indicate sentences. When searching, you could then say “find all the <sentences> that contain the phrase ‘Mark Logic” and you wouldn’t get the false match that Google returned.

Awareness of structural markup is important not only because it eliminates false matches, but because it enables you to express more powerful queries (from both a search and retrieval perspective), such as:

  • Find all the <figures> that have <captions> that contain the phrase “survival rate”
  • Return the <authors> and <abstracts> of articles that contain the word “lymphoma” and have <captions> that contain the phrase “survival rate”

Or, more powerfully, perform a citation analysis:

  • Return the <authors> and <abstracts> of <articles> with <citations> to <articles> written by <author> “Sandra Horning”

But, even better is the ability for the system to understand semantic markup, for example, coming from a taxonomy or an automatic entity extraction tool.

  • And find them only in the <articles> that contain references to the <drug> “Rituxan” which is a <monoclonal antibody> which the system knows is also called “Rituximab” and “MabThera.”
  • And which contain the <disease> diffuse large b-cell lymphoma, which the system knows is a <b-cell lymphoma> which is a <lymphoma> which is a <blood cancer> which is a <cancer>

Then think of the simple permutations of this query you can run:

  • Against all monoclonal antibodies, not just Rituxan
  • Against all lymphomas, not just diffuse large b-cell. Or against all blood cancers.
  • Against not just those citing Horning, but against those citing any other author.

And then think of all the queries you can run against this same corpus when you apply any number of any combination of full-text, structural, and semantic constraints.

And pause.

If you wonder why I say that Mark Logic lets you run database-style queries against content* you hopefully now understand why.

It’s not just about catching that Mark is the last word of sentence 20 and Logic is the first word in sentence 21. It’s about combining structural, semantic, and full-text constraints and in virtually any combination. And that unleashes a mind-boggling amount of query power. A power, by the way, that we’re accustomed to against data, but are now only beginning to understand against content.


* If you want to search only within items the author bolded for emphasis, you can do that, too!

** Or if you want to search only within footnotes, as they sometimes do in finance, you can do that, too!

Fun at the Mark Logic Company Picnic

Like many companies, we have an annual summer event. Since many Mark Logicians have young children, we bias the entertainment a bit in that direction. This year we had a face-painter, a bubble-blower, and a truly amazing balloon-twister.

During the event, two of our marketers cut the face-painting line (making a few kids cry in the process) and put the face-painter to good marketing use by, well, body-painting some key Mark Logic messaging on themselves: Mark Logic — The Best Place to Put Your XML.

(Big, bad a** fast, XQuery is on the arms.)

Gosh, we have so much fun around here at times it scares me.