Category Archives: MarkLogic

The Pillorying of MarkLogic: Why Selling Disruptive Technology To the Government is Hard and Risky

There’s a well established school of thought that high-tech startups should focus on a few vertical markets early in their development.  The question is whether government should be one of them?

The government seems to think so.  They run a handful of programs to encourage startups to focus on government.  Heck, the CIA even has a venture arm right on Sand Hill Road, In-Q-Tel, whose mission is to find startups who are not focused on the Intelligence Community (IC) and to help them find initial customers (and provide them with a dash of venture capital) to encourage them to do so.

When I ran MarkLogic between mid-2004 and 2010, we made the strategic decision to focus on government as one of our two key verticals.  While it was then, and still is, rather contrarian to do so, we nevertheless decided to focus on government for several reasons.

  • The technology fit was very strong.  There are many places in government, including the IC, where they have a bona fide need for a hybrid database / search engine, such as MarkLogic.
  • Many people in government were tired of the Oracle-led oligopoly in the RDBMS market and were seeking alternatives.  (Think:  I’m tired of writing Oracle $40M checks.)  While this was true in other markets, it was particularly true in government because their problems were compounded by lack of good technical fit — i.e., they were paying an oligopolist a premium price for technology that was not, in the end, terribly well suited to what they were doing.
  • Unlike other markets (e.g., Finance, Web 2.0) where companies could afford the high-caliber talent able to use the then-new open source NoSQL alternatives, government — with the exception of the IC — was not swimming in such talent.  Ergo, government really needed a well-supported enterprise NoSQL system usable by a more typical engineer.

The choice had always made me nervous for a number of reasons:

  • Government deals were big, so it could lead to feast-or-famine revenue performance unless you were able to figure out how to smooth out the inherent volatility.
  • Government deals ran through systems integrators (SI) which could greatly complexify the sales cycle.
  • Government was its own tribe, with its own language, and its own idiosyncrasies (e.g., security clearances).  While bad from the perspective of commercial expansion, these things also served as entry barriers that, once conquered, should provide a competitive advantage.

The only thing I hadn’t really anticipated was the politics.

It had never occurred to me, for example, that in a $630M project — where MarkLogic might get maybe $5 to $10M — that someone would try to blame failure of what appears to be one of the worst-managed projects in recent history on a component that’s getting say 1% of the fees.

It makes no sense.  But now, for the second time, the New York Times has written an article about the HealthCare.gov fiasco where MarkLogic is not only one of very few vendors even mentioned but somehow implicated in the failures because it is different.

HealthCare.gov

Let me start with a few of my own observations on HealthCare.gov from the sidelines.  (Note that I, to my knowledge, was never involved with the project during my time at MarkLogic.)

From the cheap seats the problems seem simple:

  • Unattainable timelines.  You don’t build a site “just like Amazon.com” using government contractors in a matter of quarters.  Amazon has been built over the course of a more than a decade.
  • No Beta program.  It’s incomprehensible to me that such a site would go directly from testing into production without quarters of Beta.  (Remember, not so long ago, that Google ran Beta’s for years?)
  • No general oversight.  It seems that there was no one playing the general contractor role.  Imagine if you built a house with plumbers, carpenters, and electricians not coordinated by a strong central resource.
  • Insufficient testing.  The absent Beta program aside, it seems the testing phase lasted only weeks, that certain basic functionality was not tested, and that it’s not even clear if there was a code-freeze before testing.
  • Late changes.  Supporting the idea that there was no code freeze are claims that the functional spec was changing weeks before the launch.

Sadly, these are not rare problems on a project of this scale.  This kind of stuff happens all the time, and each of these problems is a hallmark of a “train wreck” software development project.

To me, guessing from a distance, it seems pretty obvious what happened.

  • Someone who didn’t understand how hard it to build was ordered up a website of very high complexity with totally unrealistic timeframes.
  • A bunch of integrators (and vendors) who wanted their share of the $630M put in bids, probably convincing themselves in each part of the system that if things went very well that they could maybe make the deadlines or, if not, maybe cut some scope.  (Remember you don’t win a $50M bid by saying “the project is crazy and the timeframe unrealistic.”)
  • Everybody probably did their best but knew deep down that the project was failing.
  • Everyone was afraid to admit that the project was failing because nobody likes to deliver bad news, and it seems that there was no one central coordinator whose job it was to do so.

Poof.  It happens all the time.  It’s why the world has generally moved away from big-bang projects and towards agile methodologies.

While sad, this kind of story happens.  The question is how does the New York Times end up writing two articles where somehow the failure is somehow blamed on MarkLogic.  Why is MarkLogic even mentioned?  This the story of a project run amok, not the story of a technology component failure.

Politics and Technology

The trick with selling disruptive technology to the government is that you encounter two types of people.

  • Those who look objectively at requirements and try to figure out which technology can best do the job.  Happily, our government contains many of these types of people.
  • Those who look at their own skill sets and view any disruptive technology as a threat.

I met many Oracle-DBA-lifers during my time working with the government.  And I’m OK with their personal decision to stop learning, not refresh their skills, not stay current on technology, and to want to ride a deep expertise in the Oracle DMBS into a comfortable retirement.  I get it.  It’s not a choice I’d make, but I can understand.

What I cannot understand, however, is when someone takes a personal decision and tries to use it as a reason to not use a new technology.  Think:  I don’t know MarkLogic, it is new, ergo it is a threat to my personal career plan, and ergo I am opposed to using MarkLogic, prima facie, because it’s not aligned with my personal interests.  That’s not OK.

To give you an idea of how warped this perspective can get (and while this may be urban myth), I recall hearing a story that one time a Federal contractor called a whistle-blower line to report the use of MarkLogic on system instead of Oracle.  All I could think of was Charlton Heston at the end of Soylent Green saying, “I’ve seen it happening … it’s XML … they’re making it out of XML.

The trouble is that these folks exist and they won’t let go.  The result:  when a $630M poorly managed project gets in trouble, they instantly raise and re-raise decisions made about technology with the argument that “it’s non-standard.”

Oracle was non-standard in 1983.  Thirty years later it’s too standard (i.e., part of an oligopoly) and not adapted to the new technical challenges at hand.  All because some bright group of people wanted to try something new, to meet a new challenge, that cost probably a fraction of what Oracle would have charged, the naysayers and Oracle lifers will challenge it endlessly saying it’s “different.”

Yes, it is different.  And that, far as I can tell, was the point.  And if you think that looking at 1% of the costs is the right way to diagnose a struggling $630M project, I’d beg to differ.  Follow the money.

###

FYI, in researching this post, I found this just-released HealthCare.gov progress report.

The 20th Century Called. It Wants Its Relational Database Back.

I saw this piece of creative the other day for a tradeshow ad and loved it.  Remember, Ted Codd invented the relational database in 1970 with his paper “A Relational Model for Shared Data Banks.”  This PDF of the classic looks about as old as the ad.  (Do PDFs age?)  Enjoy!

My Slides from the MarkLogic 2010 Digital Publishing Summit

Just a quick post to share my slides from this year’s standing-room-only 2010 Digital Publishing Summit at the Plaza Hotel.

Thank you to everyone for attending!

Six Thoughts on The NoSQL Movement

We are in the middle of one of our periodic analyst tours at MarkLogic, where we meet about 50 top software industry analysts focused in areas like enterprise search, enterprise content management, and database management systems.  The NoSQL movement was one of four key topics we are covering, and while I’d expected some lively discussions about it, most of the time we have found ourselves educating people about NoSQL.

In this post, I’ll share the six key points we’re making about NoSQL on the tour.

Our first point is that NoSQL systems come in many flavors and it’s not just about key/value stores.  These flavors include:

  • Key/value stores (e.g., Hadoop)
  • Document databases (e.g., MarkLogic, CouchDB)
  • Graph databases (e.g., AllegroGraph)
  • Distributed caching systems (e.g., Memcached)

Our second point is that NoSQL is part of a broader trend in database systems:  specialization.  The jack-of-all-trades relational database (e.g., Oracle, DB2) works reasonably well for a broad range of applications — but it is a master of none.  For any specific application, you can design a specialized DBMS that will outperform Oracle by 10 to 1000 times.  Specialization represents, in aggregate, the biggest threat to the big-three DBMS oligopolists.  Examples of specialized DBMSs include:

  • Streambase, Skyler:  real-time stream processing
  • MarkLogic:  semi-structured data
  • Vertica, Greenplum:  mid-range data warehousing
  • Aster:  large-scale (aka “big data”) analytic data warehousing
  • VoltDB:  high volume transaction processing
  • MATLAB:  scientific data management

Our third point is that NoSQL is largely orthogonal to specialization.  There are specialized NoSQL databases (e.g., MarkLogic) and there are specialized SQL databases (e.g., Aster, Volt).  The only case where I think there are zero examples is general-purpose NoSQL systems.  While I’m sure many of the NoSQL crowd would argue that their systems can do everything, is anyone *really* going to run general ledger or opportunity management on Hadoop?   I don’t think so.

Our fourth point is that NoSQL isn’t about open source.  The software-wants-to-be-free crowd wants to build open source into the definition of NoSQL and I believe that is both incorrect and a mistake.  It’s incorrect because systems like MarkLogic (which uses an XML data model and XQuery) are indisputably NoSQL.  And it’s a mistake because technology movements should be about technology, not business models.  (The open source NoSQL gang can solve its problem simply by affiliating with both the NoSQL technology movement and the open source business model movements.)

As CEO of a company that’s invested a lot of energy in supporting standards, our fifth point was that, rather ironically, most open source NoSQL systems have proprietary interfaces.  People shouldn’t confuse “can access the source code” with “can write applications that call standard interfaces” and ergo can swap components easily.   If you take offense at the word proprietary, that’s fine.  You can call them unique instead.  But the point is an application written on Cassandra is not practically moved to Couch, regardless of whether you can access the source code both Couch and Cassandra.

Our sixth point is that we think MarkLogic provides a best-of-both-worlds option between open source NoSQL systems and traditional DBMSs.  Like open source NoSQL systems, MarkLogic provides shared-nothing clustering on inexpensive hardware, superior support for unstructured data, document-orientation, and high-performance.  But like traditional databases, MarkLogic speaks a high-level query language, implements industry standards, and is commercial-grade, supported software.  This means that customers can scale applications on inexpensive computers and storage, avoid the pains of normalization and joins, have systems that run fast, can be implemented by normal database programmers, and feel safe that their applications are built via a standard query language (XQuery) that is supported by scores of vendors.

Slides from Mark Logic Digital Publishing Summit

I’m at the Mark Logic Digital Publishing Summit at The Plaza Hotel in New York. While I’m not sure what the “official” means will be for sharing presentation slides, based on a few requests at lunch I’ve uploaded my slides and David Worlock’s slides to SlideShare and embedded them here.

Great event, over 550 registered, almost ran out of chairs at lunch. Thanks to everyone for coming!

My slides:

David’s slides:

NetBase Tragicomedy: The Perils of "Magic" and Language Processing

It’s no secret that I’m not a big fan of “magic” in software. You could argue I’m still bearing the scars from BusinessMiner, one of our few failed products, at Business Objects. You could argue that for some tasks, magic is a necessary evil, and I wouldn’t argue back too hard. Many Mark Logic customers rely on “magic” to automatically enrich content, adding XML tags that identify entities (e.g., people, places, geopolitical organizations), sentiment (e.g., positive, negative or neutral), or even geo-code content with latitude and longitude that we then index, thus enabling geo-queries against content.

While I confess to some ignorance about how the magical tools work, it’s my perception that on a bad day they’re 50% accurate and on a good one they’re 80%. Now one could argue that content that’s enriched at 80% accuracy is way more valuable than unenriched content, and you’d be right. All I’m saying is I’m glad I’m not in the business of making the software that does that, because — customers being customers — nobody wants to hear that 80% is great and 100% is unattainable. Perhaps it’s my lack of deep expertise in the field. Or perhaps it’s my belief that humans are uncomfortable around black boxes.

The other reason I don’t like magic is that it can fail in truly spectacular ways. What’s the expression? To err is human. To really foul things up requires natural language processing.

This happened today with NetBase, a company whose high-level messaging is fairly similar to Mark Logic’s though happily with very different technology and business strategy.

NetBase recently launched healthBase, “a new health research showcase to find treatments, causes, and complications of any condition [and the] pros and cons of any drug, food, or treatment.”

Sounds nice. But, today they were slaughtered on TechCrunch with a story headlined: NetBase Thinks You Can Get Rid of Jews with Alcohol and Salt. Excerpt:

Several of our readers tested out the site and found that healthBase’s semantic search engine has some major glitches (see the comments). One of the most unfortunate examples is when you type in a search for “AIDS,” one of the listed causes of the disease is “Jew.” Really.

The ridiculousness continues. When you click on Jew, you can see proper “Treatments” for Jews, “Drugs And Medications” for Jews and “Complications” for Jews. Apparently, “alcohol” and “coarse salt” are treatments to get rid of Jews, as is Dr. Pepper! Who knew?

Here’s a great demo of why I don’t want to sell semantic processing technology. Here’s the reply Netbase gave TechCrunch:

This is an unfortunate example of homonymy, i.e., words that have different meanings.

The showcase was not configured to distinguish between the disease “AIDS” and the verb “aids” (as in aiding someone). If you click on the result “Jew” you see a sentence from a Wikipedia page about 7th Century history: “Hispano-Visigothic king Egica accuses the Jews of aiding the Muslims, and sentences all Jews to slavery. ” Although Wikipedia contains a lot of great health information it also contains non-health related information (like this one) that is hard to filter out.

I hate to be pedestrian, but isn’t that just a fancy way of saying it doesn’t work? It reminds me of the quip about Autonomy, where, when the Bayesian and Shanon’s Information Theory magic isn’t working, they simply tell the customer that they’re not smart enough to understand why. Nice.

Now, for the hapless NetBase, the AIDS query was just the beginning. They get destroyed in the blog comments, which quickly turned into a contest to find the silliest results.

  • The treatment for venture capital is funding. The cons is fool.
  • Masturbation causes insanity and is cured by cocaine.
  • The treatment for Twitter is Facebook. (This one might be right.)
  • The treatment for Microsoft is Viagra
  • Babies are caused by smoking and brain damage

It goes on and on. Now yes, many of the silly queries are out of the health domain, but there has to be better way to answer them.

One active commenter, Dave, who coined the “tragicomedy” description and who isn’t me, had this to offer:

The tragi-comic failure of Netbase can teach a lot to every company in the Semantic space.

Lesson 1

Don’t even try to boil the ocean of the WWW with these technologies. [The] Internet is full of valuable information but crap (or opinions) is 90% [of it] , the cost of getting rid of this crap and save only the good stuff is very high, [and] that’s [what] makes [it] so hard to succed even for Google and Microsoft with billions [of dollars].

Lesson 2

Linguistic approaches are likely going to fail because search engines (and machines) can’t distinguish joke/seriousness, sarcasm/shame and sentiments in general. The semantic meaning is right there not in the words of a text.

Lesson 3

If you choose to apply such approaches to one specific topic like Medicine (good choice) then stick to that topic , that means accept as INPUT only medical terms and provide as OUTPUTS only medical terms.

This last point requires human intervention and predefined taxonomies/ontologies but Netbase claims that they don’t need them both, ]i.e., that] their engine is fully automatic —> the failure too.

The Perils of Text-Only Search

You won’t be surprised to know that I use a series of Google Alerts to help me track events relevant to Mark Logic. I often have two problems with them:

  • Old content is mis-identified as new. You can ask any Mark Logician about the number of times I’ve forwarded a story that I thought was a hot news item only to discover it was four years old and that Google had nevertheless “alerted” me to its existence. I highlight this here because it bugs me, but I will not drill into it.
  • Content is mis-parsed, resulting in erroneous matches and alerts.

For example, today I received a Google Alert on “Mark Logic” for this Reading Eagle story, entitled Douglass Township Man Waives Hearing on Charges He Fired Gun in Neighborhood. Wondering if we had a wayward employee, I read the story which is about a man named Jeffrey W. Logic, who is charged with firing several shots near a group of people assembled in a neighbor’s front yard.

Here’s the text that generated the hit. (Bolding mine.)

Logic pulled out a gun and fired several shots into one of the car’s tires. He also fired a shot into the pavement, and a stone or bullet fragment ricocheted and struck the driver in the neck, causing a red mark.

Logic started to walk away when two men who had been at the party approached him. Logic pointed the gun at one of them but the man swatted it away. Logic then fired into the ground once more.

What happened? The words mark and logic are sequentially related in the text. But they’re not in the same paragraph, let alone the same sentence. Clearly, if you’ll pardon the pun, this result is a misfire, but it highlights an important problem with full-text search engines: they understand neither the structure nor the semantics of the content they are indexing.

For example, in an XML representation, you might indicate structure by using <para> tags to indicate paragraphs and <sentence> tags to indicate sentences. When searching, you could then say “find all the <sentences> that contain the phrase ‘Mark Logic” and you wouldn’t get the false match that Google returned.

Awareness of structural markup is important not only because it eliminates false matches, but because it enables you to express more powerful queries (from both a search and retrieval perspective), such as:

  • Find all the <figures> that have <captions> that contain the phrase “survival rate”
  • Return the <authors> and <abstracts> of articles that contain the word “lymphoma” and have <captions> that contain the phrase “survival rate”

Or, more powerfully, perform a citation analysis:

  • Return the <authors> and <abstracts> of <articles> with <citations> to <articles> written by <author> “Sandra Horning”

But, even better is the ability for the system to understand semantic markup, for example, coming from a taxonomy or an automatic entity extraction tool.

  • And find them only in the <articles> that contain references to the <drug> “Rituxan” which is a <monoclonal antibody> which the system knows is also called “Rituximab” and “MabThera.”
  • And which contain the <disease> diffuse large b-cell lymphoma, which the system knows is a <b-cell lymphoma> which is a <lymphoma> which is a <blood cancer> which is a <cancer>

Then think of the simple permutations of this query you can run:

  • Against all monoclonal antibodies, not just Rituxan
  • Against all lymphomas, not just diffuse large b-cell. Or against all blood cancers.
  • Against not just those citing Horning, but against those citing any other author.

And then think of all the queries you can run against this same corpus when you apply any number of any combination of full-text, structural, and semantic constraints.

And pause.

If you wonder why I say that Mark Logic lets you run database-style queries against content* you hopefully now understand why.

It’s not just about catching that Mark is the last word of sentence 20 and Logic is the first word in sentence 21. It’s about combining structural, semantic, and full-text constraints and in virtually any combination. And that unleashes a mind-boggling amount of query power. A power, by the way, that we’re accustomed to against data, but are now only beginning to understand against content.


* If you want to search only within items the author bolded for emphasis, you can do that, too!

** Or if you want to search only within footnotes, as they sometimes do in finance, you can do that, too!