Category Archives: Database management system

My Slides from the MarkLogic Government Summit: “Relationertia”

Below please find an embedded copy of the slides I presented a few weeks back at the MarkLogic Government Summit at the Ritz-Carlton in Tyson’s Corner.

I had three fun quotes/concepts from this session.

First, I created a new word to describe all the reasons organizations use relational databases to try and solve problems for which they were never designed and at which they are suboptimal:  relationertia.  You know those reasons:

  • It’s safe
  • We have it already
  • It’s what we know
  • It’s free at the project level (if expensive at the agency one)

The fact is relational databases are about 40 years old and were never designed to solve some of the problems that government agencies are throwing at them.  To drive home the age point, I made a list of “other things” that happened in 1970, the year that Codd’s seminal paper was published.

  • Janis Joplin died
  • The Beatles broke up, after releasing Let It Be
  • The first 747 entered service
  • The first episode of All My Children aired

It was a long time ago.  (And that was the second fun thing.)

The third fun thing was to dust off one of my favorite old saws:  if your only tool’s a hammer, then every problem looks like a nail.  Or, as I more colorfully saw on Twitter today:  if your only tool’s a chainsaw, then every problem looks like a Zombie.

Applying this idea to relational databases, we come up with:

If your only data modeling element’s a table, then every problem looks like a column.

The slides are embedded below.

The Information Continuum and the Three Types of Subtly Semi-Structured Information

We generally refer to MarkLogic Server as an XML server, which is a special-purpose database management system (DBMS) for unstructured information.  This often sparks debate about the term “unstructured” and the information continuum in general.  Surprisingly, while both analysts and vendors frequently discuss the concept, the Wikipedia entry for information continuum is weak, and I couldn’t easily find a nice picture of it, so I decided to make my own.

The general idea that information spans a continuum with regard to structure is pretty much undisputed.  The placement of any given type of information on that continuum is more problematic.  While it seems clear the purchase orders are highly structured and that free text is not, the placement of, for example, email is more interesting.  Some might argue that email is unstructured.  In fact, only the body of an email is unstructured and there is plenty of metadata (e.g., from, send-to, date, subject) wrapping an email.  In addition, an email’s body actually does have latent structure — while it may not be explicit, you typically have a salutation followed by numerous paragraphs of text, a sign-off, a signature, and perhaps a legal footer.  Email is unquestionably semi-structured.

In fact, I believe that the vast majority of information is semi-structured.  PowerPoint decks have slides, slides have titles and bullets.  Contracts are typically word documents, but have more-or-less standard sections.  Proposals are usually Word or PowerPoint documents that tend to have similar structures.  Even the humble tweet is semi-structured:  while the contents are ostensibly 140 unstructured characters, the anatomy of a tweet reveals lots of metadata (e.g., location) and even the contents contain some structural information (e.g,. RT indicating re-tweet or #hashtags serving as topical metadata).

New let’s consider XML content.  Some would argue that XML is definitionally structured.  But I’d say that an arbitrary set of documents all stored within <document> and </document> tags is only faux structured; it appears structured because it’s XML, but the XML is just used as a container.  A corpus of twenty 2,000-page medical textbooks in 6 different schemas is indeed structured, but not well so.  To paraphrase an old saw about standards:  the nice thing about structures is that there are so many to choose from.  I believe that knowing content is marked up in XML reveals nothing about its structure, i.e., that XML-ness and structure are orthogonal.  Put differently, XML is simply a means of representing information.  The information represented may be highly structured (e.g., 100 purchase orders all in perfect adherence to a given schema) or highly unstructured (e.g., 20 documents only vaguely complying with 20 different schemas).

I have two primary beliefs about the information continuum:

  • The vast majority of information is semi-structured. There is relatively little highly structured and relatively little completely unstructured information out there.  Most information lies somewhere in the fat middle.  I overlaid a bell curve on top of the information continuum to reflect volume.
  • Even information that initially appears structured is often semi-structured.  I see three types of this subtly semi-structured information which, hopefully without being too cute, I’ll abbreviate as SSSI.  The three types are (1) schema as aspiration, (2)  time-varying schema, and (3) unknowable schema.

Let’s look at each of the three types more closely.

Schema as Aspiration

The first type of subtly semi-structured information (SSSI) is where a schema exists, but only notionally.  The schema itself is either poorly defined (actual quote:  “it is believed that this element is used for”) or well defined but not followed.  This is frequently the case with publishing and media companies.  Here are two free jokes that work well at any publishing conference:

  • Raise your hand if you have a standard schema.  Keep it up if your content actually adheres to it.
  • Oxymorons aside, how many of you have 3 or more “standard” schemas, 5 or more, … do  I hear 10?

These jokes are funny because of the state of the content.  This state is the result of two primary business trends:  (1) consolidation — most large publishers have been built through M&A thus inheriting numerous different standards, each of which may be only partly implemented — and (2) licensing — publishers frequently license content from numerous other sources, each with its own standard format.

Time-Varying Schema

The second case of SSSI is you where you have a well defined, enforced schema at any moment in time, but it keeps changing over time.  Typically this happens for one of two reasons:

  • The business reality that you’re modeling is changing.  For example, in 2009 Federal Sales was part of Eastern Sales but in 2010 it becomes its own division.  This makes comparison of Eastern results between 2009 and 2010 potentially difficult.  In BI circles, this is known as the slow-changing dimension problem.
  • Standards keep changing.  If you’re modeling information in a corporate- or industry-standard schema and that schema is changing, then your information becomes semi-structured because it is contained within multiple different schemas.  Sometimes you can avoid this by migrating all prior information to the current schema, but sometimes (e.g., massive data volumes, regulatory desire to not change existing records) you will not.

When viewed with a flash camera this information looks well structured.  When you look at the movie, you can clearly see that it’s not.

Unknowable Schema

The last case of SSSI is where you have an unknowable schema.  Consider terrorist tracking.  If you were to make a schema for a terrorist database, here are some of the attributes that spring to mind:  name, alias(es), address, former address(es), height, weight, hair color, eye color, member-of, enemy-of, friend-of, tattoos/markings.

Here are some problems with this:

  • Many of the attributes are multi-valued, such as alias or friend-of.  In a de-normalized approach, this means dealing with repeating group problems and creating N columns (e.g., alias, alias1, alias2, and up to the maximum number of aliases for any terrorist).  Normalization would take care of the repeating group but at the cost of creating a table for each multi-valued attribute and then having to join back to those tables when you run queries.  (One such real system ended up with 500 tables, with the result that no one could find anything.)
  • It is difficult to create a type for the tattoo attribute.  First, it’s multi-valued.  Second, while tattoos are sometimes images, they often contain text (e.g., Mom) and sometimes in a foreign language (e.g., 愛, the Chinese symbol for love).  Since you’re trying to secure the nation against threat you don’t want to throw away any potentially valuable information, but it’s not obvious how to store this.
  • New attributes are coming all the time.  Say you get a shoe print on a suspect as he runs away.  You need to add a shoe-size attribute to the database.  Say a terrorist runs away and leaves a pair of eyeglasses.  Now we need to add eyeglass prescription.  My favorite is what’s called pocket litter.  You find a piece of paper in a person’s pocket and it has a number on it.  It could be a phone number, a  lock combination, or maybe map coordinates.  You don’t know what it is — but again, since you don’t want to throw any potentially valuable information — you have to find a place to store it.
  • Combining an enormous number of potential attributes with the reality that very few are known for most individuals creates two problems:  (1) you end up with a sparse table which is not well handled in most RDBMSs and (2) you end up hitting column limits.

Another example of unknowable schemas would be in financial services, modeling derivatives.   Because derivatives are sometimes long-lived instruments (e.g., 30 years) you may face the time-varying schema problem.  In addition, you have the unknowable schema problem because the industry is constantly creating new products.  First we had CDOs and CDSs on banks, then single-tranche CDOs, then CDSs on single-tranche CDOs, and then synthetic CDOs.  If this makes your head hurt in terms of understanding, then think for a minute about data modeling.  How are you going to store these complex products in a database?   And what are you going to do with the never-ending stream of new ones — last I heard they were considering selling derivatives on movies.

(As it turns out XML is a great way to model both these problems as you can easily add new attributes on the fly and only provide values for attributes where you know them.)

To finish the post, I’ll revisit the statement I started with:  we generally refer to MarkLogic Server as an XML server, a special-purpose database management system (DBMS) for unstructured information.  Going forward, I think I’ll keep saying that because it’s simpler, but at the MarkLogic 201 level, the more precise statement is:  a special-purpose DBMS for semi-structured information.

There’s way more semi-structured information out there.  Realizing that information is semi-structured is sometimes subtle.  And semi-structured information is, in fact, the optimization point for our product.  So what’s MarkLogic in three concepts?  Speed, scale, and semi-structured information.

Classifying Database Management Systems: Regular and NoSQL

Thanks to two major trends — DBMS specialization and the NoSQL movement — the database management systems space is generating more interest and more innovation than any time I can remember since the 1980s.  Ever since around 1990, when the relational database management system (RDBMS) became firmly established, IT has played DBMSroulette:  spin the wheel and use the DBMS on which the needle lands — Oracle, DB2, or SQL Server.  (If you think this trivializes things, not so fast:  a friend who was the lead DBMS analyst at a major analyst firm once quipped to me that this wheel-spinning was his job, circa 1995.)

Obviously, there was always some rational basis for DBMS selection — IBM shops tended to pick DB2, best-of-breed buyers liked Oracle, performance whizzes and finance types often picked Sybase, and frugal shoppers would choose SQL Server, and later MySQL — but there was no differentiation in the model.  All these choices were relational database management systems.

Over time, our minds became dulled to orthogonal dimensions of database differentiation:

  • The database model.  For years, we lived in the database equivalent world of Henry Ford’s Model T:  any model you want as long as it’s relational.
  • The potential for trade-offs in fundamental database-ness.  We became binary and religious about what it meant be a database management system and that attitude blinded us to some fundamental trade-offs that some users might want to make — e.g., trading consistency for scalability, or trading ACID transactions for BASE.

The latter is the domain of Brewer’s CAP theorem which I will not discuss today.  The former, the database model, will be the subject of this post.

Every DBMS has some native modeling element (NME). For example, in an RDBMS that NME is the relation (or table).  Typically that NME is used to store everything in the DBMS.  For example, in an RDBMS:

  • User data is stored in tables.
  • Indexes are implemented as tables which are joined back to the base tables.
  • Administration information is stored in tables.
  • Security is usually handled through tables  and joins.
  • Unusual data types (e.g., XML) are stored in “odd columns” in tables.  (If your only model’s a table, every problem looks like a column.)

In general, the more naturally the data you’re storing maps to the paradigm (or NME) of the database, the better things will work.  For example, you can model XML documents as tables and store them in an RDBMS, or you can model tables in XML and store them as XML documents, but those approaches will tend to be more difficult to implement and less efficient to process than simply storing tables in an RDBMS and XML documents in an XML server (e.g., MarkLogic).

The question is not whether you can model documents as tables or tables as documents.  The answer is almost always yes.  Thus, the better question is should you?  The most famous example of this type of modeling problem is the storage of hierarchical data in an RDBMS.  To quote this article on managing hierarchical data in MySQL:

Most users at one time or another have dealt with hierarchical data in a SQL database and no doubt learned that the management of hierarchical data is not what a relational database is intended for.

(Personally, I blame the failure of Microsoft’s WinFS on this root problem — file systems are inherently hierarchical — but that’s  a story for a different day.)

I believe the best way to classify DBMSs is by their native modeling element.

  • In hierarchical databases, the NME is the hierarchy.  Example:  IMS.
  • In network databases, it’s the (directed, acyclic) graph. Example:  IDMS.
  • In relational databases, it’s the relation (or, table).  Example:  Oracle.
  • In object databases, it’s the (typically C++) object class. Example:  Versant.
  • In multi-dimensional databases, it’s the hypercube. Example:  Essbase.
  • In document databases, it’s the document. Example:  CouchDB.
  • In key/value stores, it’s the key/value pair. Example:  Redis.
  • In XML databases, it’s the XML document. Example:  MarkLogic.

The biggest limitation of this approach is that classifying by model fails to capture implementation differences. Some examples:

  • I would classify columnar DBMSs (e.g., Vertica) as relational if they model data as tables, and key/value stores (e.g., Hbase) as such if they model data in key/value pairs.  This fails to capture the performance advantage that Vertica gets on certain data warehousing problems due to its column orientation.
  • I would classify all relational databases as relational, despite implementation optimizations.  For example, this approach fails to capture Teradata’s optimizations for large-scale data warehousing, Aster’s optimizations for analytics on big data, or Volt’s optimizations for what Curt Monash calls HVSP.
  • I would classify all XML databases as XML databases, despite possible optimization differences for the two basic XML use-cases:  (1) XML as message wrapper vs. (2) XML as document markup.

Nevertheless, I believe that DBMSs should be classified first by model and then sub-classified by implementation optimization.  For example, a relational database optimized for big data analytics (Aster).  An XML database optimized for large amounts of semi-structured information marked in XML (MarkLogic).

In closing, I’d say that we are seeing increasing numbers of customers coming to Mark Logic saying:  “well, I suppose we could have modeled this data relationally, but in our business we think of this information as documents and we’ve decided that it’s easier and more natural to manage it that way, so we decided to give you a call.”

After thinking about this for some time, I have one response:  keep calling!

No matter how you want to think about MarkLogic Server — an XML server, an XML database, or an XML document database — dare I say an [XML] [document] server|database  — it’s definitely a document-oriented, XML-oriented database management system and a great place to put any information that you think is more naturally modeled as documents.

XML: YAFF, YADT, or Whole World?

If you have a bunch of XML and are looking for of a place to put it, then I think I may have come up with a simple test that might be helpful.

In talking with prospective vendors of XML repositories (definition: software that lets you store, search, analyze and deliver XML), try to establish what I’ll call “XML vision compatibility.” Quite simply, try to figure out if the vendor’s vision of XML is consistent with your own. To help with that exercise, I’ll define what I see as the three common XML vendor visions:

  • YAFF (yet another file format)
  • YADT (yet another data type)
  • Whole world

YAFF Vendors
Vendors with the YAFF vision view XML as yet another file format. ECM vendors clearly fall into this category (“oh yes, XML is one of the 137 file formats you can manage in our system”). So do enterprise search vendors (“oh yes, we have filters for XML formatted files which clear out all those nasty tags and feed our indexing engine the lovely text.”)

For example, let’s look at how EMC Documentum — one of the more XML-aggressive ECM vendors — handles XML on its website.

Hmm. There’s no XML on that page. But lots of information about records management, digital asset management, document capture, collaboration and document managent (it’s not there either). Gosh, I wonder where it is? SAP integration? Don’t think so. Hey, let’s try Documentum Platform, whatever that is.

Not there, either. Now that’s surprising because I really have no idea where else it might be. Oh, wait a minute. I didn’t scroll the page down. Let’s try that.

There we go. We finally found it. I knew they were committed to XML. What’s going on here is that EMC has a huge, largely vendor consolidation-driven (e.g., Documentum, Captiva, Document Sciences, x-Hive, Kazeon) vision of what content management is. And XML is just one tiny piece of that vision. XML is, well, yet another file format among the scores that they have manage, archive, capture, and provide workflow, compliance, and process management against. The vision isn’t about XML. It’s about content. That’s nice if you have an ECM problem (and a lot of money to solve it); t’s not so nice if you have an XML problem, or more precisely a problem that can be solved with XML.

YADT Vendors
Vendors with the YADT vision view XML as yet another data type. These are the relational database management system vendors (e.g., Oracle) who have decided that the best way to handle XML is to make it a valid datatype for a column in a table.

The roots of this approach go back to the late 1980s and Ingres 6.3 (see this semi-related blast from the past) which was the first commercial DBMS to provide support for user-defined datatypes. All the primitives for datatyping were isolated from the core server code and made extensible through standard APIs. So, for example, if you wanted to store complex numbers of the form (a, bi) all you had to do was to write some primitives so the server would know:

  • What they look like — i.e., (a, bi)
  • Any range constraints (the biggest, the smallest)
  • What operators should be available (e.g., +, -)
  • How to implement those operators — (a, bi) + (c, di) = (a+c, (b+d)i)

It was — far as I remember — yet another clever idea from the biggest visionary in database management systems after Codd himself: Michael Stonebraker then of UC Berkeley and now of MIT. After founding Ingres, Stonebraker went on found Illustra which was all about “datablades” — a sexy new name for user-defined types. Datablades, in turn, became sexy bait for Informix to buy the company with an eye towards leveraging the technology towards unseating Oracle from its leadership position. It didn’t happen.

User-defined datatypes basically didn’t work. There were two key problems:

  • You had user-written code running in the same address space as the database server. This made it nearly impossible to determine fault when the server crashed. Was it a database server bug, or did the customer cause problem in implementing a UDT? While RDBMS customers were well qualified to write applications and SQL, writing server-level was quite another affair. This was a bad idea.
  • Indexing and query processing performance. It’s fairly simple to say that, for example, a text field looks like a string of words and the + operator means concatenate. It’s basically impossible for a end customer to tell the query optimizer how to process queries involving those text fields and how to build indexes that maximize query performance. If getting stuff into UDTs was a level-5 challenge, getting stuff back out quickly was a level-100 one.

So while the notion of end users adding types to a DBMS basically failed, when XML came along the database vendors dusted off this approach, in saying effectively: let use all those hooks we put in to build support for XML types ourselves. And they did. Hence what I call the “XML column” approach to storing XML in a relational database.

After all, if your only data modeling element’s a table, then every problem looks like a column.

Now this approach isn’t necessarily bad. If, for example, you have a bunch of resumes and want to store attribute data in columns (e.g., name, address, phone, birthdate) and keep an XML copy of the resume alongside, then this might be a reasonable way to do things. That is, if you have a lot of data and a touch of XML, this may be the right way to do things.

So again, it comes down to vision alignment. If XML is just another type of data that you want to store in a column, then this might work for you. Bear in mind you’ll:

  • Probably have to setup separate text and pre-defined XML path indexes (a hassle on regular schemas, an impossibility on irregular ones),
  • Face some limitations in how those indexes can be combined and optimized in processing queries,
  • Need to construct frankenqueries that mix SQL and XQuery, whose mixed-language semantics are sometimes so obscure that I’ve seen experts argue for hours about what the “correct” answer for a given queries is,
  • And suffer from potentially crippling performance problems as you scale to large amounts of XML.

But if those aren’t problems, then this approach might work for you.

This is what it looks like when a vendor has a YADT vision. Half the fun in storing XML in an RDBMS is figure out which query language and which store options you want to use. See the table that starts on page 9, spans four pages, and considers nearly a dozen criteria to help you decide which of the three primary storage options you should use:

See this post from IBM for more Oracle-poking on the complexity of storage options available. Excerpt:

Oracle has long claimed that the fact that Oracle Database has multiple different ways to store XML data is an advantage. At last count, I think they have something like seven different options:

  • Unstructured
  • XML-Object-Relational, where you store repeating elements in CLOBs
  • XML-Object-Relational, where you store repeating elements in VARRAY as LOBs
  • XML-Object-Relational, where you store repeating elements in VARRAY as nested tables
  • XML-Object-Relational, where you store repeating elements in VARRAY as XMLType pointers to BLOBs
  • XML-Object-Relational, where you store repeating elements in VARRAY as XMLType pointers to nested tables
  • XML-Binary

Their argument is that XML has diverse use cases and you need different storage methods to handle those diverse use cases. I don’t know about you, but I find this list to be a little bewildering. How do you decide among the options? And what happens if you change your mind and want to change storage method?

Such is life in the land of putting XML in tables because your database management system has columns.

Whole World Vendors
Vendors with the whole world vision view XML as, well, their whole world.

And when I say XML, I don’t mean information that’s already in XML. I mean information that is either already in XML (e.g., documents, information in any horizontal or industry-specific XML standard) or that is best modeled in XML (e.g., sparse data, irregular information, semi-structured information, information in no, multiple, and/or time-varying schemas).

“Whole world” vendors don’t view XML as one format, but as a plethora: docbook, DITA, s1000d, xHMTL, TEI, XBRL, the HL7 standards in healthcare, the Acord standards in insurance, Microsoft’s Open Office XML format, Open Document Format, Adobe’s IDML, chemical markup lanuage, MathML, the DoD’s DDMS metadata standard, semantic web standards like RDF and OWL, and scores of others.

Whole world vendors don’t view XML tags as “something that get in the way of the text” and thus they don’t provide filters for XML files. Nor do they require schema adherence because they know that XML schema compliance, in real life, tends to be more of an aspiration than a reality. So they allow you load and index XML, as is, avoiding the first step’s a doozy problem, and enabling lazy clean-up of XML information.

Whole world vendors don’t try to model XML in tables simple because they have a legacy tabular data model. Instead, their native modeling element (NME) is the XML document. That is:

  • In a hierarchical DBMS the NME is the hierarchy
  • In a network DBMS the NME is the graph
  • In a relational DBMS the NME is the table
  • In an object DBMS the NME is the object class hierarchy
  • In an OLAP, or multi-dimensional, DBMS the NME is the hypercube
  • And in an XML server, or native XML, DBMS the NME is the XML document

Whole world vendors don’t bolt a search engine to a DBMS because they know XML is often document-centric, making search an integral function, and requiring a fundamentally hybrid search/database — as opposed to a bolted-together search/database — approach.

Here is what it looks like when you encounter a whole world vendor:

Reblog this post [with Zemanta]

Dear CIO: Stop Writing Big Checks for Commodity (Database) Software

Dear CIO,

What’s wrong this picture?

  • At 50%+, Oracle’s operating margins have never been higher
  • The differentiation of Oracle’s database technology, however, has never been lower and the number of both core and specialized alternatives has never been greater.

So what’s going on? You, kind Sir or Madam, are being milked. What’s worse is that you, in an example of collective behavioral dysfunction, have inadvertently played a role in setting up the milking. What happened?

  • Like all smart CIOs you followed a bit of herd mentality when it came to core technology. Pity the poor fools who, back in the day, bet big on Ingres or Sybase. You played it safe and went with Oracle, IBM, or if your requirements weren’t too heavy, Microsoft.
  • The problem is, of course, that everyone executed the same strategy you did. Hence, the market created a system of increasing returns where the strong vendors got stronger and the weak ones died. The result: the RDBMS market is an (order of magnitude) $10B/year market, structured as an oligopoly with 3 players. Most other software markets worked out the same way.
  • You were focused on standardization. You realized that through a combination of decentralized IT decision making and growth-by-acquisition your organization had become a kitchen sink of enterprise software. You had everything. In order to reduce the administrative, training, and license acquisition costs, you fought tooth and nail with your divisions to standardize the environment. You said, “Heck, it’s all the same stuff in the end, folks, so let’s make Oracle our DBMS standard, Business Objects our BI standard, Documentum our ECM standard, and SAP our ERP standard.”
  • And you won. Mostly. There’s still some Cognos in finance. And marketing didn’t totally give up on Interwoven. But, for the most part, you won. You reduced the entropy of your IT environment and drove cost savings for your organization.

The problem is you’ve won the battle but lost the war. Why? Because if, as you say, the “stuff really is all the same” you shouldn’t standardize on the most expensive product. You should standardize on the cheapest.

  • Do you really need to be paying those big fees to Oracle for enterprise licenses? Wouldn’t MySQL do?
  • Are you really using all the functionality of that $1M/year Documentum ECM system? Wouldn’t SharePoint or Alfresco do?
  • For BI, do you need all the bells and whistles of BusinessObjects? Wouldn’t Pentaho or Qlikview do a fine job, at a fraction of the cost?

But these alternatives are obvious. Heck, even “the establishment” (i.e, Gartner) says it’s safe to tread in the open source water. So the question is, what’s holding you back?

  • Switching costs. It’s hard to move off Oracle or Documentum and you don’t want to pay the nut to do so.
  • Organizational inertia. Your whippersnapper DBAs who were in their 30s in the 1980s are now in their 50s. They’re thinking that change devalues their knowledge and experience; some just want to cruise into retirement. But that’s their personal agenda, not your enterprise one.
  • Accounting: you made it free for your divisions to keep using Documentum, Oracle, or BusinessObjects because you bought an enterprise license. While this appeared to “save” you money on a per-license basis, and it helped support your standardization initiative, it squashed innovation in your divisions, reinforced the organization inertia, and has a lot of people using the wrong tool for the job, resulting in projects that either take more or more expensive hardware than necessary (Oracle is good at this), that take too long to develop, or that simply fail.

So, what do I recommend doing about all this? I suggest that you adopt these policies, which –- for full disclosure, are at least partially in the self-interest of this blog’s author:

  • Stop writing big checks for commodity software. Every time a big check comes along, ask yourself: is this software differentiated or commoditized? Be willing to pay a premium for differentiated software, and price shop commodity software. Call a group of your smartest staff together periodically to help you make the commodity versus differentiated call.

  • When you see a big check coming for commodity software, make a migration plan. My hunch is that most of the time, you can create a nice 3-year ROI in the transition from premium to cheaper software. (This reminds me of the time I visited an investment bank’s CIO asking about their Documentum strategy. The answer: “our Documentum strategy is to get off Documentum,” because we’re paying too much and using too little.)

  • Stop doing enterprise agreements that create poor economic incentives within your organization. Don’t pay $XM at the enterprise level, spread that as a “tax” across your divisions, and then make use of certain software “free.” It distorts project reality, creates false incentives, squashes innovation, and generates lots of hidden costs. If you want to negotiate a master agreement and discount rate, that’s fine. Shoot for centralized discounts without central planning.
  • Don’t worry that the prior policies will create mayhem. While I understand that you don’t want arbitrary taste differences increasing the entropy of your enterprise software portfolio, recognize that with the first policy you’ve solved that problem already. If you deem a category (e.g., core RDBMS, enterprise search) commoditized, then you are going to force people to pick on cost. You’ll get standardization on the commodity categories –- just on the least expensive alternatives. The only entropy you’ll need to manage will be on the differentiated software which, having dispatched the commodity majority, you’ll have time to explore, study, and exploit.

Why I am taking the time to write this note to you? Back in the 1980s I was a foot soldier in the relational database revolution, and today I’m the CEO of one specialized DBMS company and on the board of another.

  • Mark Logic makes an XML server which can save great amounts of time and money in creating applications against unstructured information, replacing the combination of an RDBMS, an enterprise search engine, and an application server. Not only can Mark Logic manage 100s of TB of XML, the system eliminates the object / relational/ hierarchical impedance mismatch between Java, SQL, and XML that hampers developer productivity. Mark Logic was recently named the fourth fastest-growing IT company in Silicon Valley.
  • Aster Data makes a specialized data warehouse DBMS that runs on low-cost commodity hardware with a shared nothing architecture and leverages in-database MapReduce technology for parallelism and high scalability.

And during the past 25 years or so I’ve watched the market evolve. While I fully understand the policies and market forces that have led
us to where we are, I feel like we’ve come full circle. Vendor power is now concentrated in the big three. Vendor margins top 50%. Big vendors don’t innovate; they consolidate. Inertia has set in customer organizations. And there’s a major platform shift in progress; last time it was mainframe to minicomputer, this time it’s cloud.

Things feel a lot to me the way they did in 1985, just past dawn of the relational revolution. So in one way I’m writing to point out the oft-overlooked obvious: stop paying premium prices for commodity items. And in another way I’m saying, take the money you save in so doing and invest it in innovation technologies that:

  • Drive competitive advantage (which will matter again as we come out of the Great Recession)
  • Enable the Internet-scale applications you’ll need to face the coming information deluge
  • Reform the application development stack in ways that make sense for the coming generation of information applications, not that made sense for the last generation of data-centric ones.

Thank you for reading my note. If you have any questions or comments, please give me a ping at dave-dot-kellogg-at-marklogic-com or comment on this post.

Sincerely,

Dave Kellogg

Gartner Names "Specialized Systems" A Top 10 Strategic Technology

Leading IT analyst firm Gartner has named “specialized systems” to its list of top 10 strategic technologies for 2009. While I’m sure Gartner wasn’t thinking specifically of Mark Logic (for, among other reasons, that we’ve not spoken with David Cearley though I do know him from my Business Objects days), I would indeed argue that Mark Logic fits perfectly into this trend.

Here’s what Gartner says about specialized systems:

Specialized Systems. Appliances have been used to accomplish IT purposes, but only with a few classes of function have appliances prevailed. Heterogeneous systems are an emerging trend in high-performance computing to address the requirements of the most demanding workloads, and this approach will eventually reach the general-purpose computing market. Heterogeneous systems are also specialized systems with the same single-purpose imitations of appliances, but the heterogeneous system is a server system into which the owner installs software to accomplish its function.

While this is a generalized description, the point is clear: for high-performance computing, you will increasingly partition your workload amongst a heterogeneous network of servers each designed and optimized for a specific task. For MarkLogic Server, that task is high-performance XQuery evaluation against large XML databases, documentbases, and/or contentbases.

I’d also say that this argument is similar to one that Mike Stonebraker makes: that as you partition your workload against various, specialized (database) servers (e.g., OLTP, data warehousing, stream processing, XML processing, scientific data processing) you will find that, by elimination, there is no apparent need for a general-purpose database. That is, that every purpose a DBMS serves is a special purpose and we will therefore soon see the end of the era dominated by the general-purpose DBMS.

By the way, I’d also argue that Mark Logic has a role in one of Gartner’s other top 10 trends, web-oriented architectures.

Web-Oriented Architectures. The Internet is arguably the best example of an agile, interoperable and scalable service-oriented environment in existence. This level of flexibility is achieved because of key design principles inherent in the Internet/Web approach, as well as the emergence of Web-centric technologies and standards that promote these principles. The use of Web-centric models to build global-class solutions cannot address the full breadth of enterprise computing needs. However, Gartner expects that continued evolution of the Web-centric approach will enable its use in an ever-broadening set of enterprise solutions during the next five years.

As I’ve said here before, once a customer starts to use MarkLogic as a platform / repository / search engine for their XML, they soon realize that it’s easier to write web applications in a pure top-to-bottom XML fashion than in the dual mapping from an XML-oriented browser to a object-oriented Java layer to a table-oriented (relational) DBMS. That’s the subject of a different post. If you’re interested in top-to-bottom XML, then go here.

Gartner’s top 10 list of strategic technologies for 2009 is here.

Related articles by Zemanta
Reblog this post [with Zemanta]

Positioning MarkLogic Server

Here’s a great picture from our VP of engineering, Ron Avnur, on how he positions MarkLogic Server relative to other software categories. It’s an elegant and simple way of explaining where we fit.

The two dimensions are structure and query type. Structure can either be predefined or ad hoc (and often, in the document world, there is a predefined structure that no one actually uses, which is de facto ad hoc). Query types can either be predefined (i.e., known in advance) or ad hoc (i.e., not known in advance).

Let’s look at the quadrants that result:

  • Bottom left is where both structure and queries and predefined. Hierarchical DBMSs, like IMS, live in this quadrant. In these (now legacy) systems, the structure of the data is rigidly defined as are the queries that may be run against them. These databases provide high performance, but their inflexibility became their Achilles’ heel.
  • Bottom right is where structure is predefined but queries are ad hoc. The quadrant defines the relational database, which brought unprecedented flexibility to database querying, eventually enabling the modern BI market. Data structure is predefined through the creation of tables with defined names/columns to hold the data. Queries are ad hoc — in a well designed relational database, the system can provide the results for almost any imaginable query. (And with the right indexes, it can provide those results fairly quickly.)
  • Top left is where queries are predefined but structure is not. This — and this is non-obvious to most people — is the zone of the enterprise search engine. People tend to think of search engines as providing high flexiblity because you can type any word in the search box. In reality, seen from a database viewpoint, search engines provide a small number of parametrized queries. (It’s the parametrization that gives the impression of flexibility.) The small number of queries include (1) return list of documents where document contains word or phrase, (2) return list of documents where field-in-document contains word or phrase, (3) either query (1) or (2) where word or phrase is replaced with the search engine’s basically Boolean primitive query language (i.e., AND, OR, NOT).
  • The top right is the tricky zone where both queries and structure are not defined in advance. This is the zone of the XML Server, like MarkLogic. In these systems, content can be ingested “as is” without adherence to any predefined structure. Queries are ad hoc, and written in XQuery with full-text extensions. Given the proper indexes, these systems can run virtually any query against the content with high performance.

Hopefully this sheds some light on my soundbite that: “at Mark Logic, we are doing for (XML) documents what the relational database did for data.”

Related articles by Zemanta
Reblog this post [with Zemanta]