Databases

The Information Continuum and the Three Types of Subtly Semi-Structured Information

We generally refer to MarkLogic Server as an XML server, which is a special-purpose database management system (DBMS) for unstructured information. This often sparks debate about the term "unstructured" and the information continuum in general. Surprisingly, while both analysts and vendors frequently discuss the concept, the Wikipedia entry for information continuum is weak, and I couldn't easily find a nice picture of it, so I decided to make my own.

The general idea that information spans a continuum with regard to structure is pretty much undisputed. The placement of any given type of information on that continuum is more problematic. While it seems clear the purchase orders are highly structured and that free text is not, the placement of, for example, email is more interesting. Some might argue that email is unstructured. In fact, only the body of an email is unstructured and there is plenty of metadata (e.g., from, send-to, date, subject) wrapping an email. In addition, an email's body actually does have latent structure -- while it may not be explicit, you typically have a salutation followed by numerous paragraphs of text, a sign-off, a signature, and perhaps a legal footer. Email is unquestionably semi-structured.

In fact, I believe that the vast majority of information is semi-structured. PowerPoint decks have slides, slides have titles and bullets. Contracts are typically word documents, but have more-or-less standard sections. Proposals are usually Word or PowerPoint documents that tend to have similar structures. Even the humble tweet is semi-structured: while the contents are ostensibly 140 unstructured characters, the anatomy of a tweet reveals lots of metadata (e.g., location) and even the contents contain some structural information (e.g,. RT indicating re-tweet or #hashtags serving as topical metadata).

New let's consider XML content. Some would argue that XML is definitionally structured. But I'd say that an arbitrary set of documents all stored within <document> and </document> tags is only faux structured; it appears structured because it's XML, but the XML is just used as a container. A corpus of twenty 2,000-page medical textbooks in 6 different schemas is indeed structured, but not well so. To paraphrase an old saw about standards: the nice thing about structures is that there are so many to choose from. I believe that knowing content is marked up in XML reveals nothing about its structure, i.e., that XML-ness and structure are orthogonal. Put differently, XML is simply a means of representing information. The information represented may be highly structured (e.g., 100 purchase orders all in perfect adherence to a given schema) or highly unstructured (e.g., 20 documents only vaguely complying with 20 different schemas).

I have two primary beliefs about the information continuum:

The vast majority of information is semi-structured. There is relatively little highly structured and relatively little completely unstructured information out there. Most information lies somewhere in the fat middle. I overlaid a bell curve on top of the information continuum to reflect volume.

Even information that initially appears structured is often semi-structured. I see three types of this subtly semi-structured information which, hopefully without being too cute, I'll abbreviate as SSSI. The three types are (1) schema as aspiration, (2) time-varying schema, and (3) unknowable schema.

Let's look at each of the three types more closely.

Schema as Aspiration

The first type of subtly semi-structured information (SSSI) is where a schema exists, but only notionally. The schema itself is either poorly defined (actual quote: "it is believed that this element is used for") or well defined but not followed. This is frequently the case with publishing and media companies. Here are two free jokes that work well at any publishing conference:

Raise your hand if you have a standard schema. Keep it up if your content actually adheres to it.
Oxymorons aside, how many of you have 3 or more "standard" schemas, 5 or more, ... do I hear 10?

These jokes are funny because of the state of the content. This state is the result of two primary business trends: (1) consolidation -- most large publishers have been built through M&A thus inheriting numerous different standards, each of which may be only partly implemented -- and (2) licensing -- publishers frequently license content from numerous other sources, each with its own standard format.

Time-Varying Schema

The second case of SSSI is you where you have a well defined, enforced schema at any moment in time, but it keeps changing over time. Typically this happens for one of two reasons:

The business reality that you're modeling is changing. For example, in 2009 Federal Sales was part of Eastern Sales but in 2010 it becomes its own division. This makes comparison of Eastern results between 2009 and 2010 potentially difficult. In BI circles, this is known as the slow-changing dimension problem.

Standards keep changing. If you're modeling information in a corporate- or industry-standard schema and that schema is changing, then your information becomes semi-structured because it is contained within multiple different schemas. Sometimes you can avoid this by migrating all prior information to the current schema, but sometimes (e.g., massive data volumes, regulatory desire to not change existing records) you will not.

When viewed with a flash camera this information looks well structured. When you look at the movie, you can clearly see that it's not.

Unknowable Schema

The last case of SSSI is where you have an unknowable schema. Consider terrorist tracking. If you were to make a schema for a terrorist database, here are some of the attributes that spring to mind: name, alias(es), address, former address(es), height, weight, hair color, eye color, member-of, enemy-of, friend-of, tattoos/markings.

Here are some problems with this:

Many of the attributes are multi-valued, such as alias or friend-of. In a de-normalized approach, this means dealing with repeating group problems and creating N columns (e.g., alias, alias1, alias2, and up to the maximum number of aliases for any terrorist). Normalization would take care of the repeating group but at the cost of creating a table for each multi-valued attribute and then having to join back to those tables when you run queries. (One such real system ended up with 500 tables, with the result that no one could find anything.)

It is difficult to create a type for the tattoo attribute. First, it's multi-valued. Second, while tattoos are sometimes images, they often contain text (e.g., Mom) and sometimes in a foreign language (e.g., 愛, the Chinese symbol for love). Since you're trying to secure the nation against threat you don't want to throw away any potentially valuable information, but it's not obvious how to store this.

New attributes are coming all the time. Say you get a shoe print on a suspect as he runs away. You need to add a shoe-size attribute to the database. Say a terrorist runs away and leaves a pair of eyeglasses. Now we need to add eyeglass prescription. My favorite is what's called pocket litter. You find a piece of paper in a person's pocket and it has a number on it. It could be a phone number, a lock combination, or maybe map coordinates. You don't know what it is -- but again, since you don't want to throw any potentially valuable information -- you have to find a place to store it.

Combining an enormous number of potential attributes with the reality that very few are known for most individuals creates two problems: (1) you end up with a sparse table which is not well handled in most RDBMSs and (2) you end up hitting column limits.

Another example of unknowable schemas would be in financial services, modeling derivatives. Because derivatives are sometimes long-lived instruments (e.g., 30 years) you may face the time-varying schema problem. In addition, you have the unknowable schema problem because the industry is constantly creating new products. First we had CDOs and CDSs on banks, then single-tranche CDOs, then CDSs on single-tranche CDOs, and then synthetic CDOs. If this makes your head hurt in terms of understanding, then think for a minute about data modeling. How are you going to store these complex products in a database? And what are you going to do with the never-ending stream of new ones -- last I heard they were considering selling derivatives on movies.

(As it turns out XML is a great way to model both these problems as you can easily add new attributes on the fly and only provide values for attributes where you know them.)

To finish the post, I'll revisit the statement I started with: we generally refer to MarkLogic Server as an XML server, a special-purpose database management system (DBMS) for unstructured information. Going forward, I think I'll keep saying that because it's simpler, but at the MarkLogic 201 level, the more precise statement is: a special-purpose DBMS for semi-structured information.

There's way more semi-structured information out there. Realizing that information is semi-structured is sometimes subtle. And semi-structured information is, in fact, the optimization point for our product. So what's MarkLogic in three concepts? Speed, scale, and semi-structured information.

The Information Continuum and the Three Types of Subtly Semi-Structured Information

Read more

Book Review: The Curious Case of Mike Lynch by Katie Prescott

Why I'm Joining the Board of Dreamdata

The Metrics Brothers Hiatus

A Diamond in the Rough: Startup Founder Survival Guide by David Politis