One near-universal pattern I’ve noticed with content software is that the process for using it usually looks like this:
- Step 1: prepare your content to be loaded in the system
- Step 2: load the content into the system
- Step 3: get the benefits of the system
The simple problem is that step one’s a doozy. Because of step one, many content initiatives are never started, while many others are doomed to fail. We’ve worked with some customers who have been stuck working on “step one” for literally 18 months and have gotten nowhere towards achieving the goals of their content intiatives.
The problem isn’t specific to any type of content software. Just about all content software requires this first, content-preparation doozy-step in some way.
- Relational databases require shredding content if you want to store content in anything other than an opaque BLOB or CLOB.
- Web content management systems require you to break your content into bite-sized morcels that are then recombined to form web pages.
- Document management systems require fragmenting documents to establish a unit of granularity around which the system will operate (e.g., store, collect metadata about)
- Publishing systems, both commercial and in-house, typically require that the content be presented in a single, consistent format (e.g., DTD)
- Search engines require up-front configuration to enable fielded searches. To enable fine-grained search, they require you to “burst” content to the granularity at which you want to index — e.g., if you want paragraph-level indexing of a 20 page document, then you will need to burst it into about 100+ one-paragraph files, and then index those.
The process goes by many names: content preparation, content manufacturing, content transformation, intermediary file creation, content normalization.
No matter what you call it, you’re doing the same thing: spending time and energy preparing content to be loaded into a system before you can even start thinking of getting some value from it.
You can’t help but get the impression that it’s all backwards. That you’re adapting the content to the software when the software should be adapting to the content.
Perhaps you’re so used to the status quo that you’re wondering what’s so hard about step one?
- If you’re dealing with massive volumes of content, it can take months to do the content preparation.
- If you’re dealing with rapidly-arriving content, it’s possible that content is arriving faster than you can prepare it and a backlog is building (we’ve seen this with a few search engine customers who’ve moved to MarkLogic)
- If you’re integrating content from hundreds or thousands of sources, it is difficult and time-consuming to transform it all into a single DTD, and to ensure your transformations keep up with changes in the source formats over time.
- If you’re working in military intelligence, you are simply not in a position to alter the content creation process and control the DTD in which content is formatted (hey bad guys, please all standardize on this DTD). Structured authoring tools aren’t the answer here.
I think “step one” — more than any other factor — is the reason why 80% of unstructured content lives in files on the file system and not in databases. So, if step one is the killer, is there any way to avoid it?
Absolutely. That’s exactly what we do at Mark Logic. Our XML content server loads content “as is” — no preparation is required. We take whatever content you provide (and loading can be as simple as dragging a file to WebDAV-mapped folder) and we index what’s there. If it’s in XML we will index all the text and all the XML structure and content. If it’s in a desktop format (e.g., Word or PDF) then we’ll first automatically convert it to XML, and then perform the text and XML indexing process.
There’s no magic, however. If some documents use “author” tags and while others use “creator,” then you will need to either (1) create synonyms, (2) write queries that search for both, or (3) transform documents over time to a single format. But the point is you can do all that after the content is loaded, and after you have received some incremental value.
In XQuery you can write queries that are as powerful as the markup in your documents. If you have no markup, you degenerate to something a bit more powerful than regular text search. If you have a little markup you can run basic queries. If you have a lot of markup you can write very powerful queries.
What I like about this approach is that it’s pragmatic and incremental. The ROI curve isn’t flat for 18 months with a hoped-for “big bang” once everything is loaded. You can load your content quickly, start getting value from it early, and then increase the value you get both by building applications and by normalizing and enriching the markup over time.
It’s not a big bang approach. There is no step-one doozy. While it may not sound like a big deal, I think it’s a major paradigm reversal in the world of content software.
The box should probably say: Batteries not included. No content preparation necessary. Some XQuery assembly required.