Category Archives: XML repository

XML: YAFF, YADT, or Whole World?

If you have a bunch of XML and are looking for of a place to put it, then I think I may have come up with a simple test that might be helpful.

In talking with prospective vendors of XML repositories (definition: software that lets you store, search, analyze and deliver XML), try to establish what I’ll call “XML vision compatibility.” Quite simply, try to figure out if the vendor’s vision of XML is consistent with your own. To help with that exercise, I’ll define what I see as the three common XML vendor visions:

  • YAFF (yet another file format)
  • YADT (yet another data type)
  • Whole world

YAFF Vendors
Vendors with the YAFF vision view XML as yet another file format. ECM vendors clearly fall into this category (“oh yes, XML is one of the 137 file formats you can manage in our system”). So do enterprise search vendors (“oh yes, we have filters for XML formatted files which clear out all those nasty tags and feed our indexing engine the lovely text.”)

For example, let’s look at how EMC Documentum — one of the more XML-aggressive ECM vendors — handles XML on its website.

Hmm. There’s no XML on that page. But lots of information about records management, digital asset management, document capture, collaboration and document managent (it’s not there either). Gosh, I wonder where it is? SAP integration? Don’t think so. Hey, let’s try Documentum Platform, whatever that is.

Not there, either. Now that’s surprising because I really have no idea where else it might be. Oh, wait a minute. I didn’t scroll the page down. Let’s try that.

There we go. We finally found it. I knew they were committed to XML. What’s going on here is that EMC has a huge, largely vendor consolidation-driven (e.g., Documentum, Captiva, Document Sciences, x-Hive, Kazeon) vision of what content management is. And XML is just one tiny piece of that vision. XML is, well, yet another file format among the scores that they have manage, archive, capture, and provide workflow, compliance, and process management against. The vision isn’t about XML. It’s about content. That’s nice if you have an ECM problem (and a lot of money to solve it); t’s not so nice if you have an XML problem, or more precisely a problem that can be solved with XML.

YADT Vendors
Vendors with the YADT vision view XML as yet another data type. These are the relational database management system vendors (e.g., Oracle) who have decided that the best way to handle XML is to make it a valid datatype for a column in a table.

The roots of this approach go back to the late 1980s and Ingres 6.3 (see this semi-related blast from the past) which was the first commercial DBMS to provide support for user-defined datatypes. All the primitives for datatyping were isolated from the core server code and made extensible through standard APIs. So, for example, if you wanted to store complex numbers of the form (a, bi) all you had to do was to write some primitives so the server would know:

  • What they look like — i.e., (a, bi)
  • Any range constraints (the biggest, the smallest)
  • What operators should be available (e.g., +, -)
  • How to implement those operators — (a, bi) + (c, di) = (a+c, (b+d)i)

It was — far as I remember — yet another clever idea from the biggest visionary in database management systems after Codd himself: Michael Stonebraker then of UC Berkeley and now of MIT. After founding Ingres, Stonebraker went on found Illustra which was all about “datablades” — a sexy new name for user-defined types. Datablades, in turn, became sexy bait for Informix to buy the company with an eye towards leveraging the technology towards unseating Oracle from its leadership position. It didn’t happen.

User-defined datatypes basically didn’t work. There were two key problems:

  • You had user-written code running in the same address space as the database server. This made it nearly impossible to determine fault when the server crashed. Was it a database server bug, or did the customer cause problem in implementing a UDT? While RDBMS customers were well qualified to write applications and SQL, writing server-level was quite another affair. This was a bad idea.
  • Indexing and query processing performance. It’s fairly simple to say that, for example, a text field looks like a string of words and the + operator means concatenate. It’s basically impossible for a end customer to tell the query optimizer how to process queries involving those text fields and how to build indexes that maximize query performance. If getting stuff into UDTs was a level-5 challenge, getting stuff back out quickly was a level-100 one.

So while the notion of end users adding types to a DBMS basically failed, when XML came along the database vendors dusted off this approach, in saying effectively: let use all those hooks we put in to build support for XML types ourselves. And they did. Hence what I call the “XML column” approach to storing XML in a relational database.

After all, if your only data modeling element’s a table, then every problem looks like a column.

Now this approach isn’t necessarily bad. If, for example, you have a bunch of resumes and want to store attribute data in columns (e.g., name, address, phone, birthdate) and keep an XML copy of the resume alongside, then this might be a reasonable way to do things. That is, if you have a lot of data and a touch of XML, this may be the right way to do things.

So again, it comes down to vision alignment. If XML is just another type of data that you want to store in a column, then this might work for you. Bear in mind you’ll:

  • Probably have to setup separate text and pre-defined XML path indexes (a hassle on regular schemas, an impossibility on irregular ones),
  • Face some limitations in how those indexes can be combined and optimized in processing queries,
  • Need to construct frankenqueries that mix SQL and XQuery, whose mixed-language semantics are sometimes so obscure that I’ve seen experts argue for hours about what the “correct” answer for a given queries is,
  • And suffer from potentially crippling performance problems as you scale to large amounts of XML.

But if those aren’t problems, then this approach might work for you.

This is what it looks like when a vendor has a YADT vision. Half the fun in storing XML in an RDBMS is figure out which query language and which store options you want to use. See the table that starts on page 9, spans four pages, and considers nearly a dozen criteria to help you decide which of the three primary storage options you should use:

See this post from IBM for more Oracle-poking on the complexity of storage options available. Excerpt:

Oracle has long claimed that the fact that Oracle Database has multiple different ways to store XML data is an advantage. At last count, I think they have something like seven different options:

  • Unstructured
  • XML-Object-Relational, where you store repeating elements in CLOBs
  • XML-Object-Relational, where you store repeating elements in VARRAY as LOBs
  • XML-Object-Relational, where you store repeating elements in VARRAY as nested tables
  • XML-Object-Relational, where you store repeating elements in VARRAY as XMLType pointers to BLOBs
  • XML-Object-Relational, where you store repeating elements in VARRAY as XMLType pointers to nested tables
  • XML-Binary

Their argument is that XML has diverse use cases and you need different storage methods to handle those diverse use cases. I don’t know about you, but I find this list to be a little bewildering. How do you decide among the options? And what happens if you change your mind and want to change storage method?

Such is life in the land of putting XML in tables because your database management system has columns.

Whole World Vendors
Vendors with the whole world vision view XML as, well, their whole world.

And when I say XML, I don’t mean information that’s already in XML. I mean information that is either already in XML (e.g., documents, information in any horizontal or industry-specific XML standard) or that is best modeled in XML (e.g., sparse data, irregular information, semi-structured information, information in no, multiple, and/or time-varying schemas).

“Whole world” vendors don’t view XML as one format, but as a plethora: docbook, DITA, s1000d, xHMTL, TEI, XBRL, the HL7 standards in healthcare, the Acord standards in insurance, Microsoft’s Open Office XML format, Open Document Format, Adobe’s IDML, chemical markup lanuage, MathML, the DoD’s DDMS metadata standard, semantic web standards like RDF and OWL, and scores of others.

Whole world vendors don’t view XML tags as “something that get in the way of the text” and thus they don’t provide filters for XML files. Nor do they require schema adherence because they know that XML schema compliance, in real life, tends to be more of an aspiration than a reality. So they allow you load and index XML, as is, avoiding the first step’s a doozy problem, and enabling lazy clean-up of XML information.

Whole world vendors don’t try to model XML in tables simple because they have a legacy tabular data model. Instead, their native modeling element (NME) is the XML document. That is:

  • In a hierarchical DBMS the NME is the hierarchy
  • In a network DBMS the NME is the graph
  • In a relational DBMS the NME is the table
  • In an object DBMS the NME is the object class hierarchy
  • In an OLAP, or multi-dimensional, DBMS the NME is the hypercube
  • And in an XML server, or native XML, DBMS the NME is the XML document

Whole world vendors don’t bolt a search engine to a DBMS because they know XML is often document-centric, making search an integral function, and requiring a fundamentally hybrid search/database — as opposed to a bolted-together search/database — approach.

Here is what it looks like when you encounter a whole world vendor:

Reblog this post [with Zemanta]

Unlearning the Relational Model

Thanks to a Google Alert I stumbled into this interesting post entitled The Content Imperative: Unlearning the Relational Model in another CEO blog, that of Joel Amoussou of Montreal-based Efasoft.

Says Joel:

The following are some fundamental differences between content and relational data:
  • Content is created to be human readable
  • Content can be rendered in multiple presentation formats such as print, web, and wireless devices. Therefore it is very important to cleanly separate content from presentation
  • Content can have an inherent deep hierarchical structure. For example, think about the book/part/chapter/section/subsection/paragraph hierarchy
  • The relationships between content items are expressed through hierarchical containment and hyperlinks
  • Content is often mixed (in the sense of mixed content in XML). For example inside a paragraph, some words are italicized, in bold, or underlined to indicate special meaning
  • Content can have multi-valued properties such as the authors of a document. Multi-valued properties are not supported by SQL.

He continues, starting an argument in favor of XML:

The problem with unstructured content is that it cannot be processed and queried like the well-structured relational data stored by the RDBMS on which your ERP and CRM systems sit. XML goes beyond tags (in the web 2.0 sense), taxonomies, full-text search, and content categorization to provide fine-grained content discovery, query, and processing capabilities. With XML, the document becomes the database. If your business is content (you are a media company, a publisher, or the technical documentation department of a manufacturing company), then you should seriously consider the benefits of XML in terms of content longevity, reuse, repurposing, and cross-media publishing.

And goes on to discuss XQuery:

The relational data model is based on set theory and predicate logic. Data is represented as n-ary relations and manipulated with relational algebra. CMS vendors and even standard bodies have tried to fork SQL in order to support hierarchies and multi-value properties. It is clear however that XQuery is a superior alternative, specifically designed to address those content-related concerns.

And then finally argues in favor of XML databases over a JCR repository when dealing with large amounts of content:

You should seriously consider a native XML database when dealing with large quantities of document-oriented XML documents.

I couldn’t agree more. (Hey, I think I like this guy). The post also includes some discussion of data vs. content modeling and some interesting parallel history between SGML/XML and the RDBMS.

Reblog this post [with Zemanta]

Thoughts on Category Creation and Information Access Platforms [Revised]

[Revised 8/2/08; still working on cleaning up this consciousness stream.]

Back in the old days, it seemed easy to create a category in software. Look at the database market, for example:

  • IBM invents the relational DBMS (RDBMS) category
  • Oracle, Ingres, and Informix enter in a largely undifferentiated way, though Informix eventually drifts towards the low-end/cheap segment
  • Sybase creates the derivative category of high-performance OLTP RDBMS.
  • Arbor re-christens the failed multi-dimensional DBMS as the OLAP Server
  • Tandem creates the non-stop RDBMS with its superb fault tolerance
  • Illustra launches the universal DBMS and is quickly acquired by Informix
  • Sybase launches the bitmap-indexed DBMS with SybaseIQ
  • Teradata launches the data-warehouse DBMS category

And you can find just as many examples outside database-land.

  • ASK defines the manufacturing resource planning (MRP) category
  • SAP hijacks MRP, redefines it as ERP, and goes on to become the world’s largest applications software company
  • PeopleSoft invents the HRMS category
  • Gartner Group’s Howard Dresner invents the business intelligence (BI) category, re-christening and re-framing what was formally known as DSS or EIS.
  • Siebel pioneers the sales force automation (SFA) category
  • Scopus pioneers call center automation (CCA)
  • Companies like Rubric pioneer enterprise marketing automation (EMA)
  • Siebel, through acquisition, coalesces SFA, CCA, and EMA into a single category called customer relationship management (CRM)
  • Oracle and SAP work to coalesce CRM back into ERP. Such is the ebb and flow of categories.

(And I could go on and on — BPM, KM, CMS, WCM, ECM, LMS, DRM, SCM, PLM, ETL, DI, EII — but I think I’ll stop here with the initials list.)

People are still creating categories today, and sometimes it looks easy. Uber-categories have been quite popular in the past decade as people have focused on different ways of developing and delivering software:

  • SaaS as an uber-category has worked well, with a variety offerings in various SaaS sub-categories (e.g., Salesforce, NetSuite)
  • Appliances have done pretty much the same thing — i.e., offering an appliance alternative for a wide variety of existing categories (e.g., a data warehouse appliance a la Netezza)
  • Open source has also done the same thing — again serving as a different flavor/dimension for a wide variety of largely existing software categories.

Only a few genuinely new categories have emerged, virtualization being the most obvious example. (Though you could argue that virtualization is itself an uber-category covering storage virtualization, server virtualization, et cetera.)

Companies are still working to carve new categories, particularly in the database market:

Sometimes vendors and/or the analysts who cover them try to impose either a straight name change (e.g., from MD-DBMS to OLAP) or a strategic shift (e.g., from BI to analytic applications) in category. Sometimes they’re just bored. Sometimes a vendor’s trying to redefine the market in line with its strengths. Sometimes an analyst is trying to make his/her mark on the industry and earn the coveted “father/mother of [category name],” much as Howard Dresner successfully did with BI.

BI got bored with its name several times during my tenure at Business Objects. At one point both the analysts and Informatica were trying to re-dub the category “analytic applications” in an attempt to get a fresh name and raise the abstraction level from tools to applications. Informatica nearly died on that hill.

Later, analysts tried to redefine the category, dubbing it corporate performance management (CPM), and arguing that business intelligence needed to link with financial planning systems. While knowing actuals is good, knowing actuals compared to the plan is better, and using actuals to drive the future plan better still. Cognos nearly tripped over itself repositioning around the CPM, ultimately acquiring Adaytum, which in turn lead to SRC’s eventual acquisition by Business Objects.

In an art-imitates-life sort of way, one wonders if the analysts predicted a move in the market or provoked it? My chips are on the latter.

This stream-of-consciousness is a long way of winding up to a single question: are enterprise search vendors successfully repositioning themselves as “information access platforms” or not?

Background: the enterprise-search-related vendors (e.g., Fast/Microsoft, Endeca) and search/content analysts who cover them are in the midst of an attempted category repositioning:

  • The word “enterprise search” is now seemingly dead, having been contaminated by the Google Appliance. When a shark gets in the water, all the fish jump out.
  • The word “information” is increasingly being used as a unifying term to describe both data and content (aka, unstructured data)
  • Enterprise search vendors are increasingly calling themselves “information access platforms” (though not generally abbreviated as IAP, I will do so here for brevity).

For example, consider Endeca’s corporate boilerplate:

Endeca’s innovative information access software that helps people explore, analyze, and understand complex information, guiding them to unexpected insights and better dec
isions. The Endeca Information Access Platform, built around a new class of access-optimized database, powers applications that combine the ease of searching and browsing with the analytical power of business intelligence.

I have a number of concerns on and related to this attempted shift:

  • The important thing about categories is that they exist in the mind of the customer. Analysts and vendors can try to put them there — but they have to stick. In my mind, IAP is not sticking. I have never heard a customer say: “I need to go out and get an IAP.”
  • I do, however, believe that “information” might well stick as an overall term, meaning both data and content (aka, structured and unstructured data).
  • It is not clear to me why someone who desires a unified platform for “information” would turn to a search vendor. Search engines were designed as read-only indexes to help people find documents containing tokens; hardly ideal as an application development platform.
  • In my estimation, someone managing “special” data should turn to a database vendor. While databases have classically not handled “special” data well, databases were designed as application platforms, and there is a whole new class of specialized databases emerging for handling various “special” types of data.
  • While I think a unified platform is a dandy vision, I think no one is close to delivering a unified platform that handles all types of data equally well. Bolting Lucene and MySQL together isn’t a platform. Relational databases still do a poor job with both content and many types of data (e.g., sparse, hierarchical, or semi-structured). XML servers (like MarkLogic) handle XML brilliantly, but need work before they can match RDBMSs at classical relational data.
  • I believe that someone who needs a crawl-and-index the intranet value proposition should use the Google Appliance; so I think the search vendors are correct in their desire to flee, I don’t think that “information access platform” is a good refuge.

Overall, my chips remain on the don’t come line for the attempted category repositioning from “enterprise search” to “information access platform.” You can find my stack on the come line for the emerging “special-purpose database” category and “XML servers” as an instance of them.

Documentum Post DeWalt: One Year Later

I’ve never met Dave DeWalt, but I’ve met plenty of folks who have, and they universally say good things about him. So I figured it wasn’t great news for the Documentum group at EMC when DeWalt left about a year ago to become CEO of McAfee.

Today I found an excellent post on CMSwire entitled Documentum: One Year After Dave DeWalt. Among other things it points to a superb post by John Newton, co-founder of Documentum and now co-founder and CTO of Alfresco, entitled The Departed, which goes into great depth about what DeWalt accomplished at Documentum and John’s suspicions as to why he left. If nothing else, read Newton’s post; I don’t know how I somehow missed it a year ago.

Here’s an excerpt from the CMSwire post by Marko Sillanpaa:

But gone is the passion and energy Dave and his team brought to content management. While some may disagree with the idea that content management is cool, I doubt few felt that way after seeing Dave’s keynotes. Rappelling from the ceiling or entry on motorcycles or horseback (even with diapers) woke you up in the morning and got you listening to the rest of the presentation, no matter how late you stayed at the table in Vegas.

In contrast, the EMC World 2007 keynotes were given with all the enthusiasm of a tenured professor in a second rate junior college. You could really see the difference between the west coast software and the east coast hardware marketing.

Overall, the post starts with a pretty grim impression of the post-DeWalt world, but then shows signs of hope, starting with the un-retirement of Documentum’s other co-founder, Howard Shao:

Documentum had been a tight knit family. And fortunately, in mid-year Howard Shao came out of retirement to hold the family together. It was disappointing though that while he left with a roar there was not even a peep when he returned. Howard’s return did what it intended. It settled folks down and even brought a few people back.

Joining Howard to take the reigns of CM&A was Mark Lewis, who had held several roles inside EMC including CTO. He’s only been in the role for six months so there’s been little time for change but EMC World is coming up. We’ll see if this long time EMC leader finally looks across all of the EMC products. It still baffles me that after three years few of the product lines talk to each other (EMC’s Newest Competitor EMC?). The other question, can he motivate the troops?

Marko ends his post on an hopeful note for Documentum’s future. I’m slightly less optimistic than he is because of one word: SharePoint. Acutally, two words: SharePoint and Alfresco.

I think a likely future for the ECM category is SharePoint attacking from the left with Microsoft’s standard iterative-improvement approach and Alfresco attacking from the right as the alternative to SharePoint. First-generation ECM vendors end up as the IBM mainframes in that scenario (i.e., they’re expensive and everybody has one, but they aren’t deploying new apps on them). I’ve blogged before on the similarities between ECM and BI, and I believe that while BI jelled as an integrated category that ECM never did.

But then again, I do have a bone to pick, because EMC acquired x-Hive a while back and while there is a high degree complementarity between MarkLogic and Documentum, there is a fair degree of functional overlap with x-Hive.

However, I believe an XML content server is strategic infrastructure for the customers we’re targeting and they won’t just take what comes in the box with a CMS. So while I expect the vendor relationship to be more complex than in the past, I do believe that plenty of customers will use Documentum for content management and MarkLogic for their XML repository.

That said, looking to the future, I do believe that SharePoint will put a squeeze on the classical ECM vendors and become ubiquitous, so we’re increasing our investment in SharePoint and Microsoft Office integration. And we’re thinking about an Alfresco relationship as my spider sense says there’s a good chance they will end up successfully positioning as the SharePoint alternative.