Category Archives: XML

Gartner Names "Specialized Systems" A Top 10 Strategic Technology

Leading IT analyst firm Gartner has named “specialized systems” to its list of top 10 strategic technologies for 2009. While I’m sure Gartner wasn’t thinking specifically of Mark Logic (for, among other reasons, that we’ve not spoken with David Cearley though I do know him from my Business Objects days), I would indeed argue that Mark Logic fits perfectly into this trend.

Here’s what Gartner says about specialized systems:

Specialized Systems. Appliances have been used to accomplish IT purposes, but only with a few classes of function have appliances prevailed. Heterogeneous systems are an emerging trend in high-performance computing to address the requirements of the most demanding workloads, and this approach will eventually reach the general-purpose computing market. Heterogeneous systems are also specialized systems with the same single-purpose imitations of appliances, but the heterogeneous system is a server system into which the owner installs software to accomplish its function.

While this is a generalized description, the point is clear: for high-performance computing, you will increasingly partition your workload amongst a heterogeneous network of servers each designed and optimized for a specific task. For MarkLogic Server, that task is high-performance XQuery evaluation against large XML databases, documentbases, and/or contentbases.

I’d also say that this argument is similar to one that Mike Stonebraker makes: that as you partition your workload against various, specialized (database) servers (e.g., OLTP, data warehousing, stream processing, XML processing, scientific data processing) you will find that, by elimination, there is no apparent need for a general-purpose database. That is, that every purpose a DBMS serves is a special purpose and we will therefore soon see the end of the era dominated by the general-purpose DBMS.

By the way, I’d also argue that Mark Logic has a role in one of Gartner’s other top 10 trends, web-oriented architectures.

Web-Oriented Architectures. The Internet is arguably the best example of an agile, interoperable and scalable service-oriented environment in existence. This level of flexibility is achieved because of key design principles inherent in the Internet/Web approach, as well as the emergence of Web-centric technologies and standards that promote these principles. The use of Web-centric models to build global-class solutions cannot address the full breadth of enterprise computing needs. However, Gartner expects that continued evolution of the Web-centric approach will enable its use in an ever-broadening set of enterprise solutions during the next five years.

As I’ve said here before, once a customer starts to use MarkLogic as a platform / repository / search engine for their XML, they soon realize that it’s easier to write web applications in a pure top-to-bottom XML fashion than in the dual mapping from an XML-oriented browser to a object-oriented Java layer to a table-oriented (relational) DBMS. That’s the subject of a different post. If you’re interested in top-to-bottom XML, then go here.

Gartner’s top 10 list of strategic technologies for 2009 is here.

Related articles by Zemanta
Reblog this post [with Zemanta]

XML: Why You Should Care

The folks at O’Reilly Media have created an excellent blog around their ToC (Tools of Change for Publishing) meme and event. As part of that, they are running a series called StartWithXML that has some excellent material on the topic of XML and publishing.

One of the first posts in the StartWithXML project is entitled Why You Should Care About XML by Andrew Savikas, with whom I had the pleasure of speaking on a panel at the Gilbane conference in San Francisco a few months back. Excerpt:

But there are several reasons why it’s really really important for publishers to start paying attention to XML right now, and across their entire workflow:

  • XML is here to stay, for the reasonably forseeable future. While it’s always dangerous to attempt to predict expiration dates on technology, I think it’s fair to assume XML will have a shelf life at least as long as ASCII, which has been with us for more than 40 years, and isn’t going anywhere soon.
  • Web publishing and print publishing are converging, and writing and production for print will be much more influenced by the Web than vice-versa. It will only get harder to succeed in publishing without putting the Web on par with (or ahead of) print as the primary target. The longer you wait to get that content into Web-friendly and re-usable XML, the worse.

Many in publishing balk at bringing XML “up the stack” to the production, editing, or even the authoring stage. And with good reason; XML isn’t really meant to be created or edited by hand (though a nice feature is that in a pinch it easily can be). There are two places to look for useful clues about how XML will actually fit into a publisher’s workflow: Web publishing and the “alpha geeks.”

He then goes on to examining both web publishing and alpha geek behavior in order to provide a lay of the future publishing land. See the post for more.

O’Reilly is also hosting a StartWithXML one-day forum in New York City on 1/13/09 at the McGraw-Hill Auditorium.

Unlearning the Relational Model

Thanks to a Google Alert I stumbled into this interesting post entitled The Content Imperative: Unlearning the Relational Model in another CEO blog, that of Joel Amoussou of Montreal-based Efasoft.

Says Joel:

The following are some fundamental differences between content and relational data:

  • Content is created to be human readable
  • Content can be rendered in multiple presentation formats such as print, web, and wireless devices. Therefore it is very important to cleanly separate content from presentation
  • Content can have an inherent deep hierarchical structure. For example, think about the book/part/chapter/section/subsection/paragraph hierarchy
  • The relationships between content items are expressed through hierarchical containment and hyperlinks
  • Content is often mixed (in the sense of mixed content in XML). For example inside a paragraph, some words are italicized, in bold, or underlined to indicate special meaning
  • Content can have multi-valued properties such as the authors of a document. Multi-valued properties are not supported by SQL.

He continues, starting an argument in favor of XML:

The problem with unstructured content is that it cannot be processed and queried like the well-structured relational data stored by the RDBMS on which your ERP and CRM systems sit. XML goes beyond tags (in the web 2.0 sense), taxonomies, full-text search, and content categorization to provide fine-grained content discovery, query, and processing capabilities. With XML, the document becomes the database. If your business is content (you are a media company, a publisher, or the technical documentation department of a manufacturing company), then you should seriously consider the benefits of XML in terms of content longevity, reuse, repurposing, and cross-media publishing.

And goes on to discuss XQuery:

The relational data model is based on set theory and predicate logic. Data is represented as n-ary relations and manipulated with relational algebra. CMS vendors and even standard bodies have tried to fork SQL in order to support hierarchies and multi-value properties. It is clear however that XQuery is a superior alternative, specifically designed to address those content-related concerns.

And then finally argues in favor of XML databases over a JCR repository when dealing with large amounts of content:

You should seriously consider a native XML database when dealing with large quantities of document-oriented XML documents.

I couldn’t agree more. (Hey, I think I like this guy). The post also includes some discussion of data vs. content modeling and some interesting parallel history between SGML/XML and the RDBMS.

Reblog this post [with Zemanta]

Forbes Interview: Corporate Pack Rats

I recently had a conversation with Ed Sperling from Forbes who runs their “CIO Chat” column. Today, ran a story, entitled Corporate Pack Rats, resulting from that interview.

It’s hard to talk about content these days without talking about e-discovery, email archive search (wouldn’t MarkMail be wonderful at that?), and compliance. So the story starts out with a chat about that.

I then go onto one of my new rants: why does everyone want to play offense (think: business intelligence) with their data, but simply play defense (think: records management, e-discovery) with their content? Yes, not going to jail is important, but don’t you believe there’s value in your corporate document/content that help you build better products, serve customers better, and improve the efficiency of your operations?

This excerpt summarizes it well:

Do CIOs get this?

Most CIOs? No. The vast majority are still in a place where they’re trying to avoid getting in trouble with their documents.

We later started talking about one of my favorite topics, XML, where there’s another nice excerpt:

Does all the content have to be tagged with XML, because there’s a lot of content that predates XML?

The better the tags, the better the queries. If you want to find all documents that contain the words “bird strike,” any text search engine can do that. If you want to find all documents that classify procedures related to approach, if all that is tagged, you can get a pinpointed result. Without tags, you may learn that somewhere in the 300-page PDF are the words “bird strike.” That’s not very helpful. With the tags, you can increase the precision of searches and their granularity.

Finally, another nice excerpt related to the slow, inexorable move towards XML:

There will be transition issues, but over the next three to five years we’re going to move from a “.doc” world to “.docx.” Right now, rounding up it’s 1%. But in five years, rounding down it will be 100%.

Indeed. And that’s one big change.

XML: Good, Bad, Bloated?

GCN ran an article last month, entitled XML: The Good, The Bad, and the Bloated, about which I wanted to share a few thoughts.

The article begins (bolding mine):

Depending on whom you talk to, Extensible Markup Language is either the centralized solution for managing cross-platform and cross-agency data sharing, or it’s a bloated monster that’s slowly taking over data storage and forcing too much data through networks during queries.

Which view is accurate?

In general, I believe XML’s flexibility and cross-platform capabilities far outshine any negatives. But if XML files are not properly planned and managed, there is a good possibility that you could experience XML bloat.

First, I’ll note that the author balances the pro/con of XML and comes out pro: XML’s benefits outweigh its stated and perceived disadvantages.

Now, let’s move on to the cons:

But XML bloat occurs when files are poorly constructed or not equipped for the jobs they must perform. There is a strong temptation to cram too much information into files, which makes them larger than they need to be. When an agency needs only part of the data, [it] often has to accept the whole file, including long blocks of text.

First, I’d say that “long blocks of text” are often the data in which analysts are interested, so we must be careful not to quickly classify them as baggage (i.e., let’s not be too data-centric in today’s world).

Second, I’d agree that the blind marking of everything in XML can be wasteful. That’s why I’ve long advocated a “lazy” approach where:

  • You first decide application requirements and then create XML tags in order to support them, iterating over time on both the application requirements and the sophistication of the XML to support them.

As opposed to a far-too-common “big-bang” approach whereby:

  • You design “the ultimate schema,” which can answer virtually any possible application requirement, and then spend enormous time and money first designing it, and then trying to migrate your data/content to it.

The problems with the big-bang approach are many:

  • Designing the ultimate schema is a Sisyphean task.
  • You spend money investing in XML richness which has no short-term return; i.e., you over-design for the short-term
  • You lose your budget mid-term because while you’re designing perfection, the business has seen no value and loses faith in the project.

As I like to say, “big-bang approaches often result in a big bang,” or, similarly, with too many content-oriented systems “the first step’s a doozy” beyond which you never pass.

At Mark Logic, we’re trying to change all that in three ways:

  • By delivering a forgiving XML system that accepts content in a rather ragged form, enabling you to ingest XML immediately and begin delivering value against it.
  • By evangelizing a lazy XML enrichment and migration approach that delivers business value faster than big-bang approaches.

With Mark Logic, the question is not how much slower do I have to go than an RDBMS and get the benefits of XML,” it’s typically “how much faster does it go than an RDBMS and still deliver the benefits of XML?

In customer benchmarks, we’ve see out-performance of 10:1 as common and outperformance of an RDBMS by 100:1 is certainly not unheard-of. Ask our customers and partners: MarkLogic is fast.

The article continues (bolding mine):

Luckily, technologies are evolving that can help with XML bloat.

First is the evolution of platform-based XML solutions that offer a single system to author, tag, store and manage XML files. They also allow developers to set the policies for dynamic XML integration into other documents or applications. Mark Logic is one of the best-known purveyors of such solutions, …

A lot of XML bloat perception comes from the idea that you’re inserting tags into ASCII files and those files increase by the size of the tags which, at times, appear material relative to the size of the content.

As a trivial example, if you have an XML element is named publication-author, with value (i.e., the author’s name) “Joe,” then you have added 41 characters of “overhead” (begin and end tags) to the underlying data of 3 characters. And, if Joe has authored 1,000 documents in the collection, you’d argue that you’ve added 41,000 characters of overhead for 3,000 characters of data. And you’d see precisely that if you looked at an ASCII serialization of the XML.

But good XML systems don’t store XML that way. XML is naturally tree-structured and XML documents are stored as trees. What’s more, the element names (i.e., the tags) are typically hashed. So the 20-character “publication-author” element name get hashed to 64 bits once and every time the tag appears in the corpus only the hash-value is stored. So it’s not 41K of overhead to 3K of content in the preceding example, it’s more 2K to 3K.

In fact, by Mark Logic rules of thumb, the picture often looks like:

  • 1MB of text source content, which becomes
  • 3MB of XML, which becomes
  • 300K of compressed XML in MarkLogic, which becomes
  • 1MB of compressed XML + indexes in MarkLogic

Simply put, it’s often the case that the content blows up a bit in XML only to be compressed to 1/10th its size, only to be re-inflated through indexing back to its original size.

Now this certainly isn’t true every time. Sometimes content + indexes ends up 2-5x the original size. But critics should remember: (1) you then have rich XML tags that enable you to do something with the content and (2) you then have indexes so you do it, fast. (Often the counter-arguments make it sound like nothing is gained for the size increase.)

Finally, I’d add two points:

  • With magnetic disk storage well less than $1/gigabyte (e.g., this drive) for consumer applications and maybe $10/gigabyte in a mid-range SAN …. to put it bluntly … should you care? Despite our (potentially advancing) age and attitudes about storage costs,
    we should not conserve storage for conservation’s sake, but instead optimize our computing investment so as to maximize overall return paying heed to the relative costs of subsystems and to value of functionality enabled by them.
  • Your XML can be as big or rich as you want it to be. And with MarkLogic, you can change that richness over time. Our presumption is that you are adding elements because you want to use them to deliver business value so technically speaking, there should be no “wasted elements” — i.e., elements that merely inflate size and deliver no value. That is, if you’re paying attention and following a lazy XML approach, then your XML should be no richer than the functionality required by your appliactions, and ergo — by definition — there is no waste or bloat.

Basically, if your content gets bigger, it’s simply because you wanted to do more things with it.

The Specialized Database Argument: Performance

People sometimes ask: what’s the argument for special-purpose databases like MarkLogic, as opposed to general-purpose databases like DB2, Oracle, or SQL Server? While I have written much on this topic, in the end I think it boils down to one word: performance.

The big 3 database oligopoly have proven that the general-purpose database management system (DBMS) can indeed be bloated into a wide scope of functionality (today’s RDBMSs are so bloated that most analysts now drop the R, because they’ve long-since stopped being relational).

So while the big 3 can bloat the DBMS, what they can’t do is optimize it for each special case. By definition, the general-purpose DBMS needs to be optimized for general purposes. When trade-offs are encountered, you must design for the general case.

That’s what creates the opening for specialized DBMSs. For example, MarkLogic is not optimized for the general case — a bit of transaction processing, a bit of data warehousing, a bit of analytics, a bit of text, a bit of XML, a bit of spatial indexing, a bit of data mining, a bit of huge deployments, a bit of tiny ones, a bit of OLAP, a bit of memory-residency, and so on.

MarkLogic is optimized for the specific case of large amounts of semi-structured XML data, typically containing lots of text. The result: performance numbers that simply crush the competition when they’re playing in our house.

For example, while I can’t go into specifics, one of our technical staff sent an email out this morning that went like:

Another 100x Win Against XXXXX

Today, I indexed XML in 137 seconds which took XXXXX 4 hours, even though they were running on beefier hardware. Due to other pressing deadlines [and the already clear victory], I didn’t have time to optimize the MarkLogic side. Had I been able to do threading and cache tuning, I’m quite sure I could have sped up the MarkLogic side by 4x.

Is this magic? No.

While I think the world of our engineering team and I do believe they have built a tremendous product, there’s no magic. It’s simply the combination of a great implementation focused on a specific XML-based use case. No general-purpose player can beat that.

Startup Zeitgeist

Seedcamp, a London-based, week-long camp for European entrepreneurs recently did an interesting exercise. They took the several hundred applications they received for their event and made tagclouds. Here’s what they found.

What are you creating?

How will you make money?

What tools will you use?

(I’d love to see XQuery in the toolset, but happy to see that database, server, and XML are already there.)

And who says you can’t do interesting analytics on content? I thought this was fascinating. Check out Seedcamp’s blog post about the exercise, here.