GCN ran an article last month, entitled XML: The Good, The Bad, and the Bloated, about which I wanted to share a few thoughts.
The article begins (bolding mine):
Depending on whom you talk to, Extensible Markup Language is either the centralized solution for managing cross-platform and cross-agency data sharing, or it’s a bloated monster that’s slowly taking over data storage and forcing too much data through networks during queries.
Which view is accurate?
In general, I believe XML’s flexibility and cross-platform capabilities far outshine any negatives. But if XML files are not properly planned and managed, there is a good possibility that you could experience XML bloat.
First, I’ll note that the author balances the pro/con of XML and comes out pro: XML’s benefits outweigh its stated and perceived disadvantages.
Now, let’s move on to the cons:
But XML bloat occurs when files are poorly constructed or not equipped for the jobs they must perform. There is a strong temptation to cram too much information into files, which makes them larger than they need to be. When an agency needs only part of the data, [it] often has to accept the whole file, including long blocks of text.
First, I’d say that “long blocks of text” are often the data in which analysts are interested, so we must be careful not to quickly classify them as baggage (i.e., let’s not be too data-centric in today’s world).
Second, I’d agree that the blind marking of everything in XML can be wasteful. That’s why I’ve long advocated a “lazy” approach where:
- You first decide application requirements and then create XML tags in order to support them, iterating over time on both the application requirements and the sophistication of the XML to support them.
As opposed to a far-too-common “big-bang” approach whereby:
- You design “the ultimate schema,” which can answer virtually any possible application requirement, and then spend enormous time and money first designing it, and then trying to migrate your data/content to it.
The problems with the big-bang approach are many:
- Designing the ultimate schema is a Sisyphean task.
- You spend money investing in XML richness which has no short-term return; i.e., you over-design for the short-term
- You lose your budget mid-term because while you’re designing perfection, the business has seen no value and loses faith in the project.
As I like to say, “big-bang approaches often result in a big bang,” or, similarly, with too many content-oriented systems “the first step’s a doozy” beyond which you never pass.
At Mark Logic, we’re trying to change all that in three ways:
- By delivering a forgiving XML system that accepts content in a rather ragged form, enabling you to ingest XML immediately and begin delivering value against it.
- By evangelizing a lazy XML enrichment and migration approach that delivers business value faster than big-bang approaches.
- By delivering a high-performance XML server that ingests and indexes XML in a very efficient way.
With Mark Logic, the question is not “how much slower do I have to go than an RDBMS and get the benefits of XML,” it’s typically “how much faster does it go than an RDBMS and still deliver the benefits of XML?“
The article continues (bolding mine):
Luckily, technologies are evolving that can help with XML bloat.
First is the evolution of platform-based XML solutions that offer a single system to author, tag, store and manage XML files. They also allow developers to set the policies for dynamic XML integration into other documents or applications. Mark Logic is one of the best-known purveyors of such solutions, …
A lot of XML bloat perception comes from the idea that you’re inserting tags into ASCII files and those files increase by the size of the tags which, at times, appear material relative to the size of the content.
As a trivial example, if you have an XML element is named publication-author, with value (i.e., the author’s name) “Joe,” then you have added 41 characters of “overhead” (begin and end tags) to the underlying data of 3 characters. And, if Joe has authored 1,000 documents in the collection, you’d argue that you’ve added 41,000 characters of overhead for 3,000 characters of data. And you’d see precisely that if you looked at an ASCII serialization of the XML.
But good XML systems don’t store XML that way. XML is naturally tree-structured and XML documents are stored as trees. What’s more, the element names (i.e., the tags) are typically hashed. So the 20-character “publication-author” element name get hashed to 64 bits once and every time the tag appears in the corpus only the hash-value is stored. So it’s not 41K of overhead to 3K of content in the preceding example, it’s more 2K to 3K.
In fact, by Mark Logic rules of thumb, the picture often looks like:
- 1MB of text source content, which becomes
- 3MB of XML, which becomes
- 300K of compressed XML in MarkLogic, which becomes
- 1MB of compressed XML + indexes in MarkLogic
Simply put, it’s often the case that the content blows up a bit in XML only to be compressed to 1/10th its size, only to be re-inflated through indexing back to its original size.
Now this certainly isn’t true every time. Sometimes content + indexes ends up 2-5x the original size. But critics should remember: (1) you then have rich XML tags that enable you to do something with the content and (2) you then have indexes so you do it, fast. (Often the counter-arguments make it sound like nothing is gained for the size increase.)
Finally, I’d add two points:
- With magnetic disk storage well less than $1/gigabyte (e.g., this drive) for consumer applications and maybe $10/gigabyte in a mid-range SAN …. to put it bluntly … should you care? Despite our (potentially advancing) age and attitudes about storage costs,
we should not conserve storage for conservation’s sake, but instead optimize our computing investment so as to maximize overall return paying heed to the relative costs of subsystems and to value of functionality enabled by them.
- Your XML can be as big or rich as you want it to be. And with MarkLogic, you can change that richness over time. Our presumption is that you are adding elements because you want to use them to deliver business value so technically speaking, there should be no “wasted elements” — i.e., elements that merely inflate size and deliver no value. That is, if you’re paying attention and following a lazy XML approach, then your XML should be no richer than the functionality required by your appliactions, and ergo — by definition — there is no waste or bloat.
Basically, if your content gets bigger, it’s simply because you wanted to do more things with it.