Lessons From WinFS

Microsoft put WinFS into Beta today, so it seems like a good time to talk about WinFS and what it tells us about the future of file systems and databases.

WinFS is a new file system that was originally supposed to be in the Windows Vista release (né Longhorn), which is now slated for delivery in Fall 2006. Vista has been delayed several times. The word on the street was that those delays were largely due to problems developing WinFS, leading to the conclusion that WinFS was deliberately de-scoped from Vista in order to help Microsoft finally ship it.

Microsoft calls WinFS a “unified storage system.” As far as I can tell, it’s a new XML-based file system. When you consider that Microsoft is (1) moving the Office document formats to XML, (2) embracing metadata tags with features like Smart Tags in Office, and (3) developing a new XML-based file system, you can get your first lesson from WinFS.

Lesson 1: XML is important. It is a trend, not a fad.

One duty of a file system is helping you organize (and subsequently find) things. That’s primarily been accomplished through a hierarchical directory structure. Most people don’t do a good job organizing things, whether it’s paper files or their computer equivalent. Some problems are human nature — we just aren’t strict or repeatable in building ad hoc taxonomies. Others are caused by the hierarchical requirement that a file end up in one place. (For example, does the 2005 operating plan presentation belong in the 11/04 board meeting folder or the 2005 plan folder?) While you could use shortcuts to effectively to put the file in both places, they are error prone and few people actually use them.

So while directories are most certainly helpful, we know that they should be complemented with search if you actually want to find anything. It must be embarrassing to Microsoft that customers have resorted to third-party search engines, such as Google Desktop or X1, to search for information in their file system. This leads to lesson 2.

Lesson 2: search is important. Hierarchies alone don’t let you find things.

When you think about searching a file system, two types of searches come to mind. The first is text search – e.g., find me all files containing the word “semantic.” The second is metadata search – e.g., find me all files created by “Strohlein,” or about “RDF.”

Historically, experts believed that you needed librarians with controlled vocabularies and rigorously defined taxonomies in order to get value from metadata tagging. Increasingly, people believe that while the old approach does indeed work, that you can accomplish a lot with an organic approach where authors tag their content in an ad hoc fashion, sometimes called a folksonomy. This is a radically different philosophy that nonetheless produces a valuable result. For an example of the power of ad hoc tagging, checkout Flickr. This leads to our third lesson.

Lesson 3: metadata is important.

Metadata has always been the step-child of data, and I’d always quipped that “meta-data companies tend to produce meta-revenues.” In the future, this will be decreasingly true. In fact, I’d argue there is a Maslow’s hierarchy analogy between data/metadata and between physiological/safety needs. You only worry about safety after you have food and water. You only worry about metadata after you have data.

To steal an example from my past life at Business Objects, only after you know sales of watches in NYC in April do you wonder when the data was refreshed, from which system did it come, and was it net of an allowance for returns?

One disconcerting thing I’ve heard about WinFS is that Bill Gates himself dictated that it be implemented on top of SQL Server.

On one hand, you can understand that as chief software architect, that Gates doesn’t want Microsoft running around with redundant DBMSs and file systems. For example, Outlook uses neither the file system nor one of Microsoft’s DBMSs (i.e., Access, SQL Server) to store messages in your personal folders. Outlook seems to have implemented a file system within a mail client that loads everything into one giant .OST file on your machine. (Mine is nearly 1 GB.)

On the other hand, the Achilles’ heel of the relational database has always been modeling hierarchy.File systems are hierarchical.XML is strictly hierarchical.So while I’m sure it sounded really good in a meeting to say something like “let’s put all our wood behind our one strategic DBMS arrow in SQL Server,” I am also quite sure that such a decision could easily doom the project to failure.

Lesson 4: beware things that model hierarchy in a relational database.

Time will tell how well Microsoft has implemented WinFS. Time will also tell how the line between DBMS and file system plays out. For the past 20 years, DBMS vendors have said “the file system is dead, long live the database” and for the past 20 years about 80% of all information has resisted databases, stubbornly remaining on the file system, increasingly accessed via search engines.

I believe the explanation for all this is simple

  1. RDBMSs were designed to model data, not content. Codd wasn’t thinking about Word documents and PDFs when he designed the relational model.
  2. People like the simplicity of the file system
  3. People are only willing to do content preparation for very high-value content. Hence, the generally un-penetrated market for ECM systems.

By the way, if you’re looking for a server that

  1. Handles XML content natively
  2. Models hierarchy without a problem
  3. Has “document” as the core of its data model, not “table”
  4. Handles metadata flawlessly
  5. Enables sophisticated database-style queries that can do text search, structural search, metadata search — or combine all three at once

Then you should give us a call. You don’t need to wait for a new file system that does part of the job. MarkLogic Server does the whole job, today.

Leave a Reply

This site uses Akismet to reduce spam. Learn how your comment data is processed.