Rule 1 of Database Performance

Here’s a link to a post done by Matt Turner on his Discovering XQuery blog that discusses Publishing 2.0 and content logic.

In this post Matt discusses what I call the “thick middle tier” problem with most search-engine-based content applications.

Here’s the issue. Search engines (1) return lists of links to documents and (2) allow only fairly basic “query” (and I’m reluctant to even call them that) predicates to be applied in the search engine.

As a result, a typical search-engine-based application ends up with a thick middle tier of Java code that (1) systematically materializes each document in the returned list as a DOM tree and then (2) does subsequent processing on that document using Java.

As Matt points out, you might be tempted to think of this work as “your application” or “business logic,” but in reality it’s not. It’s content processing, not business or application processing. This approach is bad for several reasons:

  • Productivity is negatively impacted because you have to do low-level content processing yourself, and typically in a relatively low-level language, like Java
  • Performance is negatively impacted because you end up with an architecture that violates “rule 1” of database performance — push processing to the data, don’t bring data to the processing

All DBMSs strive for compliance with rule 1.

  • Query optimizers always apply the most restrictive predicate first (e.g., apply emp-id = 178 before sex = female)
  • Query optimizers always do lookup joins from the table with the most restrictive predicates on it (where dept.dname = “fieldmkt” as opposed to emp.name = “*stein*”)
  • It’s why everyone loves stored procedures. Not only do they minimize client/server interaction and allow pre-compilation, most importantly, they push processing to the data.

I’m not going to criticize people who built systems this way historically. Prior to products like MarkLogic, the thick-middle-tier architecture was the best you could do. DBMSs couldn’t handle content so the best you could do was to leave your content in files (or stuff it in BLOBs), index it with a search engine, and then build these thick-middle-tier applications.

But in the future it doesn’t have to be this way. With systems like MarkLogic, you can now build content applications using a standard query language (XQuery) and the “correct” allocation of processing across tiers. This has the following benefits:

  • Improved productivity because XQuery is a relatively high-level language
  • Greatly improved performance because you can thin-out the middle tier and push content processing to the XML content server (which is both optimized to do it and close to the content)
  • Openness and standardization, which makes it easier to find skilled resources, eliminates vendor lock-in, and makes software integration generally easier.
  • Flexibility. Typically with enough smarts in the middle layer you can hack something together than runs one query fast. The trick is when you want to run many and/or new queries fast — in that case, you really need the right architecture — i.e., one that pushes processing to the content instead of bringing content to the processing.

Leave a Reply

This site uses Akismet to reduce spam. Learn how your comment data is processed.