Fingerprinting Content

I recently ran across a cool (sister, Sequoia-backed) company, called Port Authority Technologies, that sells information leak protection solutions. They sell an appliance that sits next to your firewall and looks, at a network level, for content that's getting sent outside that shouldn't be. What kind of content might that be? Contracts, marketing plans, draft financial press releases, source code, customer databases, and such.

On discovering them, I first wondered: where were they when HP needed them? (If you've not yet heard, it seems that Patricia Dunn and three other people involved in the HP affair are going to be indicted.) It seems clear that monitoring outbound network traffic is a much better way of protecting information than calling phone companies under false pretexts. Then, always in touch with my inner marketer, I dreamed of the PR they could be generating on the back of the HP scandal.

Then I connected the dots on a series of companies using what I call "content fingerprinting" technology for interesting applications. Content fingerprinting is about scanning content, recognizing and remembering at a deep level, and then re-recognizing it when it's about to head out the door, even if it has been transformed in some way, such as broken into network packets for transmission, zipped, deliberately fragmented into pieces, been through a series of global substitutions to mask it, or even (I think) encrypted. See here for more.

It seems the most popular applications of this technology are intellectual property protection, leak prevention, and compliance. Companies focused on these applications include:

For more background on these types of applications, check out this article.

I know another company, using similar fingerprinting technology, for a very different application. Palamida effectively sells an open-source detector to software companies. So instead of using the technology to first crawl sensitive corporate documents and then sniff for them on the way out, Palamida themselves go crawl every open source repository they can find. Then, they call up the VP of Engineering -- which is particularly fun during a proposed acquisition -- and say: "are you

sure

that none of your engineers have

ever

incorporated a bit of open source code that they shouldn't have ... and put it in your product, and potentially per the GPL, therefore turned your entire product into open source?"

Content fingerprinting technology also seems to be a great way for publishers to crawl the Internet and look for scraped, stolen, or otherwise misappropriated content. I'm sure publishers do some of this today, but my guess is they are using more basic methods, which means they could be missing a lot.

Speaking of publishers whose content has been misappropriated or scraped for someone else's profit, look at what these guys did with my soccer post. That's this post, scraped, with some irrelevant tags added, with ads alongside that are generating revenue for someone else. I'm not sure if it's illegal, but it's certainly not my intent to have someone else scrape my posts and effectively sell them.

Fingerprinting Content

Read more

Please Don't Smooth the Metrics

The Kellblog Companion and Thoughts on Derivative Works

What Mr. Jambo and Levi's Can Teach Us About Listening to Customers

Book Review: The Curious Case of Mike Lynch by Katie Prescott