Category Archives: Search

Internet Search: The Reality of Link-Buying and Comment Spam

Google search today has, in my opinion, degenerated roughly to the point of keyword search a decade ago.  Most searches, particularly those with commercial intent, have been search-engine-optimized, spammed, link-farmed, or content-farmed to the point of uselessness.

As Michael Arrington succinctly put it:  Search Still Sucks.  I’d actually quibble with the “still” — it’s taken a decade of cat-and-mouse to make Google as bad today as AltaVista was in 2000.

One of the many reasons search has degenerated is link-buying.  One of the benefits of running a blog is that you get to see tactics like link-buying and comment spam first-hand.  In this post, I thought I’d share that first-hand look.

Here is an email I received today which is an example of link buying.

That’s it.  If you write a post and link to my client, I’ll pay you.  It can’t be easy for Google to algorithmically figure out which links I’ve put in naturally and which ones I’ve been paid to insert.  It’s not obviously even possible, though getting close probably is.  But it can’t be easy.

For comment spam, here is what the comment dashboard looks like in my blog, which is powered by WordPress.

Since Google is all about inbound links, comment spammers either load their comments up with links (see last entry above) or enter a seemingly innocuous text comment with a blog/web address that is the link they’re promoting (see Minh’s entry).

The amazing thing about comment spam is the volume.  My blog has had 4600 spam comments in the past 60 days.   While I believe these are much easier to detect than purchased links — particularly for the blogging platform if not the search engine — the volume is certainly impressive.  Note that since WordPress bundles Akismet all of these spam comments were picked off before Google had to deal with them.  But I’m sure for plenty of blogs that’s not the case.

If you look at the history of search and spam, it’s pretty simple:

Phase 1:  keyword frequency.  Rank pages by the TF/IDF of search keywords.   Spammers then quickly discover how to load pages and/or tags with keywords to inflate their rank.

Phase 2:  inbound link frequency and authority.  Rank pages by the number and authority of inbound links.  Pages that themselves have lots of inbound links have higher authority than those that don’t.  Spammers slowly discover the aforementioned techniques to eventually beat this as well.

I believe the world is strongly in need of a phase 3 approach and I suspect it will involve curation.  Consider some more of Arrington’s comments:

Yes, search is very hard. But Silicon Valley is really good at doing hard things. The real problem right now is that there’s a perception that Google is untouchable in search. When a venture capitalist sees a pitch from a new search startup all they can think about is the Cuil debacle. And since venture capitalists are just about the most risk averse people in Silicon Valley, the funds just don’t flow.

But all the evidence suggests otherwise. Demand Media is worth $1.6 billion, and their entire business is based on pushing cheap, useless content into Google to get a few stray links. If Google was good at search, Demand Media wouldn’t exist. And Bing wouldn’t be making solid gains in search market share. And JC Penney wouldn’t be able to massively game search results for a few months, during the holiday season, without getting caught until months later.

We need to see a real competitor emerge in search. If only because it will make Google up its game, and make all of us a lot happier.

This is one reason I’m watching Blekko.  While I’m not in love with the way they currently do curation (i.e., slashtags), I do believe that they are focusing on the right core concept.  For more information on Blekko, you can read this TechCrunch article to which, I should probably say, I linked by choice and not for profit.

Quick Take on the Dassault Systèmes Acquisition of Exalead

Today, in what I consider a surprising move, French PLM and CAD vendor Dassault Systèmes announced the acquisition of French enterprise search vendor Exalead for €135M or, according to my calculator, $161M.  Here is my quick take on the deal:

  • While I don’t have precise revenue figures, my guess is that Exalead was aiming at around $25M in 2010 revenues, putting the price/sales multiple at 6.4x current-year sales, which strikes me as pretty good given what I’m guessing is around a 25% growth rate.  (This source says $21M in software revenue, though the year is unclear and it’s not clear if software means software-license or software-related.  This source, which I view as quite reliable, says $22.7M in total revenue in 2009 and implies around 25% growth.  Wikipedia says €15.5M in 2008 revenues, which equals exactly $22.7M at the average exchange rate.  This French site says €12.5M in 2008 revenues.  The Qualis press release — presumably an excellent source — says €14M ($19.5M) in 2009 revenues.  Such is the nature of detective work.)
  • I am surprised that Dassault would be interested in search-based applications, Exalead’s latest focus.  While PLM vendors have always had an interest in content delivery and life-cycle documentation (e.g., a repair person entering feedback on documentation that directly feeds into future product requirements) , I’d think they want to buy a more enterprise techpubs / DITA vendor than a search vendor to do so as in the PTC / Arbortext deal of 2005.  Nevertheless, Dassault President and CEO Bernard Charlès said that with Exalead they could build “a new class of search-based applications for collaborative communities.”  There is more information, including a fairly cryptic video which purports to explain the deal, on a Dassault micro-site devoted to the Exalead acquisition, which ends with the phrase:  search-based applications for lifelike experience.  Your guess as to what that means is as good as mine.
  • A French investment firm called SCA Qualis owned 83% of Exalead steadily building up its position from 51% in 2005 to 83% in 2008, through successive rounds of €5M, €12M and €5M in 2005, 2006, and 2008 respectively.  This causes me to question the CrunchBase’s profile that Exalead had raised a total of $15.6M.  (You can see €22M since 2005 and the company was founded in 2000.  I’m guessing there was $40M to $50M invested in total, though some reports are making me think it’s twice that.)
  • The prior bullet suggests that Qualis took $133M of the sale price and everybody else split $27M, assuming there were no active liquidation preferences on the Qualis money.
  • Given the European-focus, the search-focus, and the best-and-brightest angle (Exalead had more than its share of impressive grandes écoles graduates), one wonders why Autonomy didn’t end up owning Exalead, as opposed to a PLM/CAD company.  My guess is Autonomy took a look, but the deal got too pricey for them because they are less interested in paying up for great technology and more interested in buying much larger revenue streams at much lower multiples.  In some sense, Autonomy’s presumed “pass” on this deal is more proof that they are no longer a technology company and instead a CA-like, Oracle-like financial consolidation play.  (By the way, there’s nothing wrong with being a financial play in my view; I just dislike pretending to be one thing when you’re actually another.)
  • One wonders what role, if any, the other French enterprise search vendor, Sinequa, played in this deal.  They, too, have some great talent from France’s famed Ecole Polytechnique, and presumably some nice technology to go along with it.

Here are some links to other coverage of the deal

The Perils of Text-Only Search

You won’t be surprised to know that I use a series of Google Alerts to help me track events relevant to Mark Logic. I often have two problems with them:

  • Old content is mis-identified as new. You can ask any Mark Logician about the number of times I’ve forwarded a story that I thought was a hot news item only to discover it was four years old and that Google had nevertheless “alerted” me to its existence. I highlight this here because it bugs me, but I will not drill into it.
  • Content is mis-parsed, resulting in erroneous matches and alerts.

For example, today I received a Google Alert on “Mark Logic” for this Reading Eagle story, entitled Douglass Township Man Waives Hearing on Charges He Fired Gun in Neighborhood. Wondering if we had a wayward employee, I read the story which is about a man named Jeffrey W. Logic, who is charged with firing several shots near a group of people assembled in a neighbor’s front yard.

Here’s the text that generated the hit. (Bolding mine.)

Logic pulled out a gun and fired several shots into one of the car’s tires. He also fired a shot into the pavement, and a stone or bullet fragment ricocheted and struck the driver in the neck, causing a red mark.

Logic started to walk away when two men who had been at the party approached him. Logic pointed the gun at one of them but the man swatted it away. Logic then fired into the ground once more.

What happened? The words mark and logic are sequentially related in the text. But they’re not in the same paragraph, let alone the same sentence. Clearly, if you’ll pardon the pun, this result is a misfire, but it highlights an important problem with full-text search engines: they understand neither the structure nor the semantics of the content they are indexing.

For example, in an XML representation, you might indicate structure by using <para> tags to indicate paragraphs and <sentence> tags to indicate sentences. When searching, you could then say “find all the <sentences> that contain the phrase ‘Mark Logic” and you wouldn’t get the false match that Google returned.

Awareness of structural markup is important not only because it eliminates false matches, but because it enables you to express more powerful queries (from both a search and retrieval perspective), such as:

  • Find all the <figures> that have <captions> that contain the phrase “survival rate”
  • Return the <authors> and <abstracts> of articles that contain the word “lymphoma” and have <captions> that contain the phrase “survival rate”

Or, more powerfully, perform a citation analysis:

  • Return the <authors> and <abstracts> of <articles> with <citations> to <articles> written by <author> “Sandra Horning”

But, even better is the ability for the system to understand semantic markup, for example, coming from a taxonomy or an automatic entity extraction tool.

  • And find them only in the <articles> that contain references to the <drug> “Rituxan” which is a <monoclonal antibody> which the system knows is also called “Rituximab” and “MabThera.”
  • And which contain the <disease> diffuse large b-cell lymphoma, which the system knows is a <b-cell lymphoma> which is a <lymphoma> which is a <blood cancer> which is a <cancer>

Then think of the simple permutations of this query you can run:

  • Against all monoclonal antibodies, not just Rituxan
  • Against all lymphomas, not just diffuse large b-cell. Or against all blood cancers.
  • Against not just those citing Horning, but against those citing any other author.

And then think of all the queries you can run against this same corpus when you apply any number of any combination of full-text, structural, and semantic constraints.

And pause.

If you wonder why I say that Mark Logic lets you run database-style queries against content* you hopefully now understand why.

It’s not just about catching that Mark is the last word of sentence 20 and Logic is the first word in sentence 21. It’s about combining structural, semantic, and full-text constraints and in virtually any combination. And that unleashes a mind-boggling amount of query power. A power, by the way, that we’re accustomed to against data, but are now only beginning to understand against content.


* If you want to search only within items the author bolded for emphasis, you can do that, too!

** Or if you want to search only within footnotes, as they sometimes do in finance, you can do that, too!

Twazzup: A Nice Twitter Search Engine

As part of writing my previous post on swine flu, Twitter, and The Wisdom of Crowds, I ran into a nice, real-time, alternative Twitter search engine, called Twazzup, presumably as in, “what’s up?”

Most folks are probably aware of Summize, which was acquired by Twitter in July, 2008, and is now at http://search.twitter.com. I think Twazzup one-ups Twitter search in a few areas:

  • It shows you the TPH, presumably meaning tweets per hour, on a topic. Right now, “swine flu” is running at 6,667 TPH.
  • While they both show hot topics, Twazzup does a much better job of finding and suggesting related queries. For example, Twazzup is suggesting: Mexico, #swineflu, news, avoid. Twitter search is showing cool / nifty queries that aren’t related: #haiku, listening to, “is down.”
  • Twazzup shows a featured tweet (presumably using some authority mechanism), related pictures, and related news stories.

When using Twazzup, John Battelle’s database of intentions springs immediately to mind and frankly, because it’s real-time and it’s not just search phrases but little proclamations, I think Twitter/Twazzup does a much better job of sticking a thermometer in the public consciousness than a log of Google search phrases.

Using that thermometer what, besides swine flu, is on the public’s mind at present? Apophis, which is evidently an asteroid that might hit the Earth in 2036.

Gosh, folks are in an apocalyptic mood.

Stephen Arnold covers Twazzup here.