Archive for the ‘search engines’ Category

Robots Exclusion Protocol leaves ACAP dead in the water

June 3rd, 2008 by Paul Watson

In my first post on this blog I wrote (critically) about ACAP - a thoroughly wrong-headed attempt by some publishers to enforce stringent limitations on the way search engines index the content that publishers make public on their websites.

Today ACAP is completely dead in the water.

Google, Yahoo and Microsoft (see those links for details) today jointly announced their backing for the existing Robots Exclusion Protocol (REP) which comprises robots.txt, the Sitemap protocol, and individual page meta elements.

The people behind ACAP are probably still claiming that they have the backing of the world’s 4th largest search engine Exalead, but in the words of Bill Hicks “Yeah, maybe, but you know what, after the first 3 largest armies search engines, there’s a REAL big fucking drop-off.”

The decision by the three big search engines to back the existing REP standard—and to clarify exactly how they implement it—is a great example of these three competitors working together to the benefit of both website owners and searchers.

UPDATE - 4th June 2008: I just heard from a colleague that the whole ACAP debacle could have been avoided.  ACAP was primarily conceived as a way to convey rights/permissions metadata when feeding data from one partner organisation to another (for example, from a publisher to Amazon).

For some unknown reason the people behind ACAP decided to try to roll it out as a website technology.

This was obviously a huge strategic error, and it backs up my belief that the people behind this technology just don’t get the web. As a protocol for communicating permissions information from a publisher to Amazon or Google Books (not Google Search!) in a data feed it’s probably fine.  But ACAP has no place on the web.

What prompted the people behind ACAP to try to force it onto the web is unimaginable.  This ill-conceived idea was doomed from the start, especially when combined with their secretiveness (they have a forum on their site, but it’s hidden from view and they only give out logins to selected partners) and their attitude when replying to the tidal wave of criticism they received from bloggers.

[Slashdot] [Digg] [Reddit] [del.icio.us] [Facebook] [Technorati] [Google] [StumbleUpon]

OpenID on Google, Yahoo!, Microsoft, IBM and VeriSign

February 8th, 2008 by Paul Watson

For those of you who haven’t come into contact with it before, OpenID is an open-source single-login that works on many websites. In their own words “OpenID eliminates the need for multiple usernames across different websites, simplifying your online experience.”. And it does.

Some time ago, I set up the main domain of this website as an OpenID delegate (I use MyOpenID as my OpenID server), which basically means that I can put my own website’s URL into any OpenID login on another OpenID-enabled website, enter my OpenID password when asked for it, and be logged into that website, enabling me to comment on a blog without having to create yet another login for a blog I was unlikely to ever comment on again.

Up until now, OpenID hasn’t had many big players apart from the teen-angst-fest that is LiveJournal - it’s been mainly geek sites such as 37signals and Six Apart (although it’s also available as a plug-in for WordPress and Drupal - I have an OpenID login for this blog, which is built with WordPress).

So yesterday’s announcement that Google, IBM, Microsoft, VeriSign, and Yahoo! have joined the board of the OpenID Foundation is hugely important. To get a couple of those companies joining would have been big. To get all five of them is enormous.

So, if all five of these behemoths implemented OpenID then what would things be like? Well, the prospect of using the same ID (in my case, my personal website’s URL, which I can usually be relied on to remember!) to login to Google Analytics, Google Sitemaster Tools, YouTube, Yahoo! Instant Messenger and Flickr, would make my life much easier.

Admittedly both Google and Yahoo! had already made moves towards OpenID - last month Yahoo announced that YahooIDs would become OpenIDs (effectively tripling the number of OpenID accounts by adding the 248 million Yahoo! IDs to OpenID’s existing 120 million accounts).

Google swiftly followed suit a couple of days later by announcing that it’s blogging platform Blogger would allow users to use their blog’s URL as an OpenID URL (so long as it was hosted on BlogSpot).

This is stage one - now any BlogSpot-hosted blogger, Flickr-user or anyone with a Yahoo! login will be able to login to external sites that use OpenID. That’s a huge advancement.

But hopefully this latest announcement will take Google and yahoo one step further. At the moment the two search giants are still only providers of OpenID (they turn your existing account into an OpenID account), but they won’t accept OpenIDs as logins (although you can login to comment at Blogger using OpenID).

This hope of mine is strengthened by a comment yesterday by Yahoo!’s Jeremy Zawodny:

“Oh, and before anyone jumps on me about this not being “full” (meaning bi-directional) OpenID support, I’m quite aware of that. Consuming OpenID is a different beast that can’t happen overnight. Give it some time. I’m optimistic that we’ll get there.”

[Slashdot] [Digg] [Reddit] [del.icio.us] [Facebook] [Technorati] [Google] [StumbleUpon]

Microsoft, Yahoo, and Google

February 4th, 2008 by Paul Watson

There’s already been a lot of comment about Microsoft’s recent hostile takeover bid for Yahoo (not least from Google itself).

Google’s objections are based around arguments against Microsoft’s monopoly (hardly a threat when Google’s share of the search market is bigger than Microsoft’s and Yahoo’s combined).

Could Microsoft now attempt to exert the same sort of inappropriate and illegal influence over the Internet that it did with the PC? While the Internet rewards competitive innovation, Microsoft has frequently sought to establish proprietary monopolies — and then leverage its dominance into new, adjacent markets.

Could the acquisition of Yahoo! allow Microsoft — despite its legacy of serious legal and regulatory offenses — to extend unfair practices from browsers and operating systems to the Internet?

While I agree with Drummond’s synopsis of Microsoft’s strategy of creating proprietary monopolies I don’t think this is going to be a problem here.

Firstly, Microsoft will certainly pursue this strategy when they already dominate a market, but when faced with competition they can be forced to “play nice”.

Take the rise of Firefox - since Mozilla’s standards-compliant browser gained 20% or so of the market, Microsoft have started building standards-compliance into Internet Explorer. Admittedly, the latest announcement about IE8’s arse-about-face browser versioning metadata shows that their proposed implementation is terrible, but there’s no doubting that IE7 is more standards-compliant than IE6 was, and the news that IE8 passes the Acid2 test is very encouraging indeed.

So, when faced with a significant competitor who are fighting on an “open/standards-compliant” ticket, Microsoft will tow the line.

Secondly, Yahoo’s big problem is that, in order to compete with Google, it needs to become less like Microsoft. If Yahoo is bought by Microsoft then Yahoo will fail - because of all the things that David Drummond mentions. And most fundamentally, as Umair Haque says:

Neither company has the DNA to take on Google (let alone the massive number of startups waiting in the wings). Sure, they might collectively have the resources.

But DNA will always constrain YahooSoft from utilizing those resources in ways that create value.

Bill Gates might be thinking that if he buys Yahoo then he can add their 20% of the search market to Microsoft’s 12% to make a combined 32% against Google’s 54% share, but given a Microsoftization of Yahoo, the chances are that Yahoo’s users will jump ship to Google, giving Google a worrying 74% market share.

I say “worrying” because, while I really like some of the things Google have done, that doesn’t mean that I want them in a position of unquestionable dominance. At the moment they are undoubtedly the largest search engine, but Yahoo are still a visible competitor - an alternative if I want one (particularly for APIs).

Finally it would be a shame to see Yahoo subsumed by Microsoft. I don’t want to log into Flickr with a Microsoft Passport or Windows Live ID (which would surely be the result of a takeover).

While Yahoo are still struggling behind Google, they are doing some good stuff. Their online applications such as Flickr are good, and they actually engage with the web community (they put on good presentations at d.Construct 2006 and @media 2006 in London).

I think all that would end if they were bought by Microsoft.

[Slashdot] [Digg] [Reddit] [del.icio.us] [Facebook] [Technorati] [Google] [StumbleUpon]

Automated Content Access Protocol (ACAP)

December 4th, 2007 by Paul Watson

OK, well I’ve spent some time reviewing this new specification.

It’s a mixture of a couple of useful new qualifiers to the old robots.txt standard and a lot of anally-retentive control-freakery written by people who still don’t get “the internet”.

The good points:

  • Extends robots.txt to allow site owners to define sets of folders/files by regular expression and apply instructions to bots “en masse” - nice idea.
  • Extends robots.txt to add time-based crawl permissions (e.g. you can specify that the folder “/news/2007/” is only indexed until the end of 2007 - after which you’d want search engines to drop the last years news from their indices and start crawling the 2008 news instead

The bloody terrible points (control-freakery and lack of understanding):

  • Tries to introduce a new protocol to disallow an aggregator site (e.g. Google) from displaying a thumbnail of a page - Why?!?
  • Tries to introduce a new protocol to disallow an aggregator site (e.g. Google) from displaying a snippet of content - Again, why?!?
  • Tries to introduce a new protocol that allows publishers to set the exact length (to the character) of any snippet shown on an aggregator site. Oh dear god, why are publishers so anal about this!
  • Tries to introduce a new protocol which demands that aggregators don’t parse the content they’ve crawled outside of a specific context. No, no , no!
  • Tries to embed take-down notices to crawlers within robots.txt. Oh, FFS - this is publishing at it’s most over-controlling (and therefore self-destructive).

Thankfully there is no legal requirement that the search engines take any notice whatsoever of this new “technical framework”.

I kinda hope that the tech community takes the good points out of this spec (pattern-matching, time constraints etc) and just upgrades the good old robots.txt standard, ignoring the worts excesses of control-freakery that the publishing industry have slipped in.

a bit of blog reaction to ACAP:

  • Lauren Weinstein - “…Boiled down to the bottom line, I can’t help but sense that the intended shift in responsibility that appears to be associated with ACAP could lead to an entire new wave of litigation and possible information restrictions — enriching lawyers to be sure — but quite possibly being a significant negative development for Internet users in general.”
  • Ian Douglas (Daily Telegraph) - “…The new protocol focuses entirely on the desires of publishers, and only those publishers who fear what web users will do with the content if they don’t retain control over it at every point…ACAP might well be adopted by a lot of publishers (although not, so far, by any search engines anyone has heard of), but we’ll all be a little poorer as a result.”
  • Martin Belam - “It seems like a weak electronic online DRM - with the vague promise that in the future more ’stuff’ will be published, precisely because you can do less with it…”

In the interests of fairness I tried to find a positive article about ACAP, but there’s absolutely nothing.

Luckily this ACAP protocol does not have the support of the search engines and so is likely to fail and die.

The ACAP site does brag that “Major search engines are engaged in the project. Exalead, the world’s fourth largest search engine has been a full participant in the project.”

Exalead? Who the hell are they? If you can’t claim the involvement of Google and/or Yahoo if any search-engine specific project then you’re dead on your feet.

And in the case of ACAP, I’m glad.

[Slashdot] [Digg] [Reddit] [del.icio.us] [Facebook] [Technorati] [Google] [StumbleUpon]