SyntaxHighlighter

Friday, August 1, 2008

Hidden Web: Don't Love It, Leave It

There's been a bit of buzz lately about Google's "failure" to effectively search the "hidden (deep) web". In the discussions I've been seeing, the hidden web is equated with stuff in academic and digital library repositories, i.e., "OAI-based resources" (which I assume to mean OAI/PMH).

I have to say: repositories != hidden web. The hidden web is simply the stuff the search engines don't find. Systems that surface information about their content only through OAI/PMH interfaces might make up a small part of the hidden web because they're not being surfaced to the bots, but frankly the hidden web holds way more stuff than what's in Fedora and DSpace at universities. Just ask Wikipedia.

The assertion that repository content == the hidden web is circular and false rhetoric that obscures the real problem: people are fighting the web instead of working with it. If you fight it, it will ignore you. This sort of thinking also makes hay for enterprises like the Internet Search Environment Number that seem to me to be trying to carve out business models that exploit, perpetuate and promote the cloistering of content and the rationing of information discovery.

Yesterday, Peter Millington posted what's effectively the antidote on the JISC-REPOSITORIES list (cross-posted to other lists). I reproduce it here in full because it's good advice not just for repositories but for anybody who is putting complex collections of content on the web and wants that content to be discoverable and useful:
Ways to snatch defeat from the jaws of victory
Peter Millington
SHERPA Technical Development Officer
University of Nottingham

You may have set up your repository and filled it with interesting papers, but it is still possible to screw things up technically so that search engines and harvesters cannot index your material. Here are seven common gotchas spotted by SHERPA:
  1. Require all visitors to have a username and password
  2. Do not have a 'Browse' interface with hyperlinks between pages
  3. Set a 'robots.txt' file and/or use 'robots' meta tags in HTML headers that prevent search engine crawling
  4. Restrict access to embargoed and/or other (selected) full texts
  5. Accept poor quality or restrictive PDF files
  6. Hide your OAI Base URL
  7. Have awkward URLs
Full explanations and some solutions are given at: http://www.sherpa.ac.uk/documents/ways-to-screw-up.html

If you know of any other ways in which things may go awry, please contact us and we will consider adding them to the list.
I'm happy to say: Pleiades gets a clean bill of health if we count nos. 5 and 6 as non-applicable (since we're not a repository per se and we don't have a compelling use case for OAI/PMH or PDF).

Disclaimer: we are exploring the use of OAI/ORE through our Concordia project. One of the things we like most about it is that its primary serialization format is Atom, which is already indexed by the big search engines. With the web.

2 comments:

Matt Theobald said...

Hello, Tom,

ISEN doesn't really exist, yet. So I can understand how it might be difficult to determine what exactly it is. No cloistering, no rationing, just easier open access to interfaces to deep web content that is already on the network. ISEN is not a subscription model for information consumers, that would never work, never will.
ISEN is about universal open access to free information that is simply better organized.

Feel free to contact me about questions or concerns about ISEN.

Unknown said...

Hi Theo:

Thanks for offering this clarification of intent. I look forward to seeing more details about plans for ISEN appear on its website and blog. I gained the impressions expressed above after reading both.

Best,
Tom