Search profile: Dark Web

Ketupa

overview

directories

engines

dark web

images

shopping

people

behaviour

wetware

law

cases

anxieties

landmarks

related
Guides:

Publishing

Networks
& the GII

Metrics &
Statistics

Security
& InfoCrime

related
Profiles:

Metadata

Optimisation

Search
Terms

Colour
Pages

Browsers

dark web

This page looks at what has variously been labelled the dark web (or dark internet), invisible web, deep web or hidden web - online resources that are inaccessible (for example because protected by firewalls) or that are not identified by search engines because they feature a robot exclusion tag in their metadata.

It covers -

introduction
how big
basis
retrieval

     introduction

Much of the content on the web - and more broadly on the net - is not readily identifiable through directories or public search engines such as Google and accessible if identified.

Those navigation tools may broadly identify online resources that are not publicly available (for example provide an abstract of a report or journal article). Some content may simply be undetected by search engines. 'Google Still Not Indexing Hidden Web URLs' by
Kat Hagedorn & Joshua Santelli in 14(7) D-Lib (2008) for example comments that the leading search engine - and its peers - misses substantial OAI content.

Unavailability may reflect the content owner or publishers desire to wholly exclude access by outsiders to an intranet or database. It may instead reflect the exigencies of online publishing, with some publishers providing wide access to subscribers or on an item by item (or sessional) basis after payment of an access fee. Some commercial content appears online for a short term basis (eg for a day or a week) before moving behind a firewall. Other content is detectable through abstracts but is distributed by email rather than on the web.

The elusive nature of such content has resulted in characterisations of the dark web, the deep (as distinct from surface) web, the invisible web or the hidden web. Those characterisations can be misleading, as some content resides on the internet rather the part of the net that we label the web.

     how big

As noted in discussion elsewhere on this site regarding internet metrics, the size and composition of the 'invisible web' is contentious.

That is partly because of definitional disagreements.

Is 'invisibility' attributable to deficiencies in search engine technology, given that there are substantial numbers of web pages that can be accessed by ordinary people (ie without a password or payment) but are not indexed by the major search engines?

Elsewhere we have noted claims that the largest public search engines regularly visit and index less than 20% of all static web pages, arguably not a tragedy given the ephemeral nature of much blogging and the prevalence of domain name tasting.

Does 'deepness' include corporate databases and intranets that are not publicly accessible but have some connection to the net and from which, for example, an employee with appropriate organisation might download a document while away from the office?

Many - perhaps most - corporate networks have some connection to the net, on a permanent or ad hoc basis. Should those networks, and the millions of files they hold, be included in the dark net? Major cultural institutions now provide access to large bibliographic and image databases, with data being displayed 'on the fly'. Is that content part of the deep web?

Contention also reflects uncertainty about data, with disagreement about systematic counting of static websites and pages, the indeterminate number of sites that dynamically display content and the number of pages so displayed.

It is common to encounter claims that the overall number of 'pages' in the surface and submerged web is around 4 trillion. It is less common to see a detailed methodology for derivation of that number and figures from major search engines such as Google about both the number of sites/pages they have spidered and their estimates of what has not been spidered. There has been no authoritative inventory of commercial publishing sites and cultural institution sites.

     basis

Why can't resources be readily found and accessed? Reasons for invisibility vary widely.

Some content is in fact not meant to be invisible. It may be undetected because it is new: search engines do not purport to provide instant identification of all content on an ongoing basis and lags in discovery are common. The content may have been found by a search engine but is not displayed because the site/page has been 'sandboxed' (a mechanism used by some engines to inhibit problematical publishing by adult content vendors and similar businesses).

Some content is invisible because of search engine priorities: the search engine is programmed not to bother with the less frequented parts of cyberspace, in particular those pages that get no traffic and that are not acknowledged by other sites.

A rule of thumb is that just because a search engine shows no results that does not exclude the content's existence.

Some publicly-accessible content is displayed 'on the fly', ie only appears when there is a request from a user. Such a request might involve entering a search query on a whole-of-web or site-specific search engine. It might instead involve clicking a link or using an online form to access information otherwise held behind a firewall, often information aggregated in response to the specific request.

Examples include job listings, financial data services (with currency or share prices being updated on an ongoing basis in real time), online white pages and 'colour pages' directories, output from a range of government databases (eg patent registers) and travel sites (with pricing of some airline seats and hotel rooms for example reflecting the demand expressed by queries from users of the site).

Some engines have traditionally ignored dynamically generated web pages whose URLs feature a long string of parameters (eg a date plus search terms plus region). The rationale often provided is that such pages are likely to duplicate cached content; some specialists have occasionally fretted that the spider will be induced to go around in circles.

Some content is meant to be invisible, with access being provided only to authorised users (a class that usually does not include whole of web search engines). The restriction may involve use of a password. Access may involve an ongoing subscription (often to a whole database). It may instead involve sessional, non-subscription use of a whole database or merely delivery of an individual document (with much online scholarly journal publishing for example selling a PDF of a single article at a price equivalent to a hardcover academic book).

Such sites are proliferating, serving specialist markets (often corporate/institutional users rather than individuals without a business/academic affiliation). They include academic library subscriptions, reports by some major technical and financial publishers, and some newspapers.

Many newspapers have adopted a slightly different strategy, providing free access to excerpts (with engines such as Google thus being able to spider a 'teaser' rather than the full content of a particular item or even set of multimedia files) or pulling 'archival' content behind a firewall after a certain period of time.

That 'time-limited access' often allows ongoing access by subscribers, who may have paid for the privilege or may instead merely have supplied information that allows the publisher to build a fuzzy picture of their demographics. Removal means that search engines preserve the URL, with future visits to that page being met by a sign-up form. The boundaries of the dark web are porous: some content is cached by an engine and can be discerned through a diligent search.

Some content is in fact online, without a password or other restriction, but is invisible to search engines (rather than to anyone who knows the URL for the particular file).

That invisibility may based on the site operator's use of the 'robot exclusion' or robots.txt file or tag, which signals to a search engine - when spidered - that the particular file or part of the site is not to be indexed. Invisibility may be even more low-tech, based on the absence of a link pointing to a particular page (ie it does not form part of the hierarchical relationship in a static web site, with the homepage/index page pointing to subsidiary pages).

Geocoding (aka geo-tagging) and other filtering of content means that for some people particular parts of the web - for example those that deal with human rights, criticism of their ruler or adult content - are dark. Filtering may be intended to restrict access by a nation's population to offensive or subversive content. It may instead form part of a business strategy, with online broadcasters for example trialling systems that seek to restrict access outside specific locations. Such restriction may assume that there is a close and reliable correlation between a geographic location and the IP address of the user's computer.

     retrieval

Savvy researchers, of course, do not rely solely on Google, MSN or Yahoo! (and certainly not only on results from the first page of a search list). Much information in the dark web can be identified and even accessed through diligence or social networks, given that the best search engine is often a contact who knows what information is available and has the key needed to unlock the door.

One response is use specialist search engines such as JSTOR and Medline to identify rather than access documents. Some documents may be published in print formats, eg in journals that can be consulted in major libraries or accessed on a document by document basis through inter-library copying arrangements.

Another response, as noted above, is to use acquaintances - or even generous librarians, officials and corporate employees - to gain access via an institutional or corporate subscription to a commercial database such as LexisNexis or Factiva.

A third response is to pay for the content, whether on an item by item basis (offered by major journal publishers such as Elsevier and Blackwell), on a sessional basis or on a subscription basis.

A further response is to make use of site-specific search engines and directories, ie navigate through corporate or other sites in search of documents or dynamically generated information that is readily available to a visitor but does not appear on a whole of web engine such as MSN. That response can be important given the timeliness of data collection by major engines, with delays of weeks or months being common before new information appears in their search results.

next page (image searching)