overview
directories
engines
dark web
images
shopping
people
behaviour
wetware
law
cases
anxieties
landmarks

related
Guides:
Publishing
Networks
& the GII
Metrics &
Statistics
Security
& InfoCrime

related
Profiles:
Metadata
Optimisation
Search
Terms
Colour
Pages
Browsers
|
dark
web
This page looks at what has variously been labelled the
dark web (or dark internet), invisible web, deep web or
hidden web - online resources that are inaccessible (for
example because protected by firewalls) or that are not
identified by search engines because they feature a robot
exclusion tag in their metadata.
It covers -
introduction
Much of the content on the web - and more broadly on the
net - is not readily identifiable through directories
or public search engines such as Google and accessible
if identified.
Those navigation tools may broadly identify online resources
that are not publicly available (for example provide an
abstract of a report or journal article). Some content
may simply be undetected by search engines. 'Google Still
Not Indexing Hidden Web URLs' by
Kat Hagedorn & Joshua Santelli in 14(7) D-Lib
(2008) for example comments
that the leading search engine - and its peers - misses
substantial OAI content.
Unavailability may reflect the content owner or publishers
desire to wholly exclude access by outsiders to an intranet
or database. It may instead reflect the exigencies of
online publishing, with
some publishers providing wide access to subscribers or
on an item by item (or sessional) basis after payment
of an access fee. Some commercial content appears online
for a short term basis (eg for a day or a week) before
moving behind a firewall. Other content is detectable
through abstracts but is distributed by email
rather than on the web.
The elusive nature of such content has resulted in characterisations
of the dark web, the deep (as distinct from surface) web,
the invisible web or the hidden web. Those characterisations
can be misleading, as some content resides on the internet
rather the part of the net that we label the web.
how big
As noted in discussion elsewhere on this site regarding
internet metrics, the size
and composition of the 'invisible web' is contentious.
That is partly because of definitional disagreements.
Is 'invisibility' attributable to deficiencies in search
engine technology, given that there are substantial numbers
of web pages that can be accessed by ordinary people (ie
without a password or payment) but are not indexed by
the major search engines?
Elsewhere we have noted claims that the largest public
search engines regularly visit and index less than 20%
of all static web pages, arguably not a tragedy given
the ephemeral nature of much blogging
and the prevalence of domain name
tasting.
Does 'deepness' include corporate databases and intranets
that are not publicly accessible but have some connection
to the net and from which, for example, an employee with
appropriate organisation might download a document while
away from the office?
Many - perhaps most - corporate networks have some connection
to the net, on a permanent or ad hoc basis. Should those
networks, and the millions of files they hold, be included
in the dark net? Major cultural institutions now provide
access to large bibliographic and image databases, with
data being displayed 'on the fly'. Is that content part
of the deep web?
Contention also reflects uncertainty about data, with
disagreement about systematic counting of static websites
and pages, the indeterminate number of sites that dynamically
display content and the number of pages so displayed.
It is common to encounter claims that the overall number
of 'pages' in the surface and submerged web is around
4 trillion. It is less common to see a detailed methodology
for derivation of that number and figures from major search
engines such as Google about both the number of sites/pages
they have spidered and their estimates of what has not
been spidered. There has been no authoritative inventory
of commercial publishing sites and cultural institution
sites.
basis
Why can't resources be readily found and accessed? Reasons
for invisibility vary widely.
Some content is in fact not meant to be invisible. It
may be undetected because it is new: search engines do
not purport to provide instant identification of all content
on an ongoing basis and lags in discovery are common.
The content may have been found by a search engine but
is not displayed because the site/page has been 'sandboxed'
(a mechanism used by some engines to inhibit problematical
publishing by adult content vendors and similar businesses).
Some content is invisible because of search engine priorities:
the search engine is programmed not to bother with the
less frequented parts of cyberspace, in particular those
pages that get no traffic and that are not acknowledged
by other sites.
A rule of thumb is that just because a search engine shows
no results that does not exclude the content's existence.
Some publicly-accessible content is displayed 'on the
fly', ie only appears when there is a request from a user.
Such a request might involve entering a search query on
a whole-of-web or site-specific search engine. It might
instead involve clicking a link or using an online form
to access information otherwise held behind a firewall,
often information aggregated in response to the specific
request.
Examples include job listings, financial data services
(with currency or share prices being updated on an ongoing
basis in real time), online white pages and 'colour
pages' directories, output from a range of government
databases (eg patent registers)
and travel sites (with pricing of some airline seats and
hotel rooms for example reflecting the demand expressed
by queries from users of the site).
Some engines have traditionally ignored dynamically generated
web pages whose URLs feature a long string of parameters
(eg a date plus search terms plus region). The rationale
often provided is that such pages are likely to duplicate
cached content; some specialists have occasionally fretted
that the spider will be induced to go around in circles.
Some content is meant to be invisible, with access
being provided only to authorised users (a class that
usually does not include whole of web search engines).
The restriction may involve use of a password. Access
may involve an ongoing subscription (often to a whole
database). It may instead involve sessional, non-subscription
use of a whole database or merely delivery of an individual
document (with much online scholarly journal publishing
for example selling a PDF of a single article at a price
equivalent to a hardcover academic book).
Such sites are proliferating, serving specialist markets
(often corporate/institutional users rather than individuals
without a business/academic affiliation). They include
academic library subscriptions, reports by some major
technical and financial publishers, and some newspapers.
Many newspapers have adopted a slightly different strategy,
providing free access to excerpts (with engines such as
Google thus being able to spider a 'teaser' rather than
the full content of a particular item or even set of multimedia
files) or pulling 'archival' content behind a firewall
after a certain period of time.
That 'time-limited access' often allows ongoing access
by subscribers, who may have paid for the privilege or
may instead merely have supplied information that allows
the publisher to build a fuzzy picture of their demographics.
Removal means that search engines preserve the URL, with
future visits to that page being met by a sign-up form.
The boundaries of the dark web are porous: some content
is cached by an engine and can be discerned through a
diligent search.
Some content is in fact online, without a password or
other restriction, but is invisible to search engines
(rather than to anyone who knows the URL for the particular
file).
That invisibility may based on the site operator's use
of the 'robot exclusion' or robots.txt
file or tag, which signals to a search engine - when spidered
- that the particular file or part of the site is not
to be indexed. Invisibility may be even more low-tech,
based on the absence of a link pointing to a particular
page (ie it does not form part of the hierarchical relationship
in a static web site, with the homepage/index page pointing
to subsidiary pages).
Geocoding (aka geo-tagging)
and other filtering
of content means that for some people particular parts
of the web - for example those that deal with human rights,
criticism of their ruler or adult content - are dark.
Filtering may be intended to restrict access by a nation's
population to offensive or subversive content. It may
instead form part of a business strategy, with online
broadcasters for example trialling systems that seek to
restrict access outside specific locations. Such restriction
may assume that there is a close and reliable correlation
between a geographic location and the IP address of the
user's computer.
retrieval
Savvy researchers, of course, do not rely solely on Google,
MSN or Yahoo! (and certainly not only on results from
the first page of a search list). Much information in
the dark web can be identified and even accessed through
diligence or social networks, given that the best search
engine is often a contact who knows what information is
available and has the key needed to unlock the door.
One response is use specialist search engines such as
JSTOR and Medline to identify rather than access documents.
Some documents may be published in print formats, eg in
journals that can be consulted in major libraries or accessed
on a document by document basis through inter-library
copying arrangements.
Another response, as noted above, is to use acquaintances
- or even generous librarians, officials and corporate
employees - to gain access via an institutional or corporate
subscription to a commercial database such as LexisNexis
or Factiva.
A third response is to pay for the content, whether on
an item by item basis (offered by major journal publishers
such as Elsevier and Blackwell), on a sessional basis
or on a subscription basis.
A further response is to make use of site-specific search
engines and directories, ie navigate through corporate
or other sites in search of documents or dynamically generated
information that is readily available to a visitor but
does not appear on a whole of web engine such as MSN.
That response can be important given the timeliness of
data collection by major engines, with delays of weeks
or months being common before new information appears
in their search results.
next page (image
searching)
|
|