Internet Metrics & Statistics: Size & Shape

Ketupa

overview

domains

content

population

traffic

navigation

demographics

methods

teledensity

ranks

divides

jargon

sources

lies & spin

industry

visualisation

analytics

pageviews

related
Profile:

Domains

DNS sizes

online content

This page examines the content of the web, including estimates of the number of pages and images, the volatility of content and severity of link-rot.

It covers -

introduction
size of the web - estimates of the number of pages and documents
the non-text net - how many still images, videos and sound recordings are on the net?
number of personal sites
volatility of content - how frequently do sites and pages change
volatility of links
and of domains

     size of the web

Specifics of the number of pages or documents on the internet perhaps matter less than a broad idea of its size (and growth) and disagreement about those figures.

As of 2001 the latest academic estimate from the US was that the Web has some 800 million pages - this page is 800 million plus one - with Northern Lights, supposedly the most inclusive search engine at that time, covering less than 16% of that figure. The O'Neill, Lavoie & Bennett Trends paper suggested that in 2002 the figure had grown to 1.4 billion publicly-accessible pages.

Google reported that by December 2002 it had indexed almost 2.5 billion individual pages, increasing to 3.1 billion by February 2003. As of January 2004 Google claimed to cover 3,307,998,701 pages. In February 2004 it announced that covered "6 billion items": 4.28 billion web pages, 880 million images and 845 million Usenet messages.

Three of the seminal papers - often referred to as the 'NEC studies' - are How Big is the Web (HBW), Accessibility & Distribution of Information on the Web (ADIW) and the 1998 and 1999 Search Engine Coverage Update (SECU) by Steve Lawrence & C Lee Giles. The most recent paper suggests that the web is growing faster than coverage by the search engines and that dead links are more common.

In early 2001 Iktomi and NEC Research estimated that there were more than a billion "unique pages". IDC's The Global Market Forecast for Internet Usage & Commerce report forecast that the global online population will grow from 240 million in 1999 to 602 million in 2003, with the number of web pages climbing from 2.1 billion in 1999 to 16.5 billion in 2003.

US metrics company Cyveillance estimated that there were over 2.1 billion pages on the web (heading towards 4 billion by the end of 2001) with the "average page" having 23 internal links, 6 external links and 14 images. The US Federal Library & Information Center estimated that the federal government alone had over 27 million publicly accessible pages online.

BrightPlanet, a new entrant to the search engine market, claims that "the deep Web" contains "550 billion individual documents", with only a small fraction indexed by its competitors.

That figure, like many web statistics, is problematical. More importantly, unlike the 'surface web' the dark web information is generally not publicly accessible, eg involves a subscription or item fee or resides on a corporate intranet. That is one reason for concern about digital divides. It is also a reason why academic/public libraries have an ongoing role.

The major 2001 study and 2003 study by Hal Varian & Peter Lyman on scoping the 'information universe' - quantifying what is produced, transmitted, consumed and archived - are of relevance. The 1997 paper A Methodology for Sampling the World Wide Web by Edward O'Neill, Patrick McClain & Brian Lavoie will interest statistics buffs.

In 2005 the National Library of Australia (PDF) published an initial report on what was claimed as "the first whole Australian domain harvest", identifying some "185 million unique documents" from 811,523 hosts. 67% of the documents were text/html, 17% were JPEG images, 11% were GIF images and 1.6% were PDFs. The harvested content came to 6.69 terabytes.

     the non-text net

There is no consensus about

the number of still images (eg photographs), video recordings, animations and sound recordings on the net
the rates at which that content is growing
the nations from which most of that content is originating

In February 2005 Google announced that its cache of the web had reached over a billion images, up from some 880 million in February 2004. Some questions about audio and image searching are here.

number of personal sites

Figures about the number of 'personal sites' (homepages and blogs) are problematical.

Sonia Livingstone & Magdalena Bober's 2004 UK Children Go Online: Surveying the experiences of young people & their parents (PDF) has been interpreted as suggesting that "34% of UK kids" (the 9 to 19 cohort) have personal pages. Research such as The Construction of Identity in the Personal Homepages of Adolescents, a 1998 paper by Daniel Chandler & Dilwyn Roberts-Young, indicates that most homepages are created by older adolescents. One might infer from figures highlighted in our discussion of blogging that few personal homepages are actively maintained (or indeed visited).

volatility of content

Wallace Koehler's paper on Digital Libraries & WWW Persistence estimates that the 'half life' of a web page is less than two years and the half life of a site is a bit more than two years.

That is in line with more restricted research such as the 1997 Rate of Change & other Metrics: a Live Study of the World Wide Web paper by Fred Douglis, Anja Feldmann & Balachander Krishnamurthy and the 2000 paper How dynamic is the web? by Brian Brewington & George Cybenko. The latter estimated that 20% of pages are less than twelve days old, with only 25% older than one year.

The O'Neill, Lavoie & Bennett Trends paper comments that

while the public Web, in terms of number of sites, is getting smaller, public Web sites themselves are getting larger. In 2001, the average number of pages per public site was 413; in 2002, that number had increased to 441.

Alexander Halavais' 'Social Weather' On The Web (PDF) suggests that blogs are the most dynamic web content.

of links

There have been few large-scale studies of link-rot, ie broken links that result in the 401 'not found' message on your browser.

A 2003 paper in Science by Robert Dellavalle, Eric Hester, Lauren Heilig, Amanda Drake, Jeff Kuntzman, Marla Graber & Lisa Schilling on 'Going, Going, Gone: Lost Internet References' examined internet citations in The New England Journal of Medicine, The Journal of the American Medical Association and Nature. Web content was cited in over 1,000 items published between 2000 and 2003. After three months, 15 months and 27 months following publication the number of inactive references grew from 3.8% to 10% and 13%.

Other studies suggest that link rot and dead print-format citations to online content outside the sciences may be as high as 40% after three years. One 2002 US study suggested that up to 50% of URLs cited in articles in two IT journals were inaccessible within four years.

Its similarly been claimed that of around 2,500 UK government sites, around 27% of URLs become invalid each year as sites are restructured, cease to operate after administrative reorganisations or documents are taken offline.

and of domains

The O'Neill Trends paper also comments that

In addition to a slower rate of new site creation, the rate at which existing sites disappear may have increased. Analysis of the 2001 and 2002 Web sample data suggests that as much as 17 percent of the public Web sites that existed in 2001 had ceased to exist by 2002. Many of those who created Web sites in the past have apparently determined that continuing to maintain the sites is no longer worthwhile. Economics is one motivating factor for this: the "dot-com bust" resulted in many Internet-related firms going out of business; other companies scaled back or even eliminated their Web-based operations .... Other analysts note a decline in Web sites maintained by private individuals — the so-called "personal" Web sites. Some attribute this decline to the fact that many free-of-charge Web hosting agreements are now expiring, and individuals are unwilling to pay fees in order to maintain their site

A State of the Domain study (PDF) highlighted volatility in domain registrations from August 2001 to August 2002 -

	not renewed by current registrant	renewed by a new registrant	not renewed	not previously registered (ie wholly new)
com	11.2m	2.2m (20%)	9.0m	7.1m
org	1.4m	0.18m (13%)	1.2m	0.7m
net	2.3m	0.35m (15%)	1.9m	1.1m

Paul Clemente's The State of the Net (New York: McGraw-Hill 1998) is now dated but offers a snapshot of figures before the dot-com crash.

next page (traffic)