overview
domains
content
population
traffic
navigation
demographics
methods
teledensity
ranks
divides
jargon
sources
lies & spin
business
visualisation
analytics
pageviews
|
lies, spin and web stats
This
page considers internet statistics and their abuse.
It covers -
introduction
The internet is a young technology, with unfamiliar terms,
uncertain measures and markets where the desire for information
often outweighs an ability to critically evaluate data.
It is also a technology where some people place an almost
religious faith in numbers. It is one where many people
have come to expect that figures will be both large and
inconsistent with data from life offline, because the
internet is supposedly 'special', eg during the dot com
bubble -
- pundits
forecast that traffic would double
every hundred days during the coming decade
- gurus
claimed that dot-com alchemy
would allow enterprises to make substantial profits
even though costs stubbornly remained greater than sales
revenue.
It
is thus unsurprising that some observers have concerns
regarding the abuse of internet statistics (in particularly
demographic projections) and conflicting reports about
particular markets, where figures from different vendors
frequently diverge by over a thousand per cent. As with
past media revolutions such as radio and television many
audience measurement
mechanisms are fuzzy and there is a temptation to lie
or simply echo dubious claims, which if repeated enough
are embodied in conventional wisdom.
Instances of spin and outright
lies reflect factors such as -
- the
audience's unfamiliarity with statistical concepts and
discomfort with statistical analysis, characterised
by some as an aspect of digital
literacy
- the
absence of authoritative benchmarks
- uncritical
propagation by government agencies (including Australia's
NOIE and DCITA) and by other gatekeepers of problematical
data
- the
nature of much mass and specialist media, with journalists
and publishers having an interest in 'exciting' news
or striking figures (and on occasion being captured
by their sources)
- hype
by vendors of products and services and by promoters
such as brokers, venture
capital and private equity
fund managers
- triumphalism,
with some observers failing to recognise similarities
with past economic and technological developments and
thus not scrutinising some of the more outrageous claims
- cheerleading
by analysts and advocacy organisations, with bodies
such as ISOC feeling a need to defend 'their' internet
- the
absence, particularly prior to the 2000 Crash, of penalties
for naivety, characterised by one Canberra official
as "no one ever got fired for believing Gartner
but people get monstered for pointing out that the king
is wearing digital clothes"
- subversion
through click fraud
Pages
throughout this site highlight conflicting claims regarding
infrastructure, online publishing (eg the number of sites),
commercial activity (adult
industry advocates and critics both have an incentive
to exaggerate the size of the online erotica business)
and acharacteristics of online populations.
A simple example is the number of "internet users"
in Australia as of early 2007. eMarketer estimates that
the number of users was 13.1 million. The Nielsen//NetRatings
figure was 11.5 million; the Australian Bureau of Statistics
estimate of 10.6 million users was some 2.5 million less
than eMarketer.
As with traditional teledensity
counts a polemicist can pick a figure to illustrate a
particular argument - Australia's ahead of the pack. lagging
behind peers, digital divides
are widening or narrowing, market opportunities beckon
...
common fudges
What are some common fudges? They include -
- confusion
in terms
- extrapolation
from an unrepresentative sample
- mistaking
correlation for causation
- assuming
that growth rates will remain constant
- providing
a gross rather than a per capita figure
- assuming
that the availability of connectivity (or access to
hardware and software) equals ongoing use or a specific
type of use
Examples
are
- the
Australian government's announcement that all agencies
are "online" (a metric that does not differentiate
between whether a single official has a dialup connection
or every officer has broadband, whether "online"
equals a single web page or a rich resource for citizens,
or the quality of what is online)
- acknowledgement
that approximately 50% of people who download Firefox
actually try it and that 25% actively use it on an ongoing
basis
- claims
that one in 10 players who regularly play online games
start a physical relationship with a fellow gamer
Such
abuses are evident elsewhere. One London tabloid for example
shrilled in 2006 that "Britain's plumbers, electricians
and locksmiths drink the equivalent of 1.3 baths of tea"
each year, a figure that is somewhat less exciting when
you do the maths and recognise that annual consumption
of 120 litres of Darjeeling equals roughly a soft drink
can per day. Announcement in 2007 of a £300 million increase
in UK spending on childcare unimpressed people with a
calculator who could do the math and recognised that meant
only £1.15 per child per week.
Many of the web traffic statistics accepted by advertisers
and scholars are artefacts from a 'faith based science',
as the user is reliant on claims that can not be readily
tested and compared. Those claims might be made by a site
owner (whose figures are not independently audited) and
third party web tracking
services (which may use different mechanisms or merely
different definitions to those of their competitors and
thus not enable ready benchmarking).
As noted earlier in this guide, site operators have claimed
that their figures are accurate because they see the number
of hits on their pages, rather than inferring hits from
toolbars used by an unrepresentative demographic or data
provided by individual ISPs. That has provoked questions
about whether advertisers can trust an individual site
operator not to 'cook' its figures and whether it is possible
for advertisers to choose between competing sites on the
basis of claimed figures.
In the US Forbes famously claimed some 15 million visitors
per month to its sites, more than double the 7.3 million
that metrics specialist comScore reported for the same
sites. Confidence in claims and counterclaims is eroded
by 'restatements' from specialists, with Nielsen/NetRatings
for example in 2006 restating its reported figures regarding
Entrepreneur.com from 7.6 million monthly visits to 2
million visits. That is a substantial change if you were
paying for ad exposure or investing in the site operator
on the basis of claimed traffic. (The discussion elsewhere
on this site regarding audience measurement notes that
similar restatements have occurred in relation to radio,
television and newspaper readership figures: net data
restatements are merely the most egregious).
Confidence is also eroded by potential partiality in much
sponsored research. Sponsorship of some studies has led
some savvy observers to suggest that the data should be
labelled as 'vendor research' or simply as promo.
Conflicts in claims about what people are searching for
are highlighted here.
glossy factoids
Why is problematical research influential. One reason
is that users want to believe. Another reason is that
much output from commercial research firms is wrapped
in the trappings of authority: priced out of the reach
of many scholars or other independent analysts, replete
with jargon and buzzwords, hyped as commissioned or used
by leading private and public sector organisations, embodying
a range of charts and tables, drawing on proprietary data
analysis mechanisms and surveys.
Influence can be self-reinforcing: users refer to studies
and to specialists because they know their peers use them.
The more a report is cited the more likely it will be
referred to and the greater the authority for its author
to gain support for further research (alas, often research
that just massages the initial figures and that may not
be relevant in another location).
Many journalists and (more importantly) most end-users
seem unwilling or unable to articulate why they believe
such studies and the extent to which they believe. That
is perhaps because many of the statistics are pulled from
media releases (free) rather than the full reports (expensive).
A more significant reason is that the basis of the data
and compliance with any standards are usually opaque,
even if an observer has access to the full text of the
particular report and has had an opportunity to scrutinise
past reports from the vendor in inrder to identify 'restatements'
and anomalies.
primers
Darrell Huff's How To Lie With Statistics (New
York: Norton 1993) has not been substantially updated
since its first appearance in the early 1950s but is of
excellent value. John Paulos' A Mathematician Reads
The Newspaper (New York: Anchor 1996) and The
Tiger That Isn't: Seeing Through a World of Numbers
(London: Profile 2007) by Michael Blastland & Andrew
Dilnot are other lighthearted looks at the use and abuse
of mathematics in the mass and specialist media, complemented
by Gene Epstein's more splenetic Econospinning: How
to Read Between the Lines When the Media Manipulate the
Numbers (New York: Wiley 2006).
Joel Best's Damned Lies & Statistics: Untangling
Numbers From The Media, Politicians & Activists
(Berkeley: Uni of California Press 2001) and Jane Miller's
The Chicago Guide to Writing about Numbers: The Effective
Presentation of Quantitative Information (Chicago:
Uni of Chicago Press 2004) are harder going but perhaps
more valuable.
The Design guide on this site points
to recommended studies about the interpretation and creation
of statistical graphics. Three of particular note are
Edward Tufte's
The Visual Display of Quantitative Information
(1992), Envisioning Information (1990) and
Visual Explanations: Images & Quantities, Evidence
& Narrative (1997) - all published by Graphics
Press (Cheshire, Connecticut).
For an overview of data collection and interpretation
issues we recommend Andrew Odlyzko's important 2000 paper
on Internet Growth: Myth & Reality, Use & Abuse
and Michael Dahn's paper
Counting Angels on a Pinhead: Critically Interpreting
Web Size Estimates.
For another perspective see Alain Desrosières'
The Politics of Large Numbers - a History of Statistical
Reasoning (Cambridge: Harvard Uni Press 1998), Michael
Anderson's The American Census: A Social History (New
Haven: Yale Uni Press 1988) and essays in Statistics
& Society: The Arithmetic of Politics (London:
Arnold 1999) edited by Daniel Dorling and Stephen Simpson.
sectoral studies and standards
The US White Paper on Electronic Journal Statistics
(WPEJS),
reflecting the 1998 International Coalition of Library
Consortia 1998 Guidelines for Statistical Measures
of Usage of Web-Based Indexed, Abstracted & Full Text
Resources (ICOLC),
deals with library statistics.
The Australian Internet Industry Association (IIA)
is encouraging development of a set of standard measures
for the local online industry, including agreed standards
for "Site Centric/Rating, and Ad Server Measurement".
The University of Southen California has published a paper
(PDF)
mapping competing US industry measures. It should be read
in conjunction with the outstanding paper
by Thomas Novak & Donna Hoffman on New Metrics
for New Media Toward the Development of Web Measurement
Standards.
measuring the information economy
Questions about mapping the size, shape and volatility
of the 'new economy' are explored in the Information Economy
guide elsewhere on this
site.
DIY spin generators
Robert Orenstein's 'Irresponsible Internet Statistics
Generator' (IISG)
retains its value for those trying to make sense of some
of the loopier government, academic and business projections.
next page
(the metrics business)
|
|