On the Web, Research Work Proves Ephemeral
By Rick Weiss
It was in the mundane course of getting a scientific paper published
that physician Robert Dellavalle came to the unsettling realization that
the world was dissolving before his eyes.
The world, that is, of footnotes, references and Web pages.
Dellavalle, a dermatologist with the Veterans Affairs Medical Center in
Denver, had co-written a research report featuring dozens of footnotes --
many of which referred not to books or journal articles but, as is
increasingly the case these days, to Web sites that he and his colleagues
had used to substantiate their findings.
Problem was, it took about two years for the article to wind its way to
publication. And by that time, many of the sites they had cited had moved
to other locations on the Internet or disappeared altogether, rendering
useless all those Web addresses -- also known as uniform resource locators
(URLs) -- they had provided in their footnotes.
"Every time we checked, some were gone and others had moved," said
Dellavalle, who is on the faculty at the University of Colorado Health
Sciences Center. "We thought, 'This is an interesting phenomenon itself.
We should look at this.' "
He and his co-workers have done just that, and what they have found is
not reassuring to those who value having a permanent record of scientific
progress. In research described in the journal Science last month, the
team looked at footnotes from scientific articles in three major journals
-- the New England Journal of Medicine, Science and Nature -- at three
months, 15 months and 27 months after publication. The prevalence of
inactive Internet references grew during those intervals from 3.8 percent
to 10 percent to 13 percent.
"I think of it like the library burning in Alexandria," Dellavalle said,
referring to the 48 B.C. sacking of the ancient world's greatest
repository of knowledge. "We've had all these hundreds of years of stuff
available by interlibrary loan, but now things just a few years old are
disappearing right under our noses really quickly."
Dellavalle's concerns reflect those of a growing number of scientists and
scholars who are nervous about their increasing reliance on a medium that
is proving far more ephemeral than archival. In one recent study,
one-fifth of the Internet addresses used in a Web-based high school
science curriculum disappeared over 12 months.
Another study, published in January, found that 40 percent to 50 percent
of the URLs referenced in articles in two computing journals were
inaccessible within four years.
"It's a huge problem," said Brewster Kahle, digital librarian at the
Internet Archive in San Francisco. "The average lifespan of a Web page
today is 100 days. This is no way to run a culture."
Of course, even conventional footnotes often lead to dead ends. Some
experts have estimated that as many as 20 percent to 25 percent of all
published footnotes have typographical errors, which can lead people to
the wrong volume or issue of a sought-after reference, said Sheldon
Kotzin, chief of bibliographic services at the National Library of
Medicine in Bethesda.
But the Web's relentless morphing affects a lot more than footnotes.
People are increasingly dependent on the Web to get information from
companies, organizations and governments. Yet, of the 2,483 British
government Web sites, for example, 25 percent change their URL each year,
said David Worlock of Electronic Publishing Services Ltd. in London.
That matters in part because some documents exist only as Web pages --
for example, the British government's dossier on Iraqi weapons. "It only
appeared on the Web," Worlock said. "There is no definitive reference
where future historians might find it."
Web sites become inaccessible for many reasons. In some cases individuals
or groups that launched them have moved on and have removed the material
from the global network of computer systems that makes up the Web. In
other cases the sites' handlers have moved the material to a different
virtual address (the URL that users type in at the top of the browser
page) without providing a direct link from the old address to the new one.
When computer users try to access a URL that has died or moved to a new
location, they typically get what is called a "404 Not Found" message,
which reads in part: "The page cannot be displayed. The page you are
looking for is currently unavailable."
So common are such occurrences today, and so iconic has that message
become in the Internet era, that at least one eclectic band has named
itself "404 Not Found," and humorists have launched countless knockoffs of
the page -- including www.mamselle.ca/error.html, which looks like a
standard error page but scolds people for spending too much time on their
computers ("This page cannot be displayed because you need some fresh air
. . .") and www.coxar.pwp.blueyonder.co.uk, which offers political
commentary about the U.S. war in Iraq ("The weapons you are looking for
are currently unavailable.").
Not all apparently inaccessible Web sites are really beyond reach.
Several organizations, including the popular search engine Google and
Kahle's Internet Archive (www.archive.org), are taking snapshots of Web
pages and archiving them as fast as they can so they can be viewed even
after they are pulled down from their sites. The Internet Archive already
contains more than 200 terabytes of information (a terabyte is a million
million bytes) -- equivalent to about 200 million books. Every month it is
adding 20 more terabytes, equivalent to the number of words in the entire
Library of Congress.
"We're trying to make sure there's a good historical record of at least
some subsets of the Web, and at least some record of other parts," Kahle
said. "We're injecting the past into the present."
But with an estimated 7 million new pages added to the Web every day,
archivists can do little more than play catch-up. So others are creating
new indexing and retrieval systems that can find Web pages that have
wandered to new addresses.
One such system, known as DOI (for digital object identifier), assigns a
virtual but permanent bar code of sorts to participating Web pages. Even
if the page moves to a new URL address, it can always be found via its
Standard browsers cannot by themselves find documents by their DOIs. For
now, at least, users must use go-between "registration agencies" -- such
as one called CrossRef -- and "handle servers," which together work like
digital switchboards to lead subscribers to the DOI-labeled pages they
seek. A hodgepodge of other retrieval systems is cropping up, as well --
all part of the increasingly desperate effort to keep the ballooning Web's
If it all sounds complicated, it is. But consider the stakes: The Web
contains unfathomably more information than did the Alexandria library. If
our culture ends up unable to retrieve and use that information, then all
that knowledge will, in effect, have gone up in smoke.
Research editor Margot Williams contributed to this report.