What Web Ephemerality Means for Your Hyperlinks

Hyperlinks are a powerful tool for journalists and their readers. Diving deep into the context of an article is just a click away. But hyperlinks are a double-edged sword; Despite all the infinity of the internet, what’s on the web can also be changed, moved, or disappeared entirely.

the web fragility poses a problem for any area of ​​work or interest that depends on written materials. Loss of reference material, negative impacts on SEO, and malicious hijacking valuable outbound links are among the harmful effects of a broken URL. More fundamentally, it leaves articles from decades past as shells of themselves, severed from their original source and context. And the problem goes beyond journalism. In a 2014 study, for example, researchers (including some members of this team) found that nearly half of all hyperlinks in Supreme Court opinions led to content that had either changed since it was originally published, disappeared from the internet.

Hosts control URLs. When they remove content from a URL, whether intentionally or not, readers find a website inaccessible. This often irreversible degradation of web content is commonly referred to as linkrot. It is similar to the related problem of content drift, or the generally unannounced changes – retractions, additions, replacements – made to the content of a particular URL.

Our team of researchers at Harvard Law School has undertaken a project to better understand the extent and characteristics of journalistic linkrot and content drift. We have reviewed the hyperlinks in New York Times articles, starting with the launch of the Time website from 1996 until mid-2019, developed based on a data set provided to us by the Time. The substantial link rot and content drift we saw here reflects the inherent difficulties of building long-term links to elements of a volatile web. the Time is in particular a well-resourced digital journalism flagship with a strong institutional archiving structure. Their interest in taking on the linkrot challenge indicates that it has not yet been fully understood or addressed across the entire field.

The link dataset on which we built our analysis was assembled by Time software engineers who have extracted URLs embedded in archival articles and bundled them with basic article metadata such as section and publication date. We measured linkrot by writing a script to visit each of the unique “deep” URLs in the dataset and record HTTP response codes, redirects, and server timeouts. Based on this analysis, we labeled each link as “rotten” (removed or inaccessible) or “intact” (returning a valid page).

Two million hyperlinks

Register for CJRit is daily email

We found that out of the 553,693 articles covered by our study, i.e. they included URLs on nytimes.com, there were a total of 2,283,445 hyperlinks pointing to content outside of nytimes .com. Seventy-two percent among these were “deep links” with a path to a specific page, such as example.com/article, which we focused our analysis on (as opposed to just example.com, which made up the rest of the article). dataset).

Of these deep links, 25% of all links were completely inaccessible. Linkrot has become more common over time: 6% of 2018 links had rotted, up from 43% links since 2008 and 72% of links since 1998. Fifty-three percent of all articles containing deep links had at least one rotten link.

Rot on sections

Certain sections of the Time showed much higher rates of rotten URLs. Links in the Sports section, for example, show a relative decay rate of about 36 percent, as opposed to 13 percent for The result. This difference is largely related to time. The average age of a link in The Upshot is 1,450 days, compared to 3,196 days in the Sports section.

To detect the extent to which these chronological differences alone explain the variation in decay rate between sections, we developed a metric, the relative decay rate. It allows to see if a section has undergone proportionally more or less linkrot than the Time globally. Of the fifteen sections with the most articles, the Health section had the lowest RRR figures, falling about 17% below the base linkrot frequency. The Travel section had the highest rot rate, with over 17% of links appearing in articles in the section rotting.

A section that reports largely on government affairs or education may be disadvantaged by the fact that deep links to domains such as .gov or .edu show higher rot rates. These URLs are volatile in nature: whitehouse.gov will always have the same URL but will fundamentally change content and structure with each new administration. It is precisely because their domains are fixed that their deep ties are fragile.

Content drift

Of course, returning a valid page is not the same as returning the page as seen by the author who originally included the link in an article. Linkrot’s partner in content drift may render content at the end of a URL misleading or radically different from the original linker’s intentions. For example, a 2008 article on a race for Congress refers to a member of the New York City Council and links to what had been his page on the city hall website. Clicking on the same link today will take you to the list of current District Council members website.

To identify the prevalence of content drift, we performed a human review of 4,500 URLs randomly sampled from URLs that our script had labeled as intact. For the purposes of this review, we have defined a content drifted link as a URL used in a Time article that did not point to the relevant information that the original article referred to when it was published. Based on this analysis, the reviewers marked each sample URL as “intact” or “derivative”.

Thirteen percent intact links from this sample of 4,500 had drifted considerably since the Time published them. four percent of accessible links published in 2019 articles had drifted, compared to 25% of links accessible since 2009.

The path to follow

Linkrot and content drift at this scale across the New York Times is not a sign of neglect, but rather a reflection of the state of modern online citation. Sharing information quickly through links improves the field of journalism. The fact that it is compromised by the fundamental volatility of the Web indicates the need for new practices, workflows and technologies.

Retroactive – or mitigation – options are limited, but still important to consider. the Internet Archive hosts an impressive, though far from complete, assortment of website snapshots. It is best understood as a way to fix linkrot and content drift incidents. The posts could help improve the visibility of the Internet Archive and other similar services as a tool for readers, or even automatically replace broken links with links to archives, as the Wikipedia community has done.

Yet more fundamental measures are needed. Journalists have embraced proactive solutions, such as screenshotting and storing static images of websites. But this does not solve the problem for the reader who comes across an inaccessible link.

New frameworks for considering the purpose of a given link will help strengthen the intertwined processes of journalism and research. Before linking, for example, journalists need to decide whether they want a dynamic link to a choppy web — risk of content rot or drift, but allowing for deeper exploration of a topic — or an element of frozen archive, fixed to represent exactly what the author would have seen at the time of writing. Newsrooms––and the people who support them––should create technical tools to streamline this more sophisticated linking process, giving editors maximum control over how their articles interact with other Web content.

Newsrooms should consider adopting tools that suit their workflows and make link preservation an integral part of the journalistic process. Partnerships between library and information professionals and digital newsrooms would be fruitful in creating these strategies. Previously, these partnerships have produced domain-specific solutions, such as those offered to the legal field by the Harvard Law School Library’s Perma.cc project (on which the authors of this report work or have worked).

The skills of information professionals must be coupled with the specific concerns of digital journalism to bring out particular needs and areas for development. For example, explorations of more automated detection of link rot and content drift would open doors for newsrooms to balance the need for external links with archival considerations while keeping their publishing needs at large scale.

Digital journalism has grown considerably over the past decade, taking a vital place in history. Linkrot is already breaking that record – and it’s not going away on its own.

For an extended version of this research, as well as more information on the methodology and dataset, visit https://cyber.harvard.edu/publication/2021/paper-record-meets-ephemeral-web.

Below is a list of archived quotes included in this article:

https://www.cjr.org/tow_center_reports/the-dire-state-of-news-archiving-in-the-digital-age.php archived at https://perma.cc/FEW8-EBPH

https://www.searchenginejournal.com/404-errors-google-crawling-indexing-ranking/261541/#close archived at https://perma.cc/H23H-28CJ

https://www.buzzfeednews.com/article/deansterlingjones/links-for-sale-on-major-news-wesbites archived at https://perma.cc/D6B3-2Z2A

https://cityroom.blogs.nytimes.com/2008/06/12/democrats-rally-around-mcmahon-in-si-race/ archived at https://perma.cc/W95N-U4S9

https://council.nyc.gov/district-49/ archived at https://perma.cc/QNZ6-SEXP

Has America ever needed a media watchdog more than it does now? Help us by contact CJR today.

John Bowers, Clare Stanton and Jonathan Zittrain are the authors. John Bowers is a JD candidate at Yale Law School and affiliated with the Berkman Klein Center for Internet & Society at Harvard University, where he worked as a senior research coordinator before coming to Yale. Clare Stanton is Outreach and Communications Manager for the Perma.cc Project at Harvard Law School’s Library Innovation Lab. Jonathan Zittrain is a professor of law and computer science at Harvard, where he co-founded his Berkman Klein Center for Internet & Society.

Comments are closed.