The half-life of citations

Why are the references in your research so old? That’s feedback I remember receiving on my first bit of true research, my honours dissertation. The examiner wasn’t as blunt as my paraphrasing, but the gist of his comment was memorable enough. At the time, it seemed an odd comment. I now realise that it’s a valid concern.

Half the citations in my dissertation were more than 20 years old. A quarter were over 40. Being new to research, I didn’t see anything odd about this. In the intervening nine years, I haven’t changed my tune. I still disproportionately cite old papers. In all six papers I written as first author, the median age of papers I cite has been 10–15 years. For the papers I read, that range is more like 5–10 years.

Don’t get me wrong, I still cite a lot of recent papers. That’s “recent” in the scientific literature sense of something less than five years old, or thereabouts. But even though I cite recent papers frequently, I do it less frequently than others: 47% of papers I cite are 10 years old or less, compared to 59% in the papers I read.

Before you stop reading, I realise that no one cares about these two percentages. They aren’t even that different. More importantly, there’s no good reason for you to care about something as esoteric as the age of papers cited by a scientist you’ve likely never met. (My original title for this post was the self-indulgent “I’m not an early adopter of the scientific literature”, but I’m still unsure whether that’s a good thing or not).

The more general and more interesting point is how citations drop off as they get older. Focus on the red histogram below:

hist_reverse_colour
Many of the papers I cite are older than me. I probably shouldn’t try to keep this up too much longer.

To look for patterns in citations, I took 360 papers that I’ve accumulated over the last six years and calculated the age of the papers cited at the time each paper was published. All papers were published between 2000 and 2020. Most are from the fields of physical oceanography or fluid mechanics. Your field will differ, but the general principles should hold. (See the end of the post for the detailed methods.)

The shape of the red histogram above might look familiar: it’s an exponential decay. In fact, exponential decay is a surprisingly good fit for how citations age. Let me repeat the plot above, but with the y axis on a log scale so that exponential decay becomes a straight line:

half_life

Every nine years, the likelihood of a paper in my collection being cited halves. This halving keeps going for nearly 100 years, after which time the exponential decay model starts to become meaningless (anyone want a fraction of a citation?).

Assuming that exponential decay of citations holds for each paper individually, then the number of citations after nine years would be half of the total that a paper ever accumulates (1 + 1/2 + 1/4 + 1/8 + 1/16 + 1/32 + … = 2). Except, of course, this is a bad assumption. A small subset of foundational papers are likely what keeps the old literature going.

As much as we all want our papers to be timeless, statistically that won’t happen. The best chance of having our work recognised (i.e., getting cited) is in the first few years. As we see in the graph, that’s where the citations are most frequent. This drops off quickly. In fact, over the first 10 years, the half life is closer to six years than nine.

I can think of two good reasons for a disproportionately large number of citations of young papers: (1) papers are being published at an increasing rate and (2) citing recent papers signals to others that we’re keeping up with the latest research. My guess is the latter reason is more important.

To back up a statement, sometimes you need to cite a specific paper, regardless of how old it is. Just as often though, you can take your pick of who you want to cite. This is especially true in the introduction when statements tend to be more general. It might go something like Process X is commonly observed in setting Y (Smith 2019, Jones, 2020). There may be nothing special about the Smith or Jones papers, other than that they’re the most recent examples of process X.

Whether or not a fixation on the newest studies is a problem boils down to personal preferenceBut one thing’s for sure: a hot topic is, in itself, poor rationale for its study. Philip Moriarty, on a bit of a tangent in a comment elsewhere, promises to pull his hair out should he read just one more scientific paper that kicks off with “In recent years, [phenomenon/material/property X] has become of increasing interest …” Jeremy Fox shares the sentiment, though doesn’t promise to pull out his hair.

Assuming you agree that we shouldn’t focus on new papers at the expense of the older literature, you might think that it’s things are only going to get worse. Surely the prevalence of Twitter and other platforms that promote the latest papers will only encourage a myopic view toward such papers? In one sense, that’s correct: recent papers are being cited more often. But so are old papers. If I compare histograms like those above but separate the data into papers published 2000–2010 and 2011–2020, the distributions remain the same. What does change is how many papers are being cited.

num_cites_white
The number of citations per paper has nearly doubled in the last 20 years

Since the literature can only keep getting bigger, it makes sense that the number of citations per paper should increase as well.

Extrapolating the graph back in time suggests papers had close to zero citations 20–40 years ago. Seems about right.

 

 


Methods

The 360 papers that I used in my analysis are the same as I’ve used before for analysing word usage. As I noted there, I converted the papers from PDF form to text files, then analysed the text files.

For each paper, I searched for all numbers resembling a citation. I looked for patterns like 18XX, 19XX, or 20XX preceded by a left parenthesis, left bracket, semi-colon, period, or comma. The shell command for those whole really care is

grep -E -o '[\(|\[|\,|\;|\, |\; |\. ](18|19|20)[0-9][0-9]'

The main problem with this is matching a number in the text that isn’t a citation but gets counted because it fits the pattern. I’m assuming this has minimal effect since I’m treating all papers the same way. That said, I’ve ignored any citations from the year the paper was published (i.e., an age of 0 years) in case the year shows up in, say, the running header (or similar), which would potentially count as a citation on every page.

My method also double counts citations: once (or more) when cited in the text and again in the reference list. Again, since I treat all papers the same way, I’m ignoring this issue.

Author: Ken Hughes

Post-doctoral research scientist in physical oceanography

%d bloggers like this: