Every writer leaves a hidden fingerprint in their texts whether they know it or not. It’s hidden in the relative usage of words: some words appear more than average and other words less. Imagine there’s a rumour that a well established author has written a new book under a pen name, but they’re are pretending that this is not the case. One piece of evidence that the authors are one in the same is to count the number mundane words like and, but or -ly adverbs used within the new book and then compare the numbers to the author’s past works. Authors use surprisingly similar numbers of each word over the length of a book. Don’t believe me? Then check out Ben Blatt’s book Nabokov’s Favorite Word is Mauve.
The title of this post is a nod to Blatt’s book. In this, he statistically analyses word frequency in a range of texts from literature to fan fiction to New York Times bestsellers. He uses numbers to teach us about writing. Early on, he shows how a reduction in usage of -ly adverbs correlates with a book’s appeal. This is but one of many predictors of a text’s success based only on word frequency. In the same vein, I’m going to scrutinise my own scientific writing to find room for improvement. Navel-gazing? Yes. Will you learn something if you read on? Also yes.
As much as I try to avoid it, it’s instilled in me as a scientist to remove all possible voice from my writing. I end up using the same cliched words and transitions that abound in the papers I read. I’m going to consider three types of words in particular:
- Vague quantifiers such as poor or strong
- Conjunctive adverbs such as however or consequently
- Words with positive connotations such as robust or novel
The collection of my writing to scrutinise is the six papers I’ve written as first author. The control group is 362 papers I’ve accumulated over the years. I’ll refer to these as my library. My papers average 19 PDF pages, close to the average of my library (17 PDF pages). In total, my library comprises 6350 pages, which equates to two million words, give or take. That said, I’m not going to divide by paper length or word count because it’s easier to work with round numbers. And to paraphrase another blog-post-level analysis: there could be lots of interesting sub-analyses but I haven’t done them. You get what you pay for here! (The rest of the methods are at the end, where they belong.)
A few weeks ago, a co-author called me out for overusing appreciable and appreciably in a soon-to-be-submitted paper. It might have seemed pedantic, except there were 17 instances, far more than is reasonable for a word with a loose definition. I quickly whittled it down to three. Even that is still more than I should be using. 90% papers in my library never refer to anything as appreciable. At the other end of the scale, the worst offender has nine uses in 19 pages. The second-worst has five, albeit across a 55-page book chapter. The third-worst has four across 26 pages. These three papers all have the same author! Their other paper in my library has one appreciable. This word is part of their fingerprint.
My three most recent papers use appreciable or appreciably a total of 11 times. That puts me in the top 1%. Probably not something to boast about. I need to revert back to the approach in my first three papers where I never used either word.
While more common than appreciable and appreciably, poor and poorly are less common than I expected. Two-thirds of the papers in my library are free of these words, compared to one-third of my papers. By comparison, 1% of papers had double-digit usage. Four of the five papers with the most were review papers. My guess as to why is that review papers can dedicate much of the article to pointing out deficiencies with existing knowledge and methods. We have poor understanding of … Models perform poorly in simulating … (Review papers are also longer than average, which I’m not accounting for.)
Strong/stronger/strongly is different. It gets used a lot. The median usage is seven times per paper. 10% of papers have at least 20 instances. Again, several of the papers with the highest usage are review papers. As I expected my own usage of strong* is high. My median usage is 11 times per paper, but a mean of 16 times thanks to one of my papers with 38 instances. Everything is strong in that paper: spatial differences are strong, mixing is strong, mean currents are strong, tidal currents are strong, exchange flow is strong, correlation of seasonal cycles is strong, ice cover is strong, shears are strong… Well you get the picture. Again, this is not wording to be proud of.
However is a word I use often, but also one that I actively try to avoid. It’s tempting because makes an easy transition between two sentences. However, I don’t want to routinely contradict what I’ve just written. (You see what I did there, right?)
I use however on average 14 times per paper. That’s twice the going rate in my library. Only once have I used it fewer than 10 times in a paper. On that note, credit to the 2.5% of papers that avoid it and the 6% of papers with a single use. Please, teach me your ways.
While we’re on however, we might as well check on but. In written English in all forms, but is the 24th most common word, being used four times for every 1000 words. However is 333rd, being used 15 times less often. In scientific writing, that’s not the case of course. I use each word nearly as often and so do others. In my library, the mean and median usage of but outdoes however by only 30%.
Consequently is another of my go to words for joining sentences. Relative to however, I use consequent or consequently less an absolute sense (a median of eight times per paper), but my relative use is more than anyone. In my library, the most prolific users of consequent* are two papers with 10 instances. Three of my own papers have 9, 10, and 11 instances. I’m sure some of these could have been culled, but I’d still be a long way from the median usage of consequently, which is somehow zero times per paper. How do people avoid this convenient adverb?
Words with a positive connotation
Although we’d rather not be salespeople, as scientists we do have to promote aspects of our work, especially these days. Using PubMed abstracts as a dataset, Vinkers et al. (2015) showed that the words robust, novel, innovative, and unprecedented are used 25–150 times more often than they were 40 years ago.
Here, finally, may be some words I use less frequently than others. While I use robust four times over six papers, I’ve don’t use the other three words once. Then again, these words aren’t actually that common, at least in my library. Innovative and unprecedented don’t appear in 97% and 95% of papers, respectively. (When unprecedented does appear in my library, it’s often to describe changes in the Arctic, which are hardly a positive thing.) Novel appears more frequently, appearing in 12% of papers in my library. The journal Cell goes as far as discouraging the use of novel noting that it is overused, tends not to add meaning, and is difficult to verify.
Instead of cliched words, how about an exclamation mark for emphasis? There are 31 in total across my library from 22 papers (6% of papers). Four first authors come up twice in this list of 22. Using exclamation marks is part of their fingerprint. I’d place most of the first authors of these papers as mid-to-late career scientists. Whether there’s anything to take from that, I’m not sure.
With one exception, two exclamation marks is the maximum per paper. The exception has five. All five occur in the space of 275 words during an entertaining anecdote about trying to locate an oceanographic instrument inside a shipping container packed by someone who speaks a different language. The anecdote was written by my colleague Pavan who avoided the outdated but still tacit practice of keeping scientific papers free of enthusiasm. (It’s either because he’s an engineer, a genial person, or both.) While we don’t need five exclamation marks per page, surely there’s no harm in writing our science more like Pavan.
Toward improving writing
I’m now going to be hyperaware anytime I write strong, appreciable, however, or consequently. I can take some solace that multiple howevers may be good for my narrative index. Nevertheless, I should still aim to limit my usage of these four words, among others. I could, of course, reach for the thesaurus (or try to write shorter papers). But using clear or distinct instead of strong, say, is a band-aid solution.
Fiction writing may provide an answer. Novelists are encouraged to avoid -ly adverbs on the grounds that it is lazy. A character’s emotions, for example, should be clear from context. A writer shouldn’t need to include “… he replied angrily“. Show, don’t tell is the oft-quoted rule. The same idea holds when applied to scientific writing. I shouldn’t need to tell the reader that a process is strong or appreciable. With good writing, this will be self evident.
My library of 362 papers are most of the papers on my laptop’s hard drive published after 2000. (Excluding earlier papers avoids any problems arising from conversion of scanned pages to PDF). Each PDF was converted to a text file using pdftotext. To search for words in a file, I prefer Ag. On the command line, a search for but looks like
ag --stats 'but[ |,|.|)|!|:|?|\n]' filename.txt
To ensure I count instances of but without also including words like butter, I’ve limited the search to but followed by various punctuation including space, comma, period, new line etc. The astute reader will notice that words ending in but will be counted. For the words I analysed, I expect this to have a negligible effect. And I only realised this potential overcounting after having written the post, and I don’t plan to go back and re-analyse anything.
The search expression is the heart of the method. The rest is a wrapper, i.e., looping through all words, keeping track of the outcome, and calculating basic statistics.