A direct and quantifiable impact on science to come out of my PhD was the 50-odd times that I brewed coffee for the department morning tea. Scientists turned up and got coffee; I got thanked for helping make that happen.
Despite its impact, brewing coffee is not listed on my CV1. Instead, I have publications. Yet, compared to coffee, the direct impacts of these publications are hard to define.
In computer programming1, code smells are “surface indications that usually correspond to deeper problems in the system”. Duplicated code is one example. Copying a code fragment into many different places is generally considered bad form; Don’t Repeat Yourself is a well known principle of software development. However, duplicating code can be beneficial if, say, it makes the code easier to read and maintain.
Although code smells are undesirable, “they are not technically incorrect and do not prevent the program from functioning.”
By this description, I’d argue that smells also exist in scientific papers. Hence, I’m proposing a few of these easy-to-spot (aka sniffable) features that may point to a deeper underlying issue.
Scientific writing is obsessed with other scientific writing1 and itself.
Phrases like ‘this paper‘ and ‘this study‘ are everywhere in scientific writing2—which is not a problem per se. Used well, these phrases concisely differentiate the current study from others. Used poorly, these phrases fill the word count without adding value to the reader.
Never, for example, start a Conclusion with ‘In this paper, we showed . . .’ or ‘The main conclusions of this paper are . . .“. The first few words of a Conclusion (any section, in fact) are precious. Don’t waste them reminding me that I’m reading a paper in which you’ve shown or concluded something. Tell me something profound—something about your science.
“In this paper, we showed . . .” is a signpost (aka metadiscourse). It’s writing about the writing. And it’s a main reason that so much of science writing, like any academic writing, is so boring.
Line graphs are the Swiss army knives of data visualisation. They can be almost anything… which is both good and bad.
Line graphs are slow to interpret
Many graphs serve one clear purpose. Take the five graphs below:
Even without labels, it’s clear what role each of these graphs serves:
Pie chart—components of a total
Thermometer—progress toward a goal amount
Speedometer—percentage of the largest possible value
Histogram—distribution of values
Box plot—statistical summaries of several datasets
In other words, if I’m presented with one of the graphs above, I have an immediate head start on interpreting it. If, instead, I’m presented with a line graph, I’m forced to read the axes labels and limits first.
Deciphering text is the slow way to intake information. Shape is fastest, then colour, and only then text. This so-called Sequence of Cognition, popularised by Alina Wheeler, is something marketers need to know about.
I typically write 100–200 lines of code each time I develop a scientific figure that is destined for publication. This is a dangerous length because it’s easy to create a functioning mess. With shorter code fragments, it’s feasible to start over from scratch, and with thousands of lines of code, it makes sense to invest time upfront to organise and plan. But in between these extremes lurks the appeal to write a script that feels coherent at the time, but just creates problems for future you.
Let’s say you want to create a moderately complicated figure like this:
A script for this figure could be envisaged as a series of sequential steps:
Read data in from a csv file
Remove any flagged data
Create four subplots
Plot the first line of data against time
Label the y axis
Set the y axis limit
Repeat steps 4–6 for the second and third lines of data
Comments within code are harmless, right? They don’t affect run-time, so you might as well use them whenever there’s any doubt something is unclear.
I hope you aren’t nodding your head, because a liberal use of comments is the wrong approach. Not all types of code comments are evil, but many are rightfully despised by programmers as (i) band-aid solutions to bad code, (ii) redundant, or even (iii) worse than no comment at all.
The same is true for scientific figures and their captions. In fact, many of the rules discussed in the post Best Practices for Writing Code Comments remain valid when we replace comments and code with captions and figures, respectively.
Click on each word (not its number) for a brief elaboration
Anscombe’s quartet
Four distinct datasets (x vs y) that produce the same summary statistics (mean, variance, correlation coefficient, and line of best fit)
HSL colour space
A colour space that defines colours in terms of their Hue (e.g., red or blue), Saturation (vivid to washed out), and Lightness (white to black)
Whitespace
The area within a design (website, poster, figure, etc) that lacks text, images, or other elements
HARKing
A questionable approach to research: Hypothesising After the Results are Known
Fermi problems
Problems in which an answer cannot be estimated outright but is instead derived as the product of more easily estimated quantities (e.g., how many grains of rice are eaten across the world every year?)
Use cases of .png and .jpg images
The JPG format is optimised for photos, whereas PNGs are for graphs and diagrams
That vs which
That and which, although similar, have opposing implications about whether a clause is restrictive or not
McNamara Fallacy
Basing a decision on only numbers or other objective measures without reference to any qualitative factors
Version control
A tool for tracking and recording all changes to software and other digital files as they evolve
Serial position effect
The human tendency to better remember what happened at the start and the end and forget what happened in the middle
Zenodo and Figshare
Online repositories for datasets, code, and other research output
Your carbon footprint
A typical person living in a western country will have an annual footprint of 5–20 tonnes CO2
Matthew effect
Well known scientists get cited more often than lesser known ones leading to a positive feedback loop
Hand-waving solutions
A metaphor for an answer that might gloss over details, be vague, or rely on many approximations
Logarithmic scales
Scales that increase geometrically (e.g., 1, 2, 4, 8, 16, …) rather than linearly (2, 4, 6, 8, …)
“Data” is a plural
It can sound odd, but data were collected is correct and data was collected is not
Left-branching sentences
A sentence structure to avoid because the initial words only make sense as the sentence nears it end
Regression to the mean
A statistical tendency for outliers in an initial experiment to deviate less in a subsequent experiment
ImageMagick
Software for all manner of image manipulations and conversions that can be run from the command line
Bash shell
The default command line interface
Butterworth filters
A widely used approach for smoothing time-series data
Golden ratio
The value 1.618…; an aesthetically pleasing aspect ratio for a rectangle among many other claims to fame
Types of map projections
Flattening the earth to a two-dimensional image can be achieved in numerous ways, each with its own pros and cons
DOI and PMID
Unique digital identifiers that can point to publications, datasets, software, and more
Edward Tufte
An early name in data visualisation and author of several books on the topic
Widows and orphans
A line at the beginning or end of a paragraph that is separated from the rest by a page break
Construction cost of the Large Hadron Collider
One of the most expensive scientific experiments took ~3 billion Swiss Francs to build (or ~5 billion US dollars back in 2001)
Resolution of an electron microscope
Electron microscopes can resolve objects as small as 0.1 nanometers
Adding epicycles
Tweaking a fundamentally flawed theory in a last-ditch effort to make it explain observations
Why governments fund basic scientific research
Among many reasons, basic scientific research (i) lowers the barrier for firms that want to develop new products and (ii) develops skilled scientists and engineers who can capitalise on research undertaken elsewhere
William Shockley’s thoughts on productivity
Shockley speculated that a small number of scientists can be exponentially more productive in total because the creation of a scientific paper is the combination of many individual tasks, and productivity in each of these tasks multiplies together to give overall productivity.
Kerning
Adjusting the spacing between individual letters in text to improve aesthetics
How last authorship varies across fields
Depending on scientific field, the last author either did the least work, is the group leader, obtained funding for the project, or has a surname near the end of the alphabet
Project Jupyter
An open source project that simplifies and promotes interactive use of many programming languages
Uncertainty propagation
The uncertainty of a derived quantity (e.g., kinetic energy derived from speed and mass) can be calculated from the uncertainty of the input quantities following simple—though sometime tedious—arithmetic
SSH (Secure Shell)
The standard way to access a remote server via the command line
Difference between a hyphen and a minus sign
Although similar, they should not be confused; a hyphen (-) is a short dash used to combine words, whereas a minus sign is longer (−)
Optimal number of characters per line
A line of text should have 60–70 characters (counting spaces) for a single-column layout and 40–50 for multiple columns (see page 32 of Detail in Typography)
The .eps file type
A predecessor to PDF that was developed in the late 1980s and is almost obsolete
Stroke and fill
For line drawings, the edge is known as the stroke and the interior is known as the fill
The Greek alphabet
The order doesn’t matter, but knowing the individual letters is worthwhile
Text anti-aliasing
The smoothing of text to improve its appearance (especially relevant at coarse resolution)
Triptychs
A three-panel image or collection of images (and an easy way to create an attractive title slide)
Active voice
Better than passive voice in most cases
Fast Fourier transform
An algorithm that makes much of modern technology possible
Pseudoscience
Statements and methods purportedly grounded in science but obviously flawed
ORCID
A unique digital identifier for a researcher that is linked with their scholarly works
Transistors
Your phone likely has billions of them
RAM
A computer’s short-term memory in a sense (distinct from the long-term memory that is the hard drive)
Effective cost of a night’s worth of observation from a large telescope
One of the building blocks of any programming language that, typically, (i) takes one or more inputs, (ii) does whatever to those inputs, and (iii) returns an output
Argument from authority fallacy
The incorrect assumption that a claim is true because it is coming from an authority figure
Pregistered studies
Studies in which the methodology and hypothesis are published before data are obtained
Strawman argument
Misrepresenting a claim or changing its context so as to make it easier to argue against
The simplest way in most programming languages to make a computer do something again and again
Illusion of explanatory depth
Most people are overconfident in their understanding of a complex phenomenon or procedure until they try to explain it step by step
How regression works
Calculating a line of best fit is one of those things everyone should do manually at least once to understand the procedure that can otherwise be a black box
Bayesian statistics
An approach to statistics in which probabilities are continually updated as new information is obtained
System 1 and 2 thinking
Two distinct ways of thinking: system 1 is fast and driven by intuition and emotion, whereas system 2 is slower and more deliberate
Pasteur’s quadrant
Use-inspired basic research, or the view that basic and applied research aren’t mutually exclusive
Sayre’s law
In any dispute the intensity of feeling is inversely proportional to the value of the issues at stake
Planning fallacy
The tendency to underestimate the time needed to complete a task (e.g., writing a scientific paper) even with prior experience in the same or similar tasks
Floating point numbers
The system used by computers that allows a small number of bits (a zero or one) to represent a wide range of numbers (e.g., 64 bits can be used to closely approximate any number, positive or negative, up to 1.8×10308)
The Dobzhansky Template
A format coined by scientist-turned-filmmaker Randy Olson that aims to drill down to the essence of an idea: Nothing in ___ makes sense except in light of ___ (e.g., nothing in biology makes sense except in light of evolution)
Newman design squiggle
A visual metaphor for the design process that works equally well for the process of doing science
Gestalt Laws
Design laws, grounded in psychology, for how humans perceive combinations of objects or elements
Einstellung effect
An inefficient problem solving technique where you rely on your previous approaches that worked in the past despite there being better methods
Bike-shedding
Also known as the Law of Triviality, bike-shedding is giving undue emphasis on minor matters such as the design of bike sheds to be included within the development of a nuclear power plant
Simpson’s Paradox
Subsets of a dataset, all of which have a negative statistical trend, can still produce a positive trend in the overall dataset
Parkinson’s law
Work expands to fill the time available for its completion
Epistemic trespassing
When an expert in a given field trespasses into another and makes claims where they lack expertise
Decline effect
The strength or effect size of a scientific result tends to decline over successive replications
Base rate neglect
Misjudging the probability of an event due to more intuitive individuating information (e.g., thinking it’s more likely than not that someone who is 6-foot-8 plays basketball professionally, except that the chances are a fraction of 1%)
Identifiable victim effect
The desire to assist a specific individual facing a certain hardship but not a large, unknown group of people facing the same hardship
Deriving incorrect conclusions by overly focusing on clusters of data points that may have arisen by chance
Survivorship bias
A type of selection bias in which the dataset contains only people who made it past some hurdle
BANs: big-ass numbers
One of the simplest ways to visualise data in which, in place of graphs, a few select metrics are displayed as numbers in large text
Anchoring bias
The tendency (and salesperson’s boon) for people to focus on relative changes from an initial value rather than the absolute amount
The difference between science and engineering
Scientists aim to generate knew knowledge and engineers aim to apply knowledge to solve real-world problems
Researcher degrees of freedom
A measure of the flexibility a scientist has in developing, analysing, and publishing an experiment
Banking to 45°
As a rule of thumb, the aspect ratio of a line graph should be one in which the changes to be emphasised have a slope of ~45°
The linear model of innovation
The conjecture that basic research informs applied research, which promotes development and production, which ultimately lead to economic growth
Daryl Bem’s precognition paper
An infamous study—that passed peer-review—that purportedly shows that people can essentially see briefly into the future
Starting with the cake
A teaching philosophy that starts with the big picture rather than tedious fundamentals
Arxiv
One of the original preprint servers (now 30 years old)
Principle of least astonishment
A guideline that encourages a design (say, an interface or piece of software) to be built to behave in a way that most users expect it to
Germanic vs Latinate words
Words with a German heritage tend to be simpler and less pretentious than those from Latin
Altmetrics
A type of citation measure that counts mentions in blogs, tweets, and other social media rather than standard citations in scientific papers
scite.ai
A (now expensive) AI service that summarises the different ways a paper is cited (supported, contrasted, or mentioned) rather than merely counting the number of citations
Root cause analysis
A problem solving technique that looks to solve the underlying issue rather than the immediate (and possibly superficial) problem
Complex vs complicated
Something that is complicated may involve a tedious number of straightforward steps, whereas something that is complex may have multiple nonlinear interactions and emergent behaviour
WEIRD subjects
People from Western, Educated, Industrialized, Rich, and Democratic societies who are over-represented in scientific studies involving human subjects
I used my laptop to scan the text of 360 scientific papers for use of the word exciting (and excited and other variants). I got 195 matches. That’d suggest that scientists imbue their writing with their own excitement for science. Except that 191 of those matches are physics jargon (as in wind excites ocean waves) rather than the everyday meaning. Remove those and we’re left with ~1% of papers indicating any excitement.
That’s a weird thing tolook into is what you’re thinking, so two bits of context. First, there’s lots to be learned about scientific writing by looking at word usage statistics; see my two previousposts. Second, I came across one of these rare uses of exciting with its everyday meaning, and it stood out! Which is messed up. It’s a common word, yet it struck me as out of place in a scientific paper. Not because I think it should be, but because it is.
For comparison, I looked at the words interesting and interestingly. The result: 237 matches, all of which correspond to their everyday usage. (Including interest and interested in my search more than doubles the number of matches.)
As scientists, we record our findings in perpetuity in PDFs— literally simulations of pieces of paper. It’s time to be more dynamic and invoke a proliferation of media types. We don’t need to get rid of the notion of a paper or stop using a PDF as the version of record. But we do need to complement them with something less static. What follows is an approach I recently took using video.
The final sentence of my latest paper (preprint) steers the reader to a video that stands in place of a Conclusion section. And I’m guessing this video is a much more compelling Conclusion than any possible combination of words.
Here’s the gist of the final paragraph (paraphrased to avoid jargon):
Our simulation was made possible by tuning against measurements from a new instrument. This observation-informed simulation depicts instabilities as they evolve throughout the day. It is best appreciated as an animation (doi.org/10.5281/zenodo.4306935).
Too many scientific figures are ugly. I see three possible reasons:
Laziness: scientists could make nice figures, but don’t put in the effort
Obliviousness: scientists are unaware their figures are ugly
Indifference: scientists care only about the data, but not their presentation
Take the following published scientific figure (suitably disguised):
Let’s list the problems: (1) Space is poorly used and data are cramped. (2) Text is bold for no reason. (3) Multiple fonts are used. (4) Tick marks are barely visible. (5) Some labels don’t fit in their respective boxes. (6) Axis values are unnecessarily repeated. (7) Dashed and dash-dotted lines are ugly. (8) Mathematical symbols are not italicised.