## Line graphs: the best and worst way to visualise data

Line graphs are the Swiss army knives of data visualisation. They can be almost anything… which is both good and bad.

### Line graphs are slow to interpret

Many graphs serve one clear purpose. Take the five graphs below:

Even without labels, it’s clear what role each of these graphs serves:

• Pie chart—components of a total
• Thermometer—progress toward a goal amount
• Speedometer—percentage of the largest possible value
• Histogram—distribution of values
• Box plot—statistical summaries of several datasets

In other words, if I’m presented with one of the graphs above, I have an immediate head start on interpreting it. If, instead, I’m presented with a line graph, I’m forced to read the axes labels and limits first.

Deciphering text is the slow way to intake information. Shape is fastest, then colour, and only then text. This so-called Sequence of Cognition, popularised by Alina Wheeler, is something marketers need to know about.

Continue reading “Line graphs: the best and worst way to visualise data”

## A better way to code up scientific figures

I typically write 100–200 lines of code each time I develop a scientific figure that is destined for publication. This is a dangerous length because it’s easy to create a functioning mess. With shorter code fragments, it’s feasible to start over from scratch, and with thousands of lines of code, it makes sense to invest time upfront to organise and plan. But in between these extremes lurks the appeal to write a script that feels coherent at the time, but just creates problems for future you.

Let’s say you want to create a moderately complicated figure like this:

A script for this figure could be envisaged as a series of sequential steps:

1. Read data in from a csv file
2. Remove any flagged data
3. Create four subplots
4. Plot the first line of data against time
5. Label the y axis
6. Set the y axis limit
7. Repeat steps 4–6 for the second and third lines of data
8. Add the coloured contours and grey contour lines
9. Label the time axis
10. Add various annotations
Continue reading “A better way to code up scientific figures”

## Captioning a scientific figure is like commenting code

Comments within code are harmless, right? They don’t affect run-time, so you might as well use them whenever there’s any doubt something is unclear.

I hope you aren’t nodding your head, because a liberal use of comments is the wrong approach. Not all types of code comments are evil, but many are rightfully despised by programmers as (i) band-aid solutions to bad code, (ii) redundant, or even (iii) worse than no comment at all.

The same is true for scientific figures and their captions. In fact, many of the rules discussed in the post Best Practices for Writing Code Comments remain valid when we replace comments and code with captions and figures, respectively.

Continue reading “Captioning a scientific figure is like commenting code”

## 100 things a scientist should know about

Inspired by 250 things an architect should know but 60% less ambitious

Click on each item for a brief elaboration

1. Anscombe’s quartet

Four distinct datasets (x vs y) that produce the same summary statistics (mean, variance, correlation coefficient, and line of best fit)

2. HSL colour space

A colour space that defines colours in terms of their Hue (e.g., red or blue), Saturation (vivid to washed out), and Lightness (white to black)

3. Whitespace

The area within a design (website, poster, figure, etc) that lacks text, images, or other elements

4. HARKing

A questionable approach to research: Hypothesising After the Results are Known

5. Fermi problems

Problems in which an answer cannot be estimated outright but is instead derived as the product of more easily estimated quantities (e.g., how many grains of rice are eaten across the world every year?)

6. Use cases of .png and .jpg images

The JPG format is optimised for photos, whereas PNGs are for graphs and diagrams

7. That vs which

That and which, although similar, have opposing implications about whether a clause is restrictive or not

8. McNamara Fallacy

Basing a decision on only numbers or other objective measures without reference to any qualitative factors

9. Version control

A tool for tracking and recording all changes to software and other digital files as they evolve

10. Serial position effect

The human tendency to better remember what happened at the start and the end and forget what happened in the middle

11. Zenodo and Figshare

Online repositories for datasets, code, and other research output

12. Your carbon footprint

A typical person living in a western country will have an annual footprint of 5–20 tonnes CO2

13. Matthew effect

Well known scientists get cited more often than lesser known ones leading to a positive feedback loop

14. Hand-waving solutions

A metaphor for an answer that might gloss over details, be vague, or rely on many approximations

15. Logarithmic scales

Scales that increase geometrically (e.g., 1, 2, 4, 8, 16, …) rather than linearly (2, 4, 6, 8, …)

16. “Data” is a plural

It can sound odd, but data were collected is correct and data was collected is not

17. Left-branching sentences

A sentence structure to avoid because the initial words only make sense as the sentence nears it end

18. Regression to the mean

A statistical tendency for outliers in an initial experiment to deviate less in a subsequent experiment

19. ImageMagick

Software for all manner of image manipulations and conversions that can be run from the command line

20. Bash shell

The default command line interface

21. Butterworth filters

A widely used approach for smoothing time-series data

22. Golden ratio

The value 1.618…; an aesthetically pleasing aspect ratio for a rectangle among many other claims to fame

23. Types of map projections

Flattening the earth to a two-dimensional image can be achieved in numerous ways, each with its own pros and cons

24. DOI and PMID

Unique digital identifiers that can point to publications, datasets, software, and more

25. Edward Tufte

An early name in data visualisation and author of several books on the topic

26. Widows and orphans

A line at the beginning or end of a paragraph that is separated from the rest by a page break

27. Construction cost of the Large Hadron Collider

One of the most expensive scientific experiments took ~3 billion Swiss Francs to build (or ~5 billion US dollars back in 2001)

28. Resolution of an electron microscope

Electron microscopes can resolve objects as small as 0.1 nanometers

29. Adding epicycles

Tweaking a fundamentally flawed theory in a last-ditch effort to make it explain observations

30. Why governments fund basic scientific research

Among many reasons, basic scientific research (i) lowers the barrier for firms that want to develop new products and (ii) develops skilled scientists and engineers who can capitalise on research undertaken elsewhere

31. William Shockley’s thoughts on productivity

Shockley speculated that a small number of scientists can be exponentially more productive in total because the creation of a scientific paper is the combination of many individual tasks, and productivity in each of these tasks multiplies together to give overall productivity.

32. Kerning

Adjusting the spacing between individual letters in text to improve aesthetics

33. How last authorship varies across fields

Depending on scientific field, the last author either did the least work, is the group leader, obtained funding for the project, or has a surname near the end of the alphabet

34. Project Jupyter

An open source project that simplifies and promotes interactive use of many programming languages

35. Uncertainty propagation

The uncertainty of a derived quantity (e.g., kinetic energy derived from speed and mass) can be calculated from the uncertainty of the input quantities following simple—though sometime tedious—arithmetic

36. SSH (Secure Shell)

The standard way to access a remote server via the command line

37. Difference between a hyphen and a minus sign

Although similar, they should not be confused; a hyphen (-) is a short dash used to combine words, whereas a minus sign is longer (−)

38. Optimal number of characters per line

A line of text should have 60–70 characters (counting spaces) for a single-column layout and 40–50 for multiple columns (see page 32 of Detail in Typography)

39. The .eps file type

A predecessor to PDF that was developed in the late 1980s and is almost obsolete

40. Stroke and fill

For line drawings, the edge is known as the stroke and the interior is known as the fill

41. The Greek alphabet

The order doesn’t matter, but knowing the individual letters is worthwhile

42. Text anti-aliasing

The smoothing of text to improve its appearance (especially relevant at coarse resolution)

43. Triptychs

A three-panel image or collection of images (and an easy way to create an attractive title slide)

44. Active voice

Better than passive voice in most cases

45. Fast Fourier transform

An algorithm that makes much of modern technology possible

46. Pseudoscience

Statements and methods purportedly grounded in science but obviously flawed

47. ORCID

A unique digital identifier for a researcher that is linked with their scholarly works

48. Transistors

Your phone likely has billions of them

49. RAM

A computer’s short-term memory in a sense (distinct from the long-term memory that is the hard drive)

50. Effective cost of a night’s worth of observation from a large telescope

About \$50 000

51. Functions in programming

One of the building blocks of any programming language that, typically, (i) takes one or more inputs, (ii) does whatever to those inputs, and (iii) returns an output

52. Argument from authority fallacy

The incorrect assumption that a claim is true because it is coming from an authority figure

53. Pregistered studies

Studies in which the methodology and hypothesis are published before data are obtained

54. Strawman argument

Misrepresenting a claim or changing its context so as to make it easier to argue against

55. The amount of freely available satellite data

NASA, for example, currently has about 30 active earth-observing satellites producing about 30 TB of data each day

56. For loops

The simplest way in most programming languages to make a computer do something again and again

57. Illusion of explanatory depth

Most people are overconfident in their understanding of a complex phenomenon or procedure until they try to explain it step by step

58. How regression works

Calculating a line of best fit is one of those things everyone should do manually at least once to understand the procedure that can otherwise be a black box

59. Bayesian statistics

An approach to statistics in which probabilities are continually updated as new information is obtained

60. System 1 and 2 thinking

Two distinct ways of thinking: system 1 is fast and driven by intuition and emotion, whereas system 2 is slower and more deliberate

61. Pasteur’s quadrant

Use-inspired basic research, or the view that basic and applied research aren’t mutually exclusive

62. Sayre’s law

In any dispute the intensity of feeling is inversely proportional to the value of the issues at stake

63. Planning fallacy

The tendency to underestimate the time needed to complete a task (e.g., writing a scientific paper) even with prior experience in the same or similar tasks

64. Floating point numbers

The system used by computers that allows a small number of bits (a zero or one) to represent a wide range of numbers (e.g., 64 bits can be used to closely approximate any number, positive or negative, up to 1.8×10308)

65. The Dobzhansky Template

A format coined by scientist-turned-filmmaker Randy Olson that aims to drill down to the essence of an idea: Nothing in ___ makes sense except in light of ___ (e.g., nothing in biology makes sense except in light of evolution)

66. Newman design squiggle

A visual metaphor for the design process that works equally well for the process of doing science

67. Gestalt Laws

Design laws, grounded in psychology, for how humans perceive combinations of objects or elements

68. Einstellung effect

An inefficient problem solving technique where you rely on your previous approaches that worked in the past despite there being better methods

69. Bike-shedding

Also known as the Law of Triviality, bike-shedding is giving undue emphasis on minor matters such as the design of bike sheds to be included within the development of a nuclear power plant

70. Simpson’s Paradox

Subsets of a dataset, all of which have a negative statistical trend, can still produce a positive trend in the overall dataset

71. Parkinson’s law

Work expands to fill the time available for its completion

72. Epistemic trespassing

When an expert in a given field trespasses into another and makes claims where they lack expertise

73. Decline effect

The strength or effect size of a scientific result tends to decline over successive replications

74. Base rate neglect

Misjudging the probability of an event due to more intuitive individuating information (e.g., thinking it’s more likely than not that someone who is 6-foot-8 plays basketball professionally, except that the chances are a fraction of 1%)

75. Identifiable victim effect

The desire to assist a specific individual facing a certain hardship but not a large, unknown group of people facing the same hardship

76. John Ioannidis

A somewhat controversial physician/scientist perhaps best known for his claim that most published research findings are false

77. Texas sharpshooter fallacy

Deriving incorrect conclusions by overly focusing on clusters of data points that may have arisen by chance

78. Survivorship bias

A type of selection bias in which the dataset contains only people who made it past some hurdle

79. BANs: big-ass numbers

One of the simplest ways to visualise data in which, in place of graphs, a few select metrics are displayed as numbers in large text

80. Anchoring bias

The tendency (and salesperson’s boon) for people to focus on relative changes from an initial value rather than the absolute amount

81. The difference between science and engineering

Scientists aim to generate knew knowledge and engineers aim to apply knowledge to solve real-world problems

82. Researcher degrees of freedom

A measure of the flexibility a scientist has in developing, analysing, and publishing an experiment

83. Banking to 45°

As a rule of thumb, the aspect ratio of a line graph should be one in which the changes to be emphasised have a slope of ~45°

84. The linear model of innovation

The conjecture that basic research informs applied research, which promotes development and production, which ultimately lead to economic growth

85. Daryl Bem’s precognition paper

An infamous study—that passed peer-review—that purportedly shows that people can essentially see briefly into the future

86. Starting with the cake

A teaching philosophy that starts with the big picture rather than tedious fundamentals

87. Arxiv

One of the original preprint servers (now 30 years old)

88. Principle of least astonishment

A guideline that encourages a design (say, an interface or piece of software) to be built to behave in a way that most users expect it to

89. Germanic vs Latinate words

Words with a German heritage tend to be simpler and less pretentious than those from Latin

90. Altmetrics

A type of citation measure that counts mentions in blogs, tweets, and other social media rather than standard citations in scientific papers

91. scite.ai

A (now expensive) AI service that summarises the different ways a paper is cited (supported, contrasted, or mentioned) rather than merely counting the number of citations

92. Root cause analysis

A problem solving technique that looks to solve the underlying issue rather than the immediate (and possibly superficial) problem

93. Complex vs complicated

Something that is complicated may involve a tedious number of straightforward steps, whereas something that is complex may have multiple nonlinear interactions and emergent behaviour

94. WEIRD subjects

People from Western, Educated, Industrialized, Rich, and Democratic societies who are over-represented in scientific studies involving human subjects

95. Inkscape

A vector graphics editor that is more than sufficient for a scientist’s needs

96. Donald Knuth

A computer scientist notable for, among many things, the creation of the TeX typesetting language and his decision to forgo email as of Jan 1, 1990

97. Oblique, isometric, and one- and two-point perspective

Four standard ways to project a three-dimensional object into two dimensions

98. The second law of thermodynamics

Entropy of a closed system cannot decrease or, more simply, heat flows from hot to cold

99. The rate of sea level rise

The current global average is about 4 mm/yr, but this varies regionally depending on the vertical movement of land

100. The Oxford comma

The comma placed before “and” or “or” in a list of three or more items

## Science is interesting, but not exciting… according to our papers

I used my laptop to scan the text of 360 scientific papers for use of the word exciting (and excited and other variants). I got 195 matches. That’d suggest that scientists imbue their writing with their own excitement for science. Except that 191 of those matches are physics jargon (as in wind excites ocean waves) rather than the everyday meaning. Remove those and we’re left with ~1% of papers indicating any excitement.

That’s a weird thing to look into is what you’re thinking, so two bits of context. First, there’s lots to be learned about scientific writing by looking at word usage statistics; see my two previous posts. Second, I came across one of these rare uses of exciting with its everyday meaning, and it stood out! Which is messed up. It’s a common word, yet it struck me as out of place in a scientific paper. Not because I think it should be, but because it is.

For comparison, I looked at the words interesting and interestingly. The result: 237 matches, all of which correspond to their everyday usage. (Including interest and interested in my search more than doubles the number of matches.)

Continue reading “Science is interesting, but not exciting… according to our papers”

## Concluding a scientific paper with a video

As scientists, we record our findings in perpetuity in PDFs— literally simulations of pieces of paper. It’s time to be more dynamic and invoke a proliferation of media types. We don’t need to get rid of the notion of a paper or stop using a PDF as the version of record. But we do need to complement them with something less static. What follows is an approach I recently took using video.

The final sentence of my latest paper (preprint) steers the reader to a video that stands in place of a Conclusion section. And I’m guessing this video is a much more compelling Conclusion than any possible combination of words.

Here’s the gist of the final paragraph (paraphrased to avoid jargon):

Our simulation was made possible by tuning against measurements from a new instrument. This observation-informed simulation depicts instabilities as they evolve throughout the day. It is best appreciated as an animation (doi.org/10.5281/zenodo.4306935).

The link goes to a copy of the video below:

Continue reading “Concluding a scientific paper with a video”

## Ugly scientific figures: Are scientists lazy, indifferent, or oblivious?

Too many scientific figures are ugly. I see three possible reasons:

1. Laziness: scientists could make nice figures, but don’t put in the effort
2. Obliviousness: scientists are unaware their figures are ugly
3. Indifference: scientists care only about the data, but not their presentation

Take the following published scientific figure (suitably disguised):

Let’s list the problems: (1) Space is poorly used and data are cramped. (2) Text is bold for no reason. (3) Multiple fonts are used. (4) Tick marks are barely visible. (5) Some labels don’t fit in their respective boxes. (6) Axis values are unnecessarily repeated. (7) Dashed and dash-dotted lines are ugly. (8) Mathematical symbols are not italicised.

Continue reading “Ugly scientific figures: Are scientists lazy, indifferent, or oblivious?”

## Unintentional entertainment in scientific writing

Save for the occasional pun in the title, scientific papers seldom contain intentional humour. But there’s entertainment to be had if you have the right mindset. Let me show you.

Relatability can be the basis of a good laugh. And as a scientist who routinely uses time series data, I can relate to the struggle of unwanted gaps in a dataset. So I was entertained when I came across the following sentence:

No data are available for 1991 and 1992 because the volcanic eruption of Mt Pinatubo in 1991 contaminated the signal. (ref)

Why, exactly, am I entertained, you ask? Partly, it’s the notion of a very expensive satellite being thwarted by a bit of ash. More so, it’s that the sentence is the epitome of scientific writing. A freakin’ volcanic eruption messes up two years worth of data, and yet it’s described in the same matter-of-fact tone as the other technical details like the satellite’s pixel resolution. Good luck finding any other types of writers who recount a long-lived effect of a natural disaster in a single sentence.

Continue reading “Unintentional entertainment in scientific writing”

## Computers make me a worse mathematician, but a better scientist

“A computer gives the average person, a high school freshman, the power to do things in a week that all the mathematicians who ever lived until thirty years ago couldn’t do.” That’s Ed Roberts quoted in Hackers, a book published in 1984. So let me update his quote with my own: “My laptop gives me the power to run simulations in an afternoon that the fastest computers thirty years ago would have struggled with”.

This power has a downside. Computers are so fast these days that I’ve become lazy—mathematically speaking. A few decades ago, in my field of physical oceanography, it was routine to manipulate partial differential equations and solve complex integrals. I can do these things, if I put my mind to it. But I seldom do; there’s no need. These days, even ordinary differential equations that I learned to solve in undergrad get plugged into Mathematica most of the time or relegated to some less-than-perfect numerical method. And I can’t remember the last time I did multiplication longhand:

Continue reading “Computers make me a worse mathematician, but a better scientist”

## Transit map–style scientific figures

A good map is geographically accurate and to scale, right?

Not always. Transit maps are one exception. They are intentionally distorted in order to be information dense, yet clean, spacious, and organised.

Many of the design decisions that go into a transit map also apply to scientific figures. There’s a lot for us scientists to learn from a careful look at transit maps.

### A typical transit map: Singapore

Continue reading “Transit map–style scientific figures”