A better way to code up scientific figures

I typically write 100–200 lines of code each time I develop a scientific figure that is destined for publication. This is a dangerous length because it’s easy to create a functioning mess. With shorter code fragments, it’s feasible to start over from scratch, and with thousands of lines of code, it makes sense to invest time upfront to organise and plan. But in between these extremes lurks the appeal to write a script that feels coherent at the time, but just creates problems for future you.

Let’s say you want to create a moderately complicated figure like this:

A script for this figure could be envisaged as a series of sequential steps:

  1. Read data in from a csv file
  2. Remove any flagged data
  3. Create four subplots
  4. Plot the first line of data against time
  5. Label the y axis
  6. Set the y axis limit
  7. Repeat steps 4–6 for the second and third lines of data
  8. Add the coloured contours and grey contour lines
  9. Label the time axis
  10. Add various annotations

If you’re comfortable with a language like Python, Matlab, or R, it’s easy to flesh out steps 1–10 as a stream-of-consciousness. Things like adding subplots, labelling panels, and setting axis limits require minimal thought. Before you know it, you’ve got a script 100 lines long.

A typical laptop screen or external monitor holds, at best, 40–50 lines. So you can no longer eyeball the script to see all the steps at once. Instead, you’ve got to rely on your short-term memory.

But, wait! Let’s say you want to test a few quick changes. So you temporarily comment out a few lines, temporarily overwrite a few variables, and temporarily add another panel.

You see the foreshadowing, right? Some of these temporary changes stick. Others get reverted. The clean sequence that was steps 1–10 becomes 1, 1b, 2, 2b, 3, 3b, 6, 5, 4, 7, 8, 9, 10, 10b, 10c, 11, 12.

The real problem with this now-messy figure script arises when you’re forced to come back to it months later. (Perhaps reviewer #2 is suggesting some changes.) When you wrote the script, you were banking on your short-term memory to understand how all the pieces fit together. Of course, that’s not going to help months later.

I’ve created enough of these messy figure scripts in past years as a scientist. And I occasionally still do, especially when trying to work quickly. But, for the most part, I now take a better approach.

A modular approach to scripting figures

Rule 4 of the Ten simple rules for quick and dirty scientific programming tells you to modularize your code. And that’s exactly what I’m going to suggest here.

Each figure script that I write comprises a dozen or so functions. There’s a function to load in the data, a function to draw the panels, one function for each line plot, a function to label all the axes, etc. Here’s how a minimal Python example would look on my screen:

You might think that it’s overkill that I’m writing single-line functions, which end up making the script twice as long as it could be. But trust me, this modular approach pays off for anything more complicated than this short example.

There are four benefits to grouping the lines of code into functions:

  1. You are forced to write an outline of the script
    I suggested earlier that it’s easy to envisage a figure script as a sequence of steps. But in my experience, scientists seldom write these steps down anywhere. Whereas if you create a set of functions, you are obliged to create a high-level overview. In the example, the last five lines automatically form an outline.
  2. You can describe your code in plain English
    You don’t need to know Python to understand the steps in my example script. Rather, you look at the function names, which are in plain English. You could, of course, use code comments to reach a similar result, but comments tend to get out of sync with what the code actually does.
  3. It’s easier to locate a particular command
    Suppose you want to change the colour of a few lines on a particular panel. If your figure script is several hundred lines long, it’ll take a while to locate the particular lines you need to change. This task is much quicker when those several hundred lines are subdivided into a small number of functions. It’s the same principle as using a table of contents to find a particular page within a textbook.
  4. You can comment out a single line of code, not a whole block
    In iterating toward the final figure, you’ll likely test several different arrangements, quantities, or graph types. It’s feasible to do this testing by commenting and uncommenting blocks of cdoe. But this is cumbersome, not to mention bad practice. If, instead, you write functions that each do a particular task, you can make adjustments by commenting/uncommenting a single line. If I were to alter my example above, it might look like the following:

Your functions need not be perfect

Only one of the functions in my original example takes any input parameters, which is arguably bad form. Each function should really have arguments or accept variables. I’m not too worried about this because I know Python will go looking for variables outside the function when it can’t find them within the function itself or within its inputs. Since each function is being used only once, I can get away with not explicitly passing variables. (In case you’re wondering… yes, there is value in general in creating functions that are used only once.)

I take almost the same approach when using Matlab. The only extra step is that the whole script has to be wrapped in a parent function. Only then are nested functions allowed. My scripts looks something like this:

The suggestion to ‘use functions‘ is nothing new

The common coding languages used in science (Python, Matlab, R, Julia) are great for interactive use. Type 284*396 into the command window and it’ll reply with 112 464. But the command prompt only takes you so far. You quickly realise that you want to execute many lines in succession. So you move your set of sequential commands into a script and hit Run. Unlike the command window, scripts like this can get you far. (When I say scripts, I’m also alluding to computational notebooks. Just like scripts, notebooks are messy and can foster poor coding practices.)

Many scientists can feasibly make do without ever learning about functions. Conversely, computer programmers wouldn’t get a job without knowing how to use functions. Which creates an inconsistency that, as a result, means that I’m not sure if I’ve pitched this post at the right level. On one hand, the suggestion to use functions feels like it goes without saying. It’s as if I was advising scientific writers to use headings when writing a paper. On the other hand, I’ve seen enough messy figure scripts created by computer-savvy scientists to believe that “use functions” is worthwhile, insightful advice.

I’m not the only one trying to reconcile the gap between programmers and scientists. As Simon Hettrick of the Software Sustainability Institute put it:What does this mean for scientists who code? Do they all have to become software engineers in order to be real coders? I think not. I think that scientists should use computer programming as an exploratory tool to drive discovery in their fields, the way they use other methods and tools, but scientist-coders can benefit from learning about modularity, abstraction, and data structures.

Author: Ken Hughes

Post-doctoral research scientist in physical oceanography

%d bloggers like this: