Creating realistic fake data

Have you ever noticed the similarities between stock images that convey an increase? More often than not, it’s an arrow initially heading up at about a 30° angle, followed by a downturn, before continuing back up. There’s occasionally a second down-and-up for good measure. It’s sufficiently cliche that Yale Economics should feel a little embarrassed to have incorporated it into their logo.

Typical results from an image search for “increase” alongside Yale Economics’ logo

In the spirit of an English teacher inferring a lot from a little, I wonder if the downturns are intended to imbue some kind of story arc to the progress implied by the ascending arrow? As in, you have to get knocked down before you can get back up. Or maybe the downturns are there to instil a sense of realism?

Absurd as these rhetorical questions sound, they hint at a surprisingly profound issue about fake data and what makes it look real.

Just to be clear, I’m not encouraging anyone to lie by creating fake data. (Unless you’re in a scenario where doing so actually helps scrutinise your results or avoids sharing confidential information). Treat me more like a stats professor assigning the classic book How to lie with statistics as a way to teach what not to do. Though, let’s pretend for now that you do actually want to lie. So to avoid getting caught you should buckle up for a lesson involving forged signatures, Charles Ponzi of his eponymous pyramid scheme, a hundred-billion-dollar company manipulating random sequences, and a tip on how to catch a liar.

Like writing a good children’s book or painting abstract art, creating realistic-looking fake data is harder than you might think. Take the header image for this website, for example. It shows two random signals that I created mathematically, but that were inspired by real data:

The header image for this website at the top is inspired by the figure beneath, which is one I developed for Langhorne et al. (2015)

My early attempts to create the figure (not shown) were unconvincing because I tried to make them by hand before appreciating a few key aspects for making realistic-looking data. Here, the fake signals share two key attributes with the real signal. First, the raw, unsmoothed signals have noise levels proportional to the level of variation in the smoothed signal: when the smoothed signal is oscillating at the start, the noise is strong, and vice versa. Second, and more important, the “noise” isn’t just any noise. In this case, it’s red noise (more on that later).

My failure to fake data by hand points to a more general rule that humans are bad at creating, recognising, or appreciating randomness. IPod’s shuffle function illustrates this well: Prior to 2005, as Steven Levy notes, the shuffle function served up a truly random assortment of songs. Yet many people mistook simple coincides linking subsequent songs or a disproportionate number of songs by a single artist as evidence of a non-random order. Ironically, such occurrences are exactly what you expect in a random sequence. In September 2005, Apple debuted a smart shuffle feature that Steve Jobs described as being less random to make it feel more random. Some game designers undertake similar manipulations. It seems obvious if you think about it. A truly random sequence would include the possibility of the same song appearing twice in a row, perhaps even three times. Presumably, no one using shuffle actually wants this to happen.

Who to trust

Quick though experiment: let’s say you see a commercial for a new toothpaste purporting to whiten your teeth. The ad states that “97% of those tested achieved measurably whiter teeth”. What if, instead, “100% of those tested achieved measurably whiter teeth”? If every single person improves, then surely the product works better. Except, this is a commercial. Perhaps the marketers are taking some artistic license on the efficacy of their product. Alternatively, when they concede that it doesn’t work 3% of the time, you may be less likely to question whether they’re telling the whole truth.

The difference between 97 and 100% is inconsequential in this toothpaste scenario. In other cases, there is such a thing as being too perfect. There’s a story in the book Math on Trial that describes a signature being identified as a forgery because it was too similar to an existing one. The logic was that there are expected levels of similarity between different signatures by the same person. But an exact replica or something very close won’t occur unless an existing signature is forged.

An overabundance of similarities is also suggestive of lying. Someone giving a false story is more likely to repeat verbatim the details of their carefully rehearsed story. A truthful story, on the other hand, will include variations when told a second time.

The corollary for scientists: if you’re going to, say, duplicate and manipulate photos … just don’t!

Data don’t lie

If you want to cook the books for your company by fabricating financial results, you’d do well to be aware of Benford’s law. Put simply, if you don’t want the numbers to look suspect, the distribution of first digits should not be random. Financial records are one of several scenarios in which a large list of numbers displays a distinct pattern. For example, 30% of the numbers will start with 1, whereas only 5% of the numbers will start with 9. The title of a Reddit post succinctly illustrates the concept: Microwaves could have non-functional “9” buttons and most people would never notice.

The same distribution does not apply to the final digits, which will be random. As of writing, a scandal has recently broken out in which a scientist appears to have fabricated data in a number of papers. Among many of the suspicious pieces of evidence is a non-random distribution of terminal digits in one of the datasets. (Other evidence includes
observational data falling on a straight line with lines having a slope and intercept exact to one decimal place and several rows of data that are duplicates or near duplicates.)

Benford’s law arises in numerous situations. In any case, though, it requires a sufficiently large set of numbers to confirm or deny any manipulation. But even if you want to fake just one number, be like Charles Ponzi. On finding a checkbook and forging a signature, he wrote a check for himself for $423.58. As Coralie Colmez and Leila Schneps suggest, although it’s a large sum for the time (early 1900s), at least it’s authentic-sounding.

The colour of noise

And now for something that seems completely different, yet is actually one of the best ways to characterise a dataset: its colour.

If the idea of either a signal or its noise having a colour seems puzzling or peculiar, then see Wikipedia’s article on the topic. While there, listen to the short audio clips of the different colours of noise, namely, white, red, and blue. (Don’t ask me why a few YouTubers have deemed it necessary to create equivalent clips in video format that are 12 hours long.) Red noise, like red light, has a higher portion of longer waves (think more bass, less treble). Blue is the opposite. White noise has no such bias, it is evenly distributed in wavelength (or frequency). In other words, “white noise” isn’t just a term that describes, say, the background noise in a coffee shop; to scientists, it actually has a specific definition. (Green or pink noise may actually be more representative of the background noise of the world.)

If you know the expected colour of a signal (and have a good grasp of Fourier analysis), it is straightforward to create a fake signal that looks real. For example, a signature of turbulence in the ocean, atmosphere, or even quantum fluids is that the signal is nearly red. I’ll skip the mathematical details and show the result. Each of the following are different realisations of the same nearly-red noise:

Fake turbulent signals

There you have it: four examples of realistic yet synthetic data. I’m sure you’ll agree that there’s clearly something similar about them, yet they’re clearly different. Understanding the colour of noise makes it possible to create such signals. It can also add a poetic touch to the title of your next scientific paper.

To be honest, there’s no real call to action or single takeaway from this post. It’s more a superficial meander through a few concepts related primarily to time series data. But if for some weird reason you do actually find yourself in a scenario that calls for a piece of clip art that signifies an increase over time, consider adding some realism. Throw in some flat periods. Include consecutive decreases. Upticks needn’t follow downturns every single time.


Author: Ken Hughes

Post-doctoral research scientist in physical oceanography

%d bloggers like this: