mscroggs.co.uk
mscroggs.co.uk

subscribe

Blog

 2023-02-03 
Imagine a set of 142 points on a two-dimensional graph. The mean of the \(x\)-values of the points is 54.26. The mean of the \(y\)-values of the points is 47.83. The standard deviation of the \(x\)-values is 16.76. The standard deviation of the \(y\)-values is 26.93.
What are you imagining that the data looks like?
Whatever you're thinking of, it's probably not this:
The datasaurus.
This is the datasaurus, a dataset that was created by Alberto Cairo in 2016 to remind people to look beyond the summary statistics when analysing a dataset.

Anscombe's quartet

In 1972, four datasets with a similar aim were publised. Graphs in statistical analysis by Francis J Anscombe [1] contained four datasets that have become known as Anscombe's quartet: they all have the same mean \(x\)-value, mean \(y\)-value, standard deviation of \(x\)-values, standard deviation of \(y\)-values, linear regression line, as well multiple other values related to correlation and variance. But if you plot them, the four datasets look very different:
Plots of the four datasets that make up Anscombe's quartet. For each set of data: the mean of the \(x\)-values is 9; the mean of the \(y\)-values is 7.5; the standard deviation of the \(x\)-values is 3.32; the standard deviation of the \(y\)-values is 2.03; the correlation coefficient between \(x\) and \(y\) is 0.816; the linear regression line is \(y=3+0.5x\); and coefficient of determination of linear regression is 0.667.
Anscombe noted that there were prevalent attitudes that:
The four datasets were designed to counter these by showing that data exhibiting the same statistics can actually be very very different.

The datasaurus dozen

Anscombe's datasets indicate their point well, but the arrangement of the points is very regular and looks a little artificial when compared with real data sets. In 2017, twelve sets of more realistic-looking data were published (in Same stats, different graphs: generating datasets with varied appearance and identical statistics through simulated annealing by Justin Matejka and George Fitzmaurice [2]).
These datasets—known as the datasaurus dozen—all had the same mean \(x\)-value, mean \(y\)-value, standard deviation of \(x\)-values, standard deviation of \(y\)-values, and corellation coefficient (to two decimal places) as the datasaurus.
The twelve datasets that make up the datasaurus dozen. For each set of data (to two decimal places): the mean of the \(x\)-values is 54.26; the mean of the \(y\)-values is 47.83; the standard deviation of the \(x\)-values is 16.76; the standard deviation of the \(y\)-values is 26.93; the correlation coefficient between \(x\) and \(y\) is -0.06.
Creating datasets like this is not trivial: if you have a set of values for the statistical properties of a dataset, it is difficult to create a dataset with those properties—especially one that looks like a certain shape or pattern. But if you already have one dataset with the desired properties, you can make other datasets with the same properties by very slightly moving every point in a random direction then checking that the properties are the same—if you do this a few times, you'll eventually get a second dataset with the right properties.
The datasets in the datasaurus dozen were generated using this method: repeatedly adjusting all the points ever so slightly, checking if the properties were the same, then keeping the updated data if it's closer to a target shape.

The databet

Using the same method, I generated the databet: a collection of datasets that look like the letters of the alphabet. I started with this set of 100 points resembling a star:
My starting dataset
After a long time repeatedly moving points by a very small amount, my computer eventually generated these 26 datasets, all of which have the same means, standard deviations, and correlation coefficient:
The databet. For each set of data (to two decimal places): the mean of the \(x\)-values is 0.50; the mean of the \(y\)-values is 0.52; the standard deviation of the \(x\)-values is 0.17; the standard deviation of the \(y\)-values is 0.18; the correlation coefficient between \(x\) and \(y\) is 0.16.

Words

Now that we have the alphabet, we can write words using the databet. You can enter a word or phrase here to do this:

Given two data sets with the same number of points, we can make a new larger dataset by including all the points in both the smaller sets. It is possible to write the mean and standard deviation of the larger dataset in terms of the means and standard deviations of the smaller sets: in each case, the statistic of the larger set depends only on the statistics of the smaller sets and not on the actual data.
Applying this to the databet, we see that the datasets that spell words of a fixed length will all have the same mean and standard deviation. (The same is not true, sadly, for the correlation coefficient.) For example, the datasets shown in the following plot both have the same means and standard deviations:
Datasets that spell "TRUE☆" and "FALSE". For both sets of (to two decimal places): the mean of the \(x\)-values is 2.50; the mean of the \(y\)-values is 0.52; the standard deviation of the \(x\)-values is 1.42; the standard deviation of the \(y\)-values is 0.18.
Hopefully by now you agree with me that Anscombe was right: it's very important to plot data as well as looking at the summary statistics.
 
If you want to play with the databet yourself, all the letters are available on GitHub in JSON format. The GitHub repo also includes fonts that you can download and install so you can use Databet Sans in your next important document.

Graphs in statistical analysis by Francis J Anscombe. American Statistician, 1973.
Same stats, different graphs: generating datasets with varied appearance and identical statistics through simulated annealing by Justin Matejka and George Fitzmaurice. Proceedings of the 2017 CHI Conference on Human Factors in Computing Systems, 2017.
×12      ×4      ×4      ×4      ×4
(Click on one of these icons to react to this blog post)

You might also enjoy...

Comments

Comments in green were written by me. Comments in blue were not written by me.
Very cool! Thanks for sharing ????
Jessica
×6   ×10   ×5   ×5   ×1     Reply
 Add a Comment 


I will only use your email address to reply to your comment (if a reply is needed).

Allowed HTML tags: <br> <a> <small> <b> <i> <s> <sup> <sub> <u> <spoiler> <ul> <ol> <li> <logo>
To prove you are not a spam bot, please type "kite" in the box below (case sensitive):

Archive

Show me a random blog post
 2026 

May 2026

World Cup stickers 2026

Apr 2026

A new puzzle every day
Mixing Wordle with other games

Feb 2026

Christmas (2025) is over
 2025 

Dec 2025

Christmas card 2025

Nov 2025

Christmas (2025) is coming!

Sep 2025

The partridge puzzle

Aug 2025

TMiP 2025 puzzle hunt

Jun 2025

A nonogram alphabet

Mar 2025

How to write a crossnumber

Jan 2025

Christmas (2024) is over
Friendly squares
 2024 

Dec 2024

A regular expression Christmas puzzle
Christmas card 2024

Nov 2024

Christmas (2024) is coming!

Feb 2024

Zines, pt. 2

Jan 2024

Christmas (2023) is over
 2023 
▼ show ▼
 2022 
▼ show ▼
 2021 
▼ show ▼
 2020 
▼ show ▼
 2019 
▼ show ▼
 2018 
▼ show ▼
 2017 
▼ show ▼
 2016 
▼ show ▼
 2015 
▼ show ▼
 2014 
▼ show ▼
 2013 
▼ show ▼
 2012 
▼ show ▼

Tags

exponential growth countdown live stream wave scattering asteroids correlation propositional calculus graphs gather town harriss spiral hannah fry statistics news edinburgh gaussian elimination php curvature regular expressions raspberry pi national lottery zines simultaneous equations matrix of minors the aperiodical latex convergence pizza cutting pokémon wordle mean datasaurus dozen partridge puzzle map projections triangles interpolation logic royal institution quadrilaterals pythagoras guest posts inverse matrices ternary inline code weak imposition realhats coventry logo pi golden ratio coins manchester science festival anscombe's quartet data visualisation alphabets graph theory game show probability warwick world cup craft cross stitch golden spiral draughts errors recursion estimation bots nonograms big internet math-off european cup approximation polynomials mathslogicbot dates nine men's morris chebyshev noughts and crosses programming books mathsjam kings manchester phd mathsteroids a gamut of games puzzles arithmetic python youtube stirling numbers light machine learning pascal's triangle probability error bars weather station menace pac-man trigonometry signorini conditions friendly squares geogebra sorting platonic solids dinosaurs logs turtles crosswords squares pokémon tmip geometry oeis folding tube maps reuleaux polygons kenilworth final fantasy captain scarlet folding paper bubble bobble numbers newcastle preconditioning christmas 24 hour maths matrix of cofactors sport gerry anderson finite group fence posts talking maths in public cambridge wool bempp braiding matrix multiplication london underground wordle fractals numerical analysis standard deviation determinants computational complexity bluesky data crossnumber football rugby crossnumbers frobel games electromagnetic field rust thirteen tennis martin gardner video games hyperbolic surfaces matrices matt parker christmas card flexagons speed databet bodmas chalkdust magazine runge's phenomenon palindromes binary radio 4 advent calendar sound rhombicuboctahedron reddit javascript dataset arrangement puzzles royal baby boundary element methods pi approximation day ucl finite element method go london accuracy people maths dragon curves crochet chess plastic ratio misleading statistics hexapawn sobolev spaces fonts tetris hats stickers game of life

Archive

Show me a random blog post
▼ show ▼
© Matthew Scroggs 2012–2026