Introduction to Data Visualisation

This document provides an introduction to why we use data visualisation, what makes a good data visualisation, and how to make a good data visualisation.

This introduction sets the scene for the more in-depth studies of data visualisation topics in the rest of the course.

What is a data visualisation?

A data visualisation is an artificial image that uses simple geometric shapes to represent data values. For example, the line plot below shows the electrical activity of a human heart (an ECG). The electrical activity of the heart is represented by the height of the line relative to the y-axis scale and time is represented by the position along the x-axis. But we see no heart, we feel no electric shock, and we hear no ticking clock. The data visualisation is just an abstract representation that allows us to easily see the pattern of increasing and decreasing electrical activity over time.

A key goal of data visualisation is the faithful representation of data values, without any distortion and certainly without any misdirection.

Line plot of heart-rate

A data visualisation is different from a photograph or an image of a real object. For example, in the ultrasound of a beating heart shown below, we see the moving valves of an actual heart and the image changes in real time (Wikimedia Commons User Fruehaufsteher2; Creative Commons Attribution, Share-Alike). This sort of image is useful for observing and understanding the sequence of physical activity involved in the beating heart, but it struggles to convey information that cannot be seen directly, such as electrical activity in the heart.

Ultrasound of beating heart

A data visualisation is also different from a scientific visualisation, which is an artificial image that attempts to replicate a real object. For example, the image below is a computer-generated 3D simulation of a beating heart (DocJana; Creative Commons Attribution, Share-Alike). Again, there is great value in being able to see in real-time the coordinated movement of material objects. However, this may not be the best way to represent intangible qualities or the relationships between multiple measurements.

A 3D simulation of a beating heart

Finally, a data visualisation is also different from an image that is purely for art or entertainment. The animation below might bring joy, but it conveys little in the way of information (Behance User Lima Designer; Creative Commons Attribution, Non-Commercial, No-Derivatives).

Exercising heart animation

Why use data visualisation?

The table below comes from a spreadsheet that was used for the 2022 Youth Justice Indicators Summary Report. It shows the youth crime rate (per 100,000) for each year from 2010 to 2020, for youths aged 10 to 13. Just using this table, attempt to answer the following questions: What is the overall trend in crime rates over time (increasing or decreasing)? Is the trend the same for all ages? Is the crime rate changing faster for some ages than others? Is the change in crime rate strictly monotonic for any/all ages?

Table of number of offenders (per 100,000) by age over time

Now consider the data visualisation below. We can now answer all of the questions posed above with very little effort and almost instantaneously. That is not to say that all data visualisations are good and all tables are bad. For example, the table is more useful for answering questions like: What was the exact crime rate for 12-year-olds in the 2015/2016 year? It is also possible to generate a data visualisation that is actually very poor at answering the list of questions above. However, this example demonstrates that, given the right data visualisation, it is possible to clearly communicate a lot of information very rapidly.

Line plot of number of offenders (per 100,000) by age over time

How do we create a data visualisation?

In a data visualisation, we represent data values using simple geometric shapes. This requires mapping data values to some feature of a shape.

For example, we might map year values to the x-locations and number of offenders to the y-locations of data symbols or lines in a scatter plot, as shown in the previous section. That data visualisation also maps age values to the colour of the data symbols and lines.

    shape                = points + lines
    year                -> x-location
    number of offenders -> y-location
    age                 -> colour

In Week 2 we will learn about the Grammar of Graphics, which allows us to create a data visualisation by describing the mappings between data values and the features of shapes.

A data visualisation also usually includes scales or legends that show the mapping between data values and shape features. For example, the scatter plot from the previous section has an x-axis to show the mapping for x-locations, a y-axis to show the mapping for y-locations, and a legend to show the mapping for colours.

Finally, a data visualisation also typically includes some text labels, such as the y-axis label on the scatter plot from the previous section.

We will look at what information should be contained in text labels in Week 4 and we will look at choosing fonts in Week 6.

How do we choose the right data visualisation?

The data visualisation below shows another way that we can display exactly the same crime rate data. In this case, the number of offenders has been mapped to the heights of the bars in a bar plot.

    shape                = bars
    year                -> x-location
    number of offenders -> height
    age                 -> colour

Bar plot of number of offenders (per 100,000) by age over time

Which data visualisation is better? The answer, of course, is that it depends. For example, it depends on what question we want the data visualisation to answer. If the question is "What is the highest number of offenders (per 100,000) in a single year?" then the best option is actually the original spreadsheet table because reading exact values from a data visualisation is usually quite hard. We will often require more than one data visualisation in order to be able to answer all questions of interest. However, for any particular question, we need to be able to choose between alternative data visualisations, in which case there are multiple issues to consider.

Data Visualisation Galleries

For many common types of data and for many common questions, there are well-established types of data visualisation: histograms and box plots for visualising the distributions of single continuous variables, scatter plots for exploring the relationship between two continuous variables, bar plots for visualising the relationship between a categorical variable and a continuous variable, and so on.

The readings in Week 1 include a gallery of common data visualisations for common situations.

For example, if we want to view how the number of youth offenders (per 100,000) in 2010 differs across the various Police Regions in New Zealand, a bar plot like the one shown below is a standard option.

Bar plot showing number of offenders (per 100,000) by region (2010)

In some cases, the nature of the data dictates a very specific sort of data visualisation. For example, Police Regions are geographic areas, which can be represented naturally on a map, like the one below. In this map, the number of youth offenders is represented by different shades of blue rather than by the length of a bar.

We will look at the issues with producing maps in Week 9.

Map showing number of offenders (per 100,000) by region (2010)

Data Visualisation Guidelines

The accumulated wisdom from centuries of generating data visualisations also provides some well-established guidelines for how to produce effective data visualisations. For example, when we generate a bar plot of continuous values for a set of categories, and the categories have no inherent order, ordering the categories from the longest bar to shortest bar allows for much easier comparisons between categories.

The bar plot below demonstrates this idea for comparing the number of youth offenders across different Police Regions in New Zealand.

Bar plot showing number of offenders (per 100,000) by region (2010)

There are also some established guidelines for creating effective data visualisation that take their authority from the experience and eminence of their author. For example, Edward Tufte, who is revered by many within the data visualisation community, advocates a minimalist, every-pixel-is-sacred approach to data visualisation. The idea is to aim for a high data-ink ratio: avoid putting ink on the page that is not related directly to the data.

The data visualisation below demonstrates a version of the line plot from above with a much higher data-ink ratio. This arguably provides less distraction from the most important elements of the image, which represent the actual data.

We will look at more guidelines for data visualisation in Week 5.

Line plot of number of offenders (per 100,000) by age over time with a high data-ink ratio

Visual Perception

Further important guidelines for producing effective data visualisations come from experimental studies of human visual perception. For example, certain shape features are more effective for representing quantitative differences in data values and other features are more effective for representing qualitative changes. The data visualisation below presents the same set of data as the line plot above; the number of offenders over time for four different ages. We have changed the shape that we are using: we have coloured tiles instead of lines. And we have swapped the mappings of data values to features of the shapes: age values are mapped to the y-location of the tiles and the number of offenders is mapped to the colour of the tiles. Although we can still see overall decreasing trends across time for each age group, it is now much harder to perceive detailed changes and much harder to compare between different ages. Using the y-location of a line is a more effective way to represent a continuous value, compared to using the colour of a tile.

    shape                = tiles
    year                -> x-location
    number of offenders -> colour
    age                 -> y-location

Line plot of number of offenders (per 100,000) by age over time with number of offenders mapped to colour

There are also features of our visual system that will mislead us; these are the basis of visual illusions. For example, the data visualisation below is a deliberately staged variation of the one above. The data values mostly decrease in a similar way, but two data values have been set to be the same: the third value on the top row and the third-to-last value on the bottom row. Those two values should appear the same, but, because of a feature of our visual system called colour contrast, the rectangle in the top row appears to be lighter and less colourful than the rectangle in the bottom row. This happens because the neighbouring rectangles in the top row are darker and more colourful (whereas the neighbours in the bottom row are lighter and less colourful). This data visualisation has deliberately exaggerated the effect, but it shows that our visual system will not only have difficulty comparing shades of colour, it will in some cases completely mislead us.

Line plot of number of offenders (per 100,000) by age over time with data modified to illustrate colour contrast

Nearly 10% of the male population suffer from some form of colour blindness. The image below simulates how the line plot data visualisation from above would appear for someone suffering from (a severe form of) deuteranomaly, the most common form of colour-vision deficiency. Some of the coloured lines are now almost indistinguishable from each other, so taking colour vision deficiencies into account is important as well.

We will look at data visualisation guidelines that are based on human visual perception in Week 5.

Line plot of number of offenders (per 100,000) by age over time with colours simulated for deuteranomaly

Graphic Design

Although the main purposes of a data visualisation are clarity and veracity, as long as we do not impair the messenger, there is no harm in making an image more pleasant to look at. Some basic ideas from graphic design can not only improve the appearance of a data visualisation, but may also reduce the clutter and make it easier to see the most important features. For example, the data visualisation below uses ideas from graphic design like ensuring that every element of the image is aligned with some other element to produce an image that looks less like it was created by a statistician, while still conveying the important message without distortion.

We will look at some basic ideas of graphic design in Week 6.

Line plot of number of offenders (per 100,000) by age over time with some design guidelines applied

Customising Data Visualisations

Despite the admonitions of the great and the good of the data visualisation world, not all embellishments to a data visualisation are frivolous chart junk. For example, though it depends very much on taste and trends, a subtle background gradient can create a more interesting and sophisticated data visualisation without distracting from the main message. Knowing that this can be done and knowing how to do it are useful extra skills for producing data visualisations.

We will look at some lower-level tools for adding special features and controlling the finer details of data visualisations Weeks 10 and 11.

Line plot of number of offenders (per 100,000) by age over time with subtle gradient background

Animation and Interaction

An animated data visualisation can be more engaging and it allows us to represent more variables at once (by changing a variable over time). For example, the bar plot below shows the number of offenders across Police Regions and how that relationship differs across years.

Animated bar plot of number of offenders (per 100,000) by region over time

Another way to make a data visualisation more engaging is to add interactivity, so that the visualisation changes in response to user actions. This is also another way to represent more variables because it allows us to switch rapidly between several different data visualisations.

For example, the data visualisation below allows us to click a radio button on the left to highlight the data for a different Police Region in the line plot on the right.

We will look at animations and interactive graphics in Weeks 7 and 8.

Data Visualisation Software

In addition to knowing what sort of data visualisation we should make, we must also know how to make a data visualisation.

There are lots of options, but code-based systems like R have a number of things to recommend them.

There are many good things that come from creating a data visualisation by writing code: we can be very precise about what we want to do (expressiveness); it is easy to record what we did and to track the development of what we did (version control); we can easily recreate what we did several months or years ago, share what we did with others, and allow others to reproduce what we did (reproducibility).
There are high-level interfaces that allows us to describe a data visualisation in terms of mappings between data values and the features of shapes (the 'ggplot2' package).
```
ggplot(rates,
       aes(x=Year, y=Rate, colour=Age)) +
    geom_line(aes(group=Age)) +
    geom_point()         
      
```
There are low-level interfaces that allow us to control detailed and sophisticated features of a data visualisation, such as colours, fonts, and alignment (the 'grDevices' and 'grid' packages).

colours <- hcl(seq(0, 270, 90), 60, 60) 
grad <- linearGradient(c("grey95", "white"), y1=.5, y2=.5)

We will introduce 'ggplot2' and 'grid' in Weeks 2 and 3 and we will continue to use them throughout the course.

This work is licensed under a Creative Commons Attribution 4.0 International License.