Effective Data Visualisations

Introduction
Important concepts and terminology
Readings
References
Bibliography

The purpose of this topic is to learn some guidelines for making “good” data visualisations … and starting to explore what “good” means :)

We want to be able to critically appraise a data visualisation in terms of these guidelines and we want to be able to improve a data visualisation using these guidelines.

The guidelines in this topic tend not to be formal, theory-based laws; they are mostly rules of thumb based on common sense and the experience and authority of the person proposing them. These guidelines are widely accepted, but they should not be taken as inviolable.

We will also look at how to enforce these guidelines for effective data visualisation when we are using the ‘ggplot2’ package. The ‘ggplot2’ system has reasonable defaults, but we still need to be able to tweak settings and add additional output. This idea of customising a ‘ggplot2’ plot will be important throughout the course. As we saw in the last topic it may also be necessary or just more convenient to add drawing using ‘grid’ as well.

Introduction

When we draw a plot with ‘ggplot2’, by default a lot of decisions are made for us.

library(ggplot2)

For example, in the plot below, ‘ggplot2’ has automatically decided how to label the axes, and where to place the tick marks.

ggplot(mpg) + 
    geom_point(aes(x = displ, y = cty))

These automatic choices save us a lot of time, but they will not always be sufficient or even appropriate. For example, the labelling of plots is important to make clear what data values are being presented; what a plot is about.

The default labels (above) are not very explanatory on their own, so the code below produces an improved plot with better labelling.

ggplot(mpg) + 
    geom_point(aes(x = displ, y = cty)) +
    labs(x = "Engine Displacement (litres)",
         y = "Miles Per Gallon (city driving)",
         title = "Fuel Economy for Popular Models (1999-2008)")

Another choice that ‘ggplot2’ has made for us is the data symbol, which by default is an opaque circle. Look what happens (below) when we modify the data symbols to make them semitransparent circles, rather than the default opaque circles. Some circles are darker than others because there is overplotting; there are multiple data points with exactly the same values of engine displacement and miles per gallon. The default plot was actually completely misleading!

ggplot(mpg) + 
    geom_point(aes(x = displ, y = cty), alpha = .2) +
    labs(x = "Engine Displacement (litres)",
         y = "Miles Per Gallon (city driving)",
         title = "Fuel Economy for Popular Models (1999-2008)")

The Readings from Wilke describe several general issues, like overplotting and adequate labelling, and provide guidelines for how to address these common problems.

For each set of guidelines, we will need to be able to implement the recommendations. The References include chapters from the ‘ggplot2’ book that cover ‘ggplot2’ annotations and ‘ggplot2’ themes, which allow us to modify the default decisions that ‘ggplot2’ makes for us.

For example, the concept of a data-ink ratio (from Wilke) argues for less clutter in a plot (less unnecessary ink on the page) and the code below uses a ‘ggplot2’ theme to produce a less cluttered plot.

ggplot(mpg) + 
    geom_point(aes(x = displ, y = cty), alpha = .2) +
    labs(x = "Engine Displacement (litres)",
         y = "Miles Per Gallon (city driving)",
         title = "Fuel Economy for Popular Models (1999-2008)") + 
    theme_classic()

This theme is arguably a little too plain (for example, the title does not stand out as much), but the following code demonstrates that we can tweak individual elements of the theme as well. In this case, we make the plot title bold and larger.

ggplot(mpg) + 
    geom_point(aes(x = displ, y = cty), alpha = .2) +
    labs(x = "Engine Displacement (litres)",
         y = "Miles Per Gallon (city driving)",
         title = "Fuel Economy for Popular Models (1999-2008)") + 
    theme_classic() + 
    theme(plot.title = element_text(face = "bold", size = 18))

The plain theme also removed ALL grid lines, but there are situations where a background grid can provide useful context. The code below adds customised horizontal lines to represent historical US fuel efficiency standards.

CAFE <- data.frame(year = c(1980, 1990, 2020),  
                   limit = c(20.0, 27.5, 42.4))

ggplot(mpg) + 
    geom_hline(data = CAFE, aes(yintercept = limit), 
               linetype = "dashed", colour = "grey") +
    geom_point(aes(x = displ, y = cty), alpha = .2) +
    labs(x = "Engine Displacement (litres)",
         y = "Miles Per Gallon (city driving)",
         title = "Fuel Economy for Popular Models (1999-2008)") + 
    theme_classic() + 
    theme(plot.title = element_text(face = "bold", size = 18))

The following code combines ‘grid’ and ‘ggplot2’ (via ‘gggrid’) to add some carefully positioned labels to the horizontal lines. A lot of this topic is all about being able to customise the details of a ‘ggplot2’ plot like this.

library(gggrid)

lineLabel <- function(data, coords) {
    textGrob(paste("CAFE", data$label), 
             x = unit(1, "npc") - unit(1, "mm"),
             y = unit(coords$y, "npc") + unit(1, "mm"),
             just = c("right", "bottom"),
             gp = gpar(col = "grey"))
}

ggplot(mpg) + 
    geom_hline(data = CAFE, aes(yintercept = limit), 
               linetype = "dashed", colour = "grey") +
    geom_point(aes(x = displ, y = cty), alpha = .2) +
    labs(x = "Engine Displacement (litres)",
         y = "Miles Per Gallon (city driving)",
         title = "Fuel Economy for Popular Models (1999-2008)") + 
    theme_classic() + 
    theme(plot.title = element_text(face = "bold", size = 18)) +
    grid_panel(data = CAFE, mapping = aes(y = limit, label = year),
               grob = lineLabel)

Your job is to read and understand the guidelines from Wilke and dip into the References as needed to implement the guidelines in the data visualisation tasks set out in the lab questions.

Important concepts and terminology

Make sure that you understand the meanings of the following terms:

The principle of proportional ink
Overplotting
Jittering
Direct labelling
Redundant coding
Small multiples
The data-ink ratio

Readings

Chapters 17, 18, 19, 20, 21, 22, and 23 from “Fundamentals of Data Visualization” by Claus O. Wilke

We will spend more time on colour in later topics, we will not pay much attention to the layout of tables, we will only consider incorporating labelling within a data visualisation (not as a separate caption) and, as mentioned previously, we will not always obey all of these guidelines!

References

Chapter 28: Graphics for communication from “R for Data Science” by Hadley Wickham and Garrett Grolemund.

Useful reference for labelling, annotations, customising scales, and themes.
Section 5.5 of “ggplot2: Elegant Graphics for Data Analysis” by Hadley Wickham.

Dealing with overplotting in ‘ggplot2’.
Chapter 8 of “ggplot2: Elegant Graphics for Data Analysis” (3rd Ed) by Hadley Wickham.

Annotations (labels) in ‘ggplot2’.
Chapter 17 of “ggplot2: Elegant Graphics for Data Analysis” (3rd Ed) by Hadley Wickham.

Themes in ‘ggplot2’.
‘ggplot2’ function documentation, in particular, the help page for the theme() function.

Bibliography

“Creating More Effective Graphs” by Naomi Robbins.
The University of Auckland library has one physical copy.
“Fundamental Statistical Concepts in Presenting Data” by Rafe Donohue. This used to be online, but the Vanderbilt University link is now dead.
Edward Tufte’s books: “The visual display of quantitative information”, “Visual Explanations”, “Envisioning information”, and “Beautiful evidence” (all physical copies in the University of Auckland library).

This work is licensed under a Creative Commons Attribution 4.0 International License.