Effective Data Visualisations


The purpose of this topic is to learn some guidelines for making “good” data visualisations … and starting to explore what “good” means :)

We want to be able to critically appraise a data visualisation in terms of these guidelines and we want to be able to improve a data visualisation using these guidelines.

The guidelines in this topic tend not to be formal, theory-based laws; they are mostly rules of thumb based on common sense and the experience and authority of the person proposing them. These guidelines are widely accepted, but they should not be taken as inviolable.

We will also look at how to enforce these guidelines for effective data visualisation when we are using the ‘ggplot2’ package. The ‘ggplot2’ system has reasonable defaults, but we still need to be able to tweak settings and add additional output. This idea of customising a ‘ggplot2’ plot will be important throughout the course. As we saw in the last topic it may also be necessary or just more convenient to add drawing using ‘grid’ as well.

Introduction

When we draw a plot with ‘ggplot2’, by default a lot of decisions are made for us.

library(ggplot2)

For example, in the plot below, ‘ggplot2’ has automatically decided how to label the axes, and where to place the tick marks.

ggplot(mpg) + 
    geom_point(aes(x = displ, y = cty))

These automatic choices save us a lot of time, but they will not always be sufficient or even appropriate. For example, the labelling of plots is important to make clear what data values are being presented; what a plot is about.

The default labels (above) are not very explanatory on their own, so the code below produces an improved plot with better labelling.

ggplot(mpg) + 
    geom_point(aes(x = displ, y = cty)) +
    labs(x = "Engine Displacement (litres)",
         y = "Miles Per Gallon (city driving)",
         title = "Fuel Economy for Popular Models (1999-2008)")

Another choice that ‘ggplot2’ has made for us is the data symbol, which by default is an opaque circle. Look what happens (below) when we modify the data symbols to make them semitransparent circles, rather than the default opaque circles. Some circles are darker than others because there is overplotting; there are multiple data points with exactly the same values of engine displacement and miles per gallon. The default plot was actually completely misleading!

ggplot(mpg) + 
    geom_point(aes(x = displ, y = cty), alpha = .2) +
    labs(x = "Engine Displacement (litres)",
         y = "Miles Per Gallon (city driving)",
         title = "Fuel Economy for Popular Models (1999-2008)")

The Readings from Wilke describe several general issues, like overplotting and adequate labelling, and provide guidelines for how to address these common problems.

For each set of guidelines, we will need to be able to implement the recommendations. The References include chapters from the ‘ggplot2’ book that cover ‘ggplot2’ annotations and ‘ggplot2’ themes, which allow us to modify the default decisions that ‘ggplot2’ makes for us.

For example, the concept of a data-ink ratio (from Wilke) argues for less clutter in a plot (less unnecessary ink on the page) and the code below uses a ‘ggplot2’ theme to produce a less cluttered plot.

ggplot(mpg) + 
    geom_point(aes(x = displ, y = cty), alpha = .2) +
    labs(x = "Engine Displacement (litres)",
         y = "Miles Per Gallon (city driving)",
         title = "Fuel Economy for Popular Models (1999-2008)") + 
    theme_classic()

This theme is arguably a little too plain (for example, the title does not stand out as much), but the following code demonstrates that we can tweak individual elements of the theme as well. In this case, we make the plot title bold and larger.

ggplot(mpg) + 
    geom_point(aes(x = displ, y = cty), alpha = .2) +
    labs(x = "Engine Displacement (litres)",
         y = "Miles Per Gallon (city driving)",
         title = "Fuel Economy for Popular Models (1999-2008)") + 
    theme_classic() + 
    theme(plot.title = element_text(face = "bold", size = 18))

The plain theme also removed ALL grid lines, but there are situations where a background grid can provide useful context. The code below adds customised horizontal lines to represent historical US fuel efficiency standards.

CAFE <- data.frame(year = c(1980, 1990, 2020),  
                   limit = c(20.0, 27.5, 42.4))
ggplot(mpg) + 
    geom_hline(data = CAFE, aes(yintercept = limit), 
               linetype = "dashed", colour = "grey") +
    geom_point(aes(x = displ, y = cty), alpha = .2) +
    labs(x = "Engine Displacement (litres)",
         y = "Miles Per Gallon (city driving)",
         title = "Fuel Economy for Popular Models (1999-2008)") + 
    theme_classic() + 
    theme(plot.title = element_text(face = "bold", size = 18))

The following code combines ‘grid’ and ‘ggplot2’ (via ‘gggrid’) to add some carefully positioned labels to the horizontal lines. A lot of this topic is all about being able to customise the details of a ‘ggplot2’ plot like this.

library(gggrid)
lineLabel <- function(data, coords) {
    textGrob(paste("CAFE", data$label), 
             x = unit(1, "npc") - unit(1, "mm"),
             y = unit(coords$y, "npc") + unit(1, "mm"),
             just = c("right", "bottom"),
             gp = gpar(col = "grey"))
}
ggplot(mpg) + 
    geom_hline(data = CAFE, aes(yintercept = limit), 
               linetype = "dashed", colour = "grey") +
    geom_point(aes(x = displ, y = cty), alpha = .2) +
    labs(x = "Engine Displacement (litres)",
         y = "Miles Per Gallon (city driving)",
         title = "Fuel Economy for Popular Models (1999-2008)") + 
    theme_classic() + 
    theme(plot.title = element_text(face = "bold", size = 18)) +
    grid_panel(data = CAFE, mapping = aes(y = limit, label = year),
               grob = lineLabel)         

Your job is to read and understand the guidelines from Wilke and dip into the References as needed to implement the guidelines in the data visualisation tasks set out in the lab questions.

Important concepts and terminology

Make sure that you understand the meanings of the following terms:

  • The principle of proportional ink
  • Overplotting
  • Jittering
  • Direct labelling
  • Redundant coding
  • Small multiples
  • The data-ink ratio

Readings

  • Chapters 17, 18, 19, 20, 21, 22, and 23 from “Fundamentals of Data Visualization” by Claus O. Wilke

    We will spend more time on colour in later topics, we will not pay much attention to the layout of tables, we will only consider incorporating labelling within a data visualisation (not as a separate caption) and, as mentioned previously, we will not always obey all of these guidelines!

References

Bibliography


Creative Commons License
This work is licensed under a Creative Commons Attribution 4.0 International License.