The Data Set

The data set is a CSV file, nzpolice-proceedings.csv, which was derived from “Dataset 5” of Proceedings (offender demographics) on the policedata.nz web site.

We can read the data into an R data frame with read.csv().

crime <- read.csv("nzpolice-proceedings.csv")
crime$Month <- as.Date(crime$Date)

We will focus on youth crime (aged 15-19 inclusive).

youth <- subset(crime, Age.Lower == 15)
head(youth)
   Age.Lower Police.District                                 ANZSOC.Division    SEX       Date
1         15          Tasman                   Acts Intended to Cause Injury Female 2015-12-01
5         15   Auckland City                   Acts Intended to Cause Injury Female 2015-12-01
6         15   Auckland City                   Acts Intended to Cause Injury Female 2015-12-01
22        15   Auckland City Dangerous or Negligent Acts Endangering Persons Female 2015-12-01
23        15   Auckland City Dangerous or Negligent Acts Endangering Persons Female 2015-12-01
41        15   Auckland City                           Illicit Drug Offences Female 2015-12-01
        Month
1  2015-12-01
5  2015-12-01
6  2015-12-01
22 2015-12-01
23 2015-12-01
41 2015-12-01

The following code reorders the levels of the ANZSOC.Division factor according to the overall counts for each type of crime. It also generates typelabels, which are line-wrapped versions of the ANZSOC.Division levels.

divLevels <- levels(factor(youth$ANZSOC.Division))
divLabels <- unlist(lapply(strwrap(divLevels, width=30, simplify=FALSE),
                           function(x) {
                               if (length(x) < 3)
                                   x <- c(x, rep(" ", 3 - length(x)))
                               paste(x, collapse="\n")
                           }))
types <- table(youth$ANZSOC.Division)
typeLevels <- names(types)[order(types, decreasing=TRUE)]
typeLabels <- unlist(lapply(strwrap(typeLevels, width=30, simplify=FALSE),
                            function(x) {
                                if (length(x) < 3)
                                    x <- c(x, rep(" ", 3 - length(x)))
                                paste(x, collapse="\n")
                            }))
youth$Type <- factor(youth$ANZSOC.Division, levels=typeLevels)

The following code generates a table of counts for the number of incidents for youth broken down by type of crime.

youthType <- as.data.frame(table(youth$Type))
head(youthType)
                                             Var1  Freq
1                      Theft and Related Offences 23280
2                   Acts Intended to Cause Injury 17456
3         Traffic and Vehicle Regulatory Offences 15737
4                           Public Order Offences 12358
5 Dangerous or Negligent Acts Endangering Persons 11144
6     Property Damage and Environmental Pollution  9534

The following code reorders the levels of the ANZSOC.Division factor according to the first month count for each type of crime. It also generates newlabels, which are line-wrapped versions of the ANZSOC.Division levels.

monthTypes <- table(youth$ANZSOC.Division, youth$Month)[,1]
monthTypeLevels <- names(types)[order(monthTypes, decreasing=TRUE)]
monthTypeLabels <- unlist(lapply(strwrap(monthTypeLevels, width=30, 
                                         simplify=FALSE),
                          function(x) {
                              if (length(x) < 3)
                                  x <- c(x, rep(" ", 3 - length(x)))
                              paste(x, collapse="\n")
                          }))
youth$monthType <- factor(youth$ANZSOC.Division, levels=monthTypeLevels)

The following code generates a table of counts for the number of incidents per month, broken down by type of crime.

youthMonthType <- as.data.frame(table(youth$Month, youth$monthType))
youthMonthType$Month <- as.Date(youthMonthType$Var1)
head(youthMonthType)
        Var1                       Var2 Freq      Month
1 2014-07-01 Theft and Related Offences  393 2014-07-01
2 2014-08-01 Theft and Related Offences  369 2014-08-01
3 2014-09-01 Theft and Related Offences  343 2014-09-01
4 2014-10-01 Theft and Related Offences  367 2014-10-01
5 2014-11-01 Theft and Related Offences  369 2014-11-01
6 2014-12-01 Theft and Related Offences  348 2014-12-01

Questions of Interest

For youth crimes, we are interested in comparisons between different types of crime.

Amongst the three most common crimes:

We will also make a specific comparison between “Public Order Offences” and “Dangerous or Negligent Acts Endangering Persons”:

Data Visualisations

library(ggplot2)
  1. The following code produces a pie chart of the overall number of incidents for each type of crime.

    We can fairly easily see that Theft is the most common crime, though it helps a lot that the crimes have already been placed in order. We pretty much have to rely on the order of the legend to tell that Homicide is the least common type of crime. It is very hard to judge how much bigger the number 1 crime is compared to numbers 2 and 3, though it appears that numbers 2 and 3 are quite similar, while number 1 is significantly larger than the other two. Comparing Public Order with Dangerous or Negligent Acts is much harder, partly because it is not completely straightforward to match the colours from the legend with the colours of the wedges and partly because it is very difficult to perceive the relative sizes of the wedges. There is no time component in this data visualisation so we cannot say anything about what is happening over time.

    The number of incidents is being represented by the tilt or angle and/or area of the wedges in the pie chart and the type of crime is being represented by colour (hue). Colour is an appropriate channel for the type of crime because that is a categorical variable, but angle/area are not good channels for numeric or continuous values like counts. This is what makes it difficult to answer the questions of interest using this data visualisation.

    There are two examples of colour contrast effects in this data visualisation. On one hand, the colour within each wedge is affected by the colour of neighbouring wedges; the blue-green wedges appear greener at the boundary with a more blue neighbour and bluer at a boundary with a more green neighbour. Furthermore, the colours in the legend have a white border so they appear different again, which does not help with matching wedge colours to the legend colours.

    ggplot(youthType) +
        geom_col(aes(x=1, y=Freq, fill=Var1), position="stack") + 
        coord_polar(theta="y") +
        labs(title="Overall Number of Incidents") +
        scale_fill_hue(labels=typeLabels, 
                       name="ANZSOC.Division") +
        theme(legend.text=element_text(size=6),
              axis.title.x=element_blank(),
              axis.title.y=element_blank(),
              axis.ticks.y=element_blank(),
              axis.text.y=element_blank())

    The following code produces a small variation with white borders on the wedges. The effect is quite dramatic and it resolves both contrast effect issues.

    ggplot(youthType) +
        geom_col(aes(x=1, y=Freq, fill=Var1), position="stack",
                 colour="white") + 
        coord_polar(theta="y") +
        labs(title="Overall Number of Incidents") +
        scale_fill_hue(labels=typeLabels, 
                       name="ANZSOC.Division") +
        theme(legend.text=element_text(size=6),
              axis.title.x=element_blank(),
              axis.title.y=element_blank(),
              axis.ticks.y=element_blank(),
              axis.text.y=element_blank())

  2. The following code produces a different visualisation of the data from the previous question. This uses the position on a common scale channel to represent the number of incidents. It also uses position in space to represent the type of crime. In both cases, these are the best possible channels to use for continuous and categorical data, respectively.

    This data visualisation makes it much easier to see which types of crime are the most common and the least common. We can also see that the difference between number 1 and number 2 is larger than the difference between number 2 and number 3. It is also very clear that Public Order crimes are more common than Dangerous or Negligent Acts.

    ggplot(youthType) +
        geom_point(aes(x=Var1, y=Freq)) + 
        coord_flip() +
        labs(title="Overall Number of Incidents") +
        theme(axis.title.x=element_blank(),
              axis.title.y=element_blank())

    The following code produces another different visualisation, this time using colour (mostly luminance) to represent the number of incidents, which is one of the least effective channels for continuous data (and position in space to represent type of crime). This makes it virtually impossible to compare different types of crime (although we can tell the most and least common just from the vertical ordering of the circles).

    ggplot(youthType) +
        geom_point(aes(x=Var1, colour=Freq), y=1, size=3) + 
        coord_flip() +
        labs(title="Overall Number of Incidents") +
        theme(axis.title.x=element_blank(),
              axis.title.y=element_blank())

  3. The following code produces a plot of the number of incidents per month for each type of crime.

    We can infer some things about the most and least common types of crime overall from this data visualisation, but it is most useful for looking at changes over time. In particular, the highlighting of Public Order and Dangerous or Negligent crimes allows us to see that the former has declined consistently over time and the latter has slowly increased.

    The use of colour to identify different types of crime is an example of preattentive pop out - we can instantly tell the different types of crimes apart - although the similarity of neighbouring colours and the overlapping of lines makes this less effective.

    Examples of Gestalt Rules are: the lines joining data for each crime type (connection), which helps to identify different crime types on the plot; the white lines beneath the lines for Public Order and Dangerous and Negligent (figure and ground), which help to isolate those crime types from the others; and the fact that we can still identify Public Order as it crosses beneath Dangerous and Negligent (continuity). You could argue that the consistent use of colours between the legend and the main plot is an example of similarity,
    but again this is weakened by the fact that the colours are hard to differentiate.

    public <- subset(youthMonthType, Var2 == "Public Order Offences")
    dangerous <- 
        subset(youthMonthType, 
               Var2 == "Dangerous or Negligent Acts Endangering Persons")
    ggplot(youthMonthType) +
        geom_line(aes(Month, Freq, colour=Var2)) +
        geom_line(aes(Month, Freq), data=public,
                  colour="white", linewidth=3) +
        geom_line(aes(Month, Freq, colour=Var2),  data=public) +
        geom_line(aes(Month, Freq), data=dangerous,
                  colour="white", linewidth=3) +
        geom_line(aes(Month, Freq, colour=Var2), data=dangerous) +
        scale_colour_hue(labels=monthTypeLabels, 
                         name="ANZSOC.Division") +
        theme(legend.text=element_text(size=6)) +
        labs(title="Number of Incidents per Month") +
        theme(axis.title.x=element_blank(),
              axis.title.y=element_blank())

  4. The following code produces a data visualisation with most colours desaturated. Only two lines are highlighted by retaining their saturation.

    hues <- scales::hue_pal()(16)
    retain <- 
        monthTypeLevels %in% 
        c("Public Order Offences",
          "Dangerous or Negligent Acts Endangering Persons")
    cols <- colorspace::desaturate(hues, .7)
    cols[retain] <- hues[retain]
    ggplot(youthMonthType) +
        geom_line(aes(Month, Freq, colour=Var2)) +
        geom_line(aes(Month, Freq, colour=Var2), data=public, linewidth=1.2) +
        geom_line(aes(Month, Freq, colour=Var2), data=dangerous, linewidth=1.2) +
        scale_colour_manual(values=cols, 
                            labels=monthTypeLabels, 
                            name="ANZSOC.Division") +
        theme(legend.text=element_text(size=6)) +
        labs(title="Number of Incidents per Month") +
        theme(axis.title.x=element_blank(),
              axis.title.y=element_blank())

  5. The following code produces a version of the previous data visualisation that shows what it would look like for a viewer with deuteranomaly.

    This demonstrates that one of the highlighted lines would become indistinguishable from the other lines, so the highlighting would no longer be effective.

    deutanCols <- colorspace::deutan(cols)
    ggplot(youthMonthType) +
        geom_line(aes(Month, Freq, colour=Var2)) +
        geom_line(aes(Month, Freq, colour=Var2), data=public, linewidth=1.2) +
        geom_line(aes(Month, Freq, colour=Var2), data=dangerous, 
                  linewidth=1.2) +
        scale_colour_manual(values=deutanCols, 
                            labels=monthTypeLabels, 
                            name="ANZSOC.Division") +
        theme(legend.text=element_text(size=6)) +
        labs(title="Number of Incidents per Month") +
        theme(axis.title.x=element_blank(),
              axis.title.y=element_blank())

  6. The following code produces a facetted plot of the number of incidents per month, with a separate facet for each type of crime.

    This finally solves the overlapping lines problem, though we still have the issue of matching colours between the panels and the legend. This plot is the best so far for comparing Public Order with Dangerous and Negligent crimes, thanks to the grey lines in each panel. It also allows us to compare those crimes with all of the others. For example, we can see that Dangerous and Negligent has increased to be at a similar level to three of the four most common types of crime. The exception is Public Order, which has switched from well above Dangerous and Negligent to well below.

    In this data visualisation, Month is being represented by position on a common scale, number of incidents likewise and type of crime by a combination of colour (hue) and position in space (facetting). These are very effective visual channels: we can clearly see progression in time (left to right) and changes in the number of incidents (up and down). We can also clearly distinguish between different types of crime (facets). The only remaining issue is that the number of categories means that it is difficult to identify (based on hue) exactly which panel is which type of crime.

    library(gggrid)
    Loading required package: grid
    highlight <- rbind(public, dangerous)
    highlight$Type <- highlight$Var2
    highlight$Var2 <- NULL
    shadow <- function(data, coords) {
        polylineGrob(coords$x, coords$y, id.lengths=rep(nrow(coords)/2, 2),
                     gp=gpar(col="grey"))
    }
    ggplot(youthMonthType) +
        grid_panel(shadow, aes(Month, Freq), data=highlight) +
        geom_line(aes(Month, Freq, colour=Var2)) +
        scale_colour_hue(labels=monthTypeLabels, 
                         name="ANZSOC.Division") +
        theme(legend.text=element_text(size=6)) +
        labs(title="Number of Incidents per Month") +
        facet_wrap("Var2") +
        scale_x_continuous(breaks=seq(as.Date("2016-01-01"), 
                                      as.Date("2022-01-01"), 
                                      length.out=4), 
                           labels=c(16, 18, 20, 2022)) +
        theme(strip.text=element_blank(),
              axis.title.x=element_blank(),
              axis.title.y=element_blank()) 

Challenge

  1. The following code produces a version of the data visualisation with direct labelling and no grey background.

    The direct labelling makes use of the Gestalt Rules of proximity and consistency (same colour) to associate the crime type labels with the relevant line. This arguably is the best visualisation for identifying which line corresponds to which type of crime.

    library(gggrid)
    gridLine <- function(data, coords) {
        col <- data$colour[1]
        lwd <- 1
        face <- "plain"
        if (data$type[1] %in% 
            c("Public Order Offences", 
              "Dangerous or Negligent Acts Endangering Persons")) {
            lwd <- 2
            face <- "bold"
        } else {
            col <- colorspace::desaturate(col, .7)
        }
        line <- linesGrob(coords$x, coords$y, gp=gpar(col=col, lwd=lwd))
        text <- textGrob(data$type[1],
                         unit(-2, "mm"), mean(coords$y[1:2]), just="right",
                         gp=gpar(col=col, lwd=lwd, fontsize=8, fontface=face),
                         vp=viewport(clip="off"))
        gTree(children=gList(line, text))
    }
    gg <- ggplot(youthMonthType) +
        grid_group(gridLine, aes(Month, Freq, colour=Var2, type=Var2)) +
        scale_y_continuous(position="right") +
        scale_x_date(expand=expansion(c(0, NA))) +
        theme(axis.title.x=element_blank(),
              axis.title.y=element_blank(),
              axis.ticks.y=element_blank(),
              panel.background=element_blank(),
              panel.grid.major.x=element_blank(),
              panel.grid.minor.x=element_blank(),
              panel.grid.major.y=element_line(colour="black", linewidth=.1),
              panel.grid.minor.y=element_blank())
    grid.newpage()
    pushViewport(viewport(x=unit(3.5, "in"), width=unit(1, "npc") - unit(3.5, "in"),
                          just="left"))
    print(gg, newpage=FALSE)

Summary

In this lab we have practised assessing data visualisations in terms of visual perception. What visual channels are being employed and are they good or bad choices? What Gestalt Rules are being employed to help identify groupings? Are any visual illusions interfering with the interpretation of the images?

We have also gained more experience with ‘ggplot2’ (and possibly ‘grid’), particularly in terms of controlling the details of a data visualisation and modifying the default choices that ‘ggplot2’ makes for us.