The Data Set

The data set is a CSV file, nzpolice-proceedings.csv, which was derived from “Dataset 5” of Proceedings (offender demographics) on the policedata.nz web site.

We can read the data into an R data frame with read.csv().

crime <- read.csv("nzpolice-proceedings.csv")
head(crime)
  Age.Lower Police.District                                                   ANZSOC.Division
1        15          Tasman                                     Acts Intended to Cause Injury
2        20   Auckland City Abduction, Harassment and Other Related Offences Against a Person
3        40   Auckland City Abduction, Harassment and Other Related Offences Against a Person
4        10   Auckland City                                     Acts Intended to Cause Injury
5        15   Auckland City                                     Acts Intended to Cause Injury
6        15   Auckland City                                     Acts Intended to Cause Injury
     SEX       Date
1 Female 2015-12-01
2 Female 2015-12-01
3 Female 2015-12-01
4 Female 2015-12-01
5 Female 2015-12-01
6 Female 2015-12-01

The following code reorders the levels of the ANZSOC.Division factor according to the highest age group bar for each type of crime. It also generates newlabels, which are line-wrapped versions of the ANZSOC.Division levels.

types <- apply(table(crime$ANZSOC.Division, crime$Age.Lower), 1, max)
newlevels <- names(types)[order(types, decreasing=TRUE)]
newlabels <- unlist(lapply(strwrap(newlevels, width=30, simplify=FALSE),
                           function(x) {
                               if (length(x) < 3)
                                   x <- c(x, rep(" ", 3 - length(x)))
                               paste(x, collapse="\n")
                           }))
crime$ANZSOC.Division <- factor(crime$ANZSOC.Division, levels=newlevels)

The following code generates a table of counts for the number of incidents per age group, broken down by type of crime.

crimeAgeType <- as.data.frame(table(crime$Age.Lower, crime$ANZSOC.Division))
crimeAgeType$Age <- as.numeric(as.character(crimeAgeType$Var1))

Questions of Interest

We have already seen the distribution of incidents across age groups (across all crimes); there is a highly skewed distribution with a peak of crimes in the 15-20 age group.

Need help? Try Stackoverflow: https://stackoverflow.com/tags/ggplot2

In this lab we will focus on the distribution of incidents across age groups, broken down by type of crime (ANZSOC.Division):

Data Visualisations

library(ggplot2)
  1. The following code produces a bar plot of the number of incidents in each age group broken down by the type of crime.

    We can clearly see that some types of crime are more common than others (it helps a lot that the levels of ANZSOC.Division have been ordered by max peak already - otherwise there would be even more overlapping of bars!). It is also not too hard to see that some peaks are at higher age groups than 15-20, although it is not easy to tell exactly which crime types those correspond to because of the colour scale problems (see below). It also looks like the distributions all have a single peak, although it is hard to be sure because of the overplotting.

    ggplot(crimeAgeType) +
        geom_col(aes(Age, Freq, color=Var2, fill=Var2), position="identity",
                 just=0)

    In terms of overplotting, there is a problem with this plot because, for all but the “Homicide” category, the bars are at least partially obscured. In some cases, bars are completely obscured so it is impossible to see the full distribution of incidents across age groups for all crime types.

    In terms of colour use, we have a good example of using too many different colours. This makes it very difficult to distinguish the different crime types from each other AND very difficult to match the bars in the main plot to the square regions in the legend.

    In terms of labelling, there are several problems: there is no title, the count on the y-axis is unexplained, and the legend title is uninformative. This makes it difficult to interpret what the bar plot is showing. In addition, the legend labels are too long, so the legend takes up too much of the plot.

    In terms of the principle of proportional ink, the bars all start at zero, which is appropriate. However, because the bars overlap each other, the visible part of the bars is not at all representative of the frequency of each age group, so we could say that the principle is not being upheld.

  2. The following code attempts to improve the plot by addressing several of the issues:

    • We can set alpha=.5 to make the fill of the bars semitransparent. However, semitransparency does not improve things much because there are so many overlapping bars. It is impossible to determine what most areas mean. If anything, the situation is worse.

      ggplot(crimeAgeType) +
          geom_col(aes(Age, Freq, color=Var2, fill=Var2), position="identity",
                   just=0, alpha=.3)

    • We can change to position="jitter" to jitter the bars. However, jittering does not help either. The jittering does not make it any easier to see bars that were previously overlapped.

      ggplot(crimeAgeType) +
          geom_col(aes(Age, Freq, color=Var2, fill=Var2), position="jitter")

      Using position="dodge" does reduce the overlap, but because there are some many bars and the legend takes up too much room, we cannot make out individual crime types at all.

      ggplot(crimeAgeType) +
          geom_col(aes(Age, Freq, color=Var2, fill=Var2), position="dodge")

      Using position="stack" also reduces overlap. However, now most of the bars do not have a common baseline, so it is actually harder to see peaks and overall shapes of distributions for different types of crime.

      ggplot(crimeAgeType) +
          geom_col(aes(Age, Freq, color=Var2, fill=Var2), position="stack",
                   just=0)

    • Using a different geom (line) reduces overlap, but because there are so many lines and so many colours, it is difficult to isolate different crime types.

      ggplot(crimeAgeType) +
          geom_line(aes(Age, Freq, color=Var2, fill=Var2))
      Warning in geom_line(aes(Age, Freq, color = Var2, fill = Var2)): Ignoring unknown aesthetics: fill

      Filling the area beneath the lines, by using an area geom, does actually help to identify different crime types, and that helps to see some of the shape of some of the distributions for some of the crime types, but we are back to the original overplotting problem.

      ggplot(crimeAgeType) +
          geom_area(aes(Age, Freq, color=Var2, fill=Var2), position="identity",
                    alpha=.5)

  3. The following code attempts to improve the plot from Question 1 by addressing the labelling. We use the line-wrapped labels in the legend and shrink the label font. The legend is also given a title. An overall title has been added and the y-axis label has been removed.

    Possibly the largest improvement is the fact that the legend now takes up less room. The larger plot panel makes if much easier to see some of the shapes of some of the distributions for some crime types. However, we still have some pretty severe overlapping.

    The legend title and the overall title are just general improvements that make it easier to know what the plot is showing (so that we know that the data visualisation is relevant to the questions of interest in the first place).

    ggplot(crimeAgeType) +
        geom_col(aes(Age, Freq, color=Var2, fill=Var2), position="identity",
                 just=0) + 
        scale_fill_hue(labels=newlabels, 
                       aesthetics=c("colour", "fill"),
                       name="ANZSOC.Division") +
        theme(legend.text=element_text(size=6)) +
        labs(title="The Number of Crimes per Age Group by Crime Type",
             subtitle="Counts are totals from July 2014 to December 2022") +
        ylab("")

  4. The following code makes use of small multiples (facetting) to produce a separate panel for each crime type (and retains the labelling from the previous question).

    This completely eliminates the overlapping, so we are finally able to clearly see the distributions for each crime type. Apart from the fact that some crime types are much more common than others, we can see that all age distributions are basically unimodal. We can also see that the peak for some crime types are in slightly older age groups (e.g., Offences against Justice), though this takes a bit of effort to figure it out.

    One problem is that for uncommon crimes, the bars are so small that we cannot see the shape of the distribution very well. There also remains the problem of matching colours of bars to the legend and the labelling of the panel strips is not coping well at all with the long category labels.

    ggplot(crimeAgeType) +
        geom_col(aes(Age, Freq, color=Var2, fill=Var2), just=0) + 
        scale_fill_hue(labels=newlabels, 
                       aesthetics=c("colour", "fill"),
                       name="ANZSOC.Division") +
        theme(legend.text=element_text(size=6)) +
        labs(title="The Number of Crimes per Age Group by Crime Type",
             subtitle="Counts are totals from July 2014 to December 2022") +
        ylab("") +
        facet_wrap("Var2")

  5. The following code produces a version of the previous plot with a higher data-ink ratio.

    • Remove the panel strips (and rely on colour to encode type of crime).
    • Remove panel background.
    • Remove most grid lines (just leave horizontal majors, in grey).
    • NO fill (in either bars or legend).

    It is not clear whether it is any easier to see features of interest in this plot, although removing the panel strips has allowed a little more vertical space for the bars, so we can see the distribution shapes a little better in the bottom row. Overall, the plot is perhaps less “busy” and distracting, but there is an element of personal preference involved. The problem of matching colours of bars to the legend remains.

    ggplot(crimeAgeType) +
        geom_col(aes(Age, Freq, color=Var2), fill=NA, just=0) + 
        scale_fill_hue(labels=newlabels, 
                       aesthetics=c("colour"), 
                       name="ANZSOC.Division") +
        theme(legend.text=element_text(size=6)) +
        labs(title="The Number of Crimes per Age Group by Crime Type",
             subtitle="Counts are totals from July 2014 to December 2022") +
        ylab("") +
        facet_wrap("Var2") +
        theme(panel.background=element_blank(),
              panel.grid.major.x=element_blank(),
              panel.grid.minor.x=element_blank(),
              panel.grid.major.y=element_line(colour="grey"),
              panel.grid.minor.y=element_blank(),
              strip.text=element_blank(),
              legend.key=element_rect(fill=NA))

  6. The following code produces a facetted plot that drops the colour mapping and the legend altogether and directly labels each panel, but with a label inside the panel rather than in a strip on top. This allows us to use the entire ANZSOC.Division label, at a reasonable font size. Note that the bars are semitransparent so they do not fully obscure the labels where they overlap them.

    This is arguably the clearest view of the data yet. There is no overlapping. The lack of legend provides even more room for the panels (although the distributions on the bottom row are still hard to see). And the whole colour matching problem has disappeared entirely!

    library(gggrid)
    wrapLabel <- function(x) {
        sapply(strwrap(x, width=20, simplify=FALSE),
               function(y) paste(y, collapse="\n"))
    }
    plotLabel <- function(data, coords) {
        textGrob(wrapLabel(data$label), 
                 unit(2, "mm"), unit(1, "npc") - unit(2, "mm"),
                 just=c("left", "top"), gp=gpar(fontsize=8))
    }
    ggplot(crimeAgeType) +
        grid_panel(plotLabel, aes(label=Var2)) +
        geom_col(aes(Age, Freq), fill=4, just=0, alpha=.7) + 
        facet_wrap("Var2") +
        scale_y_continuous(expand=expansion(c(0, .05))) +
        theme(panel.background=element_rect(fill="grey90"),
              panel.grid.major.x=element_blank(),
              panel.grid.minor.x=element_blank(),
              panel.grid.minor.y=element_blank(),
              strip.text=element_blank(),
              legend.position="none")

Challenge

  1. The following code produces a variation on the previous plot with two additions:

    • There is a vertical white grid line in the middle of the 15-20 age group so that we can easily compare peaks between age groups.

    • There are grey bars in each panel where the highest bar is half the height of the panel regardless of the absolute counts.

    This allows us to see the shapes of the age distributions for less common crime types. For example, we can now easily see that the peaks for Homicide and Fraud are significantly higher than for Theft and Public Order.

    I think the grey bars are easier to do with ‘gggrid’. I don’t think the grey bars are easy with an additional ‘ggplot2’ layer because ‘ggplot2’ does not like multiple y-scales on the same plot. On the other hand, the vertical grid line is possibly just as easy with the ‘ggplot2’ layer because it has to align with the x-axis scale.

    shadow <- function(data, coords) {
        props <- .5*coords$y/max(coords$y)
        rectGrob(coords$x, 0, resolution(coords$x)*.9, props,
                 just=c("left", "bottom"),
                 gp=gpar(col=NA, fill="grey"))
    }
    ggplot(crimeAgeType) +
        geom_vline(data=data.frame(x=17.5), mapping=aes(xintercept=x),
                   colour="white") +
        grid_panel(shadow, aes(x=Age, y=Freq)) +
        geom_col(aes(Age, Freq), fill=4, just=0) + 
        grid_panel(plotLabel, aes(label=Var2)) +
        facet_wrap("Var2") +
        scale_y_continuous(expand=expansion(c(0, .05))) +
        theme(panel.background=element_rect(fill="grey90"),
              panel.grid.major.x=element_blank(),
              panel.grid.minor.x=element_blank(),
              panel.grid.minor.y=element_blank(),
              strip.text=element_blank(),
              legend.position="none")

Summary

We began with a bar plot that had serious problems with overlapping and colour use and labelling (and violated the principle of proportional ink). Improving the plot required using small multiples (facetting) plus some serious surgery on the labels, particularly the legend, or even better removing the legend entirely and using direct labelling of the panels.

In order to make some of the important changes, particularly for labelling and the legend, we have had to customise the default ‘ggplot2’ settings substantially and even resort to ‘grid’ in some cases to get fine control of the final result.

We have learned that some types of crime are much more common than others and, although the distribution of the number of incidents across age groups is roughly similar (unimodal and right skewed with a peak around younger age groups), there are some crime types with a peak at a higher level than others (e.g., Homicide and Fraud versus Theft and Public Order Offences).