The data set is a CSV file, nzpolice-proceedings.csv,
which was derived from “Dataset 5” of Proceedings
(offender demographics) on the policedata.nz
web site.
We can read the data into an R data frame with
read.csv().
crime <- read.csv("nzpolice-proceedings.csv")
The following code generates a data frame containing the number of incidents per month for both male and female offenders.
crimeTable <- table(crime$Date, crime$SEX)
crimeProp <- as.data.frame(apply(crimeTable, 1, function(x) x[1]/sum(x)))
names(crimeProp) <- "Prop"
crimeProp$Month <- as.Date(rownames(crimeProp))
crimeSex <- as.data.frame(crimeTable)
names(crimeSex) <- c("Date", "Sex", "Freq")
crimeSex$Month <- as.Date(crimeSex$Date)
head(crimeSex)
Date Sex Freq Month
1 2014-07-01 Female 2755 2014-07-01
2 2014-08-01 Female 2671 2014-08-01
3 2014-09-01 Female 2663 2014-09-01
4 2014-10-01 Female 2701 2014-10-01
5 2014-11-01 Female 2586 2014-11-01
6 2014-12-01 Female 2667 2014-12-01
The following code generates a data frame containing the number of
incidents per month for both male and female offenders, broken down by
the type of crime. The Type factor has levels in decreasing
order of the number of incidents for males in the first month of
data.
crimeTypeSexList <- lapply(split(crime,
list(crime$Date,
crime$SEX,
crime$ANZSOC.Division)),
nrow)
crimeTypeSex <- cbind(as.data.frame(do.call(rbind, crimeTypeSexList)),
as.data.frame(do.call(rbind,
strsplit(names(crimeTypeSexList),
"[.]"))))
rownames(crimeTypeSex) <- NULL
colnames(crimeTypeSex) <- c("Count", "Month", "Sex", "Type")
crimeTypeSex$Month <- as.Date(crimeTypeSex$Month)
monthFirst <- subset(crimeTypeSex, Month == "2014-07-01" & Sex == "Male")
monthLevels <- monthFirst$Type[order(monthFirst$Count, decreasing=TRUE)]
monthLabels <- unlist(lapply(strwrap(monthLevels, width=30, simplify=FALSE),
function(x) {
if (length(x) < 3)
x <- c(x, rep(" ", 3 - length(x)))
paste(x, collapse="\n")
}))
crimeTypeSex$Type <- factor(crimeTypeSex$Type, levels=monthLevels)
head(crimeTypeSex)
Count Month Sex Type
1 121 2014-07-01 Female Abduction, Harassment and Other Related Offences Against a Person
2 89 2014-08-01 Female Abduction, Harassment and Other Related Offences Against a Person
3 129 2014-09-01 Female Abduction, Harassment and Other Related Offences Against a Person
4 112 2014-10-01 Female Abduction, Harassment and Other Related Offences Against a Person
5 110 2014-11-01 Female Abduction, Harassment and Other Related Offences Against a Person
6 116 2014-12-01 Female Abduction, Harassment and Other Related Offences Against a Person
We already know that there are many more incidents involving Male offenders than Female offenders.
library(ggplot2)
ggplot(crime) +
geom_bar(aes(SEX)) +
labs(title="Number of Incidents (thousands)",
subtitle="July 2014 to Dec 2022") +
scale_y_continuous(breaks=seq(0, 600000, 200000), labels=seq(0, 600, 200)) +
theme(axis.title.y=element_blank(),
axis.title.x=element_blank())
In this lab, we will focus on differences between Males and Females over time:
The following code produces a line plot of the number of incidents over time, broken down by the sex of the offender.
We can see that there are more than twice the number of incidents for male offenders than for female offenders and, while both groups show a decreasing trend, female offending appears to be decreasing at a slightly slower rate.
In terms of CRAP guidelines:
ggplot(crimeSex) +
geom_line(aes(Month, Freq, colour=Sex)) +
labs(title="Number of Incidents")
The following code produces a modified version of the previous plot by applying the following CRAP design guidelines:
The grey background has been removed and the text labels and the remaining grid lines have been drawn grey, which produces a much stronger contrast between the data lines and other elements of the plot.
The legend has been replaced with direct labels and the labels are drawn in the same colour as the lines (repetition).
There is much more alignment, particularly of text labels: in addition the to aligned title, the sex labels are right-aligned with the right edge of the plot panel, and the y-axis labels have been bottom-aligned with the horizontal grid lines.
The direct labels have been placed right next to the ends of the lines (proximity).
The end result is arguably cleaner and simpler than the original (less clutter), with a stronger emphasis on the data lines. This makes it easier to concentrate on the most important elements of the plot, which are the trends in the data lines.
library(gggrid)
Loading required package: grid
label <- function(data, coords) {
textGrob(data$sex, 1, just="right",
unit(coords$y[nrow(coords)], "npc") + unit(5, "mm"),
gp=gpar(col=data$colour[1]))
}
ggplot(crimeSex) +
geom_line(aes(Month, Freq, colour=Sex), linewidth=1) +
grid_group(label, aes(Month, Freq, colour=Sex, sex=Sex)) +
labs(title="Number of Incidents") +
scale_x_date(expand=expansion(0)) +
theme(panel.background=element_blank(),
panel.grid.major.y=element_line(colour="grey", linewidth=.1),
legend.position="none",
text=element_text(colour="grey"),
plot.title=element_text(colour="black", size=16, face="bold"),
axis.title=element_blank(),
axis.text=element_text(colour="grey", vjust=0),
axis.ticks=element_blank(),
plot.margin=unit(rep(1, 4), "cm"))
The following code produces a facetted plot with one panel for each type of crime and separate lines for males and females within each panel.
We can see that the overall trend of many more incidents involving Male offenders than Female offenders repeats across all crimes (though it is difficult to see with the low numbers in the bottom row). There are also some remarkable consistencies in the details of the trends over time, for several types of crime: where there are pronounced peaks or troughs for Male offenders, there are similar peaks or troughs for Female offenders. Whatever is driving those variations appears to have no regard for sex. Looking in more detail, the separation between the Male and Female lines is noticeably different for some crimes. For example, Abduction is more dominated by Males than Fraud. It would be useful to plot the proportions of Males vs Females to explore this further.
The following examples of CRAP design are being employed:
NOTE that we need the ‘ggtext’ package to achieve the colours in the title.
library(gggrid)
library(ggtext)
typeLabel <- function(data, coords) {
subset <- subset(coords, sex == "Male")
label <- paste(strwrap(data$type[1], width=25), collapse="\n")
textGrob(label,
0,
unit(subset$y[1], "npc") + unit(4, "mm"),
just=c("left", "bottom"),
gp=gpar(col="grey50", fontsize=9))
}
cols <- scales::hue_pal()(2)
title <- paste0('Number of Incidents for <span style="color: ', cols[2],
'">Males</span> and <span style="color: ', cols[1],
'">Females</span>')
ggplot(crimeTypeSex) +
grid_panel(typeLabel, aes(y=Count, type=Type, sex=Sex)) +
grid_panel(segmentsGrob(0, 0, 1, 0, gp=gpar(col="grey50"))) +
geom_line(aes(Month, Count, colour=Sex)) +
scale_x_date(expand=expansion(0)) +
scale_y_continuous(expand=expansion(0),
breaks=seq(0, 900, 300)) +
coord_cartesian(ylim=c(0, 1500)) +
labs(title=title, subtitle=" ") +
facet_wrap("Type") +
theme(strip.text=element_blank(),
axis.text=element_text(colour="grey50"),
axis.title=element_blank(),
axis.text.y=element_text(vjust=0),
plot.background=element_rect(fill=grey(.1)),
panel.background=element_rect(fill=grey(.1)),
panel.grid=element_blank(),
panel.spacing.x=unit(5, "mm"),
legend.position="none",
plot.title=element_markdown(size=16, colour="grey50"),
plot.margin=unit(rep(1, 4), "cm"))
The following paragraph and modified version of the plot above
makes use of the same font: the Google
Font Gruppo. See the .Rmd file for the code that sets
up the font in the paragraph of text.
typeLabelGruppo <- function(data, coords) {
subset <- subset(coords, sex == "Male")
label <- paste(strwrap(data$type[1], width=25), collapse="\n")
textGrob(label,
0,
unit(subset$y[1], "npc") + unit(4, "mm"),
just=c("left", "bottom"),
gp=gpar(col="grey50", fontsize=9,
fontfamily="Gruppo", lineheight=1, fontface="bold"))
}
ggGruppo <- ggplot(crimeTypeSex) +
grid_panel(typeLabelGruppo, aes(y=Count, type=Type, sex=Sex)) +
grid_panel(segmentsGrob(0, 0, 1, 0, gp=gpar(col="grey50"))) +
geom_line(aes(Month, Count, colour=Sex)) +
scale_x_date(expand=expansion(0)) +
scale_y_continuous(expand=expansion(0),
breaks=seq(0, 900, 300)) +
coord_cartesian(ylim=c(0, 1500)) +
labs(title=title, subtitle=" ") +
facet_wrap("Type") +
theme(strip.text=element_blank(),
axis.text=element_text(colour="grey50"),
axis.title=element_blank(),
axis.text.y=element_text(vjust=0, family="Gruppo", face=2),
plot.background=element_rect(fill=grey(.1)),
panel.background=element_rect(fill=grey(.1)),
panel.grid=element_blank(),
panel.spacing.x=unit(5, "mm"),
legend.position="none",
plot.title=element_markdown(size=16, colour="grey50",
family="Gruppo", face=2),
plot.margin=unit(rep(1, 4), "cm"))
This paragraph of text and the plot below BOTH use the same font (the Google Font Gruppo). The plot is the same as the last question above, just with the Gruppo font applied. The width of the paragraph has also been set to be the same as the width of the plot (8 inches).
In this lab we have practised assessing data visualisations in terms of CRAP design guidelines. We have also attempted to modify data visualisations using CRAP design guidelines. Hopefully we have demonstrated that graphic design is not just about making an image pretty, but it can have a genuine impact on the clarity and focus of a data visualisation.
We have explored the crime data some more as well, learning that the differences between Males and Females does depend on the type of crime.
We have also gained more experience with ‘ggplot2’ and ‘grid’, particularly in terms of controlling the details of a data visualisation and modifying the default choices that ‘ggplot2’ makes for us.