The data set is a CSV file, nzpolice-proceedings.csv,
which was derived from “Dataset 5” of Proceedings
(offender demographics) on the policedata.nz
web site.
We can read the data into an R data frame with
read.csv().
crime <- read.csv("nzpolice-proceedings.csv")
head(crime)
Age.Lower Police.District ANZSOC.Division
1 15 Tasman Acts Intended to Cause Injury
2 20 Auckland City Abduction, Harassment and Other Related Offences Against a Person
3 40 Auckland City Abduction, Harassment and Other Related Offences Against a Person
4 10 Auckland City Acts Intended to Cause Injury
5 15 Auckland City Acts Intended to Cause Injury
6 15 Auckland City Acts Intended to Cause Injury
SEX Date
1 Female 2015-12-01
2 Female 2015-12-01
3 Female 2015-12-01
4 Female 2015-12-01
5 Female 2015-12-01
6 Female 2015-12-01
Each row contains information on a single incident that the Police
handled. The Age.Lower column gives the lower bound of a
5-year age band of the offender, the SEX column gives the
sex of the offender, and the Date column gives the year and
month of the incident (all incidents are marked as occurring on the
first day of the month). There are over 800,000 incidents recorded
between 2014 and 2022.
Our main interest is in trends in youth offending (up to age 19), particularly at the end of 2021 and the beginning of 2022.
We are also interested in the a comparison of youth offending versus adult offending and any differences between males and females.
library(ggplot2)
The code below produces a bar plot of the number of incidents for each age group.
We are using a “bar” geom. We are mapping the age group lower bound to the x-location of the bar, we are using a “count” stat to count the number of incidents in each age group, and we are mapping that count the height of the bar.
This bar plot shows that youth offending is a significant issue because there are more incidents occurring in younger age groups than in older age groups (apart from very young children). It is also interesting to see that there are crimes being committed by very old people (over 80) and very young people (less than 10). This plot tells us nothing about trends over time (we have collapsed across multiple years). It also may not provide a completely fair comparison of the number of incidents between different age groups because, for example, there will be fewer people at older ages (e.g., see the New Zealand population pyramid).
The age lower bound is used for the left location of the bar
because we have specified just=0.
ggplot(crime) +
geom_bar(aes(Age.Lower), just=0)
The following code produces that same bar plot as above, but using tabulated data. We can achieve this by changing the geom that we are using (from a “bar” to a “col”) or by changing the stat that we are using (from a “count” to “identity”). Both options are shown below.
crimeTab <- as.data.frame(table(crime$Age.Lower))
## Change geom
ggplot(crimeTab) +
geom_col(aes(Var1, Freq), just=0)
## Change stat
ggplot(crimeTab) +
geom_bar(aes(Var1, Freq), just=0, stat="identity")
The following code produces a bar plot of the number of incidents in each age group broken down by the sex of the offender.
We are still using a “bar” geom and we are still mapping the age
group lower bound to the x-location of the bar and we are still using a
“count” stat to count the number of incidents in each age group and each
SEX group and we are still mapping those counts to the
heights of the bars. There is also an additional mapping of the
SEX variable to the fill colour of the bars. One additional
complication is in the calculation of the y-location of the bars. The
male bars are still anchored at zero, but the female bars start at the
top of the male bars. This is enforced by the default
"stack" position setting.
We can see that there are fewer female offenders than male offenders and it looks like there is a roughly similar pattern across age groups but it is difficult to see details because the bars representing females do not have a common baseline.
ggplot(crime) +
geom_bar(aes(Age.Lower, fill=SEX), just=0)
The following code produces three variations of the bar plot from the previous question, using “dodge”, “identity”, and “fill” positioning of the bars.
The “dodge” positioning is quite helpful for showing that the the number of incidents in each age group is a little different for females than males, with a less pronounced peak at lower ages.
The “identity” positioning is completely useless because the female bars are completely obscured by the male bars. We would have to use a semitransparent fill for the bars to be able to see the female bars as well.
The “fill” positioning allows us to see differences in the split between male and female offenders within each age group. For example, females make up a slightly greater proportion of over-80 offenders. However, this may reflect the fact that females live longer on average than males, which me wonder whether that fairer comparisons might have to take into account the populations for males and females within each age group.
ggplot(crime) +
geom_bar(aes(Age.Lower, fill=SEX), just=0, position="dodge")
ggplot(crime) +
geom_bar(aes(Age.Lower, fill=SEX), just=0, position="identity")
ggplot(crime) +
geom_bar(aes(Age.Lower, fill=SEX), just=0, position="fill")
The following code produces a plot of the number of incidents per month.
This plot uses a line geom and identity stat and maps the month and the number of incidents to the x- and y-locations of the line.
This plot shows us that there is a general downward trend in incidents over the past 8 years. On the other hand, there is some evidence of a slight up-tick in early 2022. Once again, those observations assume a relatively stable population; in reality the population has increased if anything, so the decrease in offending is likely to be a little stronger than what this plot shows.
The y-axis scale is set explicitly so that it includes zero.
crimeTrend <- as.data.frame(table(crime$Date))
crimeTrend$Date <- as.Date(crimeTrend$Var1)
ggplot(crimeTrend) +
geom_line(aes(Date, Freq)) +
coord_cartesian(ylim=c(0, NA))
The following code produces a plot of the number of incidents per month with separate lines for males and females.
This plot uses a line geom and identity stat. It maps the month and the number of incidents to the x- and y-locations of the line and the sex of the offender to the colour of the lines.
This plot shows that the number of incidents is decreasing for both males and females, though the decrease is a little slower for females (though there are fewer than half the number of incidents involving females). Both males and females show the slight up-tick in early 2022. Once again, we have not included populations here, but at least the male/female split is likely to be around 50/50, so the lower female offending message is very real.
The colour of the lines is explicitly controlled by mapping males to
blue and females to pink. The linewidth aesthetic is
set (rather than mapped) to 1 to make the lines a little
thicker than the default.
crimeTrendSex <- as.data.frame(table(crime$Date, crime$SEX))
crimeTrendSex$Date <- as.Date(crimeTrendSex$Var1)
ggplot(crimeTrendSex) +
geom_line(aes(Date, Freq, colour=Var2), linewidth=1) +
scale_colour_manual(values=c(Male=4, Female="pink")) +
coord_cartesian(ylim=c(0, NA))
The following code produces a plot of the number of incidents per month with a separate line for each age group.
This plot uses a line geom and identity stat. It maps month and the
number of incidents to the x- and y-location of the lines and maps the
age group to the colour of the lines. Note that, because
Age is numeric, ‘ggplot2’ thinks Age is a
continuous variable, so we also need to specify group=Age
to let ‘ggplot2’ know that we want a separate line for each distinct
value of Age.
This plot shows that the general pattern of decreasing incidents over time is followed by all age groups. There are some differences, but it is difficult to see which age group each line corresponds to, partly because there are so many lines and partly because it is difficult to distinguish between the colours. We cannot, for example, be sure which lines correspond to youth offending.
crimeTrendAge <- as.data.frame(table(crime$Date, crime$Age.Lower))
crimeTrendAge$Date <- as.Date(crimeTrendAge$Var1)
crimeTrendAge$Age <- as.numeric(as.character(crimeTrendAge$Var2))
ggplot(crimeTrendAge) +
geom_line(aes(Date, Freq, colour=Age, group=Age))
The following code produces a plot of the number of incidents per month with a different facet for each age group.
This plot uses a line geom and identity stat. It maps month and the number of incidents to the x- and y-location of the lines and produces a facet for each age group. The facetting is effectively mapping age group to the location in space of the facets.
This plot in some ways provides a much clearer view of the trends over time in different age groups. It is MUCH easier to tell which line corresponds to which age group, though it is harder to compare lines between age groups (particularly when the age groups are not adjacent to each other). It is clear that the decrease in incidents is greater in the 15-20 age group, although that age group also shows the clearest indication of an up-tick in early 2022. The lines for very old and very young offenders are not particularly informative, at least in terms of trends over time.
ggplot(crimeTrendAge) +
geom_line(aes(Date, Freq)) +
facet_wrap("Age")
The following code produces a bar plot of the number of incidents in each age group, with a separate facet for each year.
This plot uses a bar geom (and a count stat), with the age group mapped to the x-location of the bars, the count of incidents in each age group to the height of the bars. The facetting effectively maps the year to the location in space of the facets.
This plot shows an overall decrease in incidents over time as well as a slight change in the distribution of incidents across age groups so that the peak around younger ages is less extreme.
The year 2014 appears unusual, but this just reflects the fact that we only have 6 months of data from 2014.
crime$Year <- as.POSIXlt(crime$Date)$year + 1900
ggplot(crime) +
geom_bar(aes(Age.Lower), just=0) +
facet_wrap("Year")
The following code produces a bar plot of the number of incidents broken down by the type of incident, with a panel for each age group. The bars are ordered from most common to least common in the 15-20 age group. Some effort has gone into formatting the legend labels; otherwise the legend occupies more than half of the image! The colours scale repeats, but this is deliberate so that the colours are easily distinguishable. The bars are always in the same order so it is easy to tell, for example, which light blue bar is which.
We can see that traffic offences become more common in older age groups (once people can drive!) and drugs become more of an issue as well.
## Order crime types by their order for 15-20s
types <- table(crime$ANZSOC.Division[crime$Age.Lower == 15])
newlevels <- names(types)[order(types, decreasing=TRUE)]
## Abbreviate crime types
newlabels <- unlist(lapply(strwrap(newlevels, width=25, simplify=FALSE),
function(x) {
if (length(x) < 4)
x <- c(x, rep(" ", 4 - length(x)))
paste(x, collapse="\n")
}))
crime$ANZSOC.Division <- factor(crime$ANZSOC.Division, levels=newlevels)
ggplot(crime) +
geom_bar(aes(ANZSOC.Division, fill=ANZSOC.Division)) +
scale_x_discrete(labels=newlabels) +
scale_fill_manual(values=rep(palette(), 2), labels=newlabels) +
facet_wrap("Age.Lower", nrow=4) +
theme(axis.title=element_blank(),
axis.text=element_blank(),
axis.ticks=element_blank(),
legend.key.size=unit(3, "mm"),
legend.text=element_text(size=6))
We have produced a variety of ‘ggplot2’ plots, using different geoms, different aesthetic mappings, positioning, and facetting. The plots are quite different in terms of what they tell us about our questions of interest.
We have discovered that: youth offending is a significant component of overall offending; offending is trending downwards generally, but there is some small evidence of an up-tick in offending in early 2022; these trends are pretty similar across age groups and between males versus females; the fastest decrease is occurring in youth offending, but this is also where the evidence is strongest for an up-tick in early 2022.