The Data Set

In this lab, we have more code than usual for setting up the data set. This is because we need data about the map we want to draw (the outline of New Zealand Police regions) in addition to the crime data that we have been using throughout the course. We then need to create data sets that combine the map data and the crime data.

Crime data

The data set is a CSV file, nzpolice-proceedings.csv, which was derived from “Dataset 5” of Proceedings (offender demographics) on the policedata.nz web site.

We can read the data into an R data frame with read.csv().

crime <- read.csv("nzpolice-proceedings.csv")

The following code generates a column of real dates, generates a Year column, and makes a tweak to the Police.District column (which will be useful later when we merge this crime data with the map outline data).

crime$Month <- as.Date(crime$Date)
crime$Year <- as.POSIXlt(crime$Date)$year + 1900
crime$Police.District <- gsub("Of", "of", crime$Police.District)

For this lab we will drop the year 2014 (for which we only have partial data).

crime <- subset(crime, Year >= 2015)

The following code generates total counts per Police District.

library(dplyr)
crimePerDistrict <- count(crime, Police.District)

The following code generates total counts for each year per Police District.

crimeYearPerDistrict <- count(crime, Police.District, Year)

The following code generates total counts for each type of crime per Police District.

crimeTypePerDistrict <- count(crime, Police.District, ANZSOC.Division)

Map data

Map data for the Police Districts was obtained from (Koordinates)[https://koordinates.com/layer/105480-nz-police-district-boundaries-29-april-2021/].

library(sf)
## districts <- st_read("SHP/nz-police-district-boundaries-29-april-2021.shp")
## districts <- st_read("CSV/nz-police-district-boundaries-29-april-2021.csv")
districts <- 
    st_as_sf(read.csv("CSV/nz-police-district-boundaries-29-april-2021.csv"),
             wkt=1)
st_crs(districts) <- 2193

The following code adds centroids per region (as X and Y).

centroids <- st_coordinates(st_centroid(st_geometry(districts)))
districts$X <- centroids[,1]
districts$Y <- centroids[,2]

Combined data

The following code combines the map data with the different crime counts.

crimeDistricts <- inner_join(districts, crimePerDistrict,
                             by=join_by(DISTRICT_N == Police.District))

crimeTypeDistricts <- inner_join(districts, crimeTypePerDistrict,
                                 by=join_by(DISTRICT_N == Police.District))

crimeYearDistricts <- inner_join(districts, crimeYearPerDistrict,
                                 by=join_by(DISTRICT_N == Police.District))

Questions of Interest

Each data visualisation in this lab will address at least one of the following questions:

Data Visualisations

library(ggplot2)
  1. The following code produces a map of the New Zealand Police Districts, with a label for each district. The only tricky bit is the use of hjust in geom_sf_text() to shift the labels out of each others way.

    ggplot(districts) +
        geom_sf() +
        coord_sf() +
        geom_sf_text(aes(label=DISTRICT_N), 
                     hjust=c(0, .5, .5, .5, 1, .5,
                             .5, 1, .5, .5, .5, .5))

  2. The following code produces a map of the New Zealand Police Districts, with each region coloured to represent the number of incidents.

    We can clearly see which regions had more incidents than others, though detailed comparisons are not easy because colour is not a good visual channel for representing count data. On the other hand, it is easy to make a north-versus-south comparison because location in space is a good visual channel for representing categories (like “North” and “South”) and map conventions like north-equals-up are very familiar to us.

    ggplot(crimeDistricts) +
        geom_sf(aes(fill=n)) +
        coord_sf()

    The major substantive issue with this map is that we have counts of crimes, not rate of crimes. This means that a region that has more crime could just be a region that has a higher population. I could not find population per district data - the data in the Youth Justice Indicators Report is for different age groups and only for specific age groups. To give an idea of how much of an issue this is, here are the differences in populations for the districts in 2020/21 for “Youth” and the region with the highest number of incidents (Counties/Manukau) does have the highest population (for this age group):

    read.csv("district-population.csv")
               District X2020.21
    1         Northland    9,232
    2         Waitematā   30,694
    3     Auckland City   21,459
    4  Counties Manukau   34,373
    5           Waikato   21,138
    6     Bay Of Plenty   20,026
    7           Eastern   11,953
    8           Central   19,640
    9        Wellington   26,625
    10           Tasman    8,827
    11       Canterbury   30,256
    12         Southern   17,866
  3. The following code produces a map of the New Zealand Police Districts, with a dot in each region, sized to represent the number of incidents.

    We can again clearly see which regions had the most incidents and, because we are using area rather than colour, it is easier to get a sense of the sizes of the differences between regions, which do not seem all that great (even though, according to the legend, some regions have twice as many incidents as others). Area is still not the best visual channel for representing count data.

    ggplot(crimeDistricts) +
        geom_sf() +
        geom_point(aes(X, Y, size=n), colour=rgb(0,0,0,.5)) +
        scale_size_area() +
        coord_sf()    

  4. The following code produces an animated plot of the number of incidents in each Police District with one frame per year.

    This animation gives an indication that crime is decreasing over time in most regions, though it appears that drops in crime are happening mostly in regions that start with higher crime.

    library(magick)
    drawFrame <- function(i) {
        print(
        ggplot(subset(crimeYearDistricts, Year == i)) +
            geom_sf(aes(fill=n)) +
            coord_sf() +
            scale_fill_gradient(limits=c(3000, 14000)) +
            labs(title=paste0("Year: ", i))
        )
    }
    years <- 2015:2022
    img <- image_graph()
    for (i in years) {
        drawFrame(i)
    }
    dev.off()
    anim <- image_animate(img, fps=1)
    image_write(anim, "anim.gif")

  5. The following code produces a line plot of the number of incidents in each region over time. It is MUCH easier to see in this plot that the decrease in crime over time actually happen in all districts. However, it is hard to mentally map districts to geographic location. For example, it is very hard to make any assessment of differences between southern districts versus northern districts.

    ggplot(crimeYearPerDistrict) +
        geom_line(aes(Year, n, colour=Police.District))

  6. The following code produces a facetted plot of regions with one facet for each type of crime.

    We can really only see more common vs less common types of crime. If we look hard, there is a suggestion that the Bay of Plenty is a bit of a hot spot for lots of crime types, but for less common types of crime the entire map is too dark to see any differences between regions.

    It might help to have a “free” colour scale in this data visualisation, though that might require quite a lot more code (and some information on arranging multiple plots that we will only look at in later topics).

    ggplot(crimeTypeDistricts) +
        geom_sf(aes(fill=n)) +
        coord_sf() +
        facet_wrap("ANZSOC.Division", nrow=3) +
        theme(legend.position="none",
              axis.text=element_blank(),
              axis.ticks=element_blank(),
              strip.text=element_text(hjust=0))

  7. The following code produces a facetted bar plot of the number of incidents in each region for each type of crime.

    This code just sets the order of the levels of the Police.District factor so that the bars are ordered from smallest to largest within the “Acts Intended to Cause Injury” crime type. The x-axis is also allowed to vary between panels so that we can see the distribution for less common crimes.

    sub <- subset(crimeTypePerDistrict, 
                  ANZSOC.Division == "Acts Intended to Cause Injury")
    regionOrder <- sub$Police.District[order(sub$n, decreasing=TRUE)]
    crime$Police.District <- 
        factor(crime$Police.District, levels=regionOrder)

    This data visualisation makes it MUCH easier to see subtle differences in the distribution of crimes across regions between crime types. We can see that the region with the most crime is NOT the same for all crime types (and that is impossible to see in the previous facetted map). This is because the number of incidents is represented by the length of bars rather than by colour (i.e., the visual channel is better). The only problem with this data visualisation compared to the map is the geographic mapping of regions, but in this case the maps do not really show us anything, so this is not a great loss.

    ggplot(crime) +
        geom_bar(aes(Police.District)) +
        coord_flip() +
        facet_wrap("ANZSOC.Division", nrow=3, scales="free_x") +
        theme(axis.text.y=element_text(size=6),
              axis.title.y=element_blank(),
              strip.text=element_text(hjust=0))

Challenge

  1. The following code produces a map of Police Districts with an embedded line plot per district.

    The code is a bit long because it makes some fine tuning of the positioning of the line plots so that they do no overlap, but otherwise the idea is relatively straightforward: push a viewport for each region and draw a rectangle, a text label, and a set of lines, where the lines are based on the number of incidents over time for that region.

    This gives potentially the best of both worlds. We get to clearly see the trends in number of incidents over time AND we can associate the different trends with their geographic locations. We are using location (or location in space) over and over again to represent the location of regions within New Zealand, the number of incidents, and time.

    One problem is the unaligned x- and y-axes for comparison between panels. This could be slightly improved by placing the panels on a more regular grid, while retaining their approximate geographic locations.

    library(gggrid)
    subplot <- function(data, coords) {
        sub <- subset(crimeYearPerDistrict, Police.District == data$name)
        xoff <- unit(0, "npc")
        yoff <- unit(0, "npc")
        xlab <- 0
        hjust <- "left"
        if (data$name == "Northland") {
            xoff <- unit(-4, "mm")
            yoff <- unit(5, "mm")
        }
        if (data$name == "Waitemata") {
            yoff <- unit(3, "mm")
        }
        if (data$name ==  "Auckland City") {
            xoff <- unit(7, "mm")
        }
        if (data$name ==  "Waikato") {
            xoff <- unit(-4, "mm")
            yoff <- unit(-2, "mm")
        }
        if (data$name == "Counties/Manukau") {
            xoff <- unit(-12, "mm")
            xlab <- 1
            hjust <- "right"
        }
        if (data$name ==  "Eastern") {
            xoff <- unit(2, "mm")
            yoff <- unit(-6, "mm")
        }
        if (data$name ==  "Wellington") {
            yoff <- unit(-1, "mm")
        }
        if (data$name == "Tasman") {
            yoff <- unit(2, "mm")
        }
        vp <- viewport(unit(coords$x, "npc") + xoff, 
                       unit(coords$y, "npc") + yoff, 
                       width=unit(1, "cm"), height=unit(1, "cm"),
                       xscale=range(crimeYearPerDistrict$Year), 
                       yscale=range(crimeYearPerDistrict$n))
        gTree(children=gList(rectGrob(gp=gpar(fill="white")),
                             textGrob(data$name, x=xlab, y=1, 
                                      just=c(hjust, "bottom"),
                                      gp=gpar(fontsize=8)),
                             linesGrob(sub$Year, sub$n, 
                                       default.units="native")), 
              vp=vp)
    }
    ggplot(crimeDistricts) +
        geom_sf() +
        grid_group(subplot, aes(X, Y, group=DISTRICT_N, name=DISTRICT_N)) +
        coord_sf()    

Summary

We have produced some basic maps of Police Districts and attempted to add representations of crime data per District. We have demonstrated that, because the map uses up the x-location and y-location visual channels, it can be difficult to effectively represent per-region data, especially more than one variable. However, when it works, having the geographic locations of the Districts represented implicitly on a map is very effective. Unfortunately, for this particular data set, having numbers of crimes rather than crime rates (per 1,000 population) rather undermines our efforts because we may just be seeing large numbers of crimes in regions with larger populations.