The purpose of this lab is to practice identifying problems with data visualisations and to experiment with improvements based on established guidelines for more effective data visualisations.
NOTE that for some questions there may not be a unique correct answer; what is important is that you can provide a reason for what you are doing, based on the concepts and terminology used in the readings for this topic.
The data set is a CSV file, nzpolice-proceedings.csv,
which was derived from “Dataset 5” of Proceedings
(offender demographics) on the policedata.nz
web site.
We can read the data into an R data frame with
read.csv().
Age.Lower Police.District ANZSOC.Division
1 15 Tasman Acts Intended to Cause Injury
2 20 Auckland City Abduction, Harassment and Other Related Offences Against a Person
3 40 Auckland City Abduction, Harassment and Other Related Offences Against a Person
4 10 Auckland City Acts Intended to Cause Injury
5 15 Auckland City Acts Intended to Cause Injury
6 15 Auckland City Acts Intended to Cause Injury
SEX Date
1 Female 2015-12-01
2 Female 2015-12-01
3 Female 2015-12-01
4 Female 2015-12-01
5 Female 2015-12-01
6 Female 2015-12-01
The following code reorders the levels of the
ANZSOC.Division factor according to the highest age group
count for each type of crime. It also generates newlabels,
which are line-wrapped versions of the ANZSOC.Division
levels.
types <- apply(table(crime$ANZSOC.Division, crime$Age.Lower), 1, max)
newlevels <- names(types)[order(types, decreasing=TRUE)]
newlabels <- unlist(lapply(strwrap(newlevels, width=30, simplify=FALSE),
function(x) {
if (length(x) < 3)
x <- c(x, rep(" ", 3 - length(x)))
paste(x, collapse="\n")
}))
crime$ANZSOC.Division <- factor(crime$ANZSOC.Division, levels=newlevels)The following code generates a table of counts for the number of incidents per age group, broken down by type of crime.
We have already seen the distribution of incidents across age groups (across all crimes); there is a highly skewed distribution with a peak of crimes in the 15-20 age group.
In this lab we will focus on the distribution of incidents across age
groups, broken down by type of crime (ANZSOC.Division):
Run the following code to produce a bar plot of the number of incidents in each age group broken down by the type of crime.
Comment on what this data visualisation tells us about the questions of interest.
Comment on each of the following issues with this data visualisation:
You should write at least one sentence for each issue: is there a problem? how is the problem affecting our ability to answer the questions of interest?
Write R code to produce three modifications of the data visualisation from Question 1 that addresses overplotting by using each of the following techniques (one at a time):
position of the
bars).Describe what changes you have made in each case and comment on whether you have improved the data visualisation in each case (is it easier or harder to answer the questions of interest?).
Do NOT use facetting in this question.
Write R code to produce a modification of the data visualisation from Question 1 that attempts to improve the labelling.
Describe each change that you have made and comment on whether you have improved the data visualisation (is it easier or harder to answer the questions of interest?).
Do NOT use facetting in this question.
Write R code to produce a modification of the data visualisation from the previous question that uses small multiples.
Comment on whether this is an improvement on the previous data visualisation (is it easier or harder to answer the questions of interest?).
Write R code to produce a modification of the data visualisation from the previous question that attempts to increase the data-ink ratio.
Describe each change that you have made and comment on whether you have improved the data visualisation (is it easier or harder to answer the questions of interest?).
Write R code that uses ‘grid’ (in combination with ‘ggplot2’) to produce the data visualisation below.
NOTE that the ANZSOC.Division labels
are top-left justified 2mm in from the top-left corner of each
panel.
Comment on whether this plot makes it easier or harder to answer the questions of interest.
Can you produce the data visualisation shown below? Does this help with answering the questions of interest at all?
NOTE that there is a vertical grid line in the middle of the 15-20 age group and the grey bars in each panel are the blue bars scaled up (or down) so that the highest bar is 0.5 the height of the panel.
Your submission should consist of a knitted R Markdown document, in HTML format, submitted via Canvas.
Your report should include:
Don’t forget to also complete the Canvas Quiz!
Marks will be lost for:
This
work is licensed under a
Creative
Commons Attribution 4.0 International License.