# "Data to Insight" Videos

## Chris Wild, Department of Statistics, The University of Auckland

[Next start date for Data to Insight: 8 October 2018 at https://www.futurelearn.com/courses/data-to-insight]

In addition to being a MOOC introducing its students to statistical data analysis, "Data to Insight"
(https://www.futurelearn.com/courses/data-to-insight) prototypes a much-further-much-faster, more-data-more-quickly Introductory Statistics Course. The acceleration should be evident from the course outline below, particularly since it is "covered" in about 10% of the "class time" of a standard introductory course.

### Strategies used for acceleration

The most novel strategies used are:
1. Being intensely visual and driving all argument off things you can see supplemented by metaphor;

2. Building software solutions (iNZight, VIT) that prevent "how do I get this out of the software?" limiting the speed at which students can encounter new situations and new ideas;

3. Finding some powerful, conceptually-undemanding "extender-capabilities" that immediately open much wider horizons.
Other strategies are more obvious: limiting messages to just those most critical for real-world learning from data; stripping concepts back to their barest bones; and exploiting, feeding and reshaping primary intuition. Additionally, we use vivid images (verbal as well as visual) to make key messages linger.

This page has been created for university and college teachers of statistics and research methods to enable them to dip in and see "what is being done and how" by jumping directly to particular video.

### Week 1: GETTING STARTED

• Introduction to the Course [3:25]
• How do we use data analysis? (Interviews with 4 statisticians working in a range of fields) [4:36]
• Data Organisation [5:15]
Review Questions ...
• What does a row in a rectangular data set correspond to?
• What does a column in a rectangular data set correspond to?
• What is an entity?
• What is a variable?
• What are the two variable types we are distinguishing between?
• Why do we make these distinctions?

### Week 2: BOOT CAMP

• Statistics Boot Camp (Introduction to the week) [1:17]
• Categorical Variables (Frequency tables and bar charts) [4:59]
Review Questions ...
• What is the basic graph that we use for displaying the data for a categorical variable?
• Why do we use this in preference to pie charts and segmented bar charts?
• What features of the data does a bar chart highlight? (What sorts of things can we easily see?)
• Ordering Categories (Alphanumeric, natural and frequency ) [2:20]
Review Questions ...
• How does most software order categories by default?
• What other orderings should we consider and what are their advantages?
• Numeric variables (Dot plots & histograms; Centre (means & medians), where and why different) [5:44]
Review Questions ...
• How is a stacked dot plot constructed?
• What are the four types of things we look for in dot plots?
• What is the common name for the mean?
• How does the mean relate to the dot plot?
• What property defines the median?
• When do the mean and median tend to be very similar, and when can they be quite different?
• Why might we prefer the median to the mean for personal incomes?
• Features of Numeric variables (Article, pdf, not a video)
• Feature Spotting (Spread, shapes & oddities; Colouring by a 2nd variable) [5:26]
Review Questions ...
• What sort of "oddities" should we look for and what sort of questions should we ask when we see them?
• What is the defining property of a median? the first quartile? and third quartile?
• What information does the box plot add to the dot plot?
• What is the interquartile range and how does it relate to the box plot?
• What is an "outlier"?
• What does "positively skewed" mean?
• What does "bimodal" mean?
• Comparing groups ((for numeric variable) Comparing features; Shifts & background variability; Colouring by a 3rd variable) [7:53]
Review Questions ...
• When we compared children per woman, between regions, how many variables were involved and what were they?
• What are the main things we look for when comparing groups using dot plots?
• Why, as a way of transmitting group-comparison information, is simply quoting group averages often misleading?
• What is the natural way of measuring the extent of a shift?
• When looking at the extent of a difference between group centres, what else should we be relating that to when trying to gauge how important a change is?
• When do we pay attention to changes in spread or variability?
• How should we measure a "change" in spreads?
• When do we pay attention to changes in shape?
• What can we do with sets of dot plots to explore the effect of another variable?
• Time Travel (Case study subsetting over a 3rd variable using tiled plots and movement) [3:22]
Review Questions ...
• In the last video, we investigated the effect of a third variable by using it to colour the points. What technique was introduced here for investigating the effect of a third variable?
• Why is it useful to play through a set of graphs (like a movie)?

### Week 3: RELATIONSHIPS

• Introduction to Relationships (Why we care; Outcome & predictor variables) [2:52]
Review Questions ...
• Why do people care about relationships between variables?
• We distinguished between an outcome variable and predictor variables. What is an outcome variable and what are predictor variables?
• Relationships between Categorical Variables (Exploration using separate bar charts and side-by-side bar charts) [6:22]
Review Questions ...
• We distinguished between an outcome variable and predictor variables. What is an outcome variable and what are predictor variables?
• What are the strengths and weaknesses of using separate bar graphs of an outcome variable for each predictor group?
• What are the strengths and weaknesses of using side-by-side bar graphs to show the relationship between an outcome variable and a predictor variable?
• Which type of graph should you look at?
• In terms of separate bar graphs for each predictor group, what would you expect to see if there was no relationship between the outcome variable and the predictor variable?
• What are you looking for when you use side by side bar charts?
• Changes across subgroups (Exploring effects of a 3rd and 4th variable on a relationship via subsetting, tiling & movement) [4:52]
Review Questions ...
• What basic ideas allow us to explore how a relationship changes with a 3rd (and perhaps a 4th) variable?
• Why is stepping through a set of graphs (like playing a movie) useful?
• Relationships between numeric variables (Scatter plots; Trend, scatter & outliers; Clustering) [5:17]
Review Questions ...
• What is the standard way of displaying the relationship between 2 numeric variables?
• What sort of variable is plotted against the vertical scale and what against the horizontal scale?
• It is often useful to think of patterns in such plots in terms of 3 components. What are they?
• What type of question should separate clusters of dots suggest to us?
• Trend, Scatter & Outliers (Examples; Prediction & prediction intervals; Training the eye) [6:42]
Review Questions ...
• How do we use a scatter plot to predict a new outcome at a given value of the predictor variable?
• How can we find a range of values that is likely to contain the new outcome?
• When will the prediction intervals/ranges be wide and when will they be narrow?
• What must we assume when we use existing data to predict a new outcome?
• In what type of region is it particularly dangerous to make predictions?
• How can we check visually whether a trend line or curve is positioned properly?
• What should the trend value corresponding to a particular predictor-variable value (the value 5 was used in the video) be telling us about?

### Week 4: MORE RELATIONSHIPS

• More Relationships (Introduction to week's coverage) [1:32]
• Lines, curves and smoothers (Lines, curves & smoothers; Least squares) [4:01]
Review Questions ...
• What shapes can be captured by each of linear, quadratic and cubic trend curves?
• What advantage does a smoother have over quadratic or cubic curves?
• Why shouldn’t we take the behaviour of flexible trend curves at the right and left-hand edges of the plot too seriously?
• If you ask iNZight, or most standard software, to put a curve of a particular type onto the scatterplot (e.g. a quadratic), how does it choose which curve of this type to draw?
• Overcoming Perceptual Problems (Problems with large datasets; Overprinting; Jitter; Varying transparency & point size; Running quantiles; Tile-density plots) [7:05]
Review Questions ...
• What is overprinting and why does it cause problems for us?
• What is jittering and how can the use of jittering help us?
• What is transparency and how does transparency help us?
• Why should we vary the transparency setting?
• In large data sets, what is emphasized visually by a low transparency setting? by a high transparency setting?
• What are running quantiles and why are they useful for large data sets?
• Why might we want to look at a version of the plot that uses very small plotting symbols?
• Why does iNZight switch its default from drawing normal scatterplots to drawing tile density plots when the data set gets large?
• Diving deeper with more variables (Additional variables using colour, subsetting & movement; different trends per group ) [5:20]
Review Questions ...
• What are we looking for when we colour by a (third) numeric variable?
• What are we looking for when we colour by a (third) categorical variable?
• What are we looking for when we subset by a (third) variable?
• Our Changing Health and Wealth (Case study using up to 6 variables at once by employing colour, size, subsetting, matrices of tiled plots and movement ) [5:41]

### Week 5: WHY WHAT WE SEE IS NEVER QUITE THE WAY IT REALLY IS

• Why what I see is never quite the way it really is (Intro to week; Facts & artefacts) [3:04]
Review Questions ...
• What is the first law of data analysis?
• Can sophisticated data analysis turn bad data into reliable conclusions?
• In terms of the patterns we see in data, what is the difference between facts and artefacts?
• Bad Data ("Measurement" issues/ biases; "Selection" biases; missingness) [7:02]
Review Questions ...
• What are artefacts?
• What are the two main ways that systematic biases get into data?
• Why can missing values cause biases?
• What is the best way investigators can protect themselves against bad data?
• Causes and Confounding Variables, Part I (Confounding & adjustment) [6:38]
Review Questions ...
• When is a variable a cause of changes in the outcome?
• What is an observational study?
• When do we have positive association between variables? negative association?
• When can a causal conclusion be justified on the basis of observational data alone?
• What is a randomised experiment?
• What is a confounder?
• What is a lurking variable?
• How can we adjust for a lack of balance on a known confounder?
• Causes and Confounding Variables, Part II (Confounding & adjustment) [3:57]
Review Questions ...
• What is a lurking variable?
• We have methods for adjusting for confounders, so why can we still not reliably draw causal conclusions from observational data alone?
• Random Error, Part I (Random variation/error; effect of sample size) [7:06]
Review Questions ...
• Do the problems caused by bad measurement systems and biased selection mechanisms go away when we get huge amounts of data?
• Do the problems caused by confounding go away when we get huge amounts of data?
• Do the problems caused by random error go away when we get huge amounts of data?
• What is a sampling error?
• Do we ever know how big the actual sampling error we incurred was? What do we try to do about that?
• Random Error, Part II (Random variation/error; effect of sample size; biases) [6:05]
Review Questions ...
• What effect does sample size have on sampling error?
• For what two reasons are non-random selection mechanisms worse than random selection mechanisms?
• What were the 5 "take home messages" from this movie?

### Week 6: ESTIMATION WITH CONFIDENCE

• Estimation With Confidence (Motivation, background, intuition) [4:52]
Review Questions ...
• What is the most reliable way we know of obtaining data about populations without misleading biases? Why is this method not perfect?
• What happens whenever we use data from a sample to estimate a population quantity?
• What do we do about that?
• If you have an estimate of a population quantity and know its margin of error, within what limits should the true value lie?
• Why does estimating the extent of sampling error in a practical application seem like an impossible problem?
• Confidence Intervals from Bootstrap Resampling (Bootstrap idea; Comparing BS re-sampling variation & sampling variation; Construction of bootstrap CI) [8:27]
Review Questions ...
• What is bootstrap re-sampling? How do we generate a bootstrap-resample?
• What would happen if we took our re-samples using the ordinary way of sampling (without replacement)?
• What do we notice from the animations shown about bootstrap re-sampling error and sampling error?
• Why don’t we just use sampling error to make our confidence intervals? What do we do instead?
• Does the development in this video demonstrate that bootstrap confidence intervals will work as a way of addressing the “how wrong could we be” problem?
• Do Bootstrap Confidence Intervals work? (Simulation coverage) [5:41]
Review Questions ...
• What is the basic idea of how a bootstrap confidence interval is constructed to capture the true population value of some quantity (e.g. a mean, a median, a percentage, ..)?
• What do we do to find out whether a method for constructing confidence intervals works?
• What do we mean by "our interval has covered the true value"?
• What do we know about bootstrapping as a way of constructing confidence intervals?
• What is meant when we say that a particular interval is a "95% confidence interval"?
• Numeric Outcomes (Use of CIs and comparison intervals in anova-type situation) [5:41]
Review Questions ...
• What is the basic idea of how we construct a bootstrap confidence interval to capture the true population value of some quantity?
• Why don’t we use the confidence intervals drawn on our plots to make comparisons between groups?
• What is the difference between a confidence interval and a comparison interval?
• How do we interpret non-overlapping comparison intervals? overlapping comparison intervals?
• Can we rely completely on the intervals drawn on our plots? What are they there for? When and how do we get more reliable answers?
• Sketch 2 diagrams showing how you can “read” a confidence interval for a difference between 2 groups using comparison intervals. (A diagram for the non-overlapping case, and one for the overlapping case.)
• Categorical Outcomes (Use of CIs and comparison intervals; [At end:bootstrapping trends on scatter plot]) [5:39]
Review Questions ...
• How is a "nest of trend curves" obtained for putting onto a scatterplot?
• How do we interpret such a "nest of trend curves"?

### Week 7: RANDOMISED EXPERIMENTS AND STATISTICAL TESTS

• Randomised Experiments [5:24]
Review Questions ...
• Why do we want to do randomised experiments? What is the point of them?
• What are the two elements we need to have in place to be able to have a fair test?
• What are treatments?
• What is a treatment group?
• What is the point of random assignment to treatment groups?
• Why is it unethical to do some experiments?
• What does blocking mean? What principle governs the use of blocking and randomisation in designing an experiment?
• Randomisation Variation (Effects of pure randomisation variation) [6:01]
Review Questions ...
• What is randomisation variation and why is it a problem?
• How can we reduce the levels of randomisation variation?
• The Randomisation Test [7:45]
Review Questions ...
• When we look at a plot of experimental data that compares two treatment groups, why is it not always just obvious which treatment is better? What question goes through our minds?
• What is the basic idea governing when we will conclude that there is a real treatment difference?
• Try to bring to mind the basic steps of the VIT visualisation of the randomisation test (and visualise the sorts of pictures that were being generated at each step).
• What is the formal name for the tail proportions the visualisations produce?
• Percentages and Multiple groups (Applied to proportions/percentages in a 2- and 3-groups situations) [7:48]
• Out of the Fog and into Power (Power, Type I error and "sample size") [8:27]

### Week 8: TIME SERIES

• Introducing Time Series [4:38]
Review Questions ...
• What is time-series data?
• Why are people interested in time-series data?
• What is quarterly data?
• Why do people plot time-series data with points joined up by lines instead of using normal scatterplots?
• What, besides trends, is another form of pattern that is very common in time-series data?
• Seasonal Decomposition & Forecasting, Part I [6:01]
• Review Questions ...
• What is the basic idea behind an additive model (or additive seasonal decomposition)?
• Why do we want to find stable structures in our time series?
• Seasonal Decomposition & Forecasting, Part II [5:12]
Review Questions ...
• What is the basic idea behind an additive model (or additive seasonal decomposition)?
• What is the basic idea behind a multiplicative model (or multiplicative seasonal decomposition)?
• Why do we want to find stable structures in our time series?
• When is an additive model likely to work better and when is a multiplicative model likely to work better?
• When we have an additive decomposition, how do we adjust the trend value to incorporate the seasonal effect at each time point?
• When we have a multiplicative decomposition, how do we adjust the trend value to incorporate the seasonal effect at each time point?
• What is the no-change value when we are making additive adjustments?
• What is the no-change value when we are making multiplicative adjustments?
• When does the Holt-Winters forecasting method usually work well?
• Can time-series forecasting predict the future reliably? Why?
• Comparing Series [5:16]
Review Questions ...
• When we are plotting several related series so that we can compare the patterns in them, what are the strengths and the weaknesses of a plot that puts all of the series on the same graph?
• When we are plotting several related series so that we can compare the patterns in them, what are the strengths and the weaknesses of a plot that puts all of the series on their own separate graphs?
• What types of feature of each series can we compare using the iNZight graphs for comparing series?