 Chris Wild, Department of Statistics, The University of Auckland

Videos from "Data to Insight: An Introduction to Data Analysis"

("Data to Insight" is a free online course on FutureLearn (Next start date: 13 Apr 2020))

Week 2: BOOT CAMP

• Statistics Boot Camp (Introduction to the week) [1:17]
• Categorical Variables (Frequency tables and bar charts) [4:59]
Review Questions ...
• What is the basic graph that we use for displaying the data for a categorical variable?
• Why do we use this in preference to pie charts and segmented bar charts?
• What features of the data does a bar chart highlight? (What sorts of things can we easily see?)
• Ordering Categories (Alphanumeric, natural and frequency ) [2:20]
Review Questions ...
• How does most software order categories by default?
• What other orderings should we consider and what are their advantages?
• Numeric variables (Dot plots & histograms; Centre (means & medians), where and why different) [5:44]
Review Questions ...
• How is a stacked dot plot constructed?
• What are the four types of things we look for in dot plots?
• What is the common name for the mean?
• How does the mean relate to the dot plot?
• What property defines the median?
• When do the mean and median tend to be very similar, and when can they be quite different?
• Why might we prefer the median to the mean for personal incomes?
• Features of Numeric variables (Article, pdf, not a video)
• Feature Spotting (Spread, shapes & oddities; Colouring by a 2nd variable) [5:26]
Review Questions ...
• What sort of "oddities" should we look for and what sort of questions should we ask when we see them?
• What is the defining property of a median? the first quartile? and third quartile?
• What information does the box plot add to the dot plot?
• What is the interquartile range and how does it relate to the box plot?
• What is an "outlier"?
• What does "positively skewed" mean?
• What does "bimodal" mean?
• Comparing groups ((for numeric variable) Comparing features; Shifts & background variability; Colouring by a 3rd variable) [7:53]
Review Questions ...
• When we compared children per woman, between regions, how many variables were involved and what were they?
• What are the main things we look for when comparing groups using dot plots?
• Why, as a way of transmitting group-comparison information, is simply quoting group averages often misleading?
• What is the natural way of measuring the extent of a shift?
• When looking at the extent of a difference between group centres, what else should we be relating that to when trying to gauge how important a change is?
• When do we pay attention to changes in spread or variability?
• How should we measure a "change" in spreads?
• When do we pay attention to changes in shape?
• What can we do with sets of dot plots to explore the effect of another variable?
• Time Travel (Case study subsetting over a 3rd variable using tiled plots and movement) [3:22]
Review Questions ...
• In the last video, we investigated the effect of a third variable by using it to colour the points. What technique was introduced here for investigating the effect of a third variable?
• Why is it useful to play through a set of graphs (like a movie)?

Week 3: RELATIONSHIPS

• Introduction to Relationships (Why we care; Outcome & predictor variables) [2:52]
Review Questions ...
• Why do people care about relationships between variables?
• We distinguished between an outcome variable and predictor variables. What is an outcome variable and what are predictor variables?
• Relationships between Categorical Variables (Exploration using separate bar charts and side-by-side bar charts) [6:22]
Review Questions ...
• We distinguished between an outcome variable and predictor variables. What is an outcome variable and what are predictor variables?
• What are the strengths and weaknesses of using separate bar graphs of an outcome variable for each predictor group?
• What are the strengths and weaknesses of using side-by-side bar graphs to show the relationship between an outcome variable and a predictor variable?
• Which type of graph should you look at?
• In terms of separate bar graphs for each predictor group, what would you expect to see if there was no relationship between the outcome variable and the predictor variable?
• What are you looking for when you use side by side bar charts?
• Changes across subgroups (Exploring effects of a 3rd and 4th variable on a relationship via subsetting, tiling & movement) [4:52]
Review Questions ...
• What basic ideas allow us to explore how a relationship changes with a 3rd (and perhaps a 4th) variable?
• Why is stepping through a set of graphs (like playing a movie) useful?
• Relationships between numeric variables (Scatter plots; Trend, scatter & outliers; Clustering) [5:17]
Review Questions ...
• What is the standard way of displaying the relationship between 2 numeric variables?
• What sort of variable is plotted against the vertical scale and what against the horizontal scale?
• It is often useful to think of patterns in such plots in terms of 3 components. What are they?
• What type of question should separate clusters of dots suggest to us?
• Trend, Scatter & Outliers (Examples; Prediction & prediction intervals; Training the eye) [6:42]
Review Questions ...
• How do we use a scatter plot to predict a new outcome at a given value of the predictor variable?
• How can we find a range of values that is likely to contain the new outcome?
• When will the prediction intervals/ranges be wide and when will they be narrow?
• What must we assume when we use existing data to predict a new outcome?
• In what type of region is it particularly dangerous to make predictions?
• How can we check visually whether a trend line or curve is positioned properly?
• What should the trend value corresponding to a particular predictor-variable value (the value 5 was used in the video) be telling us about?

Week 4: MORE RELATIONSHIPS

• More Relationships (Introduction to week's coverage) [1:32]
• Lines, curves and smoothers (Lines, curves & smoothers; Least squares) [4:01]
Review Questions ...
• What shapes can be captured by each of linear, quadratic and cubic trend curves?
• What advantage does a smoother have over quadratic or cubic curves?
• Why shouldn’t we take the behaviour of flexible trend curves at the right and left-hand edges of the plot too seriously?
• If you ask iNZight, or most standard software, to put a curve of a particular type onto the scatterplot (e.g. a quadratic), how does it choose which curve of this type to draw?
• Overcoming Perceptual Problems (Problems with large datasets; Overprinting; Jitter; Varying transparency & point size; Running quantiles; Tile-density plots) [7:05]
Review Questions ...
• What is overprinting and why does it cause problems for us?
• What is jittering and how can the use of jittering help us?
• What is transparency and how does transparency help us?
• Why should we vary the transparency setting?
• In large data sets, what is emphasized visually by a low transparency setting? by a high transparency setting?
• What are running quantiles and why are they useful for large data sets?
• Why might we want to look at a version of the plot that uses very small plotting symbols?
• Why does iNZight switch its default from drawing normal scatterplots to drawing tile density plots when the data set gets large?
• Diving deeper with more variables (Additional variables using colour, subsetting & movement; different trends per group ) [5:20]
Review Questions ...
• What are we looking for when we colour by a (third) numeric variable?
• What are we looking for when we colour by a (third) categorical variable?
• What are we looking for when we subset by a (third) variable?
• Our Changing Health and Wealth (Case study using up to 6 variables at once by employing colour, size, subsetting, matrices of tiled plots and movement ) [5:41]

Week 5: WHY WHAT WE SEE IS NEVER QUITE THE WAY IT REALLY IS

• Why what I see is never quite the way it really is (Intro to week; Facts & artefacts) [3:04]
Review Questions ...
• What is the first law of data analysis?
• Can sophisticated data analysis turn bad data into reliable conclusions?
• In terms of the patterns we see in data, what is the difference between facts and artefacts?
• Bad Data ("Measurement" issues/ biases; "Selection" biases; missingness) [7:02]
Review Questions ...
• What are artefacts?
• What are the two main ways that systematic biases get into data?
• Why can missing values cause biases?
• What is the best way investigators can protect themselves against bad data?
• Measurement, validity and reliability (Article, pdf, not a video)
• Selection biases (Article, pdf, not a video)
• Data cleaning (Article, pdf, not a video)
• Causes and Confounding Variables, Part I (Confounding & adjustment) [6:38]
Review Questions ...
• When is a variable a cause of changes in the outcome?
• What is an observational study?
• When do we have positive association between variables? negative association?
• When can a causal conclusion be justified on the basis of observational data alone?
• What is a randomised experiment?
• What is a confounder?
• What is a lurking variable?
• How can we adjust for a lack of balance on a known confounder?
• Causes and Confounding Variables, Part II (Confounding & adjustment) [3:57]
Review Questions ...
• What is a lurking variable?
• We have methods for adjusting for confounders, so why can we still not reliably draw causal conclusions from observational data alone?
• Random Error, Part I (Random variation/error; effect of sample size) [7:06]
Review Questions ...
• Do the problems caused by bad measurement systems and biased selection mechanisms go away when we get huge amounts of data?
• Do the problems caused by confounding go away when we get huge amounts of data?
• Do the problems caused by random error go away when we get huge amounts of data?
• What is a sampling error?
• Do we ever know how big the actual sampling error we incurred was? What do we try to do about that?
• Random Error, Part II (Random variation/error; effect of sample size; biases) [6:05]
Review Questions ...
• What effect does sample size have on sampling error?
• For what two reasons are non-random selection mechanisms worse than random selection mechanisms?
• What were the 5 "take home messages" from this movie?

Week 6: ESTIMATION WITH CONFIDENCE

• Estimation With Confidence (Motivation, background, intuition) [4:52]
Review Questions ...
• What is the most reliable way we know of obtaining data about populations without misleading biases? Why is this method not perfect?
• What happens whenever we use data from a sample to estimate a population quantity?
• What do we do about that?
• If you have an estimate of a population quantity and know its margin of error, within what limits should the true value lie?
• Why does estimating the extent of sampling error in a practical application seem like an impossible problem?
• Confidence Intervals from Bootstrap Resampling (Bootstrap idea; Comparing BS re-sampling variation & sampling variation; Construction of bootstrap CI) [8:27]
Review Questions ...
• What is bootstrap re-sampling? How do we generate a bootstrap-resample?
• What would happen if we took our re-samples using the ordinary way of sampling (without replacement)?
• What do we notice from the animations shown about bootstrap re-sampling error and sampling error?
• Why don’t we just use sampling error to make our confidence intervals? What do we do instead?
• Does the development in this video demonstrate that bootstrap confidence intervals will work as a way of addressing the “how wrong could we be” problem?
• Do Bootstrap Confidence Intervals work? (Simulation coverage) [5:41]
Review Questions ...
• What is the basic idea of how a bootstrap confidence interval is constructed to capture the true population value of some quantity (e.g. a mean, a median, a percentage, ..)?
• What do we do to find out whether a method for constructing confidence intervals works?
• What do we mean by "our interval has covered the true value"?
• What do we know about bootstrapping as a way of constructing confidence intervals?
• What is meant when we say that a particular interval is a "95% confidence interval"?
• Numeric Outcomes (Use of CIs and comparison intervals in anova-type situation) [5:41]
Review Questions ...
• What is the basic idea of how we construct a bootstrap confidence interval to capture the true population value of some quantity?
• Why don’t we use the confidence intervals drawn on our plots to make comparisons between groups?
• What is the difference between a confidence interval and a comparison interval?
• How do we interpret non-overlapping comparison intervals? overlapping comparison intervals?
• Can we rely completely on the intervals drawn on our plots? What are they there for? When and how do we get more reliable answers?
• Sketch 2 diagrams showing how you can “read” a confidence interval for a difference between 2 groups using comparison intervals. (A diagram for the non-overlapping case, and one for the overlapping case.)
• Categorical Outcomes (Use of CIs and comparison intervals; [At end:bootstrapping trends on scatter plot]) [5:39]
Review Questions ...
• How is a "nest of trend curves" obtained for putting onto a scatterplot?
• How do we interpret such a "nest of trend curves"?

Week 7: RANDOMISED EXPERIMENTS AND STATISTICAL TESTS

• Randomised Experiments [5:24]
Review Questions ...
• Why do we want to do randomised experiments? What is the point of them?
• What are the two elements we need to have in place to be able to have a fair test?
• What are treatments?
• What is a treatment group?
• What is the point of random assignment to treatment groups?
• Why is it unethical to do some experiments?
• What does blocking mean? What principle governs the use of blocking and randomisation in designing an experiment?
• Randomisation Variation (Effects of pure randomisation variation) [6:01]
Review Questions ...
• What is randomisation variation and why is it a problem?
• How can we reduce the levels of randomisation variation?
• The Randomisation Test [7:45]
Review Questions ...
• When we look at a plot of experimental data that compares two treatment groups, why is it not always just obvious which treatment is better? What question goes through our minds?
• What is the basic idea governing when we will conclude that there is a real treatment difference?
• Try to bring to mind the basic steps of the VIT visualisation of the randomisation test (and visualise the sorts of pictures that were being generated at each step).
• What is the formal name for the tail proportions the visualisations produce?
• Percentages and Multiple groups (Applied to proportions/percentages in 2- and 3-groups situations) [7:48]
• Out of the Fog and into Power (Power, Type I error and "sample size") [8:27]

Week 8: TIME SERIES

• Introducing Time Series [4:38]
Review Questions ...
• What is time-series data?
• Why are people interested in time-series data?
• What is quarterly data?
• Why do people plot time-series data with points joined up by lines instead of using normal scatterplots?
• What, besides trends, is another form of pattern that is very common in time-series data?
• Seasonal Decomposition & Forecasting, Part I [6:01]
• Review Questions ...
• What is the basic idea behind an additive model (or additive seasonal decomposition)?
• Why do we want to find stable structures in our time series?
• Seasonal Decomposition & Forecasting, Part II [5:12]
Review Questions ...
• What is the basic idea behind an additive model (or additive seasonal decomposition)?
• What is the basic idea behind a multiplicative model (or multiplicative seasonal decomposition)?
• Why do we want to find stable structures in our time series?
• When is an additive model likely to work better and when is a multiplicative model likely to work better?
• When we have an additive decomposition, how do we adjust the trend value to incorporate the seasonal effect at each time point?
• When we have a multiplicative decomposition, how do we adjust the trend value to incorporate the seasonal effect at each time point?
• What is the no-change value when we are making additive adjustments?
• What is the no-change value when we are making multiplicative adjustments?
• When does the Holt-Winters forecasting method usually work well?
• Can time-series forecasting predict the future reliably? Why?
• Comparing Series [5:16]
Review Questions ...
• When we are plotting several related series so that we can compare the patterns in them, what are the strengths and the weaknesses of a plot that puts all of the series on the same graph?
• When we are plotting several related series so that we can compare the patterns in them, what are the strengths and the weaknesses of a plot that puts all of the series on their own separate graphs?
• What types of feature of each series can we compare using the iNZight graphs for comparing series?

Introductory videos for inzight

( iNZightVIT is free data visualization and analysis software)

Introductory videos for inzight lite

( iNZight Lite is a free online data visualization and analysis application)