- Videos from
*"Data to Insight: An Introduction to Data Analysis"* - Introductory videos for
*iNZight* - Introductory videos for
*iNZight Lite*

- Publicity trailer for "Data to Insight"
- Week 1 Videos: Getting Started:
*Introduction, data analysis, data organisation* - Week 2 Videos: Boot Camp:
*Basic statistical concepts* - Week 3 Videos: Relationships:
*between categorical and numeric variables and changes across subgroups* - Week 4 Videos: More Relationships:
*Diving deeper into relationships* - Week 5 Videos: Why what we see is never quite the way it really is:
*Biases, confounding and random errors* - Week 6 Videos: Estimation With Confidence:
*Confidence intervals and bootstrapping* - Week 7 Videos: Randomised Experiments and Statistical Tests
*(including "significance" and randomisation tests)* - Week 8 Videos: Time Series:
*Basic ideas, seasonal decomposition and forecasting* - Glossary for Data to Insight

**Course Trailer**[2min:40s]

**Introduction to the Course**[3:25]**How do we use data analysis?**(*Interviews with 4 statisticians working in a range of fields*) [4:36]- In the course we also use one external video,
**Hans Rosling's**inspired and inspirational**"200 Countries, 200 Years, 4 Minutes"** **The place of data analysis in problem solving**(*Article, pdf, not a video*)**Datasets for Data to Insight**(*Article, pdf, not a video*)**Data Organisation**[5:15]Review Questions ...- What does a row in a rectangular data set correspond to?
- What does a column in a rectangular data set correspond to?
- What is an entity?
- What is a variable?
- What are the two variable types we are distinguishing between?
- Why do we make these distinctions?

**Statistics Boot Camp**(*Introduction to the week*) [1:17]**Categorical Variables**(*Frequency tables and bar charts*) [4:59]Review Questions ...- What is the basic graph that we use for displaying the data for a categorical variable?
- Why do we use this in preference to pie charts and segmented bar charts?
- What features of the data does a bar chart highlight? (What sorts of things can we easily see?)

**Ordering Categories**(*Alphanumeric, natural and frequency*) [2:20]Review Questions ...- How does most software order categories by default?
- What other orderings should we consider and what are their advantages?

**Numeric variables**(*Dot plots & histograms; Centre (means & medians), where and why different*) [5:44]Review Questions ...- How is a stacked dot plot constructed?
- What are the four types of things we look for in dot plots?
- What is the common name for the mean?
- How does the mean relate to the dot plot?
- What property defines the median?
- When do the mean and median tend to be very similar, and when can they be quite different?
- Why might we prefer the median to the mean for personal incomes?

**Features of Numeric variables**(*Article, pdf, not a video*)**Feature Spotting**(*Spread, shapes & oddities; Colouring by a 2nd variable*) [5:26]Review Questions ...- What sort of "oddities" should we look for and what sort of questions should we ask when we see them?
- What is the defining property of a median? the first quartile? and third quartile?
- What information does the box plot add to the dot plot?
- What is the interquartile range and how does it relate to the box plot?
- What is an "outlier"?
- What does "positively skewed" mean?
- What does "bimodal" mean?

**Comparing groups**(*(for numeric variable) Comparing features; Shifts & background variability; Colouring by a 3rd variable*) [7:53]Review Questions ...- When we compared children per woman, between regions, how many variables were involved and what were they?
- What are the main things we look for when comparing groups using dot plots?
- Why, as a way of transmitting group-comparison information, is simply quoting group averages often misleading?
- What is the natural way of measuring the extent of a shift?
- When looking at the extent of a difference between group centres, what else should we be relating that to when trying to gauge how important a change is?
- When do we pay attention to changes in spread or variability?
- How should we measure a "change" in spreads?
- When do we pay attention to changes in shape?
- What can we do with sets of dot plots to explore the effect of another variable?

**Time Travel**(*Case study subsetting over a 3rd variable using tiled plots and movement*) [3:22]Review Questions ...- In the last video, we investigated the effect of a third variable by using it to colour the points. What technique was introduced here for investigating the effect of a third variable?
- Why is it useful to play through a set of graphs (like a movie)?

**Introduction to Relationships**(*Why we care; Outcome & predictor variables*) [2:52]Review Questions ...- Why do people care about relationships between variables?
- We distinguished between an outcome variable and predictor variables. What is an outcome variable and what are predictor variables?

**Relationships between Categorical Variables**(*Exploration using separate bar charts and side-by-side bar charts*) [6:22]Review Questions ...- We distinguished between an outcome variable and predictor variables. What is an outcome variable and what are predictor variables?
- What are the strengths and weaknesses of using separate bar graphs of an outcome variable for each predictor group?
- What are the strengths and weaknesses of using side-by-side bar graphs to show the relationship between an outcome variable and a predictor variable?
- Which type of graph should you look at?
- In terms of separate bar graphs for each predictor group, what would you expect to see if there was no relationship between the outcome variable and the predictor variable?
- What are you looking for when you use side by side bar charts?

**Changes across subgroups**(*Exploring effects of a 3rd and 4th variable on a relationship via subsetting, tiling & movement*) [4:52]Review Questions ...- What basic ideas allow us to explore how a relationship changes with a 3rd (and perhaps a 4th) variable?
- Why is stepping through a set of graphs (like playing a movie) useful?

**Relationships between numeric variables**(*Scatter plots; Trend, scatter & outliers; Clustering*) [5:17]Review Questions ...- What is the standard way of displaying the relationship between 2 numeric variables?
- What sort of variable is plotted against the vertical scale and what against the horizontal scale?
- It is often useful to think of patterns in such plots in terms of 3 components. What are they?
- What type of question should separate clusters of dots suggest to us?

**Trend, Scatter & Outliers**(*Examples; Prediction & prediction intervals; Training the eye*) [6:42]Review Questions ...- How do we use a scatter plot to predict a new outcome at a given value of the predictor variable?
- How can we find a range of values that is likely to contain the new outcome?
- When will the prediction intervals/ranges be wide and when will they be narrow?
- What must we assume when we use existing data to predict a new outcome?
- In what type of region is it particularly dangerous to make predictions?
- How can we check visually whether a trend line or curve is positioned properly?
- What should the trend value corresponding to a particular predictor-variable value (the value 5 was used in the video) be telling us about?

**More Relationships**(*Introduction to week's coverage*) [1:32]**Lines, curves and smoothers**(*Lines, curves & smoothers; Least squares*) [4:01]Review Questions ...- What shapes can be captured by each of linear, quadratic and cubic trend curves?
- What advantage does a smoother have over quadratic or cubic curves?
- Why shouldn’t we take the behaviour of flexible trend curves at the right and left-hand edges of the plot too seriously?
- If you ask iNZight, or most standard software, to put a curve of a particular type onto the scatterplot (e.g. a quadratic), how does it choose which curve of this type to draw?

**Overcoming Perceptual Problems**(*Problems with large datasets; Overprinting; Jitter; Varying transparency & point size; Running quantiles; Tile-density plots*) [7:05]Review Questions ...- What is overprinting and why does it cause problems for us?
- What is jittering and how can the use of jittering help us?
- What is transparency and how does transparency help us?
- Why should we vary the transparency setting?
- In large data sets, what is emphasized visually by a low transparency setting? by a high transparency setting?
- What are running quantiles and why are they useful for large data sets?
- Why might we want to look at a version of the plot that uses very small plotting symbols?
- Why does iNZight switch its default from drawing normal scatterplots to drawing tile density plots when the data set gets large?

**Diving deeper with more variables**(*Additional variables using colour, subsetting & movement; different trends per group*) [5:20]Review Questions ...- What are we looking for when we colour by a (third) numeric variable?
- What are we looking for when we colour by a (third) categorical variable?
- What are we looking for when we subset by a (third) variable?

**Our Changing Health and Wealth**(*Case study using up to 6 variables at once by employing colour, size, subsetting, matrices of tiled plots and movement*) [5:41]

**Why what I see is never quite the way it really is**(*Intro to week; Facts & artefacts*) [3:04]Review Questions ...- What is the first law of data analysis?
- Can sophisticated data analysis turn bad data into reliable conclusions?
- In terms of the patterns we see in data, what is the difference between facts and artefacts?

**Bad Data**(*"Measurement" issues/ biases; "Selection" biases; missingness*) [7:02]Review Questions ...- What are artefacts?
- What are the two main ways that systematic biases get into data?
- Why can missing values cause biases?
- What is the best way investigators can protect themselves against bad data?

**Measurement, validity and reliability**(*Article, pdf, not a video*)**Selection biases**(*Article, pdf, not a video*)**Data cleaning**(*Article, pdf, not a video*)**Causes and Confounding Variables, Part I**(*Confounding & adjustment*) [6:38]Review Questions ...- When is a variable a cause of changes in the outcome?
- What is an observational study?
- When do we have positive association between variables? negative association?
- When can a causal conclusion be justified on the basis of observational data alone?
- What is a randomised experiment?
- What is a confounder?
- What is a lurking variable?
- How can we adjust for a lack of balance on a known confounder?

**Causes and Confounding Variables, Part II**(*Confounding & adjustment*) [3:57]Review Questions ...- What is a lurking variable?
- We have methods for adjusting for confounders, so why can we still not reliably draw causal conclusions from observational data alone?

**Random Error, Part I**(*Random variation/error; effect of sample size*) [7:06]Review Questions ...- Do the problems caused by bad measurement systems and biased selection mechanisms go away when we get huge amounts of data?
- Do the problems caused by confounding go away when we get huge amounts of data?
- Do the problems caused by random error go away when we get huge amounts of data?
- What is a sampling error?
- Do we ever know how big the actual sampling error we incurred was? What do we try to do about that?

**Random Error, Part II**(*Random variation/error; effect of sample size; biases*) [6:05]Review Questions ...- What effect does sample size have on sampling error?
- For what two reasons are non-random selection mechanisms worse than random selection mechanisms?
- What were the 5 "take home messages" from this movie?

**Estimation With Confidence**(*Motivation, background, intuition*) [4:52]Review Questions ...- What is the most reliable way we know of obtaining data about populations without misleading biases? Why is this method not perfect?
- What happens whenever we use data from a sample to estimate a population quantity?
- What do we do about that?
- If you have an estimate of a population quantity and know its margin of error, within what limits should the true value lie?
- Why does estimating the extent of sampling error in a practical application seem like an impossible problem?

**Confidence Intervals from Bootstrap Resampling**(*Bootstrap idea; Comparing BS re-sampling variation & sampling variation; Construction of bootstrap CI*) [8:27]Review Questions ...- What is bootstrap re-sampling? How do we generate a bootstrap-resample?
- What would happen if we took our re-samples using the ordinary way of sampling (without replacement)?
- What do we notice from the animations shown about bootstrap re-sampling error and sampling error?
- Why don’t we just use sampling error to make our confidence intervals? What do we do instead?
- Does the development in this video demonstrate that bootstrap confidence intervals will work as a way of addressing the “how wrong could we be” problem?

**Do Bootstrap Confidence Intervals work?**(*Simulation coverage*) [5:41]Review Questions ...- What is the basic idea of how a bootstrap confidence interval is constructed to capture the true population value of some quantity (e.g. a mean, a median, a percentage, ..)?
- What do we do to find out whether a method for constructing confidence intervals works?
- What do we mean by "our interval has covered the true value"?
- What do we know about bootstrapping as a way of constructing confidence intervals?
- What is meant when we say that a particular interval is a "95% confidence interval"?

**Numeric Outcomes**(*Use of CIs and comparison intervals in anova-type situation*) [5:41]Review Questions ...- What is the basic idea of how we construct a bootstrap confidence interval to capture the true population value of some quantity?
- Why don’t we use the confidence intervals drawn on our plots to make comparisons between groups?
- What is the difference between a confidence interval and a comparison interval?
- How do we interpret non-overlapping comparison intervals? overlapping comparison intervals?
- Can we rely completely on the intervals drawn on our plots? What are they there for? When and how do we get more reliable answers?
- Sketch 2 diagrams showing how you can “read” a confidence interval for a difference between 2 groups using comparison intervals. (A diagram for the non-overlapping case, and one for the overlapping case.)

**Categorical Outcomes**(*Use of CIs and comparison intervals; [At end:bootstrapping trends on scatter plot]*) [5:39]Review Questions ...- How is a "nest of trend curves" obtained for putting onto a scatterplot?
- How do we interpret such a "nest of trend curves"?

**Randomised Experiments**[5:24]Review Questions ...- Why do we want to do randomised experiments? What is the point of them?
- What are the two elements we need to have in place to be able to have a fair test?
- What are treatments?
- What is a treatment group?
- What is the point of random assignment to treatment groups?
- Why is it unethical to do some experiments?
- What does blocking mean? What principle governs the use of blocking and randomisation in designing an experiment?

**Randomisation Variation**(*Effects of pure randomisation variation*) [6:01]Review Questions ...- What is randomisation variation and why is it a problem?
- How can we reduce the levels of randomisation variation?

**The Randomisation Test**[7:45]Review Questions ...- When we look at a plot of experimental data that compares two treatment groups, why is it not always just obvious which treatment is better? What question goes through our minds?
- What is the basic idea governing when we will conclude that there is a real treatment difference?
- Try to bring to mind the basic steps of the VIT visualisation of the randomisation test (and visualise the sorts of pictures that were being generated at each step).
- What is the formal name for the tail proportions the visualisations produce?

**Percentages and Multiple groups**(*Applied to proportions/percentages in 2- and 3-groups situations*) [7:48]**Out of the Fog and into Power**(*Power, Type I error and "sample size"*) [8:27]

**Introducing Time Series**[4:38]Review Questions ...- What is time-series data?
- Why are people interested in time-series data?
- What is quarterly data?
- Why do people plot time-series data with points joined up by lines instead of using normal scatterplots?
- What, besides trends, is another form of pattern that is very common in time-series data?

**Seasonal Decomposition & Forecasting, Part I**[6:01]- What is the basic idea behind an additive model (or additive seasonal decomposition)?
- Why do we want to find stable structures in our time series?
**Seasonal Decomposition & Forecasting, Part II**[5:12]Review Questions ...- What is the basic idea behind an additive model (or additive seasonal decomposition)?
- What is the basic idea behind a multiplicative model (or multiplicative seasonal decomposition)?
- Why do we want to find stable structures in our time series?
- When is an additive model likely to work better and when is a multiplicative model likely to work better?
- When we have an additive decomposition, how do we adjust the trend value to incorporate the seasonal effect at each time point?
- When we have a multiplicative decomposition, how do we adjust the trend value to incorporate the seasonal effect at each time point?
- What is the no-change value when we are making additive adjustments?
- What is the no-change value when we are making multiplicative adjustments?
- When does the Holt-Winters forecasting method usually work well?
- Can time-series forecasting predict the future reliably? Why?

**Comparing Series**[5:16]Review Questions ...- When we are plotting several related series so that we can compare the patterns in them, what are the strengths and the weaknesses of a plot that puts all of the series on the same graph?
- When we are plotting several related series so that we can compare the patterns in them, what are the strengths and the weaknesses of a plot that puts all of the series on their own separate graphs?
- What types of feature of each series can we compare using the iNZight graphs for comparing series?

Review Questions ...

**iNZight Basics 1: From importing data to graphics****iNZight Basics 2:**Working with**single**numeric and categorical**variables**(1 variables at a time)**iNZight Basics 3:**Working with**pairs of variables**(2 variables at a time)**iNZight Basics 4:**Working with**subsetting variables**- (Misc) iNZight Basics 0: Unzipping the iNZightVIT-latest-zipfile folder

**iNZight Lite 1: A first look at iNZight Lite****iNZight Lite 2: Loading Example data sets**(using the "FutureLearn" data files from the "Data to Insight" MOOC as an example)

You can revert to the old behaviour using the "