Department of Statistics


Postgraduate research topics

Postgraduate students at the Department of Statistics have many research areas to choose from. See the list below for MSc and BSc(Hons) research topics available for 2020

  1. » 1. Data quality rating framework for large national pipe dataset
  2. » 2. Inference for Queues with Approximate Bayesian Computation
  3. » 3. Galileo, Hamilton, and Metropolis walk into a bar
  4. » 4. Disease risk prediction using multi-level Omics data
  5. » 5. Natural language processing of clinical data
  6. » 6. Bayesian inference for failure times of load-sharing systems with damage accumulation
  7. » 7. Bayesian state-space approach to spectral density estimation for gravitational wave data
  8. » 8. Multiple data sources in Bayesian inference
  9. » 9. Is my approximation safe to use?
  10. » 10. Modern method for high dimensional data
  11. » 11. Métier analysis and catch profiles in New Zealand fisheries
  12. » 12. Optimising trawl survey spatial stratification
  13. » 13. A comparison of spatial capture-recapture and random encounter models for camera trap data
  14. » 14. Unloading the spin cycle
  15. » 15. Adaptive control of stochastic queuing networks
  16. » 16. Stochastic modelling of patient’s trajectories in a hospital
  17. » 17. Parameter and state estimation for queueing systems
  18. » 18. Moment matching problem for truncated multivariate distributions
  19. » 19. Scheduling for a processor sharing system
  20. » 20. Using routinely collected data to predict the risk of hip and knee replacement in patients with arthritis (Masters)
  21. » 21. Statistics Education: Student understanding of distributions
  22. » 22. Statistics Education: Joint and conditional probability statements
  23. » 23. Generally-Altered, -Inflated and -Truncated (GAIT) Regression
  24. » 24. Understanding behaviour change in a complex participatory community intervention to improve maternal and child survival in rural Malawi: self-esteem, empowerment, co-coverage and equity
  25. » 25. Modelling the scale of spatial interactions between key sandflat species
  26. » 26. Simulating non-random community assemblage processes
  27. » 27. Investigating the efficiency of multivariate species distribution models to represent biodiversity
  28. » 28. Stochastic models for biodiversity
  29. » 29. Mathematical Population Genetics
  30. » 30. Inhomogeneous branching Brownian motions
  31. » 31. Visualisation and analysis of longitudinal data
  32. » 32. Text analytics
  33. » 33. Analytics for date and time-stamped data
  34. » 34. Predictive analytics
  35. » 35. Interactive graphics with R
  36. » 36. Data wrangling tools
  37. » 37. Visualisation of Multiple Response data from complex surveys
  38. » 38. Other topics

1. Data quality rating framework for large national pipe dataset


Description:

The national pipe data portal provides a significant opportunity for asset analytics across New Zealand. Through the portal, participating councils can have their data mapped to a common data standard. However, an immediate challenge is that different councils maintain varying level of data integrity. The objectives of the project are to:

  1. understand the context for which the data will be used;
  2. define the data test matrix that defines the over-all data quality
  3. develop statistical algorithms that could be run across large pipe dataset
  4. develop a reporting framework that summarised the outcomes.

Please note:
a) Research Masters (90 points+ dissertation) student in Statistics or Data Science is essential for the amount of work required.
b) This master project is jointly supervised with the Civil Engineering department, who would provide critical subject matter expertise in the understanding of the pipe asset data.

Contact: Dr Lisa Chen

Top

2. Inference for Queues with Approximate Bayesian Computation


Abstract:

Queueing systems can typically be modelled analytically in simple cases, but more realistic assumptions tend to require simulations. However, that makes it difficult to compare the models to data and infer the values of the queueing system’s parameters. In this project you will develop ways of inferring queueing parameters from data using “Approximate Bayesian Computation”, which essentially involves simulating datasets until you find ones that match the observed dataset.

Prerequisites: STATS 320, STATS 331, and excellent programming skills.
Supervisors: Brendon Brewer (bj.brewer@auckland.ac.nz), Azam Asanjarani (azam.asanjarani@auckland.ac.nz), and Ilze Ziedins (i.ziedins@auckland.ac.nz)

Top

3. Galileo, Hamilton, and Metropolis walk into a bar


Abstract:

The Metropolis Algorithm is the simple foundation of Markov Chain Monte Carlo, and can be implemented very easily. However, more advanced methods such as Hamiltonian Monte Carlo can improve performance significantly on strongly dependent target distributions, provided the derivatives of the log-density can be evaluated. However, the fundamentals of Hamiltonian MCMC have been criticised by John Skilling, who proposed a “Galilean” method instead. In this project you would implement Galilean, Hamiltonian, and Metropolis sampling and test them on a variety of difficult MCMC examples to see which ones are the most efficient and why. We’ll probably do this within Nested Sampling.

Prerequisites: STATS 331, solid calculus, and strong programming skills.

Supervisor: Brendon Brewer (bj.brewer@auckland.ac.nz)

Top

4. Disease risk prediction using multi-level Omics data


Biotechnological advances have expanded the breadth of available high-dimensional data from genomic data, to extensive epigenomic, transcriptomic and proteomic data. Effectively analyzing these data can facilitate the identification of accurate models that predict outcomes and generate important insights into the underlying genetic mechanisms of various complex traits. There is a strong need for powerful and advanced analysis strategies that can fully make use of these comprehensive high-dimensional data. This project will explore the emerging approaches for data integration and disease risk prediction.

Prerequisites: R programming and STATS 310

Supervisor: Yalu Wen (y.wen@auckland.ac.nz)

 

Top

5. Natural language processing of clinical data


Intraoperative records for some 140,000 patients are available, including clinician-entered comorbidity and event data. Some of the event data have identifying descriptors for clinicians. This project will apply natural language techniques to anonymise the event data and to identify the comorbidities. Care must be taken to account for abbreviations (eg, “T2DM” is 'diabetes'), and negations (eg “not smoker”).

Prerequisites: R/Python programming

Supervisor: Yalu Wen (y.wen@auckland.ac.nz)

Top

6. Bayesian inference for failure times of load-sharing systems with damage accumulation


In many engineering applications, it is of interest to test the reliability of systems which are composed of parallel components. In a load-sharing system, the stress is redistributed to the surviving components after one of the components fails.

This project aims to develop a Bayesian alternative to the existing maximum-likelihood approach for modeling the intensity function of a point process and apply this to failure time data of pre-stressed concrete beams that are each made up of several tension wires.

Requirements: Good knowledge of applied Bayesian methods e.g. from STATS331 or STATS731, good programming skills and knowledge of R, JAGS/WinBUGS are essential.

Supervisor: Associate Professor Renate Meyer (renate.meyer@auckland.ac.nz)

Top

7. Bayesian state-space approach to spectral density estimation for gravitational wave data


Abstract:

Advanced LIGO made the very first direct measurement of gravitational waves in September 2015. MCMC techniques were used for posterior computation of the signal parameters of the binary inspiralling black hole system. The current time series models used for parameter estimation of gravitational wave signals assume that the time-varying dimensionless strain at the detector is decomposed into a signal plus additive noise, assumed to be Gaussian, stationary with known power spectral density. However, the detector background is not only slowly varying but also subject to short-duration large-amplitude transient noise events.

This project aims to explore a novel Bayesian non-linear non-Gaussian state-space approach to spectral density estimation for a more realistic nonparametric spectral density estimation.

Requirements: A good knowledge of and interest in Bayesian inference, MCMC techniques, and time series as well as good programming skills and knowledge of R are essential.

Supervisor: Associate Professor Renate Meyer (renate.meyer@auckland.ac.nz)

Top

8. Multiple data sources in Bayesian inference


Abstract:

When multiple data sources are used in modelling, they may not be all equally important in explaining the observations of interest (response variables). Each source may have different information and, some may be lousy or less reliable. One obvious thing is incorporating these features in modelling. For example, a reliable source is directly fed in the likelihood and the other unreliable source is fed through a less direct route. This project will study the properties of some known approaches and apply to problems of interest.

STATS 331 and 210 will be useful and good programming skill is needed.

Contact: Kate Lee (kate.lee@auckland.ac.nz)

Top

9. Is my approximation safe to use?


Abstract:

When the exact inference is not feasible, approximation is an attractive choice with a hope that the resulting approximate inference is adequate. Despite of popular usage of approximation, it has not been well investigated how to qualify the resulting approximate inference. Recently some approaches have been proposed. In this project, those approaches will be examined for well known approximations such as ABC, variational Bayes, etc.

STATS 331 and 210 will be useful and good programming skill is needed.

Contact: Kate Lee (kate.lee@auckland.ac.nz)

Top

10. Modern method for high dimensional data


Abstract:

When a dataset contains lots of observations, it is often called ”tall data”. Even for a moderate dimensional data, analyzing such big volume of observations is challenging. One approach is finding a subset of data in which it has an adequate representation of the original dataset. In this project, recently proposed methods will be studied and applied to problems of interest.

STATS 210, 20x will be useful, and R-programming skill is needed.

Contact: Kate Lee (kate.lee@auckland.ac.nz)

Top

11. Métier analysis and catch profiles in New Zealand fisheries


Abstract:

The notion of a métier is widely used to describe fisheries at a level incorporating the largest part of their heterogeneity. Each metier is a group of fishing operations defined by the combination of fishing gear, target species, catch composition, area and season. Métier based approaches are particularly useful in investigation of situations where multiple fleets exploit an overlapping mix of species. This project would use information on the New Zealand commercial fishing industry to define metiers, and examine catch data to determine catch profiles and how and why these may have changed over time. Knowledge of multivariate and clustering techniques would be an advantage.

Contact: Ian Tuck (ian.tuck@niwa.co.nz)

 

Top

12. Optimising trawl survey spatial stratification


Abstract:

This project will examine spatial stratification within New Zealand trawl surveys, and use different stratification approaches to determine whether current stratification is optimal, or whether it can be improved, given the survey objectives. The project will involve species distribution modelling on the basis of trawl survey observations and environmental data, and survey simulations.

Contact: Ian Tuck (ian.tuck@niwa.co.nz)

 

Top

13. A comparison of spatial capture-recapture and random encounter models for camera trap data


Abstract:

Camera-trap surveys are commonly used to estimate density of wildlife populations. Over the last decade, spatial capture-recapture (SCR) and random encounter models (REMs) have gained traction in their application to the resulting data. They each require slightly different information---for example, SCR usually needs individuals to be recognised when they are detected, while REMs usually require a priori knowledge of average animal speeds. The two methods also make different assumptions about the way animals move and behave. In this project, we will assess the performance of SCR and REM estimators in terms of properties such as bias, precision, and robustness to assumption violations.

Requirements: An interest in programming, with good grades in statistical computing papers.

Contact: Ben Stevenson (ben.stevenson@auckland.ac.nz)

Top

14. Unloading the spin cycle


Abstract:

Many of us get our science and health information from news media channels and sometimes this information results in us changing our behaviour. However, we all know that claims made in the mainstream media and our social media channels can be misleading, and that news headlines and media articles regularly contain spin1. In order to critically evaluate claims made in the media, we need to be statistically literate. Statistical literacy is much more than statistical knowledge; it is a complex construct involving many elements (see Gal, 2002).

This is a statistics education project and will involve conducting a literature review on the impact of media spin on health and science communication. It will also involve developing and administering a short questionnaire aiming to explore perceptions of, and attitudes to, statistical literacy.

1Spin is defined as “the presentation of information in a particular way; a slant, especially a favourable one” (English Oxford Living Dictionaries, 2019).

Requirements: A genuine interest in statistics education. Skills in reading and critiquing research papers. Proficiency in essay writing. STATS 150 and/or STATS 708 would also be useful.

If you are interested in this project, please come and talk to Stephanie Budgett (s.budgett@auckland.ac.nz).

Gal, I. (2002). Adults' statistical literacy: Meanings, components, responsibilities. International Statistical Review, 70(1), 1-25.

Top

15. Adaptive control of stochastic queuing networks


Abstract:

The evolution of queuing systems often happens randomly and key variables/parameters of the system may be unknown or partially observed. Providing a stochastic model for these systems with the aim of improving the efficiency or forecasting and bringing them under on-line control, lead to reducing the customer waiting times, better server utilizations and stability. The main idea of this project is devising an appropriate and optimal model for a network of queues which fits practical applications in the fields of biology, health services, energy, manufacturing, traffic and communication networks.

Contact: Azam Asanjarani (azam.asanjarani@auckland.ac.nz)

Top

16. Stochastic modelling of patient’s trajectories in a hospital


Abstract:

The aim of this project is constructing a stochastic model for prediction of individual’s progression through various stages of a disease, or provide a sensible contribution to build a prediction model that predicts the risks and chances of the expected trajectories of patients through intensive care units.

Contact: Azam Asanjarani (azam.asanjarani@auckland.ac.nz)

Top

17. Parameter and state estimation for queueing systems


Abstract:

Queueing systems come up in a variety of situations in the real world. For instance, machines at a factory, telephone lines, traffic lights, runways at an airport, cashiers in a supermarket or doctors may not immediately supply their customers with the amount or kind of service they required. The evolution of queuing systems often happens randomly. To aid in understanding queuing systems, mathematical queueing models have been studied and employed for over a century. However, there are a few papers dealing with parameter and state estimation of queuing systems. The aim of this project is filling the literature gap in this area.

Contact: Azam Asanjarani (azam.asanjarani@auckland.ac.nz)

Top

18. Moment matching problem for truncated multivariate distributions


Abstract:

Queueing systems come up in a variety of situations in the real world. For instance, machines at a factory, telephone lines, traffic lights, runways at an airport, cashiers in a supermarket or doctors may not immediately supply their customers with the amount or kind of service they required. The evolution of queuing systems often happens randomly. To aid in understanding queuing systems, mathematical queueing models have been studied and employed for over a century. However, there are a few papers dealing with parameter and state estimation of queuing systems. The aim of this project is filling the literature gap in this area.

Contact: Azam Asanjarani (azam.asanjarani@auckland.ac.nz)

Top

19. Scheduling for a processor sharing system


Abstract:

In a variety of real-life queueing systems such as in manufacture, telecommunication, transportation, supermarkets or hospitals, job requests arrive continuously and the servers (e.g. machines, cashiers, doctors, …) may not immediately supply their customers with the amount or kind of service they required. In these situations, for reducing the waiting time in the queue and act fairly to each request, we apply scheduling policies to determine which requests in the queue are serviced at any point in time, how much time is spent on each, and what happens when a new request arrives. The aim of this project is solving the problem of scheduling arrivals to a congestion system (such as traffic intersection) with a finite number of users having identical deterministic demand sizes.

Contact: Azam Asanjarani (azam.asanjarani@auckland.ac.nz)

Top

20. Using routinely collected data to predict the risk of hip and knee replacement in patients with arthritis (Masters)


Abstract:

This project will analyse routinely collected data of patients with arthritis. For each of these patients, we will search future hospital discharges to identified those with knee and hip replacements and potential risk factors. Sampling weights will need to be incorporated in order to account for missingness. The project aims to use proportional hazard models (survival models) to study the risk of hip/knee replacement in patients with arthritis.

Requirements: R programming and survival data knowledge.

Supervisor: Dr Claudia Rivera Rodriguez (c.rodriguez@auckland.ac.nz)

Top

21. Statistics Education: Student understanding of distributions


This project would suit someone with a genuine interest in Statistics education. The notion of distribution underpins both data analysis and probability theory. The project would include conducting a literature review on how students develop an understanding of distribution. This means that skills in reading research papers and essay writing are important.

Requirements: Good grades in STATS 125 and STATS 210. It is also desirable that you have marking and/or tutoring experience in at least one of STATS 125 or STATS 210.

Supervisors: Stephanie Budgett (s.budgett@auckland.ac.nz) and Marie Fitch (m.fitch@auckland.ac.nz)

Top

22. Statistics Education: Joint and conditional probability statements


This project would suit someone with a genuine interest in Statistics education. It would include conducting a literature review on how students develop an understanding of both joint probability and conditional probability. This means that skills in reading research papers and essay writing are important.

Requirements: Good grades in STATS 125 and STATS 210. It is also desirable that you have marking and/or tutoring experience in at least one of STATS 125 or STATS 210.

Supervisors: Stephanie Budgett (s.budgett@auckland.ac.nz) and Marie Fitch (m.fitch@auckland.ac.nz)

 

Top

23. Generally-Altered, -Inflated and -Truncated (GAIT) Regression


This project is to obtain more experience with GAIT regression on real data. Heaped data is one such application, and fisheries data is another. Zero-altered (hurdle), -inflated and -truncated (positive) count distributions are now well-established, especially in Poisson

and binomial distribution forms. Recently generally-altered, -inflated and -truncated (GAIT) distributions have been developed whereby any finite set of support values are 'special' (rather than just 0). Two variants of the GAIT models have been proposed: a nonparametric method based on the multinomial logit model and a parametric method based on finite mixtures of the parent distribution on differing support. Work has been done developing the above for count distributions, e.g., the R package VGAM now has several new functions such as gatpoisson.mix() that can be applied to heaped data.

Prerequisites: STATS 330, a good grade in STATS 310 & 380 would help.

Supervisors: Thomas Yee (t.yee@auckland.ac.nz) and Ian Tuck (ian.tuck@niwa.co.nz)

This project should be ideally completed during Semester 1 as Thomas will be on Research & Study Leave during Semester 2.

Top

24. Understanding behaviour change in a complex participatory community intervention to improve maternal and child survival in rural Malawi: self-esteem, empowerment, co-coverage and equity


This project is co-supervised by Dr Sonia Lewycka.

Abstract:

A large cluster randomized trial of a participatory community intervention with women’s groups was conducted in rural district of Malawi (https://www.ncbi.nlm.nih.gov/pubmed/23683639). This intervention demonstrated an impact on maternal and child mortality, but the mechanism of behaviour change is still not understood. This project will develop the trial analysis by exploring two additional areas: 1) the impact of intervention participation on the “co-coverage” of key maternal, newborn and child health behaviours; and 2) the mediating role of empowerment and self-esteem.

1) There is growing literature on the measurement of “co-coverage” for complex maternal and child health interventions1. This approach acknowledges that improvements in health and reductions in mortality due to complex public health interventions may arise from the adoption of multiple health behaviours. Indeed, public health interventions that target multiple behaviours simultaneously may have the most powerful effects. However, the analysis of trials typically focuses on the impact of interventions on separate individual behaviours. This project will involve a secondary analysis of data from the trial conducted in Malawi. The student will create a “co-coverage” index relevant to this study, and use this to measure the impact of the intervention on health behaviour. Further analyses will explore other ways of grouping health behavior variables, such as calculation of a composite index.

2) There is a well-established link between self-esteem and mental health, and evidence for the role of empowerment in family planning and maternal health2-4, but less is known about how self-esteem and empowerment may impact upon uptake of health services and childcare behaviours. The student will review the literature on measures of self-esteem and empowerment, and the mediating role of these on health behaviours. They will construct a self-esteem/empowerment scale using the questionnaire data collected, and use this to build a multivariable model looking at the relationship between group membership and health behaviour, then explore how self-esteem mediates this effect.

The specific objectives of the study are to:

  • Review literature on co-coverage and composite indices and develop a co-coverage index
  • Review literature on empowerment, self-esteem and maternal and child health
  • Build multivariable models to explore the impact of intervention participation on co-coverage and composite indices
  • Conduct equity analyses to explore the distribution of co-coverage by household wealth, maternal education and intervention participation
  • Evaluate the ‘equity impact’ of the intervention on co-coverage by household wealth
  • Explore the study data on empowerment and self-esteem and use grouping methods, such as principal components analysis or factor analysis to important dimensions of self-esteem
  • Use these findings to inform the development of an empowerment/self-esteem scale
  • Build a multivariable model to explore the impact of intervention participation on empowerment/self-esteem
  • Build a multivariable model to explore the impact of empowerment/self-esteem on health behaviours
  • Build a multivariable model to explore the mediating role of empowerment/self-esteem in the relationship between intervention participation and health behaviour
  • Write a research paper

 

Requirements: R programming, modelling and principal components analysis/factor analysis.

References:

1. Barros AJD, Victora CG. Measuring Coverage in MNCH: Determining and Interpreting Inequalities in Coverage of Maternal, Newborn, and Child Health Interventions. PLoS medicine 2013; 10(5).

2. Yaya S, Uthman OA, Ekholuenetale M, Bishwajit G. Women empowerment as an enabling factor of contraceptive use in sub-Saharan Africa: a multilevel analysis of cross-sectional surveys of 32 countries. Reprod Health 2018; 15(1): 214.

3. Mandal M, Muralidharan A, Pappa S. A review of measures of women's empowerment and related gender constructs in family planning and maternal health program evaluations in low- and middle-income countries. BMC pregnancy and childbirth 2017; 17(Suppl 2): 342.

4. Ewerling F, Lynch JW, Victora CG, van Eerdewijk A, Tyszler M, Barros AJD. The SWPER index for women's empowerment in Africa: development and validation of an index based on survey data. The Lancet Global Health 2017; 5(9): e916-e23.

Top

25. Modelling the scale of spatial interactions between key sandflat species


Inter-species and species-habitat interactions are usually modelled as if they occur at a fixed scale, however, the strength of any interaction is likely to be a function of the scale at which different species experience their environment and also the scale at which observations are made. Aspects of the spatial scale that affect observations include grain (the size of the sampling unit), lag (the distance between samples) and extent (the area over which the observations are made). This study will assess the interactions observed between key sandflat species in different sized windows of observation by exploring the effect of varying lag and extent on simple correlations and analysing cross-correlograms.

Prerequisites: R programming to extract different sized windows, basic GIS ability, at least one Biology course.

Contact: Professor Judi Hewitt (judi.hewitt@niwa.co.nz)

Top

26. Simulating non-random community assemblage processes


Present meta-community theory suggests that the factors controlling how species are built vary from environmental filtering, source-sink dynamics (dispersal) and species interactions to stochastic demographic processes. This project would integrate the use of biological traits that represent dispersal capability, environmental selection and adult-juvenile-adult interactions into community assemblage theory. Monte Carlo simulations driven by biological traits would be used to build communities. The Metacom package would be used to assess whether the communities can be appropriately assigned to particular species assemblage mechanisms.

Prerequisites: understanding of Monte Carlo simulations, at least one biology course.

Contact: Professor Judi Hewitt (judi.hewitt@niwa.co.nz)

Top

27. Investigating the efficiency of multivariate species distribution models to represent biodiversity


Corelative models (e.g. GLM, GAM, Boosted Regression Tree model, Random Forest model) are routinely used to link biodiversity observations at specific sites to the prevailing environmental conditions at those sites. Once biodiversity-environment relationships have been quantified these can be used to make predictions in space and in time by projecting the model onto available environmental layers (termed Species Distribution Models (SDM)). A recently developed multivariate extension to Random Forest modelling called Gradient Forest modelling predicts species turnover (the differentiation of species over space) rather than individual species’ distributions.

It is hypothesised that Gradient Forest models may better represent biological communities when sampling is limited compared to individual SDMs where taxonomic surrogacy would have to be assumed, i.e. that one species captures variation in another. This project would compare the ability of Gradient Forest models and individual SDMs to represent demersal fish communities under different sampling regime scenarios using a large biological dataset (> 60, 000 trawls targeting demersal fish) and high-resolution environmental predictor variables (1km grid resolution) across the New Zealand Exclusive Economic Zone.

Prerequisites: R programming (correlative models, GLM, GAM, BRT, RF; stratified and random sampling), basic GIS ability.

Contact: Professor Judi Hewitt (judi.hewitt@niwa.co.nz)

Top

28. Stochastic models for biodiversity


This project will consider some stochastic models for biodiversity. In particular, some neutral models for extinctions and speciations using branching processes will be investigated mathematically, including the genealogical structure of reconstructed phylogenetic trees.

The project will include directed reading for any necessary background material in probability, such as Markov chains, branching processes, and Poisson processes. Computer simulations can optionally be used to exhibit typical behaviours and theoretical results.

Requirements: Basic probability (eg. STATS 125) and some mathematics (eg. proofs, limits) is essential. Some knowledge of stochastic processes recommended (eg. STATS 325, STATS 320).

Supervisor: Associate Professor Simon Harris

Top

29. Mathematical Population Genetics


This project will present an introduction to mathematical population genetics and coalescent processes, including the Kingman coalescent. Coalescent processes provide models for the genealogies (family trees) that are constructed backwards in time from samples of the present population, with ancestral lineages merging together whenever they first share a common ancestry.

The student will follow a course of directed reading covering the necessary background material in probability (such as Markov chains), the Kingman coalescent, some related models, and some applications.

Requirements: Basic probability (eg. STATS 125) and some mathematics (eg. proofs, limits) is essential. Some knowledge of stochastic processes recommended (eg. STATS 325, STATS 320).

Supervisor: Associate Professor Simon Harris

Top

30. Inhomogeneous branching Brownian motions


Brownian motion is a fundamental model in modern probability theory for the random diffusion of a particle, and can be thought of as the natural scaling limit of the well known probabilist’s simple random walk. Branching Brownian motions are population models in which each particle currently alive independently moves around in space as a diffusion, but also gives birth to offspring at random during its lifetime.

This project will investigate some inhomogeneous branching Brownian motions, where the motion, branching rates and death rates depend on current spatial position (or time) of the particles. Some fundamental questions include survival probabilities and how quickly the population colonises space given it survives. Probabilistic results about Branching Brownian motions can also yield results in mathematical analysis about corresponding reaction-diffusion equations (non-linear partial differential equations). For example, see Harris & Harris (2008), Berestycki, Brunet, Harris et al. (2010, 2017).

Requirements: Basic probability (eg. STATS 125) and good mathematics (eg. proofs, limits, calculus, differential equations) is essential. Good knowledge of stochastic processes or Markov chains also strongly recommended (eg. STATS 325, STATS 320).

Supervisor: Associate Professor Simon Harris

Top

31. Visualisation and analysis of longitudinal data


In longitudinal data we have data on a set of entities (e.g. people) recorded at multiple time points (“repeated measures”). This allows us, for example, to explore how effects change over time. An example is Growing Up in New Zealand, a longitudinal study tracking the development of approximately 7,000 New Zealand children from before birth until they are young adults.

This project will investigate ways of visualising and analysing longitudinal data with a view to creating a Shiny app for these purposes. A good project for building data-science skills that should suit curious students with interests in computing and statistics. An opportunity to learn valuable skills from the experienced members of the iNZight team.

Requirements: Good grades in STATS 330 and STATS 380

Contacts: Professor Chris Wild (c.wild@auckland.ac.nz) and Avinesh Pillai (a.pillai@auckland.ac.nz)

Top

32. Text analytics


This project will involve researching and building easy-to-use tools for obtaining and analysing data that comes in the form of text whether it be data scraped from twitter feeds, blog posts and other social media, collections of emails, fan fiction, reports and other documents, or even novels. Some of the core things people are interested in with data like this are what are these people talking about, how do they feel about the issues under discussion, and how are these sentiments changing.

This is a good project for learning to analyse text data and building data-science skills leveraging extensive capabilities already in R that should suit students with interests in computing and statistics. It is also an opportunity to learn valuable skills from the experienced members of the iNZight team.

Requirements: Good skills in R programming

Supervisor: Professor Chris Wild (c.wild@auckland.ac.nz)

Top

33. Analytics for date and time-stamped data


The subject of this project working with data with date and time fields that tell us when things happened. The project involves developing wrangling, plotting and analysis capabilities for such data. Can we sensibly automate how times and dates are handled by plots in conjunction with other variables in many useful situations?

This is a good project for learning to work with time-stamped data and building data-science skills that should suit students with interests in computing and statistics. It is also an opportunity to learn valuable skills from the experienced members of the iNZight team.

Requirements: Good skills in R programming

Supervisor: Professor Chris Wild (c.wild@auckland.ac.nz)

Top

34. Predictive analytics


This project will involve working with the iNZight development team on developing capabilities for building predictive models using automated ensemble tools such as TensorFlow and Caret, and simpler more understandable tools such as regression models and classification & regression trees.

The project will also involve model diagnostics for generalised linear models and survival models. This is a good project for building data-science skills that should suit students with interests in computing and statistics. It is also an opportunity to learn valuable skills from the experienced members of the iNZight team.

Requirements: Very good grades in STATS 310 and 330 and good R-programming skills.

Supervisor: Professor Chris Wild (c.wild@auckland.ac.nz)

Top

35. Interactive graphics with R


This project will involve researching and implementing interactive web graphs for R-generated plots including calling back to R from the webpage for new information and updating plots without having complete redrawing, including 3-D plotting.

This is also a good project for building data-science skills that should suit students with interests in computing and statistics. It is also an opportunity to learn valuable skills from the experienced members of the iNZight team.

Requirements: Good skills in R and javascript programming.

Supervisor: Professor Chris Wild (c.wild@auckland.ac.nz)

Top

36. Data wrangling tools


This project will involve investigating and drawing lessons from the available interactive data wangling systems and the R tidyverse tools to scope and develop a Shiny app that not only makes data wrangling easy for users but also writes the R code it using to: provide an audit trail, and aid reproducibility and learning data wrangling in R.

This is a good project for building data-science skills that should suit students with interests in computing and statistics. It is also an opportunity to learn valuable skills from the experienced members of the iNZight team.

Requirements: Very good R-programming skills. 

Supervisor: Professor Chris Wild (c.wild@auckland.ac.nz)

Top

37. Visualisation of Multiple Response data from complex surveys


iNZight already has tools for displaying and analysing multiple-response data from random samples. This project will involve generalising these methods so that they work with data obtained from more general survey sampling designs.

Requirements: Very good grades in Stats 310 and 340 and very good R-programming skills.

Supervisor: Professor Chris Wild (c.wild@auckland.ac.nz)

Top

38. Other topics


A student may wish to explore other data science topics with a view to later incorporation in iNZight such as: analytics for images; developing tools for survival and longitudinal data analysis; tools for webscraping; providing capabilities with colour including using colour metrics to choose colour palettes with maximally-different colours; developing Shiny apps for allowing sensitive data to be analysed and displayed with the data itself kept secure.

Requirements: Very good R-programming skills.

Supervisor: Professor Chris Wild (c.wild@auckland.ac.nz)

Top