Department of Statistics

Postgraduate research topics

Postgraduate students at the Department of Statistics have many research areas to choose from.

New MSc and BSc(Hons) research topics available for this year (2019)

 For MSc and BSc(Hons) research topics that are not listed contact - Dr Brendon Brewer or see an updated list here. For other staff members refer staff directory page.



Honours: Comparison of regression estimators for cluster level data under two phase designs.

This project will be a large simulation study where the main interest is to compare different estimation methods for clustered data under two-phase designs. We will generate a large population of interest with clustered correlated data (suppose schools, clinics). The idea is to select samples using different sampling schemes. The estimation methods to be compare include: Weighted likelihood, conditional likelihood, maximum likelihood, and weighted likelihood with calibrated weights.

By comparing these methods the student will have an insight of current state of this research field.

This project requires experience using R, simulations and regression.

Supervisor: Claudia Rivera-Rodriguez (


Honours: Optimal sample allocation for regression parameters

It is well known that the efficiency of the parameter to be estimated will depend largely on the sampling design, the information available in the population and what criteria is used for optimization.

There is a number of reasons why investigators may want to stratify before a sample is collected. One of the reasons is that it can offer gains in efficiency (smaller variances) when the target variable behaves differently between strata. Another reason is that estimates for each strata may be required.

In both cases, there is always a question that investigators want an answer for: What is the best (yields the smaller variance) sample size for each strata given that budget is available for a total sample of size $n$?.

The answer to this question depend on different aspects such as the target of estimation? (Regression parameters) and the stratifying variable. Most survey-sampling research has focused on the estimation of totals of variables (not regression parameters).

In this project we will evaluate the efficiency of different methods of allocation for regression parameters. This project requires experience using R, simulations and regression.


Supervisor: Claudia Rivera-Rodriguez (

Honours or Masters : Comparing different estimation methods for mis specified models

This project consists in comparing the following estimation methods under true and misspecified models.

- Weighted likelihood

- Conditional maximum likelihood

- Weighted likelihood with estimated/calibrated weights

- Conditional maximum likelihood with estimated weights

* If time allows (Masters), the ultimate goal is to incorporate clustered correlated data and proposing robust estimation of the variances for some of the methods. This has already been dome for weighted likelihood and weighted likelihood with calibrated/estimated weights.

The student should have good R, sampling and modelling knowledge.

Supervisor: Claudia Rivera-Rodriguez (



Masters: Estimating dementia prevalence in South Auckland using Survey sampling methods

Abstract: Dementia is a progressive syndrome that affects memory, thinking, orientation, comprehension, calculation, learning capacity, language, and judgement. It is caused mostly by neurodegenerative diseases such as Alzheimer's disease or stroke. (WHO).

Dementia can be life changing for those who suffer from it and their families. The number of people with it has increased in the recent years and it is estimated that more than 50 million people have dementia in the world (WHO). The estimated number of cases in New Zealand in 2016 was 62287 (Deloitte report, 2017), but these figures were extrapolated from international and Australian data and do not take account of the unique ethnic diversity of New Zealand’s population.
There is an urgent need to estimate dementia prevalence and to identify the main risk factors in New Zealand. Identifying these factors is crucial to inform health policies and planning, leading to better life quality for New Zealanders and less associated cost.

In Auckland, for example, Counties Manukau Disrtict Health Board (CMDHB) established The Memory Service in 2013 in order to improve services for people with memory problems, cognitive impairment and/or dementia. The service is offer at the Middlemore Hospital in South Auckland and data on different risk factors and demographic information are recorded for each registrant. Data from these services might be helpful in estimating the prevalence of dementia, albeit restricted only to those people who are referred (estimated at only 50% of the total in countries such as UK and USA) and therefore not fully representative of the community-based population.

The goal of this project is to use data collected at the CMDHB memory clinic in Middlemore hospital from 2013 to 2016 in order to estimate the risk of dementia in South Auckland. This is a challenging statistical matter for several reasons. First, people attending the service are likely to have any cognitive disorder hence dementia prevalence could be overestimated if the data are not used properly. Second, the clinic does not take referrals from the whole of the Counties Manukau geographical area, people that attend the memory service live mostly in areas closer to Middlemore hospital but not from Franklin or the Eastern localities. Thirdly, some people with dementia are diagnosed in other services, such as on inpatient wards or in outpatient Geriatric clinics at Middlemore hospital, by Community Geriatric Nurses and occasionally by General Practitioners; however all people with dementia that require homecare services have to have a formal diagnosis recorded on InterRai (available from CMDHB Health Informatics) so this may be a useful source of diagnostic data that will supplement the memory service data.

We aim to collate the data from the different diagnostic services to estimate the recorded prevalence of dementia in South Auckland. We will utilize census data to make the memory service data representative of South Auckland. This will use calibration and estimation of weights using recent Survey Sampling methods. We also aim to compute measures of uncertainty, and construct informative confidence intervals that properly inform the results.

This project requires experience using R, the survey package and complex sampling survey.

Supervisor: Claudia Rivera-Rodriguez (

Stochastic models for pharmacodynamics

 Intraoperative records for some 140,000 patients are available, including physiologic timeseries data (heart rate, blood pressure, etc) and clinician-entered drug administration data.
 This project would aim to develop a stochastic model for the effect of an administered drug on the physiology. An extension of the project could apply the same model to simulated data (mathematically modelled in commercially-available software) and compare the simulation to real data. One possible area of interest could be the effect of induction agents (Propofol, for example) on blood pressure and heart rate. Covariates might include weight and dose of drug, patient acuity, resting blood pressure (before the administration), and any other drugs administered concurrently.

Supervisor: Peter Mullins (

Natural language processing of clinical data

 Intraoperative records for some 140,000 patients are available, including clinician-entered comorbidity and event data. Some of the event data have identifying descriptors for clinicians. This project will apply natural language techniques to anonymise the event data and to identify the comorbidities. Care must be taken to account for abbreviations (eg, “T2DM” is 'diabetes'), and negations (eg “not smoker”).

Supervisor: Yalu Wen (


 Multiple imputation through statistical learning

This project aims to use statistical/machine learning techniques to build models that can be used for large-scale multiple imputation.   At the straightforward end: boosted trees and random forests.  For a more ambitious student, we could look at a neural network technique: a variational autoencoder.  Would require good R programming skills.

Supervisor: Thomas Lumley (

 Bus headway

The twitter bot @tuureiti keeps track of how close Auckland's buses are to their timetabled schedules. For frequent services (such as the Inner Link, the Northern Express, the 70 along Gt South Rd, or the 27 along Mt Eden Rd) it's more important to keep track of the time between buses. That is, if there's a bus every 15 minutes, people will typically just go to the bus stop when they are ready, and they just care how long they have to wait. Having every bus 10 minutes late is better than having alternate buses on time and 10 minutes late.  Transport nerds call this problem "maintenance of headway".  The goal of the project is to work out a way to summarise headway from Auckland Transport's real-time data and to write software that computes and displays these summaries. Would require good R programming skills.

Supervisor: Thomas Lumley (

Finding periodic climate cycles in a mud core

 We have data from about 100,000 years of mud in Orakei Basin, and we'd like to see if we can detect periodic climate cycles such as the 11-year sunspot cycle.  This project is not very well defined, and would involve methods that aren't available in prepackaged software.  Knowledge of Fourier transformations in time series would be helpful, and good programming skills would be important.

Supervisor: Thomas Lumley (

Implementing count models for survey data

The R survey package doesn't have many models for count data implemented. This project would add negative binomial and zero-inflated Poisson and maybe some other models.  You would need good R programming skills.  Both STATS 330 and 340 would be a good background for the project, though 340 isn't strictly necessary.  This blog post ( gives an idea of the sort of analyses that the code would be implementing.

Supervisor: Thomas Lumley (


The information cost of the Whittle Likelihood

In some time-series situations, we can fit a model to the data directly and obtain the posterior distribution for the parameters. However, sometimes it is easier and computationally cheaper to compute the periodogram of the data instead, and fit a model to that, using the “Whittle Likelihood”. In principle, information is lost when we do this, but how much? When does it matter? In this project you will answer these questions using the frameworks of Bayesian inference, information theory, and decision theory. You can also try some examples from the astrophysics of stellar oscillations.

Requirements: Good grades in STATS 331, comfortable programming (I might get you to learn C++) and learning some mathematics which might be new to you.

Supervisor: Brendon Brewer (


Magnifying the differences between statistical frameworks

In many circumstances, Bayesian outputs such as a posterior mean and standard deviation are numerically similar or identical to frequentist outputs such as an estimate and the standard deviation of its sampling distribution. This “coincidence” occurs when the model assumptions have certain symmetries which are often met in standard situations. Ed Jaynes suggested that when deciding between frameworks, we ought to zoom in on the cases that disagree, so that we see the differences more starkly. To do this, it would be helpful to have a systematic procedure for generating these situations (in the form of (model, parameter, dataset) triples), so we don’t just have to hope to stumble across them. In this project you will develop and test such a procedure based on the Nested Sampling algorithm.

Requirements: Good grades in STATS 310 and 331, and you should be comfortable programming.

Supervisor: Brendon Brewer (



Statistics Education: The relationship between student engagement and student achievement

Students who are engaged are active participants in their learning. It seems reasonable to assume that students who are engaged throughout the semester are more successful than those who are disengaged.  However, how is student engagement defined? How is student engagement measured? This project would involve analysing data from a small undergraduate statistics course, over a number of years, to determine factors affecting student achievement.

It will be necessary to conduct a literature review on student engagement and achievement so skills in reading and critiquing research papers and in essay writing are important.

If you are interested in this project, please come and talk to me.

Requirements: A genuine interest in Statistics Education, good data-wrangling skills, proficiency in R.

 Supervisor: Stephanie Budgett (


Detecting ecological change along environmental gradients

This project would utilise skills in r-programming to create a large ecological-environmental dataset from numerous New Zealand sources.  Data would need to be extracted from regional excel spreadsheets, a national marine environmental data base and climate data.  Change point analysis techniques would need to be adapted to determine break points in pairwise species co-occurrences (rather than a single species abundance) along single and multiple environmental gradients.

45 point MSC topic attracting fees scholarship

Contact Prof. Judi Hewitt


Simulating non-random community assemblage processes

Present meta-community theory suggests that the factors controlling how species are built vary from environmental filtering, source-sink dynamics (dispersal) and species interactions to stochastic demographic processes.  This project would integrate the use of biological traits that represent dispersal capability, environmental selection and adult-juvenile-adult interactions into community assemblage theory.  Monte Carlo simulations driven by biological traits would be used to build communities. The Metacom package would be used to assess whether the communities can be appropriately assigned to particular species assemblage mechanisms.

45 point MSC topic attracting fees scholarship

Contact Prof. Judi Hewitt

A comparison of spatial capture-recapture and random encounter models for camera trap data

Camera-trap surveys are commonly used to estimate density of wildlife populations. Over the last decade, spatial capture-recapture (SCR) and random encounter models (REMs) have gained traction in their application to the resulting data. They each require slightly different information---for example, SCR usually needs individuals to be recognised when they are detected, while REMs usually require a priori knowledge of average animal speeds. The two methods also make different assumptions about the way animals move and behave. This project will aim to assess the performance of SCR and REM estimators in terms of properties such as precision and robustness to assumption violations.

Requirements: An interest in programming, with good grades in statistical computing papers.

Supervisor: Ben Stevenson (


Parameter estimation for void point processes

Spatial point processes model the observed locations of objects or events. They can explain patterns in the locations of stars, animals, earthquakes, criminal acts, and terrorism attacks, to name just a few applications. Jones-Todd et al. (in press) described a new type of spatial point process, the void process, which was used to model spatial patterns formed by cancer and stroma cell nuclei on tissue slides from colon cancer patients. A void process comprises two types of points, Type A and Type B, which are uniformly scattered with intensities λA and λB , respectively. Type B points are not observed at all. Only Type A points that are greater than distance τ from all Type B point are observed. In other words, each hidden Type B point 'deletes' any Type A point within its 'void' of radius τ. Jones-Todd et al. (in press) proposed estimation of λA and λB, and τ via maximum Palm likelihood. Although this is a computationally efficient approach, it does not appear to be particularly statistically efficient. This project will develop a Bayesian (or possibly a maximum likelihood) estimator for void processes.

Requirements: A good grade in STATS 331 or STATS 731. A good grade in STATS 310 might also be an advantage.

Reference: Jones-Todd, C. M., Caie, P., Illian, J. B., Stevenson, B. C., Savage, A., Harrison, D. J., and Bown, J. L. (in press) Identifying prognostic structural features in tissue sections of colon cancer patients using point pattern analysis. Statistics in Medicine.




Who's skating on thin ice? Evaluating the performance of hockey goaltenders in the NHL

North American sports are well known for their meticulous recordkeeping, and ice hockey is no different. Vast quantities of data are collected from the performance of every player who steps onto the ice during an NHL match. The goaltender (or `goalie') is an important player who attempts to prevent the opposition from scoring by blocking shots directed at their goal. The save percentage (SP) is a key summary statistic that is commonly used to measure goaltender performance. However, comparison of the SP either between players (to assess who is better) or within a single player (to assess how performance changes over time) is typically achieved simply by seeing which is bigger, without any regard for sampling variation. This project aims to model goaltending performance in order to answer various questions of interest.

Requirements: At least two of the following would be beneficial, but a particularly good student could get away with fewer: (1) Experience with hierarchical models (or `latent variable models' or `mixed effects models'). (2) A good grade in statistical inference papers, such as STATS 310 and/or STATS 730. (3) Some experience using Template Model Builder (TMB). (4) A good understanding of ice hockey.

Supervisor: Ben Stevenson (




“The genealogy of individuals chosen at random from a population”

The aim of this probability project is to investigate the genealogy of individuals who are picked at random at a large time from within a population of a branching process. In particular, the project will consider a Galton-Watson branching process with geometric offspring distribution. If a fixed number of individuals are picked at random from the population conditioned to survive then tracing their family tree back in time leads to a coalescent process. We aim to describe this coalescent process as far as possible.

The project will investigate the existing literature and recent progress in this area, as well as considering some open problems. Some computer simulations may also be produced to demonstrate typical behaviour and support any theoretical results. Any contrast between super-critical and sub-critical processes should be highlighted where possible, as should any interesting large time behaviour, especially near criticality.

The project will involve probability, branching processes, generating functions, and limits. It should be suitable for a student familiar with stochastic processes and with proofs. Stats 325 would be recommended.

Supervisor: Simon Harris


Text analytics

This project will involve researching and building easy-to-use tools for obtaining and analysing data that comes in the form of text whether it be data scraped from twitter feeds, blog posts and other social media, collections of emails, fan fiction, reports and other documents, or even novels. Some of the core things people are interested in with data like this are what are these people talking about, how do they feel about the issues under discussion, and how are these sentiments changing. This is a good project for learning to analyse text data and building data-science skills leveraging extensive capabilities already in R that should suit students with interests in computing and statistics. It is also an opportunity to learn valuable skills from the experienced members of the iNZight team.

Requirements: Good skills in R programming.

Supervisor: Chris Wild,

Analytics for date and time-stamped data

The subject of this project working with data with date and time fields that tell us when things happened. The project involves developing wrangling, plotting and analysis capabilities for such data. Can we sensibly automate how times and dates are handled by plots in conjunction with other variables in many useful situations? This is a good project for learning to work with time-stamped data and building data-science skills that should suit students with interests in computing and statistics. It is also an opportunity to learn valuable skills from the experienced members of the iNZight team.

Requirements: Good skills in R programming.

Supervisor: Chris Wild,

Predictive analytics

This project will involve working with the iNZight development team on developing capabilities for building predictive models using automated ensemble tools such as TensorFlow and Caret, and simpler more understandable tools such as regression models and classification & regression trees. The project will also involve model diagnostics for generalised linear models and survival models. This is a good project for building data-science skills that should suit students with interests in computing and statistics. It is also an opportunity to learn valuable skills from the experienced members of the iNZight team.

Requirements: Very good grades in Stats 310 and 330 and good R-programming skills.

Supervisor: Chris Wild,

Interactive graphics with R

This project will involve researching and implementing interactive web graphs for R-generated plots including calling back to R from the webpage for new information and updating plots without having complete redrawing, including 3-D plotting. This is also a good project for building data-science skills that should suit students with interests in computing and statistics. It is also an opportunity to learn valuable skills from the experienced members of the iNZight team.

Requirements: Good skills in R and javascript programming.

Supervisor: Chris Wild,

Data wrangling tools

This project will involve investigating and drawing lessons from the available interactive data wangling systems and the R tidyverse tools to scope and develop a Shiny app that not only makes data wrangling easy for users but also writes the R code it using to: provide an audit trail, and aid reproducibility and learning data wrangling in R. This is a good project for building data-science skills that should suit students with interests in computing and statistics. It is also an opportunity to learn valuable skills from the experienced members of the iNZight team.

Requirements: Very good R-programming skills.

Supervisor: Chris Wild,

Visualisation of Multiple Response data from complex surveys

iNZight already has tools for displaying and analysing multiple-response data from random samples. This project will involve generalising these methods so that they work with data obtained from more general survey sampling designs.

Requirements: Very good grades in Stats 310 and 340 and very good R-programming skills.

Supervisor: Chris Wild,

Other topics

A student may wish to explore other data science topics with a view to later incorporation in iNZight such as: analytics for images; developing tools for survival and longitudinal data analysis; tools for webscraping; providing capabilities with colour including using colour metrics to choose colour palettes with maximally-different colours; developing Shiny apps for allowing sensitive data to be analysed and displayed with the data itself kept secure.

Requirements: Very good R-programming skills.

Supervisor: Chris Wild,


New PhD research topics available

For new available PhD topics, please contact our PhD enrolment officers as supervisors are willing to consider different topics to match suitable PhD applicants:

Rachel Fewster

Renate Meyer

Geoffrey Pritchard

Other PhD Enrollment officers


Previous or on-going PhD research topics

Below are some examples of PhD research topics that have been or are still being offered.

Topic Supervisor
Modelling detection counts from wildlife surveys using spatial capture-recapture Ben Stevenson
Using statistics to test cricketing superstitions Brendon Brewer
A picture tells a thousand words, and so can statistical graphs Paul Murrell
Exponentially weighted moving average control charts Arden Miller
Modeling and optimal control of queueing networks Ilze Ziedins
Monte Carlo methods for LISA data analysis Renate Meyer
Making the link between investigative questions and conclusions in the statistical enquiry cycle Maxine Pfannkuch
Multivariate predictors of diabetes and CVD Patricia Metcalf
Mixture density estimation under shape restrictions Yong Wang 
Efficient analysis with biased samples Chris Wild 
Quantitative models for ecological data Russell Millar 
Some problems concerning the generalized hyperbolic distribution David Scott 
Reconstruction of probability distributions in population genetics Rachel Fewster 
A statistical computation environment built on common lisp Ross Ihaka 
Wind power in the New Zealand electricity market Geoffrey Pritchard 
Statistics applied to statistical biology, multivariate modeling and inference Thomas Yee 
Analysis and design of outcome-dependent sampling studies using semiparametric maximum likelihood Alan Lee 
Assessing biological and statistical violations in the analysis of conservation datasets James Russell