Department of Statistics

Postgraduate research topics

Postgraduate students at the Department of Statistics have many research areas to choose from.

New MSc and BSc(Hons) research topics available for this year

 For MSc research topics that are not listed contact - Dr Brendon Brewer . For other staff members refer staff directory page.


Exploring fishing power

Fishing power provides a measure of vessel efficiency, and can be used to explain why some vessels catch more than others. Commercial catch and fishing effort data are often used by fisheries scientists to estimate fish abundance indices (catch per unit effort), but often struggle to explain vessel effects identified. This project will examine a range of commercial fisheries catch data, to identify the factors that affect a vessel’s fishing power.


Dr Ian Tuck


Who's skating on thin ice? Evaluating the performance of hockey goaltenders in the NHL

In ice hockey, the goaltender (or 'goalie') is an important player who attempts to prevent the opposition from scoring by blocking shots directed at their goal. The save percentage (SP) is a key summary statistic that is commonly used to measure goaltender performance. However, comparison between players (to assess who is better) or within a single player (to assess how performance changes over time) is typically achieved simply by looking at the relevant SPs without any regard for sampling variation. This project aims to model goaltender data in order to gain better insight into performance. A considerable amount of the project may involve the development of an automated way to obtain the relevant data from online sources.

Prerequisites: Some of the following would be advantageous: (1) Some experience in scraping data from the web, (2) good grades in statistical inference papers (e.g., STATS 310 and/or STATS 730), (3) a decent understanding of ice hockey.


Ben Stevenson


Optimising trawl survey spatial stratification

This project will examine species distribution and spatial stratification within New Zealand trawl surveys, and use different stratification approaches to determine whether current stratification is optimal, or whether it can be improved, given the survey objectives


Dr Ian Tuck

Modelling the scale of spatial interactions between key sandflat species

Inter-species and species-habitat interactions are usually modelled as if they occur at a fixed scale, however, the strength of any interaction is likely to be a function of the scale at which different species experience their environment and also the scale at which observations are made.  Aspects of the spatial scale that affect observations include grain (the size of the sampling unit), lag (the distance between samples) and extent (the area over which the observations are made).  This study will assess the interactions observed between key sandflat species in different sized windows of observation by exploring the effect of varying lag and extent on simple correlations and analysing cross-correlograms.


Dr Judi Hewitt


q-type functions

The VGAM R package estimates many distributions by full maximum likelihood estimation (see the R CRAN task views for distributions). Associated with these distributions are d-, p-, q-, and r-type functions, e.g., dunif(), punif(),qunif(), runif().  The aim of this project is to write many q-type functions that are currently unimplemented. Special attentions is needed to make the computations fast and reliable, without excessive memory requirements. This project would be of interest to those with a numerical
analysis background and strong R programming skills.

Prerequisites: STATS 310, numerical computing skills and good R programming skills


Thomas Yee


Multinomial Logit Model

The aim of this project is to improve the multinomial() family function in the VGAM R package. The multinomial logit model is the  standard model for regressing a nominal categorical response against a set of explanatory variables. It can suffer from numerical  problems with sparse data, however, bias reduction can be a solution for this (Ding and Gentleman, JCGS, 2005). One task is to implement this within the function. Also, we could write functions to conduct a score test, as well as the Hausman-McFadden
test for independence of irrelevant alternatives (IIA). Time permitting, another useful feature would be to handle the nested multinomial logit model, however this would be quite a challenge.This project would suit a student with good R programming skills and has done STATS 310 and STATS 330.

Prerequisites: STATS 310, STATS 330 and R programming skills


Thomas Yee


Analysis of Fly Fishing Data

This project is ideally needs somebody familiar with (freshwater) fly fishing and has good R programming skills. Much data processing is first required to get the data into shape, and it is essential to know about the main flyfishing techniques such as nymphing, wetlining, dry fly, etc. The data was collected by the Department of Conservation over a long period of time, and this work is in collaboration with a DOC scientist.

Prerequisites: knowledge about flyfishing, STATS 330 and good R programming skills


Thomas Yee


Expected Information Matrices

This is a project that is suitable for a student interested in mathematical statistics. The aim is to derive the expected information matrices (EIMs) for various statistical distributions and/or to search the literature for such. One can also derive EIMs for certain variants of some distributions, such as to allow for 0-inflation or 1-inflation, etc. We can also look at survival distributions having certain censoring mechanisms. Ideally the student has a good grade in STATS 310 or equivalent, and has a sound mathematical background.
Prerequisites: STATS 310 and some maths papers


Thomas Yee


Bias Reduction

The VGAM R package implements bias reduction for some special cases of generalized linear models (GLMs). They appear as family functions with an argument called 'bred'. This project will involve simulations and developing new implementations of bias reduction for other types of GLMs, including the multinomial logit model. This project could be of interest to those with a STATS 310 and 330 bent. Some maturity in mathematical statistics is needed.

Prerequisites: STATS 310 and some R programming


Thomas Yee

How healthy are NZ diets? Using sales and nutrient data to examine the nutrient quality of food purchases made by New Zealand households

Ultra-processed foods are formulations made from cheap industrial sources of energy and nutrients plus additives, using a series of processes. Such foods are generally very palatable, encourage overconsumption, and contain fewer beneficial nutrients than fresh foods. As such, categorising foods by their level of processing has recently become popular as a way to describe their contribution to health. Reducing intake of processed foods is starting to become a focus of dietary guidelines, including in New Zealand.

This project will complete a systematic review of existing literature, work on unique sources of data, develop an analysis plan, potentially contribute to a peer reviewed journal.

For full details click here

Prerequisites: SAS or Strata

Primary Supervisor:  Dr Helen Eyles -  

Secondary Supervisors:  Wilma Waterlander and Yannan Jiang -


Robustness of spatial capture-recapture models

Spatial capture-recapture (SCR) models can be used to estimate animal density from various kinds of detection data. SCR is known to be remarkably robust to various model misspecifications, but this has usually been shown for surveys that deploy a large number of traps. Acoustic detection is becoming a popular means of monitoring animal populations, and SCR methods can be applied to the resulting data. However, these surveys often only deploy around 2--6 recorders. This project will investigate whether the same robustness results hold once we consider detector arrays of this scale. This work has potential applications to acoustic data of frogs in South Africa and/or whales in the Pacific Ocean.

Prerequisites: An interest in programming.


Ben Stevenson


Penalty Analysis

In market research, consumers are frequently used in the sensory analysis of consumer products. Take as an example beer. Each subject considers a small subset of the total number of beers being tested. For each beer that they taste, however, they have score several attributes, e.g. colour, sweetness, bitterness, aftertaste, overall liking. Often a subject will be asked to score the sample of beer on 15 to 20 attributes. Some of these attributes are purely subjective, for example overall liking, but for others, e.g. sweetness, the subject will be asked to score not only the sweetness of the particular sample of beer being tasted, but also their perception of the ideal sweetness for that type of beer. There is substantial variation between tasters so it is necessary to use a properly designed experimental protocol, which takes into account how often each beer is tasted and the number of times it is tasted first, second, third, as well as the number of times each pair of beers is tasted together. The data for the sensory attributes and the ideals requires full statistical analysis to adjust for between taster variability.

In Penalty Analysis for each attribute with an ideal e.g. sweetness, the ideal scores are also analysed and an confidence interval around mean ideal sweetness is developed. For each beer the tasters or subjects are split into three groups; too sweet, just right, not sweet enough. Within each group the mean overall liking score is found and the means of the three groups are compared. If the mean overall liking of the too sweet group is not very different from that of the just right group the brewer will not experience too great a penalty if the beer is judged to be too sweet.

At present these three liking scores are not adjusted for between taster variation. The purpose of this project is to measure the magnitude of the effect of between taster variation on penalty analysis. The project will require a student who is a confident R user, and who has passed STATS 330, and preferably STATS 340.                           


Chris Triggs


The identification of key biological processes for cancer progression using bioinformatics tools

Benefiting from high-throughput biotechnologies, high-dimensional genetic data sets that include molecular profiles and clinical outcomes of various types of cancers become available. However, translating these multi-dimensional resources into clinical practice remains challenging. Moreover, emerging evidence suggests that there are cancer specific prognostic variations that can be used to classify cancers and guide therapeutic strategies. This project aims at identifying key biological processes driving cancer development and progression using various bioinformatics tools and databases.

Prerequisites: R programming


Yalu Wen


Multi-level Omic data analysis

Biotechnological advances have expanded the breadth of available high-dimensional data from genomic data, to extensive epigenomic, transcriptomic and proteomic data. Effectively analyzing these data can facilitate the identification of accurate models that predict outcomes and generate important insights into the underlying genetic mechanisms of various complex traits. There is a strong need for powerful and advanced analysis strategies that can fully make use of these comprehensive high-dimensional data. This project will explore the emerging approaches for data integration and use the appropriate method to analyse data from Alzheimer's Disease Neuroimaging Initiative.

Prerequisites: R programming, STATS 20x, and STATS 310.


Yalu Wen


Zoning in on pressure: making the measurements count

Between 30% and 60% of women will experience urinary incontinence during their lifetime. This condition can lead to depression, social isolation and loss of productivity. The solution is well-understood: pelvic floor muscle exercise, when performed correctly and for long enough, is a conservative yet effective treatment. However, the problem is that women do not do these exercises, nor do they engage with their pelvic floor health. Members of the Pelvic Floor Research Group at the Auckland Bioengineering Institute are developing innovative medical devices in order to help address this problem.

The team has recently developed a prototype sensor array (FemFit) which is capable of:

•          Measuring pressure profile

•          Distinguishing between abdominal pressure and pelvic floor muscle pressure

•          Providing the user with real-time feedback about the effectiveness of their activity via a smartphone app


See for more information.


The next step is clinical validation of the device. Clinical trials in women with and without urinary incontinence are about to commence in Montreal, Canada. Appropriate statistical analysis will be crucial to determine the effectiveness of the FemFit in measuring pressure changes pre and post pelvic floor muscle training. If you are interested in working with real data, interacting with people in other disciplines, and have a strong statistical and computing skillset, then this is the project for you. This is a long-term study and suitable as an initial honours year project, but with the potential to develop into a masters project.


Stephanie Budgett (Department of Statistics) -

and Jenny Kruger (ABI) -


Towards better measures of income in social surveys: An empirical investigation of measurement error and missing data using the Integrated Data Infrastructure.

This project involves working with the New Zealand Census and the Integrated Data Infrastructure. The Integrated Data Infrastructure is a collection of de-identified administrative data sets that have been linked at the person level for the whole of the New Zealand population, and made available for use by researchers under strict conditions which protect individuals ‘ privacy and confidentiality. The accuracy of income estimates from social surveys is compromised by low response rates to questions about income, and response error in the sources of income and the amount of income reported.

This project involves making comparisons between income recorded in the Inland Revenue tax data, and self-reported income recorded in the census. The project involves highlighting discrepancies between self-reported income and income measured on tax records, testing estimates from multiple imputation models in the census against observed income in the Inland Revenue tax data, and creating bias adjustments to align the self-reported income in the Census to income reported on tax records.

The result of this project with be 1) estimates for discrepancies between self-reported and collected data on income, with the magnitude and direction of discrepancies investigated across social groups. 2) Best practice for imputing missing self-reported income based on tests of different imputation models and methodologies. 3) Creation of bias adjustments weights that align reports of income in the census with official records from the Inland Revenue. Time permitting this will develop into a paper to be submitted to an academic journal with the student named as an author.

The student working on this project will need to have excellent data analysis skills, be willing to learn STATA, and be prepared to work under Statistics New Zealand’s strict privacy and confidentiality conditions.


Nichola Shackleton, Barry Milne -


Creating intergeneration data using the Integrated Data Infrastructure

The Integrated Data Infrastructure (IDI) is a collection of de-identified administrative datasets (e.g., on health events, justice contacts, education enrolments, tax paid) that have been linked at the person-level for the whole New Zealand population, and made available for use by researchers under strict conditions which protect individuals’ privacy and confidentiality. The datasets linked cover different time frames, most going back only as far as the 1980s or 1990s. However, the Department of Internal Affairs data includes birth information dating from the 1840s, with details about the child (e.g., gender, birth weight, place of birth) and their parents (e.g., age, occupation). Intriguingly, both the child and their parents are given IDs allowing for parents born in New Zealand to be linked back to their own birth records and to details of their parents, and (potentially) to the parents of their parents, etc. This project aims to make use of these intergenerational links to answer the following questions: (i) how many generations can be determined, and what is the total number at each generation?; (ii) what is the pattern of intergenerational geographic mobility, does this differ by ethnicity, and how has it changed over time?; and (iii) (iii) for disorders known to be impacted by paternal age (e.g., autism, schizophrenia), is there an intergenerational effect (e.g., an effect of grandfather age, or age of generations further back).

The student working on this project will need to have excellent SAS skills, be willing to learn STATA and occupation coding, and be prepared to work under Statistics New Zealand’s strict privacy and confidentiality conditions.


Dr Barry Milne, Dr Nichola Shackleton, Dr Andrew Sporle, Dr Dan Exeter.


How many glasses of wine can you taste?

For taste testing experiments – or for other human senses as well as taste, think of smell or touch –subjects can only assess a small number of products. This number is often four or three, but may be as low as two. If you want to assess perfumes at the same time you have only two wrists on which perfume can be sprayed. When designing experiments for sensory analysis where we need to compare several different products incomplete block designs have to be used. The subjects are the blocks and we need to take account of subject to subject differences.

This project will study incomplete block designs for factorial experiments, that is where the different treatments depend on several factors. Here is an example from wine making. We have chardonnay grapes grown on five vineyards in the same district and harvested at the same ripeness. Each of the five batches of grapes are split and fermented under six slightly different conditions. After fermentation the thirty (fix times six) different wines are stored for the same length of time before tasting. Each taster or subject will assess only three of the wines. How do we choose the sets of three wines so that the differences between vineyard and fermentation technique, and their interaction, are estimated as precisely as possible? How do we minimise the effect of taster to taster variation?

To take on this project you will need to have passed STAT 310, or equivalent, e.g. STAT 732, and be able to program in R and use R packages.


Professor Chris Triggs


Crash-testing Nested Sampling

Nested Sampling is a powerful Monte Carlo algorithm used in Bayesian inference and statistical mechanics. However, like all such algorithms, it can get quite slow on big problems with lots of unknown parameters and complex structure in the shape of the posterior distribution. In this project you will investigate how the accuracy of Nested Sampling depends on the properties of the problem, and what tweaks to the algorithm might help or hurt. The goal is to come up with a clear set of recommendations for practitioners.

Prerequisites: You’ll need good grades in 331 and/or 731, and be comfortable with programming, including learning a bit about some languages other than R.


Dr Brendon Brewer


Lazy Ambiguous Single Transferable Vote

The Single Transferable Vote system used in NZ local elections allows voters to rank all the candidates or only their most-preferred subset, with ballots being valid until the first missing or ambiguous ranking. Sometimes voters care more about their *last* preferences -- for example, opposing anti-fluoridation or anti-vaccination candidates for District Health Boards, so it would be nice to allow incomplete last preferences.  This project will look at the properties of a voting system designed to include top and bottom preferences and maintain the "don't care" property for candidates who are not ranked.  It will also compare a full implementation to simple approximations feasible for hand counting.

Prerequisites: None, though some programming ability will be important


Thomas Lumley


Approximations to the distribution of a large quadratic form

For an important family of hypothesis tests in modern genomics, the (asymptotic) null distribution is a quadratic form in Normal random variables.  There are several published methods for computing tail probabilities, but they mostly have been evaluated for moderate p-values and small quadratic forms.  In modern DNA sequence analysis the quadratic form may involve thousands or tens of thousands of variables and p-values near machine epsilon are of interest. This project will compare some of the approximations on both accuracy and computational time.

Prerequisites: 310, a course in linear algebra, programming experience

[This project could be extended to a research MSc for a suitable student. Alternatively, it would also be suitable for a single semester]


Thomas Lumley


How are descriptive statistics for a social network affected by ascertainment error?

There are many popular summary statistics for whole social networks and individual nodes: degree distribution, connected component size, centrality and eigenvalue centrality, betweenness, clique size.  This project will look at how much these are affected by false positives or false negatives in whether two nodes are actually linked.

Prerequisites: 310, a course in linear algebra, programming experience


Thomas Lumley


Sampling bias in exponential random graph models

Exponential Random Graph Models are popular for describing the structure of social networks.  However, maximum likelihood for these models is inconsistent under even simple random sampling.  This project will investigate the bias and attempt to correct it with weighted estimation.

[This project could be extended to a research MSc for a suitable student. Or extended even further to a PhD]

Prerequisites: 310, 340 or 740, substantial programming experience


Thomas Lumley


Evaluating  Methods of Assessing "Optimism" in Regression Models

A fitted regression model usually fits the data it was based on better than new data. This problem is referred to as model "optimism" and is exaggerated when a model selection process was used to identify the model that was fitted.  A number of procedures (including cross-validation and bootstrapping) have been suggested to measure and correct for optimism. This project would evaluate the effectiveness of these procedures using an extensive simulation study.

This project would be suitable for a student who is interested in regression models and simulation studies. The simulations will be done using R and require writing R programmes.  

Requirements: Should have obtained a good grade (B+ or better) in  either STATS 330 or STATS 762.


Arden Miller


Mobile detectors and point responses in Rapid Eradication Assessment

Rapid eradication assessment (Russell et al. J. Applied Ecology, allows pest eradication managers to rapidly assess the probability of successful eradication on islands using statistical models. This project will develop the basic mathematical model to allow for greater flexibility including incorporation of mobile detectors through line transect sampling (conservation dogs) and point responses (known incursion detections).

Requirements: Students should be competent programming in R, familiar with Shiny, independent and motivated, and have interests in survey design, animal detection models and conservation. The student needs a GPA above 7, and can expect a high impact scientific publication as an outcome of the project, as well as opportunities to join ecological fieldwork. Would suit a one year project rather than 6 month.


James Russell


Mathematical annotation in 'gridSVG'

The 'gridSVG' package can be used to convert an R plot to SVG format, which is useful for embedding the plot in a web page, adding interactive features to the plot, and improving the accessibility of the plot. The conversion to SVG can be tricky for any mathematical formulas in the R plot (see ?plotmath in R).  This project will assess the current support in 'gridSVG' for converting mathematical annotation to SVG (including web browser support), which is far from perfect, and investigate ways to improve the situation.

Pre-requisites: This project requires an interest in programming and web technologies,  plus demonstrated competence in these areas (e.g., good grades in statistical computing papers such as 220, 380, 769, 779, 782).


Paul Murrell


Implementation of a Bayesian nonparametric time series approach in BUGS/JAGS


The aim of this project is to implement a Bayesian nonparametric time series analysis in BUGS/JAGS. Choudhuri et al. (2004) suggest a nonparametric approach based on the Whittle likelihood and Dirichlet-Bernstein prior. They specify a Gibbs sampler to sample from the posterior distribution. To achieve an implementation in the software package BUGS/JAGS, it will be necessary to study other implementations of nonparametric methods first and explore whether these can be adjusted to allow the implementation of Choudhuri's model.

Familiarity with Bayesian inference from STATS 331 or 731 and some experience  with R and WinBUGS/JAGS is required. Knowledge about time series analysis from 326 would be beneficial.


Renate Meyer


Iterative proportional fitting

Iterative proportional fitting (IPF) is a commonly used algorithm for maximum likelihood estimation in loglinear models, first proposed by Deming and Stephan in 1940. In this project, an overview should be given about the analysis of contingency tables, the IPF algorithm should be introduced, contrasted to other algorithms such as Newton-Raphson, and a convergence proof given.

Its application should be demonstrated using nonlinear models for two- and higher-dimensional contingency tables. Familiarity with loglinear or generalised linear models is an advantage.


Renate Meyer



New PhD research topics available

For new available PhD topics, please contact our PhD enrolment officers as supervisors are willing to consider different topics to match suitable PhD applicants:

Rachel Fewster

Renate Meyer

Geoffrey Pritchard

Other PhD Enrollment officers


Previous or on-going PhD research topics

Below are some examples of PhD research topics that have been or are still being offered.

Topic Supervisor
Modelling detection counts from wildlife surveys using spatial capture-recapture Ben Stevenson
Using statistics to test cricketing superstitions Brendon Brewer
A picture tells a thousand words, and so can statistical graphs Paul Murrell
Exponentially weighted moving average control charts Arden Miller
Modeling and optimal control of queueing networks Ilze Ziedins
Monte Carlo methods for LISA data analysis Renate Meyer
Making the link between investigative questions and conclusions in the statistical enquiry cycle Maxine Pfannkuch
Multivariate predictors of diabetes and CVD Patricia Metcalf
Mixture density estimation under shape restrictions Yong Wang 
Efficient analysis with biased samples Chris Wild 
Quantitative models for ecological data Russell Millar 
Some problems concerning the generalized hyperbolic distribution David Scott 
Reconstruction of probability distributions in population genetics Rachel Fewster 
A statistical computation environment built on common lisp Ross Ihaka 
Wind power in the New Zealand electricity market Geoffrey Pritchard 
Statistics applied to statistical biology, multivariate modeling and inference Thomas Yee 
Analysis and design of outcome-dependent sampling studies using semiparametric maximum likelihood Alan Lee 
Assessing biological and statistical violations in the analysis of conservation datasets James Russell