Department of Statistics


There are no seminars to show.

Past seminars

Investigating linkage bias in the Integrated Data Infrastructure

Speaker: Eileen Li

Affiliation: The University of Auckland

When: Tuesday, 14 December 2021, 2:00 pm to 3:00 pm


Linked administrative data can provide rich information on a wide range of outcomes, and its usage in on the rise both in New Zealand and internationally. The Integrated Data Infrastructure (IDI) is a database maintained by Statistics New Zealand (Stats NZ) and contains linked administrative data at individual level. In the absence of unique personal identifier, probabilistic record linkage is performed which unavoidably would evoke linkage errors. However, the majority of IDI analysis is completed without understanding, measuring or correcting for potential linkage bias. We aim to quantify linkage errors in the IDI and provide feasible approaches to adjust for linkage biases in IDI analysis. In this talk, I will briefly explain how linkage errors (false links and missed links) may occur in the IDI, followed by approaches on false link and missed link identification. Some key limitations will also be addressed.

Estimating the approximation error for the saddlepoint maximum likelihood estimate

Speaker: Godrick Maradona Oketch

Affiliation: The University of Auckland

When: Wednesday, 8 December 2021, 2:00 pm to 3:00 pm


Saddlepoint approximation to a density function is increasingly being used primarily because of its immense accuracy. A common application of this approximation is to interpret it as a likelihood function, especially when the true likelihood function does not exist or is intractable. The application aims at obtaining parameter estimates using the likelihood function based on saddlepoint approximation. This study examines the likelihood function (based on first and second-order saddlepoint approximation) to estimate the difference between the true but unknown maximum likelihood estimation (MLE) estimates and the saddlepoint-based MLEs. We propose an expression to estimate this difference (error) by computing the gradient of the neglected term in the first-order saddlepoint approximation. Then using common distributions whose true likelihood functions are known to perform confirmatory tests on the proposed error expression, we show that the results are consistent with the difference between the true MLEs and saddlepoint MLEs. These tests indicate that the proposed formula could complement simulation studies, which has been widely used to justify the accuracy of such saddlepoint MLEs.

Disease risk prediction using deep neural networks

Speaker: Xiaowen Li

Affiliation: The University of Auckland

When: Thursday, 2 December 2021, 2:00 pm to 3:00 pm


Accurate disease risk prediction is an essential step towards precision medicine, an emerging model of health-care that tailors treatment strategies based on individual's profiles. The recent abundant genome-wide data provide unprecedented opportunities to systematically investigate complex human diseases. However, the ultra-high dimensionality and complex relationships between biomarkers and outcomes have brought tremendous analytical challenges. Hence, dimension reduction is crucial for analysing high-dimensional genomic data. Deep learning models are promising approaches for modelling features of high complexity, and thus they have the potential to offer a unified approach in efficiently modelling diseases with different underlying genetic architectures. The overall objective of this project is to develop a hybrid deep neural network with multi-kernel Hilbert-Schmidt independence Criterion-Lasso (MK-HSIC-lasso) incorporated to efficiently select important predictors from ultra-high dimensional genomic data and model their complex relationships, for risk prediction analysis on high-dimensional genomic data.

Accessing 'grid' from 'ggplot2'

Speaker: Paul Murrell

Affiliation: The University of Auckland

When: Thursday, 18 November 2021, 3:00 pm to 4:00 pm


The 'ggplot2' package for R is a very popular package for producing statistical plots (in R). 'ggplot2' provides a high-level interface that makes it easy to produce complex images from small amounts of R code. The 'grid' package for R is an unpopular package for producing arbitrary images (in R). 'grid' provides a low-level interface that requires a lot of work to produce complex images. However, 'grid' provides complete control over the fine details of an image. 'ggplot2' uses the low-level package 'grid' to do its drawing so, in theory, users should be able to get the best of both worlds. This talk will discuss the surprising fact that 'ggplot2' users cannot easily get the best of both worlds and it will introduce the 'gggrid' package, which is here to save the day (and both worlds).

Applications of scoring rules

Speaker: Matthew Parry

Affiliation: University of Otago

When: Thursday, 21 October 2021, 3:00 pm to 4:00 pm


Suppose you publicly express your uncertainty about an unobserved quantity by quoting a distribution for it. A scoring rule is a special kind of loss function intended to measure the quality of your quoted distribution when an outcome is actually observed. In statistical decision theory, you seek to minimise your expected loss. A scoring rule is said to be proper if the expected loss under your quoted distribution is minimised by quoting that distribution. In other words, you cannot game the system!

In addition to having a rich theoretical structure – for example, associated with every scoring rule is an entropy and a divergence function – scoring rules can be tailored to the problem at hand and consequently have a wide range of application. They are used in statistical inference, for evaluating and ranking of forecasters, for assessing the quality of predictive distributions, and in exams.

I will talk about a range of scoring rules and discuss their application in areas such as classification and time series. In addition to so-called local scoring rules that do not depend on the normalisation of the quoted distribution, I will also discuss recently discovered connections between scoring rules and the Whittle likelihood.

Current state and prospects of R-package for the design of experiments

Speaker: Emi Tanaka

Affiliation: Monash University

When: Thursday, 14 October 2021, 3:00 pm to 4:00 pm


The critical role of data collection is well captured in the expression "garbage in, garbage out" -- in other words, if the collected data is rubbish then no analysis, however complex it may be, can make something out of it. The gold standard for data collection is through well-designed experiments. Re-running an experiment is generally expensive, contrary to statistical analysis where re-doing it is generally low-cost; there's a higher stake in getting it wrong for experimental designs. But how do we design experiments in R? In this talk, I will review the current state of R-package for the design of experiments and present my prototype R-package {edibble} that implements a framework that I call the "grammar of experimental design".


Dr. Emi Tanaka is a lecturer in statistics at Monash University whose primary interest is to develop impactful statistical methods and tools that can readily be used by practitioners. Her research area includes data visualisation, mixed models and experimental designs, motivated primarily by problems in bioinformatics and agricultural sciences. She is currently the President of the Statistical Society of Australia Victorian Branch and the recipient of the Distinguished Presenter's Award from the Statistical Society of Australia for her delivery of a wide-range of R workshops.

Highly comparative time-series analysis

Speaker: Ben Fulcher

Affiliation: School of Physics, The University of Sydney

When: Thursday, 7 October 2021, 3:00 pm to 4:00 pm


Over decades, an interdisciplinary scientific literature has contributed myriad methods for quantifying patterns in time series. These methods can be encoded as features that summarize different types of time-series structure as interpretable real numbers (e.g., the shape of peaks in the Fourier power spectrum, or the estimated dimension of a time-delay reconstructed attractor). In this talk, I will show how large libraries of time-series features (>7k, implemented in the hctsa package) and time series (>30k, in the CompEngine database) enable new ways of of analyzing time-series datasets, and of assessing the novelty and usefulness of time-series analysis methods. I will highlight new open tools that we’ve developed to enable these analyses, and discuss specific applications to neural time series.

Merging Modal Clusters via Significance Assessment

Speaker: Yong Wang

Affiliation: The University of Auckland

When: Thursday, 19 August 2021, 3:00 pm to 4:00 pm


In this talk, I will describe a new procedure that merges modal clusters step by step and produces a hierarchical clustering tree. This is useful to deal with superfluous clusters and to reduce the number of clusters as often desired in practice. Based on some new properties we establish for Morse functions, the procedure merges clusters in a sequential manner without causing unnecessary density distortion. Each cluster is evaluated for its significance relative to the other clusters, using the Kullback-Leibler divergence or its log-likelihood approximation, by truncating the density for the cluster at an appropriate level. The least significant cluster is then merged into one of its adjacent clusters, using the novel concept of cluster adjacency we define. The resulting hierarchical clustering tree is useful for determining the number of clusters, as may be preferred by a specific user or in a general, meaningful manner. Numerical studies show that the new procedure handles well difficult clustering problems and often produces intuitively appealing and numerically more accurate clustering results, as compared with several other popular clustering methods in the literature.

Generally-altered, -inflated and -truncated regression, with application to heaped and seeped counts

Speaker: Thomas Yee

Affiliation: University of Auckland

When: Thursday, 22 July 2021, 3:00 pm to 4:00 pm

Where: MLT3/303-101

A very common aberration in retrospective self-reported survey data is digit preference (heaping) whereby multiples of 10 or 5 upon rounding are measured in excess, creating spikes in spikeplots. Handling this problem requires great flexibility. To this end,

and for seeped data also, we propose GAIT regression to unify truncation, alteration and inflation simultaneously, e.g., to general sets rather than {0}. Models such as the zero-inflated

and zero-altered Poisson are special cases. Parametric and nonparametric alteration and inflation means our combo model has five types of 'special' values. Consequently it spawns

a novel method for overcoming underdispersion through general truncation by expanding out the support. Full estimation details involving Fisher scoring/iteratively reweighted least squares are presented as well as working implementations for three 1-parameter

distributions: Poisson, logarithmic and zeta. Previous methods to date for heaped data have been found wanting, however GAIT regression hold great promise by allowing the joint flexible

modelling of counts having absences, deficiencies and excesses at arbitrary multiple special values. Now it is possible to analyze the joint effects of alteration, inflation and truncation

on under- and over-dispersion. The methodology is now implemented in the VGAM R package available on CRAN.

Does It Add Up? Hierarchical Bayesian Analysis of Compositional Data

Speaker: Em Rushworth

Affiliation: University of Auckland

When: Thursday, 22 July 2021, 2:00 pm to 3:00 pm

Where: 303-B05

Compositional data is everywhere - in mineral analysis, demographics, and species abundance for example. However, despite the well-known difficulties when analysing such data, research into expanding the existing methods or exploring alternative approaches has been limited. The past five years has seen a resurgence of interest in the field popularised by the publication of Aitchison(1982) that uses a family of log-ratio transformations with traditional statistical methodology to analyse compositional data. Most recent publications using this methodology focus solely on application despite the innate limitations of log-ratios preventing wider adoption. This research aims to fill in many of the blanks, including studying the approaches outside the log-ratio transformation family, by proposing a consistent definition of compositional data regardless of approach, methodological developments to the consideration of zeroes and sparse data, and demonstrations of applications across multiple domains.

Bayesian hierarchical models are prominent in many of the domains considered in this research, such as ecology and movement studies, and provide a useful framework for considering compositional data. Despite consistent mentions in either Bayesian or compositional data modelling papers of the crossover between these two fields, there is very little literature and it remains largely unexplored. Leininger et al. (2013) successfully used a Bayesian hierarchical framework to model the presence of zeroes as a separate hierarchical level, but there has not been any research outside of the log-ratio transformation. This research will seek to present a Bayesian approach to compositional data analysis using hierarchical models, and hopefully, assist to make the field more accessible for future researchers.


Please give us your feedback or ask us a question

This message is...

My feedback or question is...

My email address is...

(Only if you need a reply)

A to Z Directory | Site map | Accessibility | Copyright | Privacy | Disclaimer | Feedback on this page