Department of Statistics


Modern Variable Selection for Vector Generalized Linear Models

Speaker: Wenqi Zhao

Affiliation: UoA

When: Monday, 27 May 2024, 1:00 pm to 2:00 pm

Where: 303-257


The generalized linear model (GLM) is the framework in

statistics for modeling the relationship between a response variable

and one or more predictor variables, it is typically used to

fit random variables to linear regression to predict observations.

While GLMs offer relatively straightforward interpretation of

coefficients, they may not capture complex interactions or nonlinear

relationships in the data. Vector generalized linear models(VGLMs)

and vector generalized additive models (VGAMs) can greatly extend

GLMs, currently VGAM implements over 150 family functions, it has

a large flexible framework to vary model elements. Variable

selection is a crucial step in statistical modeling identifying the

most relevant observations for predicting the response variable.

In VGLM/VGAM framework, usually using the minimum value

of some information criterion (IC). Among such, the Akaike

IC (AIC) and Bayesian IC (BIC) are the most common.

VGAMs also can penalize regression splines using P-spline

smoothers, which we term ‘P-spline VGAMs’, however, fitting VGAMs

with penalized regression splines can be computationally intensive,

particularly when dealing with large datasets or high-dimensional

predictor spaces. When the variables are greater than the


In this project, we propose to combine elastic net and VGLM/VGAM

framework to create a new model selection method. Elastic net

regularization techniques can help prevent over#tting and

multidisciplinary. Elastic net can result in sparser models with fewer

predictors. This regularization path helps in identifying and handling

multicollinearity by favoring models with fewer predictors in

VGLM/VGAM framework.

This is the PYR seminar.

Designing to Support Doing Data Science and Statistics in Schools
Hollylynne Lee

Speaker: Hollylynne Lee

Affiliation: NC State University

When: Tuesday, 16 July 2024, 4:00 pm to 5:00 pm

Where: 303-310

Abstract: The U.S. often looks to New Zealand for resources and research related to teaching and learning statistics. In this talk, Hollylynne will discuss two recent projects situated in the U.S. that are advancing the teaching and learning of statistics and data science for secondary schools. These projects have designed curricula and online professional learning experiences for teachers at all stages of their career, from undergraduate education through life-long learning as a practicing teacher. We collaborate with a team at CODAP to integrate advanced data experiences into classrooms. The presentation will have something for everyone related to research, design of educational materials, and ideas for secondary classrooms.

Bio: Dr. Hollylynne Lee is a Distinguished University Professor of Mathematics and Statistics Education in the STEM Education department at NC State University, Raleigh NC, USA. She is also a Senior Faculty Fellow at the Friday Institute for Educational Innovation where she directs the Hub for Innovation and Research in Statistics and Data Science Education ( With experience teaching in elementary, middle, and high school classrooms, she brings a depth of practical perspectives to her research, and ensures her research and designs of educational resources are directly applicable to teachers and students. Her current work includes a focus on teachers’ professional learning for teaching with data using tools like CODAP and transforming undergraduate teacher preparation related to teaching statistics and data science. She loves reading, kayaking, watching volleyball, spending time with family, and her dog and cat.

Childhood Risk and Resilience Factors for Pasifika Youth Respiratory Health: Accounting for Attrition and Missingness

Speaker: Dawson Zhai

Affiliation: UoA

When: Friday, 24 May 2024, 1:00 pm to 2:00 pm

Where: 303-310


In New Zealand, 7% of deaths are related to respiratory diseases, with Pacific people at higher risk. Based on knowledge of lung development, lung function can be damaged in two ways: 1) Lung function reduction: early insults may lower the maximum lung function and/or accelerate its decline after the peak; 2) Predisposition to later respiratory disease: early disease raises the risk of later disease occurring. Conversely, some resilience factors can create beneficial effects on respiratory function and/or provide protection to stop subsequent respiratory diseases; among these factors are childhood levels of physical activity, smoke exposure, immunisation, housing conditions, and breastfeeding.

Using Pacific Island Family Study (PIFS) cohort data, this work will investigate the causal effects of identified early-life factors on early-adulthood lung function, quality of life and comorbidities. The PIFS cohort is a longitudinal cohort, the participants of which were enrolled at birth in Middlemore Hospital (n=1398) between March and December 2000. A respiratory assessment (n=466) was conducted within the cohort when participants were 18 years old. In this PIFS birth cohort respiratory study, the primary respiratory outcome was the z-score of the Forced Ejection Volume in 1 second (FEV1). Secondary outcomes consisted of FEV1 adjusted for height and sex; the healthy lung function (HLF) indicator, defined as the z-score exceeding -1.64; health-related and respiratory-health-related quality of life scores; and respiratory condition indicators. The attrition and missingness present in the group undergoing respiratory assessment will inform much of the analysis plan, as will the longitudinal character of the risk and protective factors and their confounders.

This is the PYR seminar.

Statistical Methods and Designs for Multi-Wave Validation Studies

Speaker: Gustavo, Guimaraes DeCastro Amorim

Affiliation: Vanderbilt University Medical Center

When: Thursday, 23 May 2024, 11:00 am to 12:00 pm

Where: 303-310


Measurement errors are present in many data collection procedures and can harm analyses by biasing estimates. To correct for measurement errors, researchers often validate a subsample of records and then incorporate the information learned from this validation sample into the estimation. In practice, the validation sample is often selected using simple random sampling (SRS). However, SRS leads to inefficient estimates because it ignores information on the error-prone variables, which can be highly correlated to the unknown truth. Applying and extending ideas from the two-phase sampling literature, we propose optimal and nearly-optimal designs for selecting the validation sample in the classical measurement-error framework. We also present novel extensions of estimators that make use of all available data collected in two or more waves. We show through simulations that incorporating information from intermediate steps can lead to substantial gains in efficiency. These works are motivated by and illustrated in Multi-National HIV Research Cohorts.

About the speaker :

Dr Amorim is an Assistant Professor of Biostatistics. His research interest include developing novel statistical methods for problems arising in public health studies, semiparametric models for model misspecification, two-phase designs, measurement-error problems and ordinal data analysis.

COVID-19 vaccine fatigue in Scotland: How do the trends in attrition rates for the second and third doses differ by age, sex, and council area?

Speaker: Robin Muegge

Affiliation: University of Glasgow

When: Thursday, 16 May 2024, 11:00 am to 12:00 pm

Where: 303-310

Abstract: Vaccine fatigue is the propensity for individuals to start but not finish a vaccination program with several doses, which thus means they are less protected. This is an especially important topic for COVID-19, where the vaccination program commonly consists of two doses followed by a booster vaccine dose to get full protection. COVID-19 vaccine hesitancy (the delay in acceptance or refusal of the first dose of the vaccine) has been studied extensively, and a few studies investigated the willingness to receive the booster vaccine dose. In contrast, attrition rates across subsequent doses caused by evolving vaccine fatigue have yet to be examined, which is the novel contribution of this paper. Our study focuses on Scotland, where the vaccine rollout began on 8th December 2020. We model vaccine attrition rates in the first transition (from doses one to two) and the second transition (from doses two to three) for the 32 council areas in Scotland. We estimate the effects of sex and transition, examine trends and patterns in the attrition rates by age group and council area and evaluate if these differ by sex or transition. We model the attrition rates with a hierarchical binomial logistic regression model that allows for flexible autocorrelation estimation for the corresponding neighbourhood and age group structures via correlated random effects models. Inference is based on a Bayesian paradigm, using integrated nested Laplace approximation (INLA). Our main findings are that attrition rates smoothly decrease with increasing age, that they are much higher in the second transition than in the first, that they are generally higher for males than females, and that the variation in attrition rates between age groups is greater for males than females.

At the end of the seminar, I will introduce my current work on outlier detection in areal data, titled “Disease mapping: What if Tobler's First Law of Geography doesn't hold?”

Bio: Robin Muegge is a PhD student in statistics from the University of Glasgow, UK. His research is in spatial and spatio-temporal areal data modelling under the supervision of Duncan Lee, Nema Dean, and Eilidh Jack. Robin completed his B.Sc. Mathematics at the Leibniz University of Hanover in Germany, and his M.Sc. Statistics at Portland State University, USA. He spent 11 weeks at the University of Wollongong, collaborating with Andrew Zammit Mangion, and is visiting the University of Auckland from the 13th to the 17th of May before returning to Glasgow.

Advanced methods for time series data applied to prediction of operating modes and detection of anomalies for wind turbines

Speaker: Hannah Yun

Affiliation: UoA

When: Monday, 13 May 2024, 10:00 am to 11:00 am

Where: 303-257


Wind turbine can be characterised by distinct operating modes that reflect production efficiency. In this talk, we focus on the forecasting problem for univariate discrete-valued time series of operating modes of a wind turbine. We define three prediction strategies to overcome the difficulties associated with missing data. These strategies are evaluated through experiments using five forecasting methods across two real-life datasets. Two of the forecasting methods have been introduced in the statistical literature as extensions of the well-known context algorithm: variable length Markov chains and Bayesian context tree. Additionally, we consider a Bayesian method based on conditional tensor factorisation and two different smoothers from the classical tools for time series forecasting. Each pair prediction strategy/forecasting method is evaluated in terms of prediction accuracy versus computational complexity. We provide guidance on the methods that are suitable for forecasting the time series of operating methods. The prediction results demonstrate that high accuracy can be achieved with reduced computational resources.

We will also briefly discuss how recent advances in the field of dictionary learning can be tailored to detect equipment health deterioration in the case of wind turbines.

This is the PYR seminar.

New Methods for Fitting Hawkes Models with Large Data

Speaker: Conor Kresin

Affiliation: University of Otago

When: Thursday, 2 May 2024, 11:00 am to 12:00 pm

Where: 303-310

Abstract :

Hawkes processes are concise mathematical representations of diverse point process data, ranging from disease spread and wildfire occurrences to non-physical phenomena such as financial asset price movements. Models for point process data are often fit using maximum likelihood (MLE) or Markov Chain Monte Carlo (MCMC), but such methods are slow or computationally intractable for data with large n. In this talk, I will present a novel estimator based on the Stoyan-Grabarnik (sum of inverse intensity) statistic. Unlike MLE or MCMC approaches, the proposed estimator does not require approximation of a computationally expensive integral. I will show that under quite general conditions, this estimator is consistent for estimating parameters governing spatial-temporal point processes such as the Hawkes process and present simulations demonstrating the performance of the estimator. In the second portion of the talk, I will discuss increasingly flexible parametric Hawkes models, culminating in Continuous Long Short Term Memory (cLSTM) recurrent neural networks.

About the speaker :

Conor Kresin is a lecturer of the Department of Mathematics and Statistics, University of Otago. His research interest include Point process theory and applications, stochastic geometry, disease modelling, information theory, causal inference.

Reproducible inference and model selection using bagged posteriors

Speaker: Jeffrey Miller

Affiliation: Harvard University

When: Thursday, 18 April 2024, 11:00 am to 12:00 pm

Where: 303-310


Under model misspecification, it is known that Bayesian posteriors often do not properly quantify uncertainty about true or pseudo-true parameters. Even more fundamentally, misspecification leads to a lack of reproducibility in the sense that the same model will yield contradictory posteriors on independent data sets from the true distribution. To improve reproducibility, an easy-to-use and widely applicable approach is to apply bagging to the Bayesian posterior ("BayesBag"); that is, to use the average of posterior distributions conditioned on bootstrapped datasets. To define a criterion for reproducible uncertainty quantification under misspecification, we consider the probability that two confidence sets constructed from independent data sets have nonempty overlap, and we establish a lower bound on this overlap probability that holds for any valid confidence sets. We prove that credible sets from the standard posterior can strongly violate this bound, indicating that it is not internally coherent under misspecification, whereas the bagged posterior typically satisfies the bound. We demonstrate on simulated and real data.

About the speaker:

Jeff Miller is an Associate Professor of Biostatistics, Harvard University. He is interested in using statistics to understand the molecular mechanisms of diseases of aging. His methodological research focuses on robustness to model misspecification, nonparametric Bayesian models, frequentist analysis of Bayesian methods, and efficient algorithms for inference in complex models.

Practical Functions: Practically Magic

Speaker: Nicholas Tierney

Affiliation: Telethon Kids Institute

When: Thursday, 21 March 2024, 11:00 am to 12:00 pm

Where: 303-310

Abstract : I think the highest value skillset in statistical programming is knowing how to write good functions. Functions are often taught as a tool to avoid repetition using the mnemonic DRY: Don't Repeat Yourself. Whilst DRY is both true and real, I think functions are at their best when they encapsulate expression and are easy to reason with. That is, DRY is sufficient, but not necessary. Writing good functions is more than esoteric aesthetics. We need to be able to reason with our code in statistics. We often don't have the capacity to write tests to show our code is "correct". Instead, we need to rely on our ability to reason with, trust, and verify that the code works as it should. I believe writing good functions that encapsulate expressions and are able to be reasoned with are how we can ensure our code, and therefore our methods, and our analyses, work as they should. In this talk I will discuss some practical ideas on writing a good function, how to identify bad ones, and how to move between the two states.

About the speaker : I work as a research software engineer, with Nick Golding on the greta R package for statistical modelling, and implementing novel statistical methods for infectious diseases like COVID19 and malaria. I work at the Telethon Kids Institute, which is based in Perth, Western Australia, but I work remotely in Launceston, Tasmania. I am a strong advocate for free and open source software, and have written several R packages to improve data analysis.

Sampling older populations: methods and challenges in the IDEA Programme

Speaker: Ngaire Kerse

Affiliation: UoA

When: Wednesday, 20 March 2024, 11:00 am to 12:00 pm

Where: 303-257

Abstract: Dementia is a global health priority. The IDEA programme is a dementia prevalence study aiming to establish the true prevalence of dementia among older adults in Aotearoa New Zealand, with a particular emphasis on diverse ethnic groups. In this seminar, we will provide an overview of the methods and challenges associated with sampling older adults. We will discuss the various sampling strategies employed for each setting, including the community, retirement villages, and aged residential care.

Speaker: Ngaire Kerse is the Joyce Cook Chair in Ageing Well, a GP, and Professor of General Practice and Primary Health Care at the University of Auckland. With over 350 publications and 50 research grants, she is an international expert in falls prevention, bi-cultural ageing, and primary health care. Leading multiple research teams, Ngaire spearheads projects such as LiLACS NZ, focusing on equity, health service use, and well-being in advanced age. Her work on fall prevention includes studies on older individuals post-stroke and in residential care. Currently, she heads the IDEA programme, investigating the prevalence and impact of dementia in Aotearoa.

Two Applications of Regression Averaging

Speaker: Norman Matloff

Affiliation: University of California

When: Thursday, 7 March 2024, 3:00 pm to 4:00 pm

Where: 303-310


My term "regression averaging" refers to first running a regression estimation procedure, be it a linear model, k-Nearest Neighbors or whatever, then averaging the fitted values over some region. I will present two applications of this. The first is on the topic of dealing with missing values, specifically in a context of prediction rather than effect estimation. The second is in the area of removing bias with respect to sensitive variables, say race or gender in a prediction model.

About the speaker:

Norman Matloff is a professor in the Department of Computer Science at the University of California. Professor Matloff’s research areas include parallel processing (especially software distributed shared memory), statistical computing, and predictive analytics.

An Overview: Data Analysis for Space-based Gravitational Wave Observations

Speaker: Ollie Burke

Affiliation: Laboratoire des 2 infinis - Toulouse (L2IT)

When: Thursday, 7 March 2024, 2:00 pm to 3:00 pm

Where: 303S-561


Current observations through ground-based detectors of gravitational waves (GWs) are having a pronounced effect on the understanding of our universe. Due to the presence of the earth, ground-based detectors are limited in sensitivity to lower frequency GWs, losing access to the rich science that can be reaped from higher mass black hole coalescences. The proposed space-based detector, the Laser Interferometer Space Antennae (LISA), eliminates sources of noise from the earth and will provide access to observations of GWs in the rich mHz frequency band, thus higher mass binaries. The aim of this talk is to be pedagogical in nature: reviewing GWs up to the first detection GW150914, providing an overview of LISA specific sources with a simple example of Bayesian inference applied to a toy GW model. We will finish on the prospects for the LISA instrument by discussing both current work and future challenges in the context of data analysis.


Please give us your feedback or ask us a question

This message is...

My feedback or question is...

My email address is...

(Only if you need a reply)

A to Z Directory | Site map | Accessibility | Copyright | Privacy | Disclaimer | Feedback on this page