## Department of Statistics

# Seminars

**Unsupervised Statistical Tools for the Detection of Anomalies in Populations**

Speaker: Prof. Fabrizio Ruggeri

Affiliation: CNR-IMATI Milano, Italy

When: Wednesday, 8 March 2023, 2:00 pm to 3:00 pm

Where: 303-310

The research is motivated by the increased interest in detecting possible

frauds in healthcare systems. We propose some unsupervised statistical

tools (Lorenz curve, concentration function, sum of ranks, Gini and Pietra

indices) to provide efficient and easy-to-use methods aimed to signal

possible anomalous behaviours. A more sophisticated method, based on

Bayesian co-clustering, is presented as well.

**The propensity score for the analysis of observational studies**

Speaker: Prof Markus Neuhaeuser

Affiliation:

When: Tuesday, 28 February 2023, 1:00 pm to 2:00 pm

Where: 303-148

In observational, non-randomized studies, groups usually differ in some baseline covariates. Propensity scores are increasingly being used in the statistical analysis to adjust for those between-group variations. There is great flexibility in how the propensity score can be appropriately used. One possible strategy is stratification, also called subclassification. We present examples and discuss the question how many strata are useful.

Moreover, the flexibility might encourage p-value hacking – where several alternative uses of propensity scores are explored and the one yielding the lowest p-value is selectively reported. Although such an approach is scientifically not acceptable, it might occur and therefore we simulate the extent of type I error inflation.

**K-12 Data Science or Statistics? Is a distinction needed?**

Speaker: Professor Rob Gould

Affiliation: Vice-chair Undergraduate Studies, Department of Statistics, UCLA.

When: Friday, 10 February 2023, 11:00 am to 12:00 pm

Where: 303-G14

For decades now, statistics educators have worked to achieve wide-spread statistical literacy. And now, well before the task is accomplished, along comes Data Science Education. I’ll explain why, from my perspective, this term is more than just a new label for an old thing, describe updates to the American Statistical Association’s Guidelines for Assessment and Instruction in Statistics Education (GAISE) Pre-K-12 report, and give a brief overview of a high school data science course that I helped design and propagate. I’ll also discuss currents in the US pushing back against data science (and statistics) education.

**Adversarial Risk Analysis for Bi-Agent Influence Diagrams: An Algorithmic Approach**

Speaker: Javier Cano

Affiliation:

When: Wednesday, 2 November 2022, 2:00 pm to 3:00 pm

Where: 303-310

Authors: Jorge González-Ortega, David Ríos Insua, Javier Cano

Abstract: We describe how to support a decision maker who faces an adversary. To that end, we consider general interactions entailing sequences of both agents’ decisions, some of them possibly being simultaneous or repeated across time. We model their joint problem as a bi-agent influence diagram. Unlike previous solutions framed under a standard game-theoretic perspective, we provide a decision-analytic methodology to support the decision maker based on an adversarial risk analysis paradigm. This allows the avoidance of non-realistic strong common knowledge assumptions typical of non-cooperative game theory as well as a better apportion of uncertainty sources. We illustrate the methodology with a schematic critical infrastructure protection problem.

DOI: https://doi.org/10.1016/j.ejor.2018.09.015

**Generally-altered, -inflated, -truncated and deflated concrete regression**

Speaker: Willow Shi

Affiliation: The University of Auckland

When: Wednesday, 31 August 2022, 4:00 pm to 5:00 pm

Where: 303-G14

Zero-altered, -inflated and -truncated count regression is now well established in modern statistical modelling, especially with Poisson and binomial parents. Recently Yee and Ma (2022) have proposed generally-altered, -inflated, -truncated and deflated (GAITD) regression that extends these models in almost a maximal way. The main idea is to combine the four operators into a single supermodel. GAITD regression includes parametric and nonparametric forms, the latter based on the multinomial logit models (MLM). A special application is heaped measurement error data, e.g., spikes at multiples of 10 or 5 upon rounding. The VGAM R package currently implements three GAITD distributions. The next stage of this work is to develop GAITD continuous distributions. However, when handling spikes and seeping (dips), these width-zero support values must be treated as in a discrete distribution. We thus call the distribution 'concrete': both CONtinuous and disCRETE. The resulting GAITD Mix-MLM combo model has thirteen special value types. Some potential applications of GAITD concrete distributions will be presented as well as some preliminary results.

**Networks of Accumulating Priority Queues**

Speaker: Yan Chen

Affiliation: University of Auckland

When: Friday, 29 July 2022, 2:00 pm to 3:00 pm

Where: 303-310

Priority queueing systems have applications in many different fields.

In accumulating priority queues (APQs), the priority of a new arrival increases with time spent in

the system, at a rate that depends on the class of the arrival. Thus, an arrival from a lower priority class who has been waiting sufficiently long may be seen before a more recent arrival from a higher class.

Most of the research on APQs has been for single queues, while many processes function like a network of queues, particularly in a medical setting. We will apply coupling methods to explore the sample path behaviour and performance of queueing networks under the accumulating priority discipline. Natural performance measures to consider here are waiting and departure times. This talk will describe initial results for sequential queues, as well as discussing proposed further work.

**On computationally efficient methods for testing multivariate distributions with unknown parameters**

Speaker: Sara Algeri

Affiliation: School of Statistics, University of Minnesota

When: Wednesday, 20 July 2022, 11:00 am to 12:00 pm

Where: 303-310

Despite the popularity of classical goodness fit tests such as Pearson’s chi-squared and Kolmogorov-Smirnov, their applicability often faces serious challenges in practical applications. For instance, in a binned data regime, low counts may affect the validity of the asymptotic results. Excessively large bins, on the other hand, may lead to loss of power. In the unbinned data regime, tests such as Kolmogorov-Smirnov and Cramer-von Mises do not enjoy distribution-freeness if the models under study are multivariate and/or involve unknown parameters. As a result, one needs to simulate the distribution of the test statistic on a case-by-case basis. In this talk, I will discuss a testing strategy that allows us to overcome these shortcomings and equips experimentalists with a novel tool to perform goodness-of-fit while reducing substantially the computational costs.

**Joint Modelling of Medical Cost and Survival in Complex Sample Surveys**

Speaker: Seong Hoon Yoon

Affiliation: The university of Auckland

When: Wednesday, 25 May 2022, 2:00 pm to 3:00 pm

Where: Zoom

Joint modelling of longitudinal and time-to-event data is a method that recognises the dependency between the two data types, and combines the two outcomes into a single model, which leads to more efficient estimates. These models are applicable when individuals are followed over a period of time, generally to monitor the progression of a disease or a medical condition, and also when longitudinal covariates related to the time-to-event variable are available. The present project consists in developing and applying joint models of medical cost and survival. However, medical cost datasets are usually obtained using a complex sampling design rather than simple random sampling and this design needs to be considered in the statistical analysis. This project aims to develop a novel approach to joint modelling of complex data by combining survey calibration with standard joint modelling, which will be achieved by incorporating a new set of equations to calibrate the sampling weights. The newly developed methods will be applied to jointly model survival and cost data taken from the 'Analyzing clustered epidemiological studies from complex surveys: Using big data to estimate dementia prevalence in New Zealand' data from the Integrated Data Infrastructure (IDI).

https://auckland.zoom.us/j/96860004649?pwd=WDRHR00zSmpTOFUrMTluc21WOHNaQT09

**Mixed Proportional Hazard Models with Complex Samples**

Speaker: Brad Drayton

Affiliation: The University of Auckland

When: Friday, 25 February 2022, 2:00 pm to 3:00 pm

Where: https://auckland.zoom.us/j/94433887376?pwd=RHhTLzBvYnYvSkJKOVpMamkrS1FzZz09

Large amounts of data are collected by the administrative parts of government. These data can potentially be useful for research, but we need valid analysis methods that can account for their complex characteristics. This project focuses on two characteristics of time-to-event data – cluster correlation and complex samples. The use of auxiliary data to improve estimates will also be investigated. Time-to-event data are traditionally modelled with the famous Cox proportional hazards model (PHM). This has been extended to account for cluster correlation – the mixed-effect PHM, and to work with complex samples. A current gap in both theory and software is handling both cluster correlation and complex sampling at once. In this talk I will review key parts of the theory around cox models, the mixed-effect and complex sample extensions, and propose plans to integrate these theories. I will also review the use of auxiliary information, and how our new theory may be extended to incorporate this.

**Learning symmetries of regression functions**

Speaker: Louis Christie

Affiliation: University of Cambridge

When: Tuesday, 22 February 2022, 2:00 pm to 3:00 pm

Where: https://auckland.zoom.us/j/99560912672?pwd=cEIyR2taOFpPNTNRRmk4TmgvUVFLdz09

A model that incorporate the symmetry of an object to be estimated (in this talk non-parametric regression functions) perform better. One way to build symmetry into a model is feature averaging - taking an average of the model's output over the orbit of the input under some group action G. This averaging operator is a projection on L^2 functions into the space of invariant functions, and so the generalisation error under L^2 loss of the averaged model is guaranteed to be small if the regression function f is invariant to the action. These symmetrised models are increasingly being used in practice through invariant and equivariant neural networks. In this talk we present a consistent hypothesis test for H_0: f is G-invariant against H_1 : f is not G-invariant to give confidence to the use of such models when the symmetry is not known a priori. This is based on a family of test statistics derived from the test errors of the symmetrised and unsymmetrised model. Further, we present a method for estimating the maximal symmetry of a regression function from within a specified class of symmetries.