## Department of Statistics

# Seminars

**Bursty Markovian Arrival Processes**

Speaker: Azam Asanjarani

Affiliation: The University of Auckland

When: Wednesday, 23 January 2019, 3:00 pm to 4:00 pm

Where: 303-610

We consider stationaryMarkovian Arrival Processes (MAPs) where both the squared coeficient of variation of inter-event times and the asymptotic index of dispersion of counts are greater than unity. We refer to such MAPs as bursty. The simplest bursty MAP is a Hyperexponential Renewal Process (H-renewal process). Applying Matrix analytic methods (MAM), we establish further classes of MAPs as Bursty MAPs: the Markov Modulated Poisson Process (MMPP), the Markov Transition Counting Process (MTCP) and the Markov Switched Poisson Process (MSPP). Of these, MMPP has been used most often in applications, but as we illustrate, MTCP and MSPP may serve as alternative models of bursty traffic. Hence understating MTCPs, MSPPs, and MMPPs and their relationships is important from a data modelling perspective. We establish a duality in terms of first and second moments of counts between MTCPs and a rich class of MMPPs which we refer to as slow-MMPPs (modulation is slower than the events).

**Critical issues in recent guidelines**

Speaker: Prof Markus Neuhaeuser

Affiliation: Dept. of Mathematics and Technology, Koblenz University of Applied Sciences, Remagen, Germany

When: Tuesday, 12 February 2019, 11:00 am to 12:00 pm

Where: 303-B09

To increase rigor and reproducibility, some medical journals provide detailed guidelines for experimental design and statistical analysis. Although this development is positive, quite a few recommendations are critical because they reduce the power or are indefensible from a statistical point of view. This is shown using two current examples, namely the 2017 published checklist of the journal Circulation Research [Circulation Research. 2017; 121:472-9] and the 2018 published guideline of the British Journal of Pharmacology [British Journal of Pharmacology. 2018; 175:987-93]. Topics discussed are the analysis of variance in case of heteroscedasticity, the question of balanced sample sizes, the power calculation including so-called post-hoc power analyses, minimum group sizes, and the t test for small samples.

https://www.hs-koblenz.de/en/profilepages/profil/neuhaeuser/

**Data depth and generalized sign tests**

Speaker: Christine Mueller

Affiliation: Fakultät Statistik Technische Universität Dortmund

When: Wednesday, 20 February 2019, 11:00 am to 12:00 pm

Where: 303-310

Data depth is one of the approaches to generalize the outlier robust median to more complex situations. In this talk, I show how this can be done for multivariate data and regression. The concept of half space depth and simplicial depth are crucial for the generalization for multivariate data while the concept of nonfit was used for defining regression depth and simplicial regression depth. Thereby, simplicial regression depth leads often to a so-called sign depth and corresponding generalized sign tests. These generalized sign tests can be used as soon as residuals are available and are much more powerful than the classical sign test. I demonstrate this for

**Variable Selection and Dimension Reduction methods for high dimensional and Big-Data Set.**

Speaker: Benoit Liquet

Affiliation: University of Pau & Pays de L'Adour, ACEMS (QUT)

When: Wednesday, 27 February 2019, 11:00 am to 12:00 pm

Where: 303-310

It is well established that incorporation of prior knowledge on the structure existing in the data for potential grouping of the covariates is key to more accurate prediction and improved interpretability. In this talk, I will present new multivariate methods incorporating grouping structure in Bayesian and frequentist methodology for variable selection and dimension reduction to tackle the analysis of high dimensional and Big-Data set.

**Bayesian demographic estimation and forecasting**

Speaker: John Bryant

Affiliation: New Zealand Treasury

When: Wednesday, 27 March 2019, 11:00 am to 12:00 pm

Where: 303-310

Demographers face some tricky statistical problems. Users of demographic statistics are demanding estimates and forecasts that are as disaggregated as possible, including not just age and sex, but also ethnicity, area, income, and much else besides. Administrative data, and big data more generally, offer new possibilities. But these new data sources tend to be noisy, incomplete, and mutually inconsistent. The talk will describe a long-term project to develop Bayesian methods, and the associated software, to address these problems. We will look at some concrete examples, including a probabilistic demographic account for New Zealand.

About the Speaker:

John Bryant has worked at universities in New Zealand and Thailand, at the New Zealand Treasury, and (until February 2019) at Stats NZ. He has a PhD in Demography from the Australian National University. He is the author, with Junni Zhang, of the book Bayesian Demographic Estimation and Forecasting, published by CRC Press in 2018.

**Data Science on Music Data**

Speaker: Prof. Claus Weihs

Affiliation: TU Dortmund University, Germany

When: Wednesday, 17 April 2019, 11:00 am to 12:00 pm

Where: 303-310

The talk discusses the structure of the field of data science and substantiates the key premise that statistics is one of the most important disciplines in data science and the most important discipline to analyze and quantify uncertainty. As an application, the talk demonstrates data science methods on music data for automatic transcription and automatic genre determination, both on the basis of signal-based features from audio recordings of music pieces.

Literature:

Claus Weihs und Katja Ickstadt (2018): Data Science: The Impact of Statistics; International Journal of Data Science and Analytics 6, 189–194

Claus Weihs, Dietmar Jannach, Igor Vatolkin und Günter Rudolph (Eds.)(2017): Music Data Analysis: Foundations and Applications; CRC Press, Taylor & Francis, 675 pages

**Critical two-point function for long-range models with power-law couplings: The marginal case for $d\ge d_c$**

Speaker: Akira Sakai

Affiliation: Department of Mathematics, Hokkaido University, Japan

When: Wednesday, 16 January 2019, 3:00 pm to 4:00 pm

Where: 303-610

Consider the long-range models on $\mathbb{Z}^d$ of random walk, self-avoiding walk,percolation and the Ising model, whose translation-invariant 1-step distribution/couplingcoefficient decays as $|x|^{-d-\alpha$}$ for some $\alpha>0$. In the previous work (Ann.Probab., 43, 639--681, 2015), we have shown in a unified fashion for all $\alpha\ne2$ that,assuming a bound on the "derivative" of the $n$-step distribution (the compound-zetadistribution satisfies this assumed bound), the critical two-point function $G_{p_c}(x)$ decays as $|x|^{\alpha\wedge2-d}$ above the upper-critical dimension$d_c\equiv(\alpha\wedge2)m$, where $m=2$ for self-avoiding walk and the Ising modeland $m=3$ for percolation.

In this talk, I will show in a much simpler way, without assuming a bound on the derivativeof the $n$-step distribution, that $G_{p_c}(x)$ for the marginal case $\alpha=2$ decaysas $|x|^{2-d}/log|x|$ whenever $d\ge d_c$ (with a large spread-out parameter L). Thissolves the conjecture in the previous work, extended all the way down to $d=d_c$, andconfirms a part of predictions in physics (Brezin, Parisi, Ricci-Tersenghi, J. Stat. Phys.,157, 855--868, 2014). The proof is based on the lace expansion and new convolutionbounds on power functions with log corrections.

**A hidden Markov open population model for spatial capture-recapture surveys**

Speaker: Prof David Borchers

Affiliation: University of St Andrews

When: Wednesday, 12 December 2018, 11:00 am to 12:00 pm

Where: 303-310

Open population capture-recapture models are widely used to estimate population demographics and abundance over time The only published open population methods for spatial capture-recapture surveys use Markov chain Monte Carlo methods for inference, and have relatively high computational cost. We formulate open population spatial capture-recapture surveys as a hidden Markov model, allowing inference by maximum likelihood for both Cormack-Jolly-Seber and Jolly-Seber models with and without animal activity centre movement. The method is applied to a twelve-year survey of male jaguars (Panthera onca) in the Cockscomb Wildlife Sanctuary Basin, Belize, to estimate survival probability and population abundance over time. For this application, inference is shown to be biased when assuming activity centres are fixed over time, while including a model for activity centre movement provides negligible bias and nominal confidence interval coverage, as demonstrated by a simulation study. The hidden Markov model approach is compared with Bayesian method and a series of closed population models applied to the same data. The method is much more efficient than the Bayesian approach and provides a lower root-mean-square error in predicting population density compared to a series of closed population models.

**Visualization and directional measures of population differentiation based on the saddlepoint approximation method**

Speaker: Louise McMillan

Affiliation: The University of Auckland

When: Wednesday, 5 December 2018, 3:00 pm to 4:00 pm

Where: 303-310

We propose a method for visualizing genetic assignment data by characterizing the distribution of genetic profiles for each candidate source population (McMillan & Fewster, 2017). This method enhances the assignment method of Rannala & Mountain (1997) by calculating appropriate graph positions for individuals for which some genetic data are missing. An individual with missing data is positioned in the distributions of genetic profiles for a population according to its estimated quantile based on its available data. The quantiles of the genetic profile distribution for each population are calculated by approximating the cumulative distribution function (CDF) using the saddlepoint method, and then inverting the CDF to get the quantile function. The saddlepoint method also provides a way to visualize assignment results calculated using the leave-one-out procedure. We call the resulting plots GenePlots.

This new method offers an advance upon assignment software such as GeneClass2, which provides no visualization method, and is biologically more interpretable than the bar charts provided by the widely-used genetic software STRUCTURE.

We show results from simulated data and apply the methods to microsatellite genotype data from ship rats (Rattus rattus) captured on the Great Barrier Island archipelago, New Zealand. The visualization method makes it straightforward to detect features of population structure and to judge the discriminative power of the genetic data for assigning individuals to source populations.

We then advance these techniques further by proposing methods for quantifying population genetic structure, and associated tests of significance. The measures we propose are closely related to GenePlots, and enable visual features obvious from the plots to be expressed more formally. One measure is the interloper detection probability: for two random genotypes arising from populations A and B, the probability that the one from A has the better fit to A and thus the genotype from B would be correctly identified as the `interloper' in A. Another measure is the correct assignment probability: this corresponds to the probability that a random genotype arising from A would be correctly assigned to A rather than B.

Using permutation tests, we can test two populations for significant population structure. These permutation tests are sensitive to subtle population structure, and are particularly useful for eliciting asymmetric features of the populations being studied, e.g. where one population has undergone extensive genetic drift but the other population has remained large enough to retain greater genetic diversity.

McMillan, L. F. and Fewster, R. M. (2017), Visualizations for genetic assignment analyses using the saddlepoint approximation method. Biometrics, 73:1029�1041. doi:10.1111/biom.12667

**The mixture of Markov jump processes: distributional properties and statistical estimation**

Speaker: Budhi Surya

Affiliation: Victoria University of Wellington

When: Wednesday, 21 November 2018, 3:00 pm to 4:00 pm

Where: 303-310

In this talk, I will discuss a tractable development and statistical estimation of the finite mixture of right-continuous Markov jump processes moving at different speeds on the same finite state space, while the speed regimes are assumed to be unobservable. Unlike the underlying processes, the mixture itself has non-Markov property. The mixture model was first proposed by Frydman (JASA 2005) as a generalization of the mover-stayer model of Blumen et al. (Cornell Stud. Ind. Labor Relat., 1955) and was recently generalized by Surya (Stochastic Systems 2018 and Surya 2018), in which further distributional properties and explicit identities of the process are given. Statistical estimation is performed under complete and incomplete observations of the sample paths of the mixture process. Under complete observations, maximum likelihood estimates are given in explicit form in terms of sufficient statistics of the process. Estimation under incomplete observation is performed under the EM algorithm. The estimation results completely characterize the mixture process in terms of the initial probability of the process, intensity matrices of the underlying Markov processes, and the switching probability matrix. The new method generalizes the existing statistical inferences for the Markov model (Albert, Ann. Math. Statist, 1961), the mover-stayer model (Frydman, JASA 1984), and the Markov mixture model (Frydman, JASA 2005). Some numerical examples are given to illustrate the results.

**Estimation of inbreeding, kinship and population structure parameters**

Speaker: Professor Bruce Weir

Affiliation: Department of Biostatistics, University of Washington, Washington, US

When: Friday, 16 November 2018, 1:00 pm to 2:00 pm

Where: 303-G15

Knowledge of inbreeding, kinship and population structure is important for many aspects of population, quantitative, ecological, human and forensic genetics. These quantities can all be framed in terms of probabilities of pairs of alleles being identical by descent, within and between individuals, or within and between populations.

Values of the various parameters can be predicted if the individual or population pedigrees are known. With information about only the observed individuals or populations, however, it can be difficult to estimate the parameters. Many current estimators use functions involving squares and products of sample allele frequencies and this can result in rankings of kinship estimates, for example, that do not reflect the rankings of the pedigree values for a set of individuals.

Professor Weir and his colleagues are having success with approaches based on allele matching proportions, without explicit use of sample allele frequencies. In this seminar, he will present results from recent papers in Genetics and Molecular Ecology and some unpublished work with simulated, human and natural population data.

**The split-and-drift random graph, a null model for speciation**

Speaker: Francois Bienvenu

Affiliation: Sorbonne Universite, Paris

When: Wednesday, 14 November 2018, 3:00 pm to 4:00 pm

Where: 303-310

We introduce a new random graph model motivated by biological questions relating to speciation. This random graph is defined as the stationary distribution of a Markov chain on the space of graphs on {1,... ,n}. The dynamics of this Markov chain is governed by two types of events: vertex duplication, where at constant rate a pair of vertices is sampled uniformly and one of these vertices loses its incident edges and is rewired to the other vertex and its neighbors; and edge removal, where each edge disappears at constant rate. Besides the number of vertices n, the model has a single parameter r_n.

Using a coalescent approach, we obtain explicit formulas for the first moments of several graph invariants such as the number of edges or the number of complete subgraphs of order k. These are then used to identify five non-trivial regimes depending on the asymptotics of the parameter r_n. We derive an explicit expression for the degree distribution, and show that under appropriate rescaling it converges to classical distributions when the number of vertices goes to infinity. Finally, we give asymptotic bounds for the number of connected components, and show that in the sparse regime the number of edges is Poissonian.

**All New Zealand Acute Coronary Syndrome - Quality Improvement (ANZACS-QI) Programme: National Coverage and Applications**

Speaker: Yannan Jiang and Rachel Chen

Affiliation: Statistical consulting centre, Department of Statistics, Auckland University

When: Wednesday, 14 November 2018, 11:00 am to 12:00 pm

Where: 303-310

Governance of the ANZACS-QI programme is on behalf of the NZ branch of the Cardiac Society of Australia and New Zealand (CSANZ) and the NZ Cardiac Network, and funded by the NZ Ministry of Health. The goal of this programme is to support appropriate, evidence-based management of ACS and drive improvements in the quality of cardiac care delivered in New Zealand’s hospitals. It also aims to reduce regional variation in regard to the assessment, investigation and treatment of patients with ACS.

ANZACS-QI utilises two main sources of anonymised heart disease related patient data, including the ANZACS-QI web-based registry deployed nationally to 41 public hospitals and 6 private hospitals, and the MOH national hospitalisation and mortality datasets. This talk will provide an overview of the ANZAS-QI programme and data collections, discuss its coverage and agreement with the national administrative datasets, and present some applications in national healthcare system and clinical research studies.

**Mixture-based Nonparametric Density Estimation**

Speaker: Yong Wang

Affiliation: The University of Auckland

When: Wednesday, 7 November 2018, 11:00 am to 12:00 pm

Where: 303-310

In this talk, I will describe a general framework for nonparametric density estimation that uses nonparametric or semiparametric mixture distributions. Similar to kernel-based estimation, the proposed approach uses bandwidth to control the density smoothness, but each density estimate for a fixed bandwidth is determined by likelihood maximization, with bandwidth selection carried out as model selection. This leads to much simpler models than the kernel ones, yet with higher accuracy. Results of simulation studies and real-world data in both the univariate and the multivariate situation will be given, all suggesting that these mixture-based estimators outperform the kernel-based ones.

**Version control: An introduction to git and GitHub using RStudio**

Speaker: Blake Seers

Affiliation: Data analyst, Department of Statistics Consultant Statistical Consulting Centre (SCC), Auckland University

When: Wednesday, 10 October 2018, 11:00 am to 12:00 pm

Where: 303-310

Reproducibility is the hallmark of good science. Git is becoming increasingly popular with researchers as it can facilitate greater reproducibility and transparency in science [1]. In this talk I will introduce the git version control system, the GitHub hosting service, and how to get started on using git for version control in RStudio.

[1]: Ram, K 2013. Git can facilitate greater reproducibility and increased transparency in science. Source Code for Biology and Medicine 2013 8:7. https://scfbm.biomedcentral.com/articles/10.1186/1751-0473-8-7

**Maximum likelihood estimation for latent count models**

Speaker: Wei Zhang

Affiliation: Department of Statistics, University of Auckland

When: Wednesday, 26 September 2018, 11:00 am to 12:00 pm

Where: 303-310

Latent count models constitute an important modeling class in which a latent vector of counts, z, is summarized or corrupted for reporting, yielding observed data y=Tz where T is a known but non-invertible matrix. The observed vector y generally follows an unknown multivariate distribution with a complicated dependence structure. Latent count models arise in diverse fields, such as estimation of population size from capture-recapture studies; inference on multi-way contingency tables summarized by marginal totals; or analysis of route flows in networks based on traffic counts at a subset of nodes. Currently, inference under these models relies primarily on stochastic algorithms for sampling the latent vector z, typically in a Bayesian data-augmentation framework. These schemes involve long computation times and can be difficult to implement. Here, we present a novel maximum-likelihood approach using likelihoods constructed by the saddlepoint approximation. We show how the saddlepoint likelihood may be maximized efficiently, yielding fast inference even for large problems. For the case where z has a multinomial distribution, we validate the approximation by applying it to a specific model for which an exact likelihood is available. We implement the method for several models of interest, and evaluate its performance empirically and by comparison with other estimation approaches. The saddlepoint method consistently gives fast and accurate inference, even when y is dominated by small counts.