**2014 Seminars**

## Department of Statistics

# 2014 Seminars

Seminars by year: Current | 2004 | 2005 | 2006 | 2007 | 2008 | 2009 | 2010 | 2011 | 2012 | 2013 | 2014

**EM-test and the related issues**

Speaker: Jiahua Chen

Affiliation: UBC

When: Monday, 15 December 2014, 11:00 am to 12:00 pm

Where: 303.310

In scientific investigations, a population is often suspected of containing several more homogeneous sub-populations. Such a population structure is most accurately described by a finite mixture model but the model should only be adopted with a statistically significant evidence through a rigorous hypothesis test. Developing valid and effective statistical inference methods on mixing distribution is an important research topic yet it posts serious technical challenges. Classical procedures when applied to mixture models often have sophisticated asymptotic properties which render them useless in applications. For a large number of finite mixture models, we have successfully designed corresponding EM-tests whose limiting distributions are easier to derive mathematically, simple for implementation in data analysis.

In this talk, we will first give a general introduction to EM-test. We will illustrate their elegant asymptotic properties and present a new approach to the tuning of ancillary parameters. We will also present some results on the sample size determination.

http://www.stat.ubc.ca/~jhchen/

**Probabilistic consensus via polling and majority rules**

Speaker: Richard Cruise

Affiliation: Heriot Watt

When: Thursday, 20 November 2014, 3:00 pm to 4:00 pm

Where: 303.310

We consider lightweight decentralised algorithms for achieving consensus in distributed systems. Each member of a distributed group has a private value from a fixed set consisting of, say, two elements, and the goal is for all members to reach consensus on the majority value. We explore variants of the voter model applied to this problem. In the voter model, each node polls a randomly chosen group member and adopts its value. The process is repeated until consensus is reached. We generalize this so that each member polls a (deterministic or random) number of other group members and changes opinion only if a suitably defined super-majority has a different opinion. We show that this modification greatly speeds up the convergence of the algorithm, as well as substantially reducing the probability of it reaching consensus on the incorrect value.

**Three "multiscale" problems**

Speaker: Jesse Goodman

Affiliation: U. Auckland

When: Thursday, 20 November 2014, 11:00 am to 12:00 pm

Where: 303.310

Run a Brownian motion on the surface of a sphere for a long time. What are the shapes and sizes of the regions that have not been visited?

Answering this question turns out to be closely related to finding the maximum of many correlated Gaussian variables. This talk aims to explain, in relatively non-technical language, the common features of these problems, and how "multi-scale analysis" enables us to understand some systems with strong, long-range correlation.

https://www.stat.auckland.ac.nz/people/jgoo625

**Urns & Balls and Communications**

Speaker: Emina Soljanin

Affiliation: Bell Labs

When: Thursday, 13 November 2014, 3:00 pm to 4:00 pm

Where: 303.310

Urns and balls models refer to basic probabilistic experiments in which balls are thrown randomly into urns, and we are interested in various patterns of urn occupancy (e.g., the number of empty urns). These models are central in many disciplines such as combinatorics, statistics, analysis of algorithms, and statistical physics. After covering the fundamentals, we will show how some modern network and traffic communications scenarios give rise to problems that are related to the classical urns and balls questions. We will also describe how some new models and problems emerge in content delivery because information packets can be processed (e.g., by using finite field arithmetic) in a way their physical counterparts, urns and balls, cannot.

http://ect.bell-labs.com/who/emina/

**Modelling evidential structures in linked crimes using Bayesian networks**

Speaker: Jacob de Zoete

Affiliation: University of Amsterdam/

When: Thursday, 23 October 2014, 4:00 pm to 5:00 pm

Where: 303.310 Statistics Seminar Room

When two or more crimes show specific similarities, such as a very distinct modus operandi, the probability that they were committed by the same offender becomes relevant. This probability depends on the degree of similarity and distinctiveness. We show how Bayesian networks can be used to model different evidential structures that can occur when linking crimes, and how they assist in understanding the complex underlying dependencies. That is, how evidence that is obtained in one case can be used in another and vice versa. The flip side of this is that the intuitive decision to ``unlink'' a case in which exculpatory evidence is obtained leads to serious overestimation of the strength of the remaining cases.

The examples that will be discussed include Bayesian networks for different numbers of cases and different numbers of offenders. Although the examples discussed are about crime linkage, they generalise to other types of problems where multiple items are combined, e.g. multiple stain problems in forensic practice. The same questions are relevant and similar caution is needed.

**The openapi Project**

Speaker: Paul Murrell

Affiliation: U. Auckland

When: Thursday, 18 September 2014, 4:00 pm to 5:00 pm

Where: 303.310

The global movement towards Open Government Data has lead to a wealth of data being made available to the general public, with huge potential benefits for social, economic, and political empowerment.

However, few individuals possess the knowledge and skills to make use of these data by themselves. The openapi project is developing a flow-based framework that is primarily aimed at lowering the barriers to use of Open Data by the general public. This talk will present an overview of the openapi project, present some early prototype software that has been developed, and discuss some of the future goals and challenges for the project.

https://www.stat.auckland.ac.nz/people/pmur002

**Semiparametric generalized linear models for time-series data**

Speaker: Thomas Fung

Affiliation: Macquarie University

When: Thursday, 4 September 2014, 4:00 pm to 5:00 pm

Where: 303.B11

We introduce a semiparametric generalized linear models framework for

time-series data that does not require specification of a working

distribution or variance function for the data. Rather, the conditional response distribution is treated as an infinite-dimensional parameter,

which is then estimated simultaneously with the usual finite-dimensional parameters via a maximum empirical likelihood approach. A general consistency result for the resulting estimators is shown. Simulations suggest that both estimation and inferences using the proposed method can perform as well as a correctly-specified parametric model even for moderate sample sizes, but is much more robust than parametric methods under model misspecification. The method is used to analyse the Polio dataset from Zeger (1988). This talk represents joint research with Dr Alan Huang.

**Adjusting for linkage bias in the New Zealand Longitudinal Census cohort**

Speaker: Dr Barry Milne

Affiliation: COMPASS

When: Thursday, 21 August 2014, 4:00 pm to 5:00 pm

Where: 303.310

The recent development of the New Zealand Longitudinal Census cohort - which links data for individuals across the NZ Censuses from 1981 - 2006 - creates an opportunity to answer a number of innovative research questions. However, as there is incomplete linkage across censuses (i.e. some individuals are able to be linked while others are not), there is the potential for bias if associations among those linked differ from associations in the full population. I will describe the New Zealand Longitudinal Census cohort and some attempts to identify and adjust for bias. I will show that there is evidence of association biases, and that biases tend to be greater for unadjusted vs covariate adjusted associations. Weighting by the inverse probability of linkage reduced but did not eliminate these biases. The implications of this remaining bias will be discussed, and I will also discuss other methods for bias-adjustment that might be considered.

http://www.arts.auckland.ac.nz/people/bmil080

**Random Projections as Regularizers: Learning a Linear Discriminant from Fewer Observations than Dimensions**

Speaker: Dr Bob Durrant

Affiliation: Statistics, University of Waikato

When: Wednesday, 6 August 2014, 4:00 pm to 5:00 pm

Where: 303.310

`Random projection' is a simple non-adaptive dimensionality reduction scheme which is finding increasing use in application as a result of its computational cheapness, effectiveness, and pleasing theoretical properties.

In this talk I will present some theoretical guarantees for the performance of an averaging-ensemble of randomly projected Fisher Linear Discriminant (FLD) classifiers, focusing on the case when there are far fewer training observations than data dimensions. I will also discuss the attractive computational properties of the algorithm this theory implies.

The specific form and simplicity of this ensemble permits a direct and detailed analysis of its performance and, in particular, one can show that the randomly projected ensemble is equivalent to implementing a sophisticated regularization scheme to the linear discriminant learned in the original data space and this helps prevent overfitting in conditions of small sample size where pseudo-inverse FLD learned in the data space is provably poor.

I will also present some experimental results, which confirm our theoretical findings and demonstrate the utility of our approach on some very high-dimensional datasets from the bioinformatics and drug discovery domains, both settings in which fewer observations than dimensions are the norm.

http://www.stats.waikato.ac.nz/~bobd/

**Survival Probability of a Random Walk Among a Poisson System of Moving Traps**

Speaker: Alex Drewitz

Affiliation: Columbia

When: Thursday, 3 July 2014, 11:00 am to 12:00 pm

Where: 303.310

We consider the survival probability of a random walk among a Poisson system of moving traps on the lattice, which can also be interpreted as the solution of a parabolic Anderson model with a random time-dependent potential.

We prove that the averaged survival probability decays sub-exponentially, with known dimension-dependent rates. In addition, we show that the quenched survival probability always decays exponentially.

A key ingredient in the proofs is what is known in the physics literature as the Pascal principle, which asserts that the annealed survival probability is maximised if the random walk stays at a fixed position. As a corollary, we obtain that the expected cardinality of the range of a continuous time symmetric random walk increases under perturbation by a deterministic path.

http://www.math.columbia.edu/~drewitz/

**Selfish routing in a system of parallel queues**

Speaker: Alex Wang

Affiliation: U. Auckland

When: Thursday, 29 May 2014, 3:00 pm to 4:00 pm

Where: 303.310

**PhD Provisional talk **

Adding capacity to queues doesn't always yield better performance, when individuals choose their own route through a network. In this talk I will give results for a system of parallel queues, with one batch-service and several first come first served queues. A brief outline of plans for future research will also be presented.

**Wiki New Zealand: Democratising data so that everyone can know their country and make informed decisions.**

Speaker: Lillian Grace

Affiliation: Wiki New Zealand

When: Thursday, 8 May 2014, 4:00 pm to 5:00 pm

Where: 303.310

There is a vast amount of data available on New Zealand's performance and position globally, but extracting this data into useable form requires both time and skill. Wiki New Zealand brings data together in one place, in visual formats that allow information to be easily digested.

As the name indicates, Wiki New Zealand enables contributors to register and upload data to be published through an audited process. This redefines the audience that can be reached by researchers, organisations and domain experts with data about New Zealand, by enabling them to produce accessible information as well as valuable data.

The venture has been live since December 2012 and wikinewzealand.org currently has about 8,000 visitors each month. The aim is to grow this number to about 200,000 per month and become a household name. This will be done by partnering with organisations to increase auditing and content creation capacity, by targeting specific user groups like education and media, and by running national campaigns.

----

Watch Lillian Grace at TEDx Auckland

https://www.youtube.com/watch?v=17eTVC_cOJI

---

https://www.stat.auckland.ac.nz/seminar/recorded/lilian2

**Human-assisted spread of a maladaptive behavior in a critically endangered bird: sequential Monte Carlo on population pedigrees for rare events**

Speaker: Raazesh Sainudiin

Affiliation: U. Canterbury

When: Thursday, 1 May 2014, 4:00 pm to 5:00 pm

Where: 303.310

Conservation management often focuses on counteracting the adverse effects of human activities on threatened populations. However, conservation measures may unintentionally relax selection by allowing the `survival of the not-so-fit', increasing the risk of fixation of maladaptive traits. Here, we report such a case in the critically-endangered Chatham Island black robin (Petroica traversi) which, in 1980, was reduced to a single breeding pair. Following this bottleneck, some females were observed to lay eggs on the rims of their nests. Rim eggs left in place always failed to hatch. To expedite population recovery, rim eggs were repositioned inside nests, yielding viable hatchlings. Repositioning resulted in rapid growth of the black robin population, but by 1989 over 50% of all females were laying rim eggs. We used an exceptional, species-wide pedigree to consider both recessive and dominant models of inheritance over all plausible founder genotype combinations at a biallelic and possibly sex-linked locus. The pattern of rim laying is best fitted as an autosomal dominant Mendelian trait. Using a phenotype permutation test we could also reject the null hypothesis of non-heritability for this trait in favour of our best-fitting model of heritability. Data collected after intervention ceased shows that the frequency of rim laying has strongly declined, and that this trait is maladaptive. This episode yields an important lesson for conservation biology: fixation of maladaptive traits could render small threatened populations completely dependent on humans for reproduction, irreversibly compromising the long term viability of populations humanity seeks to conserve.

The methodological part of the talk will outline a sequential Monte Carlo (SMC) algorithm over an increasing sequence of the black robin pedigree to obtain the likelihood of the observed female phenotypes under a given model of inheritance, founder genotypes and the observed population pedigree. This algorithm is a novel adaptation of the SMC-based rare event simulation of Del Moral and Doucet to nested sequences of combinatorial pedigree structures.

If time permits, we will look at the problem of proposing a new class of population pedigree models that can integrate field observations in behavioural ecology with classical theories in evolutionary population genetics.

This is joint work with Melanie Massaro, Don Merton, James V. Briskie, Anthony M. Poole and Marie L. Hale.

**Asymptotic for survey data**

Speaker: Leshun Xu

Affiliation: U. Auckland

When: Wednesday, 30 April 2014, 4:00 pm to 5:00 pm

Where: 303.B07

PhD provisional talk

Central limit theorems (CLTs) play a fundamental role in the research on asymptotic properties. They provide theoretical guidance for estimators in practical applications. In this talk, we will briefly present how CLTs work in two inference frameworks, design-based and model-based. Under the model-based framework, we will concisely introduce two existing works and what we are going to do in our further research, which will involve a CLT for finite population generated by sampling from a random field.

**Bridging physiological and evolutionary time scales in a gene regulatory network**

Speaker: Dr. Matthieu Vignes

Affiliation: Massey University, and INRA - Toulouse

When: Tuesday, 15 April 2014, 4:00 pm to 5:00 pm

Where: 303.310

Both physiological response and evolutionary adaptation modify the phenotype, but they act at different time scales. Because gene regulatory networks (GRN) govern phenotypic adaptations, they reflect the trade-offs between these different forces. To identify patterns of molecular function and genetic diversity in GRNs, we studied the drought response of the common sunflower, Helianthus annuus, and how the underlying GRN has influenced its evolution. We examined the responses of 32,423 expressed sequences to drought and to the hormone abscisic acid and selected 145 co-expressed transcripts. We characterized their regulatory relationships in nine kinetic studies based on different hormones. From this, we inferred a GRN by meta-analyses of a Gaussian Graphical model and a Random Forest algorithm and studied the genetic diversity of its nodes.

Two main hubs in the network were identified as nitrate transporters in guard cells. This suggests that this function is key in sunflower physiological response to drought. Among Helianthus populations, we observed that more highly connected nodes in the GRN had lower genetic diversity. The systems biology approach combined molecular data at different time scales and identified important physiological processes. At the evolutionary level, we propose that network topology constrained adaptation to dry environment and thus speciation."

This is a joint work between Marchand, G; Huynh-Thu, V.-A., Kane, N.; Arribat, S.; Vares, D.; Rengel, D., Balzergue, S.; Rieseberg, L.; Vincourt, P.; Geurts, P.; Vignes, M. and Langlade, N. (affiliations are complex from INRA Toulouse, INRA Evry, Liege university, University of Colorado and University of British Columbia). My main roles were to discuss, propose a methodology and implement GRN inference and I was also part of the network evolution analysis. I can also explain some of the biological content although I would be very limited to answer specific questions I am afraid...This would really be a biological story related by a statistician...Who can't resist to talk on the models he used !

http://carlit.toulouse.inra.fr/wikiz/index.php/Matthieu_VIGNES

**May the gravitational force be with us: using gravitational lensing to find missing galaxies.**

Speaker: David Huijser

Affiliation: U. Auckland

When: Wednesday, 9 April 2014, 10:00 am to 11:00 am

Where: 303.310

** PhD Provisional talk **

According to theory of hierarchical galaxy formation, mergers play an important role in the formation and evolution of galaxies.

Theory predicts a vast amount of small satellites, consisting mostly of dark matter, exist around massive galaxies. However, so far only a few have been observed. The discovery of approximately 30 satellite galaxies around our closest neighbouring galaxy (Andromeda) gives us reason to believe that they exist, but direct observation would not allow us to find satellites that are totally dark.

Observing these dark galaxies is challenging, because they can only be detected indirectly (e.g through gravitational lensing).

Gravitational lensing uses the fact that the gravitational field of a galaxy deflects light on its path toward us, creating highly distorted images. From these distorted images, a matter distribution can be deduced.

In order to do this, a model with a large number of parameters must be fitted to the image, which

involves a vast amount of computation, or some simplifying assumptions.

Because of the complexity of the problem it's impossible to evaluate the model for all possible parameter values, therefore we will apply MCMC methods to find the posterior distribution. We will use a combination of different methods:

Slice Sampling, Nested Sampling and the Reversible Jump MCMC, each of which brings its own virtues to help us solve this problem.

**Pickands polynomial functions**

Speaker: Prof. Francois Perron

Affiliation: U. Montreal

When: Thursday, 3 April 2014, 4:00 pm to 5:00 pm

Where: 303.310

In the theory of bivariate extreme value distributions the joint distribution can be expressed in terms of the marginal distributions and the copula. The copula depends on a special function called the Pickands dependence function. This function is convex and lives in a triangle. In this talk we will find all of the Pickands functions that are polynomials. We will derive approximation results. We will discuss the problem of finding the mle. We will compare the mle with other nonparametric approaches.

**What in the heavens? and Improved Models for Stutter Prediction in Forensic DNA Analysis**

Speaker: Matt Edwards and Sampath Fernando

Affiliation: U. Auckland

When: Thursday, 27 March 2014, 4:00 pm to 5:00 pm

Where: 303.310

**PhD provisional talk**

Albert Einstein's general theory of relativity predicts the existance of gravitational waves - ripples in the fabric of space-time caused by asymmetries in catastrophic and highly accelerated events throughout the cosmos. Important astrophysical information travels through the universe via gravitational waves, and direct detection of these miniscule distortions will open a new window to view the universe through. This will have profound consequences in astrophysics and cosmology. Using the latest numerical simulations of rotating core collapse supernovae, we present a Bayesian framework to extract the physical information encoded in noisy gravitational wave signals.

**PhD provisional talk**

Analysing the effect of stutter products is essential in forensic DNA analysis. Therefore, proper selection of statistical models for interpretations of stutter is vital. Bright et al. (2013) studied five statistical models for the behaviour of the stutter ratio (SR). These were two log‐normal models and two gamma models on SR and a two‐component normal mixture model on log(SR). Each models performance was assessed by calculating the log‐likelihoods and Akaike information criterion (AIC) was used for comparisons. As both the log‐normal and gamma distributions are positively skewed, testing the appropriateness of symmetric distributions would be valuable. This study tests normal and Students t models with both common variance and locus specific variance. A two‐component normal mixture model, and a two‐component t mixture model were also tested on SR. Proposed six models and existing five models were studied with the same Identifiler and NGM SElect DNA profiles. In this talk, I will be describing the statistical methodology that we adopted to improve the models developed by Bright and others. I will also compare the performances of proposed models and existing models using AIC and present the interesting findings related to the improvements in the new models. I will also outline some future research directions for my PhD.

**Great tit genomics - gene mapping goes wild**

Speaker: Dr. Anna Santure

Affiliation: U. Auckland

When: Thursday, 20 March 2014, 4:00 pm to 5:00 pm

Where: 303.310

Great tits (Parus major) are a model wild study system in evolutionary biology, and their long term study has enabled the examination of areas as diverse as the evolution of life history traits in response to climate change and the heritability of personality. However, identifying the genomic regions underlying these traits has been restricted by a lack of genetic resources. Great tits have been monitored in Wytham Woods, Oxfordshire since the 1940s, with extensive morphological, life history and pedigree information and blood samples available, providing tremendous opportunity to locate the regions of the genome contributing to variation in the morphological and life history traits. I will talk about the development of genomic resources for the great tit, and present some exciting results using three different marker-based statistical approaches (partitioning of additive genetic variance across genomic regions, pedigree-based quantitative trait locus mapping and genome wide association scans) to investigate the genetic basis of heritable morphological and life history traits in the Wytham population. For the majority of these traits, there is little evidence of genes of major effect, supporting the hypothesis that variation in key traits in natural populations is likely to be determined by many loci of small effect spread throughout the genome, which are subject to continued input of variation by mutation and migration.

**Importance sampling schemes for evidence approximation in mixture models**

Speaker: Kate Lee

Affiliation: AUT

When: Friday, 14 March 2014, 4:00 pm to 5:00 pm

Where: 303.B09

The marginal likelihood is a central tool for drawing Bayesian inference about the number of components in mixture models. It is often approximated since the exact form is unavailable and the bias may be due to an incomplete exploration by a simulated Markov chain that is also known as lack of label switching. Borrowing an existing idea of averaging all permutations, we propose the dual importance sampling using a Rao-Blackwellised importance function. To increase the computational efficiency, a preliminary step is proposed to avoid permutations in which do not contribute to the to the Rao-Blackwell estimate and reduce the number of relevant permutations considered in the averaging.

**Web-Based Graphics with WeBIPP - Bridging the GUI-Code Chasm**

Speaker: Jimmy Oh

Affiliation: U. Auckland

When: Thursday, 20 February 2014, 4:00 pm to 5:00 pm

Where: 303.B11

There currently exist two broad options when trying to produce a

Web-Based Statistical Graphic. The easier option is to choose a tool

like Tableau, a GUI-based tool that is relatively easy to use. The

price of easy entry? Limited flexibility and strangled creativity. But

the alternative is hardly appealing. Learning a programming language

and coding your original graphic, from scratch.

WeBIPP is a tool that bridges this divide. Possessing a GUI front-end

that is relatively easy to use, WeBIPP allows the user to quickly

build new graphics from scratch with a few clicks of the mouse. More,

the underlying mechanism for WeBIPP ensures that the GUI front-end

does not limit any programming input, enabling any programming

knowledge in the relevant fields (HTML, CSS, SVG, JS, D3js) to easily

be employed to further enhance the graphic.

The talk will elaborate on the conceptual basis of WeBIPP, and its

implications, with an analogy that involves cookies. It will also

include a demonstration of the WeBIPP prototype, constructing a

population pyramid for NZ Census data, from scratch, during the talk.

**Efficient Estimation of Some Elliptical Copula Regression Models through Scale Mixtures of Normal**

Speaker: Nuttanan Wichitaksorn

Affiliation: U. Canterbury

When: Thursday, 13 February 2014, 4:00 pm to 5:00 pm

Where: 303.B09

One major problem in implementing elliptical copulas is correlation matrix estimation, especially in high-dimensional cases. We propose the efficient estimation of some elliptical copula regression models by expressing both copula and marginal densities as scale mixtures of normals and standardizing conditional normal random variables. With the facts that copulas are invariant to location and scale of marginal distributions and all elliptical distributions have the same correlation structure where some of them can be expressed as a scale mixture of multivariate normal, we can then estimate their correlation matrix with ease. Two simulation studies show favorable performance of our proposed models and methods. We also conduct two empirical studies on U.S. excess return and Thai wage earnings.

**Auckland Mini-workshop in probability**

Speaker:

Affiliation:

When: Tuesday, 11 February 2014, 10:00 am to 11:00 am

Where: 303.B11

Title: The components of the Brownian complement are sort of round.

Speaker: Tom Salisbury (York University)

Abstract: Run a 2-dimensional Brownian motion for unit time, and then excise its path from the plane. This leaves an open set, with infinitely many components $C_k$. We're interested in the question of their shape, specifically whether they are long and thin or reasonably round. In other words, are their areas $A_k$ comparable to the square of their diameters $d_k$? We will show that they are indeed round, in the sense that for any exponent $z$, we have $\sum A_k^z<\infty$ if and only if $\sum d_k^{2z}<\infty$. This is joint work with Yuval Peres and Serban Nacu.

Title: Extinction Times in the Stochastic Logistic Epidemic

Speaker: Malwina Luczak (QMU London)

Abstract: The stochastic logistic process is a well-known birth-and-death process, often used to model the spread of an epidemic within a population of size $N$. We survey some of the known results about the time to extinction for this model. Our focus is on new results for the ``subcritical'' regime, where the recovery rate exceeds the infection rate by more than $N^{-1/2}$, and the epidemic dies out rapidly with high probability. Previously, only a first-order estimate for the mean extinction time of the epidemic was known, even in the case where the recovery rate and infection rate are fixed constants: we derive precise asymptotics for the distribution of the extinction time throughout the subcritical regime. In proving our results, we illustrate a technique for approximating certain types of Markov chain by differential equations over long time periods.

This is joint work with Graham Brightwell.

**Exploring the Sampling Universe of RNA-seq**

Speaker: Dr. Arndt von Haeseler

Affiliation: Center for Integrative Bioinformatics Vienna

When: Thursday, 30 January 2014, 4:00 pm to 5:00 pm

Where: 303.B11

How deep is deep enough? While RNA-sequencing represents a well-established technology, the required sequencing depth for detecting all expressed genes is not known. If we leave the entire biological overhead and meta-information behind we are dealing with a classical sampling process. Such sampling processes are well known from population genetics and thoroughly investigated. Here we use the Pitman Sampling Formula to model the sampling process of RNA-sequencing. By doing so we characterize the sam- pling by means of two parameters which grasp the conglomerate of different sequencing technologies, pro- tocols and their associated biases. We differ between two levels of sampling: number of reads per gene and respectively, number of reads starting at each position of a specific gene. The latter approach allows us to evaluate the theoretical expectation of uniform coverage and the performance of sequencing protocols in that respect. Most importantly, given a pilot sequencing experiment we provide an estimate for the size of the underlying sampling universe and, based on these findings, evaluate an estimator for the number of newly detected genes when sequencing an additional sample of arbitrary size.