Angling for the best models

13 September 2018
Dr Thomas Yee, Department of Statistics
Dr Thomas Yee

Scan down Thomas Yee’s list of publications, and you notice something fishy: titles like “Analysis of the 2008 World Fly Fishing Championships (2014)” and “VGLMs and VGAMs: An overview for applications in fisheries research (2010).”

Yes, Thomas, a senior lecturer in the Department of Statistics, is a keen fly fisherman, and he has been able to marry his statistical nous with his hobby.

As a PhD student in the Department of Statistics in the early 1990s, he “dabbled” in trout fishing. But his skills really took off after a chance meeting with a teenager at the Coromandel who had grown up on the bank of a “trout-infested” stream.

“He guided me for a few hours, and I learned more about fly fishing in the first two hours than in the previous two years!” Thomas recalls. Since then, he’s been reeling in the fish. No, he won’t say where, except that it’s in ... the North Island.

When not chasing trout, Thomas works on nonparametric regression and generalized linear models, with particular applications to medical studies and ecology.

Regression is not the angler’s reaction when a big one gets away; rather, it’s a technique that statisticians use a great deal. It usually involves some response variable, often called y, and a set of explanatory variables, called x. The aim is to model y with x using a statistical model.

“So, y might be whether somebody will develop lung cancer or not (yes or no) and the vector x might contain variables such as age, sex and whether they smoke or not,” Thomas explains.

Nonparametric regression and generalized linear models refer to different topics in regression. The first refers to a technique called smoothing, which allows one to fit curves to the data rather than insisting on fitting lines through the data, regardless of whether it is linear or not.

“In a nutshell, nonparametric regression allows the data to speak for themselves, which is very important in exploratory data analysis.”  

Generalized linear models, or GLMS, unify about half a dozen very important statistical models; in the past, each model was treated separately. For example, it was discovered that the linear model, logistic regression and Poisson regression could all be fitted using one algorithm, and that there was an overriding theory for them.

“These three types of models allow one to handle continuous, binary and count responses respectively,” says Thomas, “so they are extremely useful.”

Like many of his colleagues in the Department of Statistics, Thomas is adept in statistical computing, R in particular; he is a firm believer that any new statistical methodology is best implemented in software. One of his R packages, called VGAM for R (the acronym stands for vector generalized additive models) is used worldwide in fields as diverse as psychology, ecology, language studies, and climate-change detection.  

And fishing. In 2008, the World Fly Fishing Championships took place in New Zealand, and Thomas volunteered as a controller.  He enjoyed watching some of the world’s best fly fishers in action, and organisers agreed to give him competition data.

“I started to analyse it in R and found that some regressions could answer some natural questions, such as how depleted the beats became after being hammered by successive fishing sessions, and how large a role chance played relative to the competitors’ skill.”

The resulting 2014 paper is, he thinks, the first in an academic journal on fly fishing, but he is happy to be corrected.

There may be another paper on the way – Thomas has been talking with fellow fly fisher Len Cook, a former chief Government statistician in both this country and the UK, about a new fly fishing statistics project.

In the meantime, Thomas is teaching two papers this semester – statistical computing (STATS 782) and data mining (STATS 784).

 

Looking for a research project? Thomas has several available for honours and masters students: bias reduction, expected information matrices, analysis of flyfishing data, multinomial logit model, q-type functions.