Statistics
Te Tari Pāngarau me te Tatauranga
Department of Mathematics & Statistics

Archived seminars in Statistics

Seminars 101 to 150
Previous 50 seminars
Next 50 seminars
Fast computation of spatially adaptive kernel smooths

Tilman Davies

Department of Mathematics and Statistics

Date: Thursday 9 March 2017

Kernel smoothing of spatial point data can often be improved using an adaptive, spatially-varying bandwidth instead of a fixed bandwidth. However, computation with a varying bandwidth is much more demanding, especially when edge correction and bandwidth selection are involved. We propose several new computational methods for adaptive kernel estimation from spatial point pattern data. A key idea is that a variable-bandwidth kernel estimator for d-dimensional spatial data can be represented as a slice of a fixed-bandwidth kernel estimator in (d+1)-dimensional "scale space", enabling fast computation using discrete Fourier transforms. Edge correction factors have a similar representation. Different values of global bandwidth correspond to different slices of the scale space, so that bandwidth selection is greatly accelerated. Potential applications include estimation of multivariate probability density and spatial or spatiotemporal point process intensity, relative risk, and regression functions. The new methods perform well in simulations and real applications.
Joint work with Professor Adrian Baddeley, Curtin University, Perth.
170110164649
Detection and replenishment of missing data in the observation of point processes with independent marks

Jiancang Zhuang

Institute of Statistical Mathematics, Tokyo

Date: Thursday 2 March 2017

Records of processes of geophysical events, which are usually modeled as marked point processes, such as earthquakes and volcanic eruptions, often have missing data that result in underestimate of corresponding hazards. This study presents a fast approach for replenishing missing data in the record of a temporal point process with time independent marks. The basis of this method is that, if such a point process is completely observed, it can be transformed into a homogeneous Poisson process on the unit square $[0,1]^2$ by a biscale empirical transformation. This method is tested on a synthetic dataset and applied to the record of volcanic eruptions at the Hakone Volcano, Japan and several datasets of the aftershock sequences following some large earthquakes. Especially, by comparing the analysis results from the original and the replenished datasets of aftershock sequence, we have found that both the Omori-Utsu formula and ETAS model are stable, and the variations in the estimated parameters with different magnitude thresholds in past studies are caused by the influence of short-term missing of small events.
170110164432
A new multidimensional stress release statistical model based on coseismic stress transfer

Shiyong Zhou

Peking University

Date: Tuesday 14 February 2017

NOTE venue is not our usual
Following the stress release model (SRM) proposed by Vere-Jones (1978), we developed a new multidimensional SRM, which is a space-time-magnitude version based on multidimensional point processes. First, we interpreted the exponential hazard functional of the SRM as the mathematical expression of static fatigue failure caused by stress corrosion. Then, we reconstructed the SRM in multidimensions through incorporating four independent submodels: the magnitude distribution function, the space weighting function, the loading rate function and the coseismic stress transfer model. Finally, we applied the new model to analyze the historical earthquake catalogues in North China. An expanded catalogue, which contains the information of origin time, epicentre, magnitude, strike, dip angle, rupture length, rupture width and average dislocation, is composed for the new model. The estimated model can simulate the variations of seismicity with space, time and magnitude. Compared with the previous SRMs with the same data, the new model yields much smaller values of Akaike information criterion and corrected Akaike information criterion. We compared the predicted rates of earthquakes at the epicentres just before the related earthquakes with the mean spatial seismic rate. Among all 37 earthquakes in the expanded catalogue, the epicentres of 21 earthquakes are located in the regions of higher rates.
170110163941
Next generation ABO blood type genetics and genomics

Keolu Fox

University of San Diego

Date: Wednesday 1 February 2017

The ABO gene encodes a glycosyltransferase, which adds sugars (N-acetylgalactos-amine for A and α-D- galactose for B) to the H antigen substrate. Single nucleotide variants in the ABO gene affect the function of this glycosyltransferase at the molecular level by altering the specificity and efficiency of this enzyme for these specific sugars. Characterizing variation in ABO is important in transfusion and transplantation medicine because variants in ABO have significant consequences with regard to recipient compatibility. Additionally, variation in the ABO gene has been associated with cardiovascular disease risk (e.g., myocardial infarction) and quantitative blood traits (von Willebrand factor (VWF), Factor VIII (FVIII) and Intercellular Adhesion molecule 1 (ICAM-1). Relating ABO genotypes to actual blood antigen phenotype requires the analysis of haplotypes. Here we will explore variation (single nucleotide, insertion and deletions, and structural variation) in blood cell train gene loci (ABO) using multiple datasets enriched for heart, lung and blood-related diseases (including both African-Americans and European-Americans) from multiple NGS datasets (e.g. the NHLBI Exome Sequencing Project (ESP) dataset). I will also describe the use of a new ABO haplotyping method, ABO-seq, to increase the accuracy of ABO blood type and subtype calling using variation in multiple NGS datasets. Finally, I will describe the use of multiple read-depth based approaches to discover previously unsuspected structural variation (SV) in genes not shown to harbor SV, such as the ABO gene, by focusing on understudied populations, including individuals of Hispanic and African ancestry.

Keolu has a strong background in using genomic technologies to understand human variation and disease. Throughout his career he has made it his priority to focus on the interface of minority health and genomic technologies. Keolu earned a Ph.D. in Debbie Nickerson's lab in the University of Washington's Department of Genome Sciences (August, 2016). In collaboration with experts at Bloodworks Northwest, (Seattle, WA) he focused on the application of next-generation genome sequencing to increase compatibility for blood transfusion therapy and organ transplantation. Currently Keolu is a postdoc in Alan Saltiel's lab at the University of California San Diego (UCSD) School of Medicine, Division of Endocrinology and Metabolism and the Institute for Diabetes and Metabolic Health. His current project focuses on using genome editing technologies to investigate the molecular events involved in chronic inflammatory states resulting in obesity and catecholamine resistance.
170125161950
To be or not to be (Bayesian) Non-Parametric: A tale about Stochastic Processes

Roy Costilla

Victoria University Wellington

Date: Tuesday 24 January 2017

Thanks to the advances in the last decades in theory and computation, Bayesian Non-Parametric (BNP) models are now use in many fields including Biostatistics, Bioinformatics, Machine Learning, Linguistics and many others.

Despite its name however, BNP models are actually massively parametric. A parametric model uses a function with finite dimensional parameter vector as prior. Bayesian inference then proceeds to approximate the posterior of these parameters given the observed data. In contrast to that, a BNP model is defined on an infinite dimensional probability space thanks to the use of a stochastic process as a prior. In other words, the prior for a BNP model is a space of functions with an infinite dimensional parameter vector. Therefore, instead of avoiding parametric forms, BNP inference uses a large number of them to gain more flexibility.

To illustrate this, we present simulations and also a case study where we use life satisfaction in NZ over 2009-2013. We estimate the models using a finite Dirichlet Process Mixture (DPM) prior. We show that this BNP model is tractable, i.e. is easily computed using Markov Chain Monte Carlo (MCMC) methods; allowing us to handle data with big sample sizes and estimate correctly the model parameters. Coupled with a post-hoc clustering of the DPM locations, the BNP model also allows an approximation of the number of mixture components, a very important parameter in mixture modelling.
170116145247
Computational methods and statistical modelling in the analysis of co-ocurrences: where are we now?

Jorge Navarro Alberto

Universidad Autónoma de Yucatán (UADY)

Date: Wednesday 9 November 2016

NOTE day and time of this seminar
The subject of the talk is statistical methods (both theoretical and applied) and computational algorithms for the analysis of binary data, which have been applied in ecology in the study of species composition in systems of patches with the ultimate goal to uncover ecological patterns. As a starting point, I review Gotelli and Ulrich's (2012) six statistical challenges in null model analysis in Ecology. Then, I exemplify the most recent research carried out by me and other statisticians and ecologists to face those challenges, and applications of the algorithms outside the biological sciences. Several topics of research are proposed, seeking to motivate statisticians and computer scientists to venture and, eventually, to specialize in the subject of the analysis of co-occurrences.
Reference: Gotelli, NJ and Ulrich, W, 2012. Statistical challenges in null model analysis. Oikos 121: 171-180
161101160727
Extensions of the multiset sampler

Scotland Leman

Virginia Tech, USA

Date: Tuesday 8 November 2016

NOTE day and time of this seminar
In this talk I will primarily discuss the Multiset Sampler (MSS): a general ensemble based Markov Chain Monte Carlo (MCMC) method for sampling from complicated stochastic models. After which, I will briefly introduce the audience to my interactive visual analytics based research.

Proposal distributions for complex structures are essential for virtually all MCMC sampling methods. However, such proposal distributions are difficult to construct so that their probability distribution match that of the true target distribution, in turn hampering the efficiency of the overall MCMC scheme. The MSS entails sampling from an augmented distribution that has more desirable mixing properties than the original target model, while utilizing a simple independent proposal distributions that are easily tuned. I will discuss applications of the MSS for sampling from tree based models (e.g. Bayesian CART; phylogenetic models), and for general model selection, model averaging and predictive sampling.

In the final 10 minutes of the presentation I will discuss my research interests in interactive visual analytics and the Visual To Parametric Interaction (V2PI) paradigm. I'll discuss the general concepts in V2PI with an application of Multidimensional Scaling, its technical merits, and the integration of such concepts into core statistics undergraduate and graduate programs.
161011102333
New methods for estimating spectral clustering change points for multivariate time series

Ivor Cribben

University of Alberta

Date: Wednesday 19 October 2016

NOTE day and time of this seminar
Spectral clustering is a computationally feasible and model-free method widely used in the identification of communities in networks. We introduce a data-driven method, namely Network Change Points Detection (NCPD), which detects change points in the network structure of a multivariate time series, with each component of the time series represented by a node in the network. Spectral clustering allows us to consider high dimensional time series where the number of time series is greater than the number of time points. NCPD allows for estimation of both the time of change in the network structure and the graph between each pair of change points, without prior knowledge of the number or location of the change points. Permutation and bootstrapping methods are used to perform inference on the change points. NCPD is applied to various simulated high dimensional data sets as well as to a resting state functional magnetic resonance imaging (fMRI) data set. The new methodology also allows us to identify common functional states across subjects and groups. Extensions of the method are also discussed. Finally, the method promises to offer a deep insight into the large-scale characterisations and dynamics of the brain.
161007094119
Inverse prediction for paleoclimate models

John Tipton

Colorado State University

Date: Tuesday 18 October 2016

NOTE day and time of this seminar
Many scientific disciplines have strong traditions of developing models to approximate nature. Traditionally, statistical models have not included scientific models and have instead focused on regression methods that exploit correlation structures in data. The development of Bayesian methods has generated many examples of forward models that bridge the gap between scientific and statistical disciplines. The ability to fit forward models using Bayesian methods has generated interest in paleoclimate reconstructions, but there are many challenges in model construction and estimation that remain.

I will present two statistical reconstructions of climate variables using paleoclimate proxy data. The first example is a joint reconstruction of temperature and precipitation from tree rings using a mechanistic process model. The second reconstruction uses microbial species assemblage data to predict peat bog water table depth. I validate predictive skill using proper scoring rules in simulation experiments, providing justification for the empirical reconstruction. Results show forward models that leverage scientific knowledge can improve paleoclimate reconstruction skill and increase understanding of the latent natural processes.
161007103441
Ultrahigh dimensional variable selection for interpolation of point referenced spatial data

Benjamin Fitzpatrick

Queensland University of Technology

Date: Monday 17 October 2016

NOTE day and time of this seminar
When making inferences concerning the environment, ground truthed data will frequently be available as point referenced (geostatistical) observations accompanied by a rich ensemble of potentially relevant remotely sensed and in-situ observations.
Modern soil mapping is one such example characterised by the need to interpolate geostatistical observations from soil cores and the availability of data on large numbers of environmental characteristics for consideration as covariates to aid this interpolation.

In this talk I will outline my application of Least Absolute Shrinkage Selection Opperator (LASSO) regularized multiple linear regression (MLR) to build models for predicting full cover maps of soil carbon when the number of potential covariates greatly exceeds the number of observations available (the p > n or ultrahigh dimensional scenario). I will outline how I have applied LASSO regularized MLR models to data from multiple (geographic) sites and discuss investigations into treatments of site membership in models and the geographic transferability of models developed. I will also present novel visualisations of the results of ultrahigh dimensional variable selection and briefly outline some related work in ground cover classification from remotely sensed imagery.

Key references:
Fitzpatrick, B. R., Lamb, D. W., & Mengersen, K. (2016). Ultrahigh Dimensional Variable Selection for Interpolation of Point Referenced Spatial Data: A Digital Soil Mapping Case Study. PLoS ONE, 11(9): e0162489.
Fitzpatrick, B. R., Lamb, D. W., & Mengersen, K. (2016). Assessing Site Effects and Geographic Transferability when Interpolating Point Referenced Spatial Data: A Digital Soil Mapping Case Study. https://arxiv.org/abs/1608.00086
161007111343
New Zealand master sample using balanced acceptance sampling

Paul van Dam-Bates

Department of Conservation

Date: Thursday 13 October 2016

Environmental monitoring for management organisations like the Department of Conservation is critical. Without good information about outcomes, poor management actions may persist much longer than they should or initial intervention may occur too late. The Department currently conducts focused research at key natural heritage sites (Tier 3) as well as a long term national monitoring (Tier 1). The link between the two tiers of investigation to assess the impact of management across New Zealand (Tier 2) is yet to be implemented but faces unique challenges for working at many different spatial scales and coordinating with multiple agencies. The solution is to implement a Master Sample using Balanced Acceptance Sampling (BAS). To do this some practical aspects of the sample design are addressed such as stratification, unequal probability sampling, rotating panel designs and regional intensification. Incorporating information from Tier 1 monitoring directly is also discussed.

Authors: Paul van Dam-Bates[1], Ollie Gansell[1] and Blair Roberston[2]
1 Department of Conservation, New Zealand
2 University of Canterbury, Department of Mathematics and Statistics
160525145234
How robust are capture–recapture estimators of animal population density?

Murray Efford

Department of Mathematics and Statistics

Date: Thursday 6 October 2016

Data from passive detectors (traps, automatic cameras etc.) may be used to estimate animal population density, especially if individuals can be distinguished. However, the spatially explicit capture–recapture (SECR) models used for this purpose rest on specific assumptions that may or may not be justified, and uncertainty regarding the robustness of SECR methods has led some to resist their use. I consider the robustness of SECR estimates to deviations from key spatial assumptions – uniform spatial distribution of animals, circularity of home ranges, and the shape of the radial detection function. The findings are generally positive, although variance estimates are sensitive to over-dispersion. The method is also somewhat robust to transience and other misspecifications of the detection model, but it is not foolproof, as I show with a counter example.
160527115814
Bootstrapped model-averaged confidence intervals

Jimmy Zeng

Department of Preventive and Social Medicine

Date: Thursday 29 September 2016

Model-averaging is commonly used to allow for model uncertainty in parameter estimation. In the frequentist setting, a model-averaged estimate of a parameter is a weighted mean of the estimates from the individual models, with the weights being based on an information criterion, such as AIC. A Wald confidence interval based on this estimate will often perform poorly, as its sampling distribution will generally be distinctly non-normal and estimation of the standard error is problematic. We propose a new method that uses a studentized bootstrap approach. We illustrate its use with a lognormal example, and perform a simulation study to compare its coverage properties with those of existing intervals.
160520152426
N-mixture models vs Poisson regression

Richard Barker

Department of Mathematics and Statistics

Date: Thursday 22 September 2016

N-mixture models describe count data replicated in time and across sites in terms of abundance N and detectability p. They are popular because they allow inference about N while controlling for factors that influence p without the need for marking animals. Using a capture-recapture perspective we show that the loss of information that results from not marking animals is critical, making reliable statistical modeling of N and p problematic using just count data. We are unable to fit a model in which the detection probabilities are distinct among repeat visits as this model is overspecified. This makes uncontrolled variation in p problematic. By counter example we show that even if p is constant after adjusting for covariate effects (the 'constant p' assumption) scientifically plausible alternative models in which N (or its expectation) is non-identifiable or does not even exist, lead to data that are practically indistinguishable from data generated under an N-mixture model. This is particularly the case for sparse data as is commonly seen in applications. We conclude that under the constant p assumption reliable inference is only possible for relative abundance in the absence of questionable and/or untestable assumptions or with better quality data then seen in typical applications. Relative abundance models for counts can be readily fitted using Poisson regression in standard software such as R and are sufficiently flexible to allow controlling for p through the use covariates while simultaneously modeling variation in relative abundance. If users require estimates of absolute abundance they should collect auxiliary data that help with estimation of p.
160829124021
Single-step genomic evaluation of New Zealand's sheep

Mohammad Ali Nilforooshan

Department of Mathematics and Statistics

Date: Thursday 15 September 2016

Quantitative genetics is the study of inheritance of quantitative traits, which are generally continuously distributed. It uses biometry to study the expression of quantitative differences among individuals and considers genetic relatedness and, environment. In the past, knowing the genetic structure of individuals has been very expensive to be used commercially. However, in the last decade, the price of genotyping has fallen rapidly, and now, there are commercial genotype chips available for most livestock species. Currently, dense marker maps are used to predict the genetic merit of animals, early in life. There are methods available for genomic evaluation. However, because they do not consider all the available information at the same time, bias or accuracy loss may occur. Single-step GBLUP is a method that uses all the genomic, pedigree and phenotypic data on all animals, simultaneously and is reported to be limit bias and in cases increase accuracy of prediction. Preliminary results of this approach on New Zealand Sheep will be presented.
160525125408
Clinical trial Data Monitoring Committees - aiding science

Katrina Sharples

Department of Mathematics and Statistics

Date: Thursday 8 September 2016

The goal of a clinical trial is to obtain reliable evidence regarding the benefits and risks of a treatment while minimising the harm to patients. Recruitment and follow-up may take place over several years, accruing information over time, which allows the option of stopping the trial early if the trial objectives have been met or the risks to patients become too great. It has become standard practice for trials with significant risk to be overseen by an independent Data Monitoring Committee (DMC). These DMCs have sole access to the accruing trial data; they are responsible for safeguarding the rights of the patients in the trial, and for making recommendations to those running the trial regarding trial conduct and possible early termination. However interpreting the accruing evidence and making optimal recommendations is challenging. As the number of trials having DMCs has grown there has been increasing discussion of how train new DMC members. Some DMCs have published papers describing their decision-making processes for specific trials, and workshops are held fairly frequently. However it is recognised that DMC expertise is best acquired through apprenticeship. Opportunities for this are rare internationally but in New Zealand, in 1996, the Health Research Council established a unique system for monitoring clinical trials which incorporates apprenticeship positions. This talk will describe our system, discuss some of the issues and insights that have arisen along the way, and the effects it has had on the NZ clinical trial environment.
160524142027
A statistics-related guest seminar in Preventive and Social Medicine: A researcher's guide to understanding modern statistics

Sander Greenland

University of California

Date: Monday 5 September 2016

Note day, time and venue of this special seminar
Sander Greenland is Research Professor and Emeritus Professor of Epidemiology and Statistics at the University of California, Los Angeles. He is a leading contributor to epidemiological statistics, theory, and methods, with a focus on the limitations and misuse of statistical methods in observational studies. He has authored or co-authored over 400 articles and book chapters in epidemiology, statistics, and medical publications, and co-authored the textbook Modern Epidemiology.

Professor Greenland has played an important role in the recent discussion following the American Statistical Association’s statement on the use of p values.[1-3] He will discuss lessons he took away from the process and how they apply to properly interpreting what is ubiquitous but rarely interpreted correctly by researchers: Statistical tests, P-values, power, and confidence intervals.

1. Ronald L. Wasserstein & Nicole A. Lazar (2016): The ASA's statement on p-values: context, process, and purpose, The American Statistician, 70, 129-133, DOI: 10.1080/00031305.2016.1154108
2. Greenland, S., Senn, S.J., Rothman, K.J., Carlin, J.C., Poole, C., Goodman, S.N., and Altman, D.G. (2016). Statistical tests, confidence intervals, and power: A guide to misinterpretations. The American Statistician, 70, online supplement 1 at http://www.tandfonline.com/doi/ suppl/10.1080/00031305.2016.1154108; reprinted in the European Journal of Epidemiology, 31, 337-350.
3. Greenland, S. (2016). The ASA guidelines and null bias in current teaching and practice. The American Statistician, 70, online supplement 10 at http://www.tandfonline.com/doi/ suppl/10.1080/00031305.2016.1154108
160804123414
Sugars are not all the same to us! Empirical investigation of inter- and intra-individual variabilities in responding to common sugars

Mei Peng

Department of Food Science

Date: Thursday 25 August 2016

Given the collective interests in sugar, from food scientists, geneticists, neurophysiologists, and many others (e.g., health professionals, food journalists, and YouTube experimenters), one would expect the picture of human sweetness perception should be reasonably complete by now. This is unfortunately not the case. Some seemingly fundamental questions have not yet been answered – is one’s sweetness sensitivity generalisable across different sugars? Can people discriminate sugars when they are equally sweet? Do common sugars have similar effects on people’s cognitive processing?

Answers to these questions have a close relevance to illuminating the sensory physiology of sugar metabolism, as well as to practical research of sucrose substitution. In this seminar, I would like to present findings from a few behavioural experiments focused on inter-individual and intra-individual differences in responding to common sugars, using methods from sensory science and cognitive psychology. Overall, our findings challenged some of the conventional beliefs about sweetness perception, and provided some insights into future research about sugar.
160517112909
New models for symbolic data

Scott Sisson

University of New South Wales

Date: Thursday 18 August 2016

Symbolic data analysis is a fairly recent technique for the analysis of large and complex datasets based on summarising the data into a number of "symbols" prior to analysis. Inference is then based on the analysis of the data at the symbol level (modelling symbols, predicting symbols etc). In principle this idea works, however it would be more advantageous and natural to fit models at the level of the underlying data, rather than the symbol. Here we develop a new class of models for the analysis of symbolic data that fit directly to the data underlying the symbol, allowing for a more intuitive and flexible approach to analysis using this technique.
160520142124
Estimation of relatedness using low-depth sequencing data

Ken Dodds

AgResearch

Date: Thursday 11 August 2016

Estimates of relatedness are used for traceability, parentage assignment, estimating genetic merit and for elucidating the genetic structure of populations. Relatedness can be estimated from large numbers of markers spread across the genome. A relatively new method of obtaining genotypes is to derive these directly from sequencing data. Often the sequencing protocol is designed to interrogate only a subset of the genome (but spread across the genome). One such method is known as genotyping-by-sequencing (GBS). A genotype consists of the pair of genetic types (alleles) at a particular position. Each sequencing delivers a read from one of the pairs, and so does not guarantee that both alleles are seen, even when there are two or more reads at the position. A method of estimating relatedness which accounts for this feature of GBS data is given. The method depends on the number of reads (the depth) at a particular position and also accommodates zero reads (missing). The theory for the method, simulations and some applications to real data are presented, along with further related research questions.
160517113606
The replication "crisis" in psychology, medicine, etc.: what should we do about it?

Jeff Miller

Department of Psychology

Date: Thursday 4 August 2016

Recent large-scale replication studies and meta-analyses suggest that about 50—95% of the positive “findings” reported in top scientific journals are false positives, and that this is true across a range of fields including Psychology, Medicine, Neuroscience, Genetics, and Physical Education. Some causes of this alarmingly high percentage are easily identified, but what is the appropriate cure? In this talk I describe a simple model of the research process that researchers can use to identify the optimal attainable percentage of false positives and to plan their experiments accordingly.
160517113415
Do density-dependent processes structure biodiversity?

Jon Waters

Department of Zoology

Date: Thursday 28 July 2016

New Zealand’s marine ecosystems have experienced rapid and dramatic changes over recent centuries. Notably, DNA comparisons of New Zealand’s archaeological versus modern pinniped and penguin assemblages have revealed sudden spatio-temporal genetic shifts, apparently in response to human-mediated extirpation events. These rapid biological changes in our marine environment apparently underscore the role of ‘founder takes all’ processes in shaping biogeographic distributions. Specifically, established high-density populations are seemingly able to exclude individuals dispersing from distant sources. Conversely, extirpation events, including those driven by human pressure and environmental change, can provide opportunities for range expansion of surviving lineages. The recent self-introductions of penguins and pinniped lineages from trans-oceanic sources highlight the dynamic biological history of coastal New Zealand.
160517113246
Musings of a statistical consultant

Tim Jowett

Department of Mathematics and Statistics

Date: Thursday 21 July 2016

I will talk about my role as a consultant statistician and briefly discuss some interesting applied statistics projects that I have been working on.
160209092604
Quantitative genetics in the modern genomics era: new challenges, new opportunities

Phil Wilcox

Department of Mathematics and Statistics

Date: Thursday 14 July 2016

Modern genomics technologies have led to vast amounts of data being generated on an increasingly wide range of species, particularly in non-human and non-model organisms. In this talk I will describe how these technology advances impacts the training of students, and how we’ve responded by developing of a new course offering in quantitative genetics. I will also describe some of the analytical challenges in my current research projects, and from the wider Virtual Institute of Statistical Genetics research programme, including opportunities for further statistical method development.
160524143246
Project presentations

Honours and PGDip students

Department of Mathematics and Statistics

Date: Friday 27 May 2016

STATISTICS
Michel de Lange :Deep learning
Georgia Anderson : Probabilistic linear discriminant analysis
Nick Gelling : Automatic differentiation in R

15-MINUTE BREAK 2.40-2.55

MATHEMATICS
Alex Blennerhassett : Toeplitz algebra of a directed graph
Zoe Luo : Wavelet models for evolutionary distance
Xueyao Lu : Making sense of the λ-coalescent
Terry Collins-Hawkins : Reactive diffusion in systems with memory
Josh Ritchie : Linearisation of hyperbolic constraint equations

Also
CJ Marland : Extending matchings of graphs: a survey
This one mathematics project presentation takes place at 12 noon on Thursday 26 May, room 241
160520092655
The US obesity epidemic: evidence from the Economic Security Index

Trent Smith

Department of Economics

Date: Thursday 26 May 2016

A growing body of research supports the "economic insecurity" theory of obesity, which posits that uncertainty with respect to one's material well- being may be an important root cause of the modern obesity epidemic. This literature has been limited in the past by a lack of reliable measures of economic insecurity. In this paper we use the newly developed Economic Security Index to explain changes in US adult obesity rates as measured by the National Health and Nutrition Examination Surveys (NHANES) from 1988-2012, a period capturing much of the recent rapid rise in obesity. We find a robust positive and statistically significant relationship between obesity and economic insecurity that holds for nearly every age, gender, and race/ethnicity group in our data, both in cross-section and over time.
160215153330
Spline-based approach to infer farm vehicle trajectories

Jerome Cao

Department of Mathematics and Statistics

Date: Thursday 19 May 2016

GPS units mounted on a vehicle record its position, speed and bearing. The time series of positions then represents the trajectory of the vehicle. Noisy measurements and infrequent sampling, however, mean simplistic trajectory reconstruction will have unrealistic features, like sharp turns. Smoothing spline methods can efficiently build smoother, more realistic trajectories. In a conventional smoothing spline, the objective function includes a term for errors in position and also a penalty term, which has a single parameter that controls the smoothness of reconstruction. An adaptive smoothing spline extends the single parameter to a function that varies in different domains and performs local smoothing. In this talk, I will introduce a tractor spline that incorporates both position and velocity information but penalizes excessive accelerations. The penalty term is also dependent on the operational status of the tractor. The objective function now includes a term for errors in velocity that is controlled by a new parameter and an adjusted penalty term for better control of trajectory curvature. We develop cross validation techniques to find three parameters of interest. A short discussion of the relationship between spline methods and Gaussian process regression will be given. A simulation study and real data example are presented to demonstrate the effectiveness of this new method.
160217090057
Bayes of Thrones: Bayesian prediction for Game of Thrones

Richard Vale

WRUG

Date: Tuesday 17 May 2016

I will describe an analysis from 2014 of the appearances of characters in a popular series of novels. Treating the number of appearances of each character as longitudinal data, a mixed model can be used to make probabilistic predictions of their appearances in future novels. I will discuss how the model is fitted and how its predictions should be evaluated.
160503133837
Designing a pilot study using adaptive DP-optimality

Stephen Duffull

School of Pharmacy

Date: Thursday 12 May 2016

Managing the balance between developing a life threatening blood clot and the risk of major bleeding is a complicated clinical problem. When patients are at risk of a blood clot it is usual clinical practice to administer an anticoagulant, in this case enoxaparin, to reduce this risk. Anticoagulants, however, increase the risk of a major bleed. To help reduce the risks it has become common clinical practice to measure a biomarker that provides a measure of the overall risk profile. In the absence of a suitable biomarker the clinician is essentially wearing a blind fold. Recently a mathematical model has been developed that provides a description of the coagulation processes in the human body. This model was used to identify a target that may provide a suitable biomarker of the risk for enoxaparin treatment. While the model is reasonably complicated (77 ODEs) it is not able to describe the exact analytical conditions that are necessary to establish the experimental conditions from which to develop a suitable biomarker. In this work, an adaptive DP-optimal design method is used in conjunction with a simplified coagulation model to define the experimental conditions that can be used in clinical practice. These conditions were then later tested in a clinical study and found to perform well.
160215133140
Decoupled shrinkage and selection for Gaussian graphical models

Beatrix Jones

Massey University

Date: Thursday 5 May 2016

Even when a Bayesian analysis has been carefully constructed to encourage sparsity, conventional posterior summaries with good predictive properties (eg the posterior mean) are typically not sparse. An approach called Decoupled Shrinkage and Selection (DSS), which uses a loss function that penalizes both poor fit and model complexity, has been used to address this problem in regression. This talk extends that approach to Gaussian graphical models. In this context, DSS not only provides a sparse summary of a posterior sample of graphical models, but allows us to obtain a sparse graphical structure that summarises the posterior even when the (inverse) covariance model fit is not a graphical model at all. This potentially offers huge computational advantages. We will examine simulated cases where the true graph is non-decomposable, a posterior is computed over decomposable graphical models, and DSS is able to recover the true non- decomposable structure. We will also consider computing the initial posterior based on a Bayesian factor model, and then recovering the graph structure using DSS. Finally, we illustrate the approach by creating a graphical model of dependencies across metabolites in a metabolic profile—in this case, a data set from the literature containing simultaneous measurements of 151 metabolites (small molecules in the blood), for 1020 subjects.
Joint work with Gideon Bistricer (Massey Honors Student) Carlos Carvalho (U Texas) and Richard Hahn (U Chicago)
160209092512
Is research methodology a latent subject of data science?

Ben Daniel

HEDC

Date: Thursday 28 April 2016

As a field of study, research methodology is concerned with the utilisation of systematic approaches and procedures to investigate well-defined problems, underpinned by particular epistemological and ontological orientations. For many years, research methodology has occupied a central role in postgraduate education, with courses taught at all levels and across a wide range of disciplinary contexts.

In this seminar, I will first present findings from a large scale research project examining the concept of research methodology among academic staff involved in teaching methods courses from 139 universities in 9 countries. I will then discuss how this has ultimately influenced the way academics relate to and approach teaching of the subject.

Secondly, I will share key findings from another study aimed at exploring postgraduate students’ views on the value of research methodology and outline the challenges they face in learning the subject. To conclude, I will address the question whether the recognition of research methodology as an independent field of study within data science can contribute to better understanding of current and future challenges associated with the increasing availability of data from vast interconnected and loosely coupled systems within the higher education sector.
160211094606
Focussed model averaging in GLMs

Chuen Yen Hong

Department of Mathematics and Statistics

Date: Thursday 21 April 2016

Parameter estimation is often based on a single model, usually chosen by a model selection process. However, this ignores the uncertainty in model selection. Model averaging takes this uncertainty into account. In the frequentist framework, the model-averaged point estimate is a weighted mean of the estimates obtained from each model. The weights are often based on an information criterion such as AIC or BIC, but can also be chosen to minimize an estimate of the mean squared error of the model-averaged estimator. The latter type of weight is focussed on the parameter of interest. We present an approach for deriving focussed weights for generalised linear models (GLMs), and compare it with existing approaches.
160209092422
Development of a next generation genetic evaluation system for the New Zealand sheep industry

Benoit Auvray

Department of Mathematics and Statistics

Date: Thursday 14 April 2016

The Otago Quantitative Genetics Group, part of the Mathematics and Statistics Department of the University of Otago, is working in collaboration with Beef+Lamb New Zealand Genetics and other organisations to develop a new genetic evaluation system for New Zealand sheep. The new system will seamlessly combine traditional genetic evaluation data, typically millions of pedigree and phenotype records, along with new DNA data, that may include billions of data points, to give estimates of animal genetic merit for a wide variety of economically important traits, for selection and breeding purposes.

In this seminar, we will present the work undertaken by our group and compare it with the existing genetic evaluation system.
160218151236
Real-time updating for decision-making in emergency response to outbreaks of foot-and-mouth disease

Will Probert

University of Nottingham

Date: Thursday 7 April 2016

During infectious disease outbreaks there may be uncertainty regarding both the extent of the outbreak and the optimality of alternative control interventions. As an outbreak progresses, and information accrues, so too does the level of confidence upon which decisions regarding control response are based. However, the longer the delay in making a decision the larger the potential opportunity cost of inaction.

We examine this trade-off using data from the UK 2001 outbreak of foot-and-mouth disease (FMD) by fitting a dynamic epidemic model to the observed infection data available at several points throughout each outbreak and compare forward simulations of the impact of alternative culling and vaccination interventions. For comparison, we repeat these forward simulations at each time point using the model fitted to data from the complete outbreak.

Results illustrate the impact of the accrual of knowledge on both model predictions and on the evaluation of candidate control actions, and highlight the importance of control policies that permit both rapid response and adaptive updating of control actions in response to additional information.
160209092652
Mixed graphical models with applications to integrative cancer genomics

Genevera Allen

Rice University, Texas

Date: Thursday 24 March 2016

"Mixed Data'' comprising a large number of heterogeneous variables (e.g. count, binary, continuous, skewed continuous, among others) is prevalent in varied areas such as imaging genetics, national security, social networking, Internet advertising, and our particular motivation - high-throughput integrative genomics. There have been limited efforts at statistically modeling such mixed data jointly. In this talk, we address this by introducing several new classes of Markov Random Fields (MRFs), or graphical models, that yield joint densities which directly parameterize dependencies over mixed variables. To begin, we present a novel class of MRFs arising when all node-conditional distributions follow univariate exponential family distributions that, for instance, yield novel Poisson graphical models. Next, we present several new classes of Mixed MRF distributions built by assuming each node-conditional distribution follows a potentially different exponential family distribution. Fitting these models and using them to select the mixed graph in high-dimensional settings can be achieved via penalized conditional likelihood estimation that comes with strong statistical guarantees. Simulations as well as an application to integrative cancer genomics demonstrate the versatility of our methods.
Joint work with Eunho Yang, Pradeep Raviukmar, Zhandong Liu, Yulia Baker, and Ying-Wooi Wan
160209091439
Some interesting challenges in finite mixture and extreme modelling

Kate Lee

Auckland University of Technology

Date: Wednesday 23 March 2016

Note day and time; not the usual
The goal of Bayesian inference is to infer a parameter and a model in a Bayesian setup. In this talk I will discuss some well-known problems in finite mixture and extreme modelling, and I will present my recent work.
Finite mixture model is a flexible tool for modelling multimodal data and has been used in many applications in statistical analysis. The model evidence is often approximated and it makes demands on computation due to a well-known lack of identifiability. I will present the dual importance sampling scheme to fit the demand of evidence approximation and show how to reduce the computational workload. Lastly, an extreme event is often described by modelling exceedances over the threshold and the threshold value plays a key role in the statistical inference. I will demonstrate that a suitable threshold can be determined using the Bayesian measure of surprise and this approach is easily implemented for both univariate and multivariate extremes.
160311140558
Improved estimation of intrinsic growth $r_{max}$ for long-lived species: mammals, birds, and sharks

Peter Dillingham

University of New England, New South Wales

Date: Tuesday 22 March 2016

Note day and time; not the usual
Intrinsic population growth rate ( $r_{max}$ ) is an important parameter for many ecological applications, such as population risk assessment and harvest management. However, $r_{max}$ can be a difficult parameter to estimate, particularly for long-lived species, for which appropriate life table data or abundance time series are typically not obtainable. We developed a method for improving estimates of $r_{max}$ for long-lived species by integrating life-history theory (allometric models) and population-specific demographic data (life table models). Broad allometric relationships, such as those between life history traits and body size, have long been recognized by ecologists. These relationships are useful for deriving theoretical expectations for $r_{max}$ , but $r_{max}$ for real populations vary from simple allometric estimators for “archetypical” species of a given taxa or body mass. Meanwhile, life table approaches can provide population-specific estimates of $r_{max}$ from empirical data, but these may have poor precision from imprecise and missing vital rate parameter estimates. By integrating the two approaches, we provide estimates that are consistent with both life-history theory and population-specific empirical data. Ultimately, this yields estimates of $r_{max}$ that are likely to be more robust than estimates provided by either method alone.
160311135949
How to put together data science and sports science to understand expertise in sport

Ludovic Seifert

University of Rouen

Date: Thursday 17 March 2016

Expertise is mostly analysed in terms of performance outcomes through speed, accuracy, and economy criteria. But understanding expertise goes beyond the questions, "how fast can you swim?", “how far can you jump?” or, “how fluently can you climb?”. Adaptability - the capacity of an expert to modify their behaviour to respond to subtle modification in the constraint acting on them - might also be a key concept to investigate (Seifert, Button, & Davids, 2013). According to Newell (1986), individuals interact with three types of constraints: environmental (e.g. wind, wave, temperature), task (e.g. required speed or frequency) and organismic (e.g. impact of size, shape and density of the body and its segments). By artificially generating perturbation during practice, we can explore how performers adapt their limb movements and limb coordination pattern to constraints, brought about by a subtle blend between behavioural stability and flexibility. Stability corresponds to the capability and the time an individual takes to resist a perturbation or to recover his initial motor behaviour after perturbation (Seifert et al., 2014). Flexibility relates to the fluctuations within a coordinative pattern to continually adapt to a given set of constraints (Davids, Araújo, Seifert, & Orth, 2015). Adaptability corresponds to the ratio between behavioural stability and flexibility, in the sense where an adaptive performer is stable and flexible when required, supporting functional movement and coordination variability. The aim of this talk is to present how data sciences such as data mining and machine learning can help to examine behavioural dynamics and variability within and between individuals in order to understand expertise in sport.

Davids, K., Araújo, D., Seifert, L., & Orth, D. (2015). Expert performance in sport: An ecological dynamics perspective. In J. Baker & D. Farrow (Eds.), Handbook of Sport Expertise (pp. 273–303). London, UK: Taylor & Francis.
Newell, K. M. (1986). Constraints on the development of coordination. In M. G. Wade & H. T. A. Whiting (Eds.), Motor development in children. Aspects of coordination and control (pp. 341–360). Dordrecht, Netherlands: Martinus Nijhoff.
Seifert, L., Button, C., & Davids, K. (2013). Key properties of expert movement systems in sport : an ecological dynamics perspective. Sports Medicine, 43(3), 167–78.
Seifert, L., Komar, J., Barbosa, T., Toussaint, H., Millet, G., & Davids, K. (2014). Coordination pattern variability provides functional adaptations to constraints in swimming performance. Sports Medicine, 44(10), 1333–45.
160218151340
Hidden Markov models for sparse time series from non-volcanic tremor observations

Ting Wang

Department of Mathematics and Statistics

Date: Thursday 10 March 2016

Tremor activity has been recently detected in various tectonic areas worldwide, and is spatially segmented and temporally recurrent. In this seminar I will present a type of hidden Markov models (HMMs) that we designed to investigate this phenomenon, where each state represents a distinct segment of tremor sources. A mixture distribution of a Bernoulli variable and a continuous variable was introduced into the HMM to solve the problem that tremor clusters are very sparse in time. We applied our model to the tremor data from the Tokai region in Japan to identify distinct segments of tremor source regions and the spatiotemporal migration pattern among these segments. I will also discuss about ways to check the model fit.
160217095305
Efficient computation of likelihoods of physical traits of species

Gordon Hiscott

Department of Mathematics and Statistics

Date: Thursday 3 March 2016

We present new methods for efficiently computing likelihoods of visible physical traits (phenotypes) of species which are connected by an evolutionary tree. These methods combine an existing dynamic programming algorithm for likelihood computation with methods for numerical integration. We have already applied these particular methods to a dataset on extrafloral nectaries (EFNs) across a large evolutionary tree connecting species of Fabales plants. We compare the different numerical integration techniques that can be applied to the dynamic programming algorithm. In addition, we compare our numerical integration results to the published results of a “precursor” model applied to the same EFN dataset. These results include not only likelihood approximations, but also changes in phenotype along the tree and the Akaike Information Criterion (AIC), which is used to determine the relative quality of a statistical model.
160211094704
Scaring the bejesus into people: adventures in religion and death

Jamin Halberstadt

Department of Psychology

Date: Thursday 15 October 2015

The claims that fear of death motivates religious belief, and that belief assuages such fear, feature prominently in historical and contemporary theories of religion. However, the evidence for these claims, both correlational and experimental, is ambivalent, in part due to uncertainty about the meanings of "belief" and "fear", as well as a failure to consider both conscious and unconscious cognition. In this talk I present the results of recent studies on the effect of death anxiety on acute explicit and implicit religious belief, as well as the effectiveness of belief in ameliorating both explicit and implicit anxiety. Initial studies suggest dual and complementary processes: ostensibly nonreligious individuals report greater explicit disbelief, but also greater implicit belief, when experiencing death anxiety. One interpretation is that this discrepancy permits nonreligious individuals to enjoy "worldview defense" while implicitly allowing for the immortality that supernatural agents afford.
150414161644
Marginal and conditional sampling in hierarchical models

Colin Fox

Department of Physics

Date: Thursday 8 October 2015

Gibbs sampling is a popular technique for exploring hierarchical stochastic models, and is commonly the algorithm of first choice for statisticians. However, the Gibbs sampler scales poorly with model size or data size. We follow up on the observation by Harvard Rue that in most circumstances where Gibbs sampling is feasible, the more efficient sampling over marginal distributions is also feasible. The resulting "marginal then conditional" sampler scales very well with problem size and even allows inference directly over function spaces. I will demonstrate these ideas with several examples from inverse problems and statistics, including the nuclear power plan pump failure data presented by Gelfand and Smith that kicked off the MCMC revolution.

Joint work with Richard Norton and Andres Christen.
150416160327

Project presentations, Statistics Honours and PGDip students

Department of Mathematics and Statistics

Date: Friday 2 October 2015

Yunan Wang: Binary segmentation for change-point detection in GPS time series
Patrick Brown: Investigating dynamic time series models to predict future tourism demand in New Zealand
Alastair Lamont: Hierarchical modelling approaches to estimate genetic breeding values
150922114126
Earthquake clustering: modelling, testing and extension

Jiancang Zhuang

Institute of Statistical Mathematics, Japan

Date: Thursday 1 October 2015

The Epidemic-Type Aftershock Sequence (ETAS) model developed by Ogata has become a de facto standard model, or null hypotheses, for testing other models and hypotheses related to seismicity. This seminar introduces the history of the ETAS model and some other general topics, including 1. Stochastic separation of earthquake clusters from the catalogue; 2. Techniques related to testing hypothesis or seismicity anomalies against the ETAS model; 3. Modelling the role of earthquake fault geometry in seismicity triggering and inversion of earthquake fault geometry based on seismicity. Throughout this seminar, I will focus on how to find what features in the earthquake data are not described in the model, but not on what are already built in the model. The methods discussed in this seminar can be also applied to other point process models.
150908130713
Surfing with wavelets

Ingrid Daubechies, AMS-NZMS 2015 Maclaurin Lecturer

Duke University

Date: Monday 28 September 2015

This talk gives an overview of wavelets: what they are, how they work, why they are useful for image analysis and image compression. Then it will go on to discuss how they have been used recently for the study of paintings by e.g. Van Gogh, Goossen van der Weyden, Gauguin and Giotto.
150623133358
Sparsity in Data Analysis and Computation

Ingrid Daubechies, AMS-NZMS 2015 Maclaurin Lecturer

Duke University

Date: Monday 28 September 2015

Numerical computation has long exploited that sparse matrices are special, and that there exist very fast algorithms to deal with them. In the last 15 years or so, mathematicians, engineers, computer scientists and statisticians have come to realize that "sparsity" can buy us much more: using it correctly is now crucially important in many computational contexts, and we will review a few.
150623140016
Obesity, breast cancer, and something Shiny

Mik Black

Department of Biochemistry

Date: Thursday 24 September 2015

There is growing evidence that obesity not only increases an individual's lifetime risk of developing cancer, but also that it may modify both patient prognosis, and the molecular biology of the disease. In this seminar, I will explore the molecular aspects of this hypothesis using publicly available genomic data from a collection of international breast cancer cohorts. As a side topic, I will also demonstrate the use of the recently developed Shiny package for R, using examples involving these genomics data sets, and also breast cancer data from the NZ Cancer Registry. The Shiny package allows users to quickly generate flexible web-based interfaces for basic data visualization and analysis, and thus provides a powerful mechanism for making complex data sets more accessible to a wide range of researchers.
150413104830
Rank tests for survey data

Alastair Scott

University of Auckland

Date: Thursday 17 September 2015

Analysing survey data has become big business in recent years, driven in part by public access to the results of large medical and social surveys such as NHANES and the British Household Panel Survey. Typically, the researchers carrying out this analysis would like to use exactly the same statistical techniques that they would use with experimental data. This is now broadly possible; all the major packages have survey modules for implementing most standard statistical methods. However there are still a couple of significant gaps. In particular, there are no survey analogues of nonparametric procedures such as the Wilcoxon or Kruskal-Wallis tests. In their absence, many researchers just use the standard versions designed for random sampling instead. We show how to construct general rank tests under complex sampling, both for comparing groups within a survey and for using a national survey as a reference distribution. We look at their performance in simulations and in some examples from the literature where ordinary rank tests have been applied naively.
150409163506
The development, use and future of free interactive web apps for training in official statistics

John Harraway

Department of Mathematics and Statistics

Date: Thursday 10 September 2015

Users of official statistics, including government and other public decision makers, have varying levels of knowledge and skill in interpreting statistics for appropriate decision making. Through our Department, interactive Web Apps to provide relevant training have been developed. Based on the content of the NZ Certificate in Official Statistics, these apps are now freely available for use on a variety of IT platforms such as desktops, laptops, tablets and phones. The initial planning for the Apps will be discussed, encompassing relevance for New Zealand, Pacific Nations and Developing Countries. The structure and content of three Apps will be illustrated; the first Measuring Price Change, the second Comparing Populations both within and between countries and the third on Data Visualisation. Hosted by the International Statistics Literacy Project web site, the Apps were released at the recent World Statistics Congress in Brazil. This week they are being used in a United Nations workshop in Bangkok. National Statistics Offices in Vietnam, Angola, Mozambique, Malawi, Gambia and Egypt have already expressed interest. There is a need to translate to other languages, make data more country specific, and develop stand-alone versions.
150410100623
Evaluating ingenious instruments for fundamental determinants of long-run economic growth and development

Dorian Owen

Department of Economics

Date: Thursday 3 September 2015

The empirical literature on the determinants of cross-country differences in long-run economic development is characterized by widespread application of instrumental variables (IV) estimation methods and the ingenious nature of many of the instruments used. However, scepticism remains about their ability to provide a valid basis for causal inference. This study examines the extent to which more emphasis on the statistical dimensions of selecting IVs can usefully complement economic theory as a basis for assessing instrument choice. The statistical underpinnings of IV estimation are highlighted by explicit consideration of the statistical adequacy of the underlying reduced form (RF). Most fundamental determinants studies do not explicitly report the full RF and it is not evident that they test for misspecification of the RF. Diagnostic testing of RFs in influential studies reveals widespread evidence of model misspecification. This feature, surprisingly not previously identified, potentially undermines inferences about the structural parameters, such as the quantitative and statistical significance of different fundamental determinants.
150423105209