15 Satellite Transmitters and Tītī
The Mutton bird (tītī) has one of the longest migration routes in the world, and breeds in the Southern Islands of New Zealand. The chicks are harvested, so sustainability of the population is an important issue.
This lesson investigates the possible adverse effects of attaching transmitters to tītī. This research was carried out by David Fletcher (Department of Mathematics & Statistics, University of Otago).
Data
There is 1 data file associated with this presentation. It contains the data you will need to complete the lesson tasks.
Video
Objectives
Tasks
In this study, tītī attendance at burrows was monitored over 30 days. The tītī were randomly allocated into two groups - those with satellite transmitters attached, and those without.
0. Read and Format data
0a. Read in the data
First check you have installed the package readxl
(see Section 2.6) and set the working directory (see Section 2.1), using instructions in Getting started with R.
Load the data into R.
The code has been hidden initially, so you can try to load the data yourself first before checking the solutions.
Code
#loads readxl package
library(readxl)
#loads the data file and names it transmitter
<-read_xls("Titi Data.xls",range="A2:C32")
transmitter
#view beginning of data frame
head(transmitter)
Code
#loads the data file and names it notransmitter
<-read_xls("Titi Data.xls",range="D2:E32")
notransmitter
#copies the Day column to include in this data frame
$Day<-transmitter$Day
notransmitter
#view beginning of data frame
head(notransmitter)
Code
#loads readxl package
library(readxl)
Warning: package 'readxl' was built under R version 4.2.2
Code
#loads the data file and names it transmitter
<-read_xls("Titi Data.xls",range="A2:C32")
transmitter
#view beginning of data frame
head(transmitter)
# A tibble: 6 × 3
Day Burrows `Attended Number`
<dbl> <dbl> <dbl>
1 1 3 0
2 2 4 2
3 5 5 3
4 6 5 2
5 7 6 4
6 8 8 4
Code
#loads the data file and names it notransmitter
<-read_xls("Titi Data.xls",range="D2:E32")
notransmitter
#copies the Day column to include in this data frame
$Day<-transmitter$Day
notransmitter
#view beginning of data frame
head(notransmitter)
# A tibble: 6 × 3
Burrows `Attendede Number` Day
<dbl> <dbl> <dbl>
1 2 1 1
2 4 0 2
3 6 2 5
4 6 2 6
5 7 1 7
6 8 1 8
The data recorded is Group (transmitter or no transmitter), Day, Burrows (the number of burrows monitored on each day), and Attended Number (the number of burrows for which tītī attendance was observed).
0b. Format the data
Rename the Attended Number variable in both groups to be a single word Attendance, for easier reference later.
Code
colnames(transmitter)<-c("Day","Burrows","Attendance")
Repeat for the notransmitter
data frame.
Code
colnames(transmitter)<-c("Day","Burrows","Attendance")
colnames(notransmitter)<-c("Burrows","Attendance","Day")
1. New Variables
Tītī attendance was measured with bounded counts (i.e. counts out of a possible total). Therefore we have binomial data and we should calculate the attendance proportions. These are the fraction of Burrows that were monitored where we observed a certain Attendance.
Calculate this for each data frame and add it as a new column.
Code
$Attended_Proportion<-transmitter$Attendance/transmitter$Burrows transmitter
Repeat for the notransmitter
data frame.
Code
$Attended_Proportion<-transmitter$Attendance/transmitter$Burrows
transmitter
$Attended_Proportion<-notransmitter$Attendance/notransmitter$Burrows notransmitter
2. Scatter Plot
To analyse the Attended_Proportion, begin by looking at the transmitter group data.
A good starting point for this analysis of the proportions is to draw a scatter plot of the data, with Attended_Proportion on the y axis and Day on the x axis for the transmitter group. Make sure to set axes and title labels, you can also investigate some different point characters and colours.
What patterns are visible from the graph - is there an overall trend visible for the transmitter group, and are there any outliers?
Code
plot(transmitter$Day,transmitter$Attended_Proportion,pch=17,col="cadetblue",
xlab="Day",ylab="Attended Proportion",
main = "Proportion of Burrows attended by Tītī each Day")
Code
plot(transmitter$Day,transmitter$Attended_Proportion,pch=17,col="cadetblue",
xlab="Day",ylab="Attended Proportion",
main = "Proportion of Burrows attended by Tītī each Day")
There appears to be a negative relationship between Day and Attended Proportion of burrows, for increasing days the proportion of attended burrows generally decreases. On day 40 nearly 50% of the burrows are attended, this departs from the overall trend and may be an outlier. Potentially due to recording error - 0.05 mis-recorded as 0.5.
3. Linear Regression, Plot Fitted Line and Confidence Interval Limits (Base R vs. GGplot), ANOVA, Residual Plots
3a. Simple Linear Regression
Despite the variability in the data exhibited in the scatter plot, there seemed to be an overall trend in the Attended_Proportion for the transmitter group over the six week period of the study.
Fit a linear regression of Attended_Proportion against Day to capture this trend.
Code
#fitting linear model
<-lm(Attended_Proportion~Day,data=transmitter)
transModel
#view model output
summary(transModel)
#anova
anova(transModel)
Code
#fitting linear model
<-lm(Attended_Proportion~Day,data=transmitter)
transModel
#view model output
summary(transModel)
Call:
lm(formula = Attended_Proportion ~ Day, data = transmitter)
Residuals:
Min 1Q Median 3Q Max
-0.47026 -0.10508 0.00459 0.08414 0.32657
Coefficients:
Estimate Std. Error t value Pr(>|t|)
(Intercept) 0.478579 0.067408 7.100 1e-07 ***
Day -0.008321 0.002814 -2.957 0.00625 **
---
Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
Residual standard error: 0.1718 on 28 degrees of freedom
Multiple R-squared: 0.2379, Adjusted R-squared: 0.2107
F-statistic: 8.742 on 1 and 28 DF, p-value: 0.006251
Code
#anova
anova(transModel)
Analysis of Variance Table
Response: Attended_Proportion
Df Sum Sq Mean Sq F value Pr(>F)
Day 1 0.25816 0.258163 8.742 0.006251 **
Residuals 28 0.82688 0.029531
---
Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
3b. Interprete ANOVA
The F statistic in the ANOVA has a p-value of 0.006 - what does this tell us?
A p-value of 0.006 indicates that if there is truly no relationship between the response Attended_Proportion and the predictor Day, the probability of observing the data that we have is only 0.6%.
This presents significant evidence that there is in fact a relationship between Day and Attended_Proportion. That is, Attended_Proportion changes according to Day.
A cut-off of 0.05 is typically implemented for interpreting p-values, as a 5% chance of observing the data (if there is in fact no true relationship between the predictor and the response) is considered sufficiently unlikely.
3c. Plot Fitted Line (Base R)
Include the fitted linear model on the scatter plot from Task 2.
Calculate 95% confidence limits for this fitted line and add them to the plot.
Code
#repeat original plot
plot(transmitter$Day,transmitter$Attended_Proportion,pch=17,col="cadetblue",
xlab="Day",ylab="Attended Proportion",
main = "Proportion of Burrows attended by Tītī each Day")
#add fitted regression line. lwd= controls the thickness of the line
abline(transModel,col="red3",lwd=2)
Calculate 95% confidence limits
Code
#add lower and upper ci boundaries
#generate a fine sequence of values for Day
<-seq(0,max(transmitter$Day)+5,length.out=200)
daySeq
#predict fitted values, along with lower and upper limits of confidence interval
#for each daySeq value using the fitted model
<-predict(transModel,newdata=data.frame(Day=daySeq),interval="confidence")
ci
head(ci) #have a look at what is being calculated
Add confidence limits to the plot using the lines()
function. Complete with a legend to distinguish the lines.
Code
#repeat plot with regression line
plot(transmitter$Day,transmitter$Attended_Proportion,pch=17,col="cadetblue",
xlab="Day",ylab="Attended Proportion",
main = "Proportion of Burrows attended by Tītī each Day")
abline(transModel,col="red3",lwd=2)
#lty= controls the line type (e.g dashed)
lines(daySeq,ci[,2],lty=2,col="red") #lower limit
lines(daySeq,ci[,3],lty=2,col="red") #upper limit
#add legend
legend("topright",c("Fitted line","Lower 95% CI","Upper 95% CI"),lwd=c(2,1,1),lty=c(1,2,2),col=c("red","red3","red3"))
Code
#repeat original plot
plot(transmitter$Day,transmitter$Attended_Proportion,pch=17,col="cadetblue",
xlab="Day",ylab="Attended Proportion",
main = "Proportion of Burrows attended by Tītī each Day")
#add fitted regression line. lwd= controls the thickness of the line
abline(transModel,col="red3",lwd=2)
#add lower and upper ci boundaries
#generate a fine sequence of values for Day
<-seq(0,max(transmitter$Day)+5,length.out=200)
daySeq
#predict fitted values, along with lower and upper limits of confidence interval
#for each daySeq value using the fitted model
<-predict(transModel,newdata=data.frame(Day=daySeq),interval="confidence")
ci
#lty= controls the line type (e.g dashed)
lines(daySeq,ci[,2],lty=2,col="red") #lower limit
lines(daySeq,ci[,3],lty=2,col="red") #upper limit
#add legend
legend("topright",c("Fitted line","Lower 95% CI","Upper 95% CI"),lwd=c(2,1,1),lty=c(1,2,2),col=c("red","red3","red3"))
The fitted regression line confirms our earlier observation of a downwards trend in the proportion of attended burrows for increasing days in the transmitter data. The confidence limits indicate we can be confident the true relationship is negative, as any potential slope within these limits would be downwards.
3d. Plot Fitted Line (GGplot)
It is possible to construct the plot similar to the one above with only a single line of code. However, this involves installing a new package and thinking about plotting in a different way.
The ggplot2
package was designed for easily creating attractive plots. The information to be included on the plot is specified by layers of functions, which are joined by + signs. These functions correspond to certain components of the plot. We need to consider 3 components: data, aesthetics, and geometric objects, to replicate our confidence interval plot.
The data component is the data we wish to include in the plot. The aesthetics component maps this data to the graphical objects, for example specifying the predictor (x) and response (y) variables. These first 2 components are both included in the function ggplot(data= ,aes=)
.
The geometric object component indicates the actual plot to be created. Each type of plot or graphical object has its own function, beginning with geom_
. We will use 2 geometric objects, geom_point()
to create the basic scatter plot, and geom_smooth()
to add the fitted line. Specifyingmethod="lm"
indicates that we want to fit a linear model to the data (as we did above), and by default geom_smooth
includes 95% confidence limits around the fitted line.
Code
library(ggplot2)
#ggplot scatter plot with fitted line
ggplot(data=transmitter,aes(x=Day,y=Attended_Proportion))+geom_point()+
geom_smooth(method="lm")
Code
library(ggplot2)
#ggplot scatter plot with fitted line
ggplot(data=transmitter,aes(x=Day,y=Attended_Proportion))+geom_point()+
geom_smooth(method="lm")
`geom_smooth()` using formula 'y ~ x'
This plot is identical to the earlier one constructed using base R.
You can see that once you understand the syntax of ggplot, it is straightforward to create a scatter plot with a fitted line and 95% confidence interval. However, in many ways ggplot is less flexible than base R. It can be harder to tweak certain automatically specified elements of your plots when using this package. Both methods have costs and benefits, depending the particular application.
You can read more about ggplot by looking at the links in the ggplot2
help file.
3e. Prediction from Regression Equation
By looking back at the regression output we can see the fitted equation is:
\[\hat{y}=0.4786-0.00832 Day\]
Use this to predict the Attended_Proportion at the \(20^{th}\) and \(50^{th}\) days - are these predictions reliable?
Code
#day 20
=0.4786-0.00832*20
haty20 haty20
Modify to calculate for the \(50^{th}\) day.
Code
#day 20
=0.4786-0.00832*20
haty20 haty20
[1] 0.3122
Code
#day 50
=0.4786-0.00832*50
haty50 haty50
[1] 0.0626
Attended proportion on the 20th day is predicted to be 0.3122. This is a fairly reliable prediction, the value is in the middle of the sample data used to fit the model.
Attended proportion on the 50th day is predicted to be 0.0626. This is not a reliable prediction, as the value is well outside our sample data. It is an extrapolation, and these should be interpreted with care.
3f. Residual Plots
An important step in model fitting is to check the residuals. Generate the relevant plots and check for any indicators of concern.
The relevant residual plots are created by passing the fitted model object to the plot()
function
Code
plot(transModel)
Code
plot(transModel)
There are some slight curves in the red mean line through the residuals, but these are not dramatic enough to cause concern regarding non-linearity. There appears to be constant variance across the fitted values.
The standardised residuals on the normal qq plot relatively closely follow the standard normal quantiles, with departure only at the extreme quantiles.
There are a few outliers with square root standardised residuals around 1.5 (corresponding to 2.25 when squared).
The fourth plot indicates several influential points - observations 1, 5 and 29. All are influential due to a combination of large residual values and high leverage positions. Their large residuals may be due to measurement or transcription errors or simply additional natural variation that has not been accounted for by our model. 5 has the lowest leverage of the three so is slightly less influential. Observations 1 and 29 have the two biggest residual values and positions of high leverage so are considerably influential.
4. Extension: Scatter Plot, Linear Regression, ANOVA, Plot Fitted Line and Confidence Interval Limits (Base R vs. GGplot), Residual Plots
Analyse the notransmitter data (remember this is a separate data frame).
Create a scatter plot.
Perform a linear regression and ANOVA. Add the fitted line along with 95% confidence interval limits to your plot, to see if there is evidence of a trend for the noransmitter group.
If there is a trend, is it different from that shown by the transmitter group? Are there any outliers?
Construct and interpret residual plots to check model fit.
5. Confidence Intervals by Day
Another way to analyse the data is to directly compare the two groups on each of the 30 nights by looking at the differences in the Attended_Proportion.
This allows us to see which group had a higher rate of attendance on each day, and how large this difference was.
Add a new column to the transmitter data frame, calculated as the Attended_Proportion from the transmitter group minus the Attended_Proportion from the notransmitter group.
Following this we can plot Prop_Difference against Day and add 95% confidence limits to these differences, producing the plot below.
What can we conclude from this graph - what is the overall pattern, and what do the 95% confidence intervals tell us?
Calculate difference variable
Code
$Prop_Difference<-transmitter$Attended_Proportion-notransmitter$Attended_Proportion transmitter
If you wish to recreate the plot of confidence limits for the difference in proportions by day, work through the following code:
First we calculate the components of our 95% confidence interval - the estimate, the standard error, and the multiplier.
Code
#estimated difference in proportion
<-transmitter$Prop_Difference
est
#shortening the variable names for easier reference
<-transmitter$Attended_Proportion
p1<-notransmitter$Attended_Proportion
p2
#calculating the standard error for the difference in proportions
<-sqrt(((p1*(1-p1))/transmitter$Burrows) + ((p2*(1-p2))/notransmitter$Burrows))
se
#multiplier for 95% ci for difference in proportion uses standard normal distribution
<-qnorm(0.975) m
Plot the scatter plot of differences.
We construct the 95% confidence limits for each Day, then use these values to specify the minimum and maximum of vertical lines around the corresponding estimated Prop_Difference.
Finally we add a line at 0, so we can check if these confidence intervals include a 0 difference.
Code
#scatter plot
plot(transmitter$Day,transmitter$Prop_Difference,xlab="Day",
ylab="Difference in Proportion",pch=15,col="slateblue2",ylim=c(-1,1),
main="Attendance Proportion Difference (transmitter- no transmitter)")
#calculate lower and upper ci limits for each day
<-est-m*se
ci1<-est+m*se
ci2
#add lines to plot by specifying the day and lower and upper limits (coordinates)
segments(x0=transmitter$Day,y0=ci1,y1=ci2)
#horizontal line at 0, lty=3 is dotted
abline(h=0,lty=3)
Code
#difference variable
$Prop_Difference<-transmitter$Attended_Proportion-notransmitter$Attended_Proportion transmitter
Code
#estimated difference in proportion
<-transmitter$Prop_Difference
est
#shortening the variable names for easier reference
<-transmitter$Attended_Proportion
p1<-notransmitter$Attended_Proportion
p2
#calculating the standard error for the difference in proportions
<-sqrt(((p1*(1-p1))/transmitter$Burrows) + ((p2*(1-p2))/notransmitter$Burrows))
se
#multiplier for 95% ci for difference in proportion uses standard normal distribution
<-qnorm(0.975) m
Code
#scatter plot
plot(transmitter$Day,transmitter$Prop_Difference,xlab="Day",
ylab="Difference in Proportion",pch=15,col="slateblue2",ylim=c(-1,1),
main="Attendance Proportion Difference (transmitter- no transmitter)")
#calculate lower and upper ci limits for each day
<-est-m*se
ci1<-est+m*se
ci2
#add lines to plot by specifying the day and lower and upper limits (coordinates)
segments(x0=transmitter$Day,y0=ci1,y1=ci2)
#horizontal line at 0, lty=3 is dotted
abline(h=0,lty=3)
The difference in attended proportion between transmitter and non-transmitter birds suggests that initially the fitting of a transmitter appears to increase burrow attendance, with several entirely positive confidence intervals (this is counter-intuitive to what we might expect). The difference appears to decrease over time, although there is considerable overlap in the confidence intervals of successive days so we cannot be confident of a true decrease. The majority of confidence intervals for the difference in attended proportion include 1, so there is no conclusive evidence of greater attendance by either transmitter or non-transmitter birds.
6. Issues with Small Datasets
What are some of the problems we should be aware of when dealing with proportions from small samples?
A small sample size gives us less information than a larger sample and has a greater risk of misrepresenting the characteristics of the wider population. For example the ages of Tītī, presence of injuries etc. in this sample may not be an accurate reflection of the entire population.
If multiple samples were taken (this should never be done in practice) and estimates of a value of interest are compared, these results will vary much more between small samples (that are easily influenced by extreme values) than larger ones.
A small sample size reduces the power of any statistical tests we perform, it limits our ability to detect a true effect if there is one. The plotted confidence intervals in Task 5 indicate no difference in burrow attendance for transmitter and non-transmitter birds, but this may be due to the small sample size rather than the lack of a true difference.
A small sample size can also increase the risk of concluding a true effect when one does not exist, as if extreme values happen to be sampled they will not be moderated by more common values (as they would in a large sample).
7. Bootstrap Confidence Interval by Day
An alternate method for creating confidence intervals for Prop_Difference is to use the bootstrap technique. In previous lessons we have used a non-parametric bootstrap, which involves directly re-sampling from the data. In this task we will perform a parametric bootstrap.
A parametric bootstrap involves resampling from a fitted model. By doing this several times, we get an estimate of the amount of variation that we might have seen had we performed the experiment multiple times. This gives us an estimate of the distribution of the statistic of interest.
Create a new data frame that contains both the transmitter and notransmitter Attended_Proportion for each Day.
Code
<-data.frame("Day"=transmitter$Day,"Attended_Trans"=transmitter$Attended_Proportion,"Attended_Notrans"=notransmitter$Attended_Proportion,"Burrows_Trans"=transmitter$Burrows,"Burrows_Notrans"=notransmitter$Burrows) transmitterExpand
Write a function that draws samples from two binomial distributions (one for the transmitter and one for the notransmitter data) and calculates the difference in proportion of attended burrows.
Code
#function that calculates the difference in proportions for each day drawn from binomial distributions
<-function(data,indices){
propdiffFun
<-rbinom(1,data$Burrows_Trans,data$Attended_Trans)
dataInd1
<-rbinom(1,data$Burrows_Notrans,data$Attended_Notrans)
dataInd2
<-(dataInd1/data$Burrows_Trans)-(dataInd2/data$Burrows_Notrans)
propdiff
return(propdiff)
}
Perform the bootstrap. Note that we must do this separately for each day, so enclose the relevant code in a for loop.
Code
#set seed so you get the same results when running again
set.seed(170)
library(boot)
#loop over each day within the data frame, perform the bootstrap and plotting the results
for(i in 1:nrow(transmitterExpand)){
#bootstrap difference proportion of burrows attended for Day i, note sim="parametric"
<-boot(transmitterExpand[i,],sim="parametric",statistic=propdiffFun,R=10000)
propdiffBoot
#percentile confidence interval
<-boot.ci(propdiffBoot,type="perc")
percCi
#extract day for plot title
<-transmitterExpand[i,]$Day
d#plot density histogram of bootstrap results
hist(propdiffBoot$t,breaks=20,freq=F,xlab="Difference in Attended Proportion (Transmitter - Non-Transmitter)", main=paste("Day",d,"Difference in Attended Proportion of Burrows"))
#add a smoothed line to show the distribution
lines(density(propdiffBoot$t,na.rm=T), lwd = 2, col = "slategray3")
#add confidence limits
abline(v=percCi$percent[4:5],col="slategray2",lty=2)
}
Create a new data frame that contains both the transmitter and notransmitter Attended_Proportion for each Day.
Code
<-data.frame("Day"=transmitter$Day,"Attended_Trans"=transmitter$Attended_Proportion,"Attended_Notrans"=notransmitter$Attended_Proportion,"Burrows_Trans"=transmitter$Burrows,"Burrows_Notrans"=notransmitter$Burrows) transmitterExpand
Write a function that draws samples from two binomial distributions (one for the transmitter and one for the notransmitter data) and calculates the difference in proportion of attended burrows.
Code
#function that calculates the difference in proportions for each day drawn from binomial distributions
<-function(data,indices){
propdiffFun
<-rbinom(1,data$Burrows_Trans,data$Attended_Trans)
dataInd1
<-rbinom(1,data$Burrows_Notrans,data$Attended_Notrans)
dataInd2
<-(dataInd1/data$Burrows_Trans)-(dataInd2/data$Burrows_Notrans)
propdiff
return(propdiff)
}
Perform the bootstrap. Note that we must do this separately for each day, so enclose the relevant code in a for loop.
Code
#set seed so you get the same results when running again
set.seed(170)
library(boot)
Warning: package 'boot' was built under R version 4.2.2
Code
#loop over each day within the data frame, perform the bootstrap and plotting the results
for(i in 1:nrow(transmitterExpand)){
#bootstrap difference proportion of burrows attended for Day i, note sim="parametric"
<-boot(transmitterExpand[i,],sim="parametric",statistic=propdiffFun,R=10000)
propdiffBoot
#percentile confidence interval
<-boot.ci(propdiffBoot,type="perc")
percCi
#extract day for plot title
<-transmitterExpand[i,]$Day
d#plot density histogram of bootstrap results
hist(propdiffBoot$t,breaks=20,freq=F,xlab="Difference in Attended Proportion (Transmitter - Non-Transmitter)", main=paste("Day",d,"Difference in Attended Proportion of Burrows"))
#add a smoothed line to show the distribution
lines(density(propdiffBoot$t,na.rm=T), lwd = 2, col = "slategray3")
#add confidence limits
abline(v=percCi$percent[4:5],col="slategray2",lty=2)
}