13 Rotavirus Burden
We have a survey open right now that we invite you to fill out.
This lesson follows a study for the World Health Organisation investigating the estimation of global burden of disease attributable to rotavirus. That is, counting the number of deaths globally due to rotavirus. The original research is presented by Julie Legler (St Olaf College).
The data discussed in the video was retrieved from the World Health Organisation (WHO) in 2004, and due to its estimated nature there have been some revisions between now and then. There has also been widespread implementation of the rotavirus vaccine that Julie discusses, so it is interesting and relevant for us to consider some more current data. Data for the year 2016 is the most recently available at the time of writing and provided for download on this page. Estimated proportions of diarrhoea deaths attributable to rotavirus were retrieved from Troeger et al. (2018), and all other data was sourced from the WHO website. The reported statistics will not align with the video, rather we can focus on comparing them to see how the vaccine has impacted the number of rotavirus deaths. The methodology used to obtain these values is directly translatable from the video.
Data
There is 1 file associated with this presentation. It contains the 2016 data you will need to complete the lesson tasks.
Video
Objectives
Tasks
0. Read and Format data
0a. Read in the data
First check you have installed the package readxl
(see Section 2.6) and set the working directory (see Section 2.1), using instructions in Getting started with R.
Load the data into R.
The code has been hidden initially, so you can try to load the data yourself first before checking the solutions.
Code
#loads readxl package
library(readxl)
#loads the data file and names it rotavirus
<-read_xlsx("Rotavirus Data.xlsx")
rotavirus
#view beginning of data frame
head(rotavirus)
Code
#loads readxl package
library(readxl)
Warning: package 'readxl' was built under R version 4.2.2
Code
#loads the data file and names it rotavirus
<-read_xlsx("Rotavirus Data.xlsx")
rotavirus
#view beginning of data frame
head(rotavirus)
# A tibble: 6 × 13
`WHO Region` World…¹ Count…² Country Under…³ Under…⁴ Diarr…⁵ TotFe…⁶ Femal…⁷
<chr> <chr> <chr> <chr> <dbl> <dbl> <dbl> <dbl> <dbl>
1 Eastern Medit… Low in… AFG Afghan… 5529. 79812 5.84e3 4.8 62.4
2 Europe Upper … ALB Albania 177. 331 2.58e0 1.66 79.7
3 Africa Lower … DZA Algeria 4741. 25016 1.01e3 3.05 77.5
4 Africa Lower … AGO Angola 5319. 98196 1.28e4 5.69 64.5
5 Americas High i… ATG Antigu… 7.36 11 0 2.00 77.8
6 Americas Upper … ARG Argent… 3740. 8219 8.02e1 2.29 79.3
# … with 4 more variables: TotHealthExpend <dbl>, InfantMortality <dbl>,
# EstimatedRotavirusProp <dbl>, ImmunisationPerc <dbl>, and abbreviated
# variable names ¹`World Bank Income Group`, ²`Country ISO Code`, ³Under5Pop,
# ⁴Under5Deaths, ⁵DiarrhoeaDeaths, ⁶TotFertility, ⁷FemaleLifeExpect
# ℹ Use `colnames()` to see all variable names
This data frame contains information for 183 countries (Country), including their WHO Region, World Bank Income Group, and Country ISO Code. For each country in the year 2016 this includes the population under 5 years in thousands (Under5Pop), the number of deaths under 5 years (Under5Deaths), the number of deaths attributed to diarrhoea (DiarrhoeaDeaths), the total fertility rate (TotFertility), the life expectancy at birth for females (FemaleLifeExpect), the total health expenditure as a percentage of GDP (TotHealthExpend), the infant mortality rate per 1000 live births (InfantMortality), and the percentage of 1-year olds who had received a completed rotavirus vaccine dose in 2014 (ImmunisationPerc). The proportion of diarrhoea deaths due to rotavirus are given for 105 countries (EstimatedRotavirusProp).
1. Confidence Interval (mean proportion), Missing Data (overall mean)
1a. Confidence Interval (mean proportion)
Obtain the overall mean proportion of diarrhoea deaths due to rotavirus (EstimatedRotaVirusProp) and the corresponding 95% confidence interval.
How does our estimate for the mean proportion of diarrhoea deaths due to rotavirus (2016) compare to the mean proportion (2004) discussed in the video?
We calculate a confidence interval for the mean by using t.test()
for a single set of data.
Code
t.test(rotavirus$EstimatedRotavirusProp)
Code
t.test(rotavirus$EstimatedRotavirusProp)
One Sample t-test
data: rotavirus$EstimatedRotavirusProp
t = 11.711, df = 104, p-value < 2.2e-16
alternative hypothesis: true mean is not equal to 0
95 percent confidence interval:
0.1999397 0.2814555
sample estimates:
mean of x
0.2406976
We can say with 95% confidence that the true mean proportion of diarrhoea deaths due to rotavirus in 2016 was between 0.1999 and 0.2815, with a best estimate of 0.2407. This is less than the mean proportion of 0.36 recorded in 2004.
1b. Replace Missing Data (overall mean)
Use the mean proportion as an estimate for missing EstimatedRotaVirusProp values.
Create a new variable ConstantRotavirusProp containing existing EstimatedRotavirusProp values, and replace the missing proportions with the mean proportion estimate.
To calculate the total estimated rotavirus deaths in 2016, multiply the number of DiarrhoeaDeaths by the proportion attributed to rotavirus for each country (where missing data is the estimated by the mean proportion). sum()
these to find the total.
Code
#new variable, pre-fill with available rotavirus proportions
$ConstantRotavirusProp<-rotavirus$EstimatedRotavirusProp
rotavirus
#assign missing entries of this variable to be equal to the mean rotavirus proportion
$ConstantRotavirusProp[which(is.na(rotavirus$ConstantRotavirusProp))]<-0.2407 rotavirus
Code
#total estimated deaths
sum(rotavirus$DiarrhoeaDeaths*rotavirus$ConstantRotavirusProp)
Code
#new variable, pre-fill with available rotavirus proportions
$ConstantRotavirusProp<-rotavirus$EstimatedRotavirusProp
rotavirus
#assign missing entries of this variable to be equal to the mean rotavirus proportion
$ConstantRotavirusProp[which(is.na(rotavirus$ConstantRotavirusProp))]<-0.2407 rotavirus
Code
#total estimated deaths
sum(rotavirus$DiarrhoeaDeaths*rotavirus$ConstantRotavirusProp)
[1] 119376.8
2. Bootstrap Confidence Interval, Histogram (sum of rotavirus deaths)
2a. Bootstrap Confidence Interval
Construct a bootstrap confidence interval for the sum of diarrhoea deaths attributable to rotavirus, continuing to use the mean rotavirus proportion as an estimate for countries where data is missing (ConstantRotavirusProp).
Perform bootstrap for the sum.
Code
library(boot)
#set seed so you get the same results when running again
set.seed(402)
#function that calculates the sum from the bootstrapped samples
<-function(data,indices){
sumFun<-data[indices,]
dataInd<-sum(dataInd$DiarrhoeaDeaths*dataInd$ConstantRotavirusProp)
sumdatareturn(sumdata)
}
#bootstrap sum of diarrhoea deaths attributable to rotavirus.
#create object called sumBoot with the output of the bootstrapping procedure
<-boot(rotavirus,statistic=sumFun, R=10000) sumBoot
Use boot.ci()
to calculate a percentile confidence interval.
Code
#percentile (type="perc") confidence interval
<-boot.ci(sumBoot,type="perc")
percCi
#check lower and upper limits from percentile interval object
$percent[4:5] percCi
Code
library(boot)
Warning: package 'boot' was built under R version 4.2.2
Code
#set seed so you get the same results when running again
set.seed(402)
#function that calculates the sum from the bootstrapped samples
<-function(data,indices){
sumFun<-data[indices,]
dataInd<-sum(dataInd$DiarrhoeaDeaths*dataInd$ConstantRotavirusProp)
sumdatareturn(sumdata)
}
#bootstrap sum of diarrhoea deaths attributable to rotavirus.
#create object called sumBoot with the output of the bootstrapping procedure
<-boot(rotavirus,statistic=sumFun, R=10000) sumBoot
Code
#percentile (type="perc") confidence interval
<-boot.ci(sumBoot,type="perc")
percCi
#check lower and upper limits from percentile interval object
$percent[4:5] percCi
[1] 51464.73 236036.51
2b. Histogram
Create a graph showing the distribution of the sum of the bootstrap samples with a density curve and 95% confidence interval limits plotted on top.
How does this bootstrap confidence interval for the number of rotavirus deaths compare to the bootstrap confidence interval for the number of rotavirus deaths discussed in the video?
What factors could have contributed to this change?
Code
#plot density histogram of bootstrap results
hist(sumBoot$t,breaks=20,freq=F, xlab="Total Rotavirus Deaths",
main="Sum of Rotavirus Deaths in 2016")
#add a smoothed line to show the distribution
lines(density(sumBoot$t),col="skyblue3",lwd=2)
#95% ci limits
abline(v=percCi$percent[4:5],col="skyblue2",lty=2)
Code
#plot density histogram of bootstrap results
hist(sumBoot$t,breaks=20,freq=F, xlab="Total Rotavirus Deaths",
main="Sum of Rotavirus Deaths in 2016")
#add a smoothed line to show the distribution
lines(density(sumBoot$t),col="skyblue3",lwd=2)
#95% ci limits
abline(v=percCi$percent[4:5],col="skyblue2",lty=2)
The bootstrapped sums follow a right-skewed distribution with mean of 119376.8. This mean of sums matches the estimated sum of diarrhoea deaths attributable to rotavirus we calculated earlier, and the distribution of bootstrapped sum estimates around it give an indication of the amount of variability in our estimate due to sampling. This 95% percentile interval is from 51,464.73 to 236,036.51, which is a very large range.
The corresponding bootstrap confidence interval in the video is from 362,363.04 to 1,157,640.34. This interval encompasses much greater sums of rotavirus deaths and is entirely above our interval for 2016. The reduction in estimated rotavirus deaths may be due to improved awareness, better sanitation, nutrition and healthcare, and the implementation of the rotavirus vaccine.
3. Missing Data (stratification)
3a. Stratification by Female Life Expectancy
Use female life expectancy (FemaleLifeExpect) to define the five strata.
Code
#define object with FemaleLifeExpect values in decreasing order
<-rotavirus$FemaleLifeExpect[order(-rotavirus$FemaleLifeExpect)] rotavirusFLE
Code
#calculate difference between consecutive female life expectancies
<-rotavirusFLE[1:182]-rotavirusFLE[2:183]
diff
#establish cut-off between strata
#difference vector entry number
which(diff>0.60)
#actual difference in life expectancy
which(diff>0.60)] diff[
Ignoring differences 178, 179, 180, 181 and 182 as change points due to their lack of separation leaves four change points to define five strata.
Confirm this decision by visualising the ordered FemaleLifeExpect with change points indicated.
Code
#x axis is index from 1 to 183, y axis is female life expectancy.
#type="o" plots points joined with lines
plot(1:length(rotavirusFLE),rotavirusFLE,type="o",col="orange",
pch=20,ylab="Female Life Expectancy (years)", xlab="Country Index")
#add lines at proposed change points between strata
abline(v=c(31,103,142,175),col="skyblue2",lty=2)
If you open this plot in a new window (button in the top right of the plot), you should see the proposed change points line up with the biggest drops female life expectancy. This provides a justification for grouping according to these categories.
Extract the FemaleLifeExpect ages corresponding to these cut-offs, and define a new variable StrataFemaleLifeExpect that groups FemaleLifeExpect into categories.
Code
#subset entries corresponding to cut-offs
c(31,103,142,175)] rotavirusFLE[
Code
#initially define new variable as all 0s, same number of rows as data frame
$StrataFemaleLifeExpect<-rep(0,nrow(rotavirus))
rotavirus
#life expectancy greater than or equal to 82.34 in first strata (1)
$StrataFemaleLifeExpect[which(rotavirus$FemaleLifeExpect>=82.34)]<-1
rotavirus
#life expectancy greater than or equal to 74.78 and less than 82.34 in second strata (2)
$StrataFemaleLifeExpect[which(rotavirus$FemaleLifeExpect>=74.78&rotavirus$FemaleLifeExpect<82.34)]<-2
rotavirus
#life expectancy greater than or equal to 67.26 and less than 74.78 in third strata (3)
$StrataFemaleLifeExpect[which(rotavirus$FemaleLifeExpect>=67.26&rotavirus$FemaleLifeExpect<74.78)]<-3
rotavirus
#life expectancy greater than or equal to 60.81 and less than 67.26 in fourth strata (4)
$StrataFemaleLifeExpect[which(rotavirus$FemaleLifeExpect>=60.81&rotavirus$FemaleLifeExpect<67.26)]<-4
rotavirus
#life expectancy less than 60.81 in fifth strata (5)
$StrataFemaleLifeExpect[which(rotavirus$FemaleLifeExpect<60.81)]<-5 rotavirus
Code
#define object with FemaleLifeExpect values in decreasing order
<-rotavirus$FemaleLifeExpect[order(-rotavirus$FemaleLifeExpect)] rotavirusFLE
Code
#calculate difference between consecutive female life expectancies
<-rotavirusFLE[1:182]-rotavirusFLE[2:183]
diff
#establish cut-off between strata
#difference vector entry number
which(diff>0.60)
[1] 1 31 103 142 175 178 179 180 181 182
Code
#actual difference in life expectancy
which(diff>0.60)] diff[
[1] 1.29 0.63 0.69 0.64 0.74 0.74 0.78 0.62 4.28 2.45
Ignoring differences 178, 179, 180, 181 and 182 as change points due to their lack of separation leaves four change points to define five strata.
Confirm this decision by visualising the ordered FemaleLifeExpect with change points indicated.
Code
#x axis is index from 1 to 183, y axis is female life expectancy.
#type="o" plots points joined with lines
plot(1:length(rotavirusFLE),rotavirusFLE,type="o",col="orange",
pch=20,ylab="Female Life Expectancy (years)", xlab="Country Index")
#add lines at proposed change points between strata
abline(v=c(31,103,142,175),col="skyblue2",lty=2)
If you open this plot in a new window (button in the top right of the plot), you should see the proposed change points line up with the biggest drops female life expectancy. This provides a justification for grouping according to these categories.
Extract the FemaleLifeExpect ages corresponding to these cut-offs, and define a new variable StrataFemaleLifeExpect that groups FemaleLifeExpect into categories.
Code
#subset entries corresponding to cut-offs
c(31,103,142,175)] rotavirusFLE[
[1] 82.34 74.78 67.26 60.81
Code
#initially define new variable as all 0s, same number of rows as data frame
$StrataFemaleLifeExpect<-rep(0,nrow(rotavirus))
rotavirus
#life expectancy greater than or equal to 82.34 in first strata (1)
$StrataFemaleLifeExpect[which(rotavirus$FemaleLifeExpect>=82.34)]<-1
rotavirus
#life expectancy greater than or equal to 74.78 and less than 82.34 in second strata (2)
$StrataFemaleLifeExpect[which(rotavirus$FemaleLifeExpect>=74.78&
rotavirus$FemaleLifeExpect<82.34)]<-2
rotavirus
#life expectancy greater than or equal to 67.26 and less than 74.78 in third strata (3)
$StrataFemaleLifeExpect[which(rotavirus$FemaleLifeExpect>=67.26&
rotavirus$FemaleLifeExpect<74.78)]<-3
rotavirus
#life expectancy greater than or equal to 60.81 and less than 67.26 in fourth strata (4)
$StrataFemaleLifeExpect[which(rotavirus$FemaleLifeExpect>=60.81&
rotavirus$FemaleLifeExpect<67.26)]<-4
rotavirus
#life expectancy less than 60.81 in fifth strata (5)
$StrataFemaleLifeExpect[which(rotavirus$FemaleLifeExpect<60.81)]<-5 rotavirus
3b. Strata Mean for Missing Data
Calculate the proportion average of EstimatedRotavirusProp in each stratum using tapply()
.
Then calculate the sum of rotavirus deaths overall using these values for the missing data.
Use the mean strata proportion as an estimate for missing EstimatedRotaVirusProp values.
Code
tapply(rotavirus$EstimatedRotavirusProp,INDEX=rotavirus$StrataFemaleLifeExpect,FUN=mean,na.rm=T)
Create a new variable FLERotavirusProp containing existing EstimatedRotavirusProp values, and replace the missing proportions with the relevant mean proportion estimate.
Code
#new variable, pre-fill with available rotavirus proportions
$FLERotavirusProp<-rotavirus$EstimatedRotavirusProp
rotavirus
#assign missing entries of this variable that correspond to a value of 1 in the strata variable,
#equal to the mean proportion for that strata
$FLERotavirusProp[which(is.na(rotavirus$EstimatedRotavirusProp)&
rotavirus$StrataFemaleLifeExpect==1))]<-0.2144838
(rotavirus
#assign missing entries of this variable that correspond to a value of 2 in the strata variable,
#equal to the mean proportion for that strata
$FLERotavirusProp[which(is.na(rotavirus$EstimatedRotavirusProp)&
rotavirus$StrataFemaleLifeExpect==2))]<-0.2592195
(rotavirus
#assign missing entries of this variable that correspond to a value of 3 in the strata variable,
#equal to the mean proportion for that strata
$FLERotavirusProp[which(is.na(rotavirus$EstimatedRotavirusProp)&
rotavirus$StrataFemaleLifeExpect==3))]<-0.2236106
(rotavirus
#assign missing entries of this variable that correspond to a value of 4 in the strata variable,
#equal to the mean proportion for that strata
$FLERotavirusProp[which(is.na(rotavirus$EstimatedRotavirusProp)&
rotavirus$StrataFemaleLifeExpect==4))]<-0.2474089
(rotavirus
#assign missing entries of this variable that correspond to a value of 5 in the strata variable,
#equal to the mean proportion for that strata
$FLERotavirusProp[which(is.na(rotavirus$EstimatedRotavirusProp)&
rotavirus$StrataFemaleLifeExpect==5))]<-0.1972870 (rotavirus
Calculate the sum of rotavirus deaths as before, using the FLERotavirusProp values where missing data is estimated using female life expectancy strata means.
Code
sum(rotavirus$DiarrhoeaDeaths*rotavirus$FLERotavirusProp)
Use the mean strata proportion as an estimate for missing EstimatedRotaVirusProp values.
Code
tapply(rotavirus$EstimatedRotavirusProp,INDEX=rotavirus$StrataFemaleLifeExpect,FUN=mean,na.rm=T)
1 2 3 4 5
0.2144838 0.2592195 0.2236106 0.2474089 0.1972870
Create a new variable FLERotavirusProp containing existing EstimatedRotavirusProp values, and replace the missing proportions with the relevant mean proportion estimate.
Code
#new variable, pre-fill with available rotavirus proportions
$FLERotavirusProp<-rotavirus$EstimatedRotavirusProp
rotavirus
#assign missing entries of this variable that correspond to a value of 1 in the strata variable,
#equal to the mean proportion for that strata
$FLERotavirusProp[which(is.na(rotavirus$EstimatedRotavirusProp)&
rotavirus$StrataFemaleLifeExpect==1))]<-0.2144838
(rotavirus
#assign missing entries of this variable that correspond to a value of 2 in the strata variable,
#equal to the mean proportion for that strata
$FLERotavirusProp[which(is.na(rotavirus$EstimatedRotavirusProp)&
rotavirus$StrataFemaleLifeExpect==2))]<-0.2592195
(rotavirus
#assign missing entries of this variable that correspond to a value of 3 in the strata variable,
#equal to the mean proportion for that strata
$FLERotavirusProp[which(is.na(rotavirus$EstimatedRotavirusProp)&
rotavirus$StrataFemaleLifeExpect==3))]<-0.2236106
(rotavirus
#assign missing entries of this variable that correspond to a value of 4 in the strata variable,
#equal to the mean proportion for that strata
$FLERotavirusProp[which(is.na(rotavirus$EstimatedRotavirusProp)&
rotavirus$StrataFemaleLifeExpect==4))]<-0.2474089
(rotavirus
#assign missing entries of this variable that correspond to a value of 5 in the strata variable,
#equal to the mean proportion for that strata
$FLERotavirusProp[which(is.na(rotavirus$EstimatedRotavirusProp)&
rotavirus$StrataFemaleLifeExpect==5))]<-0.1972870 (rotavirus
Calculate the sum of rotavirus deaths as before, using the FLERotavirusProp values where missing data is estimated using female life expectancy strata means.
Code
sum(rotavirus$DiarrhoeaDeaths*rotavirus$FLERotavirusProp)
[1] 118973.9
Use total health expenditure as a percentage of GDP (TotHealthExpend) to define the five strata.
Compare the estimates of rotavirus deaths using FemaleLifeExpect and TotHealthExpend to fill in missing values with the estimate gained using an overall average proportion for missing values.
4. Scatter Plots, Correlations (predictors and response)
4a. Fertility, Female Life Expectancy, Health Expenditure, Infant Mortality
Investigate individually the four predictors (TotFertility, FemaleLifeExpect, TotHealthExpend and InfantMortality).
Examine scatter plots and the correlation coefficients between these predictors and the known proportions (EstimatedRotavirusDeaths) for the 105 countries.
Describe the relationship between each of these predictors and EstimatedRotavirusProp. Is this what you would expect, and if not, what could explain it?
Code
#create new data frame where rows with missing estimated rotavirus proportions are removed
<-rotavirus[!is.na(rotavirus$EstimatedRotavirusProp),]
rotavirusFull
#scatter plot
plot(rotavirusFull$FemaleLifeExpect,rotavirusFull$EstimatedRotavirusProp,xlab="Female Life Expectancy (years)",pch=15,col="purple",ylab="Estimated Diarrhoea Deaths due to Rotavirus (proportion)")
#correlation between values
cor(rotavirusFull$FemaleLifeExpect,rotavirusFull$EstimatedRotavirusProp)
Repeat these analyses for TotFertility, TotHealthExpend and InfantMortality. Make sure to remove any rows containing missing data for these variables, otherwise the corr()
function will return NA.
Code
#create new data frame where rows with missing estimated rotavirus proportions are removed
<-rotavirus[!is.na(rotavirus$EstimatedRotavirusProp),]
rotavirusFull
#scatter plot
plot(rotavirusFull$FemaleLifeExpect,rotavirusFull$EstimatedRotavirusProp,xlab="Female Life Expectancy (years)",pch=15,col="purple",ylab="Estimated Diarrhoea Deaths due to Rotavirus (proportion)")
Code
#correlation between values
cor(rotavirusFull$FemaleLifeExpect,rotavirusFull$EstimatedRotavirusProp)
[1] -0.001615984
Very slight negative relationship, as female life expectancy increases the estimated proportion of diarrhoea deaths from rotavirus decrease. This is unlikely to be significant however.
Code
plot(rotavirusFull$TotFertility,rotavirusFull$EstimatedRotavirusProp,xlab="Total Fertility Rate",pch=15,col="pink",ylab="Estimated Diarrhoea Deaths due to Rotavirus (proportion)")
Code
cor(rotavirusFull$TotFertility,rotavirusFull$EstimatedRotavirusProp)
[1] 0.03986769
Slight positive relationship, as total fertility increases the estimated proportion of diarrhoea deaths from rotavirus increase.
Code
<-rotavirusFull[!is.na(rotavirusFull$TotHealthExpend),]
rotavirusFull2
plot(rotavirusFull2$TotHealthExpend,rotavirusFull2$EstimatedRotavirusProp,xlab="Total Health Expenditure (% GDP)",pch=15,col="blue",ylab="Estimated Diarrhoea Deaths due to Rotavirus (proportion)")
Code
cor(rotavirusFull2$TotHealthExpend,rotavirusFull2$EstimatedRotavirusProp)
[1] 0.1527748
Positive relationship, as total health expenditure increases the estimated proportion of diarrhoea deaths from rotavirus increase.
Code
plot(rotavirusFull$InfantMortality,rotavirusFull$EstimatedRotavirusProp,xlab="Infant Mortality (per 1000 live births)",pch=15,col="red", ylab="Estimated Diarrhoea Deaths due to Rotavirus (proportion)")
Code
cor(rotavirusFull$InfantMortality,rotavirusFull$EstimatedRotavirusProp)
[1] -0.04307308
Slight negative relationship, as infant mortality increases the estimated proportion of diarrhoea deaths from rotavirus decrease.
4b Immunisation Rates.
Examine the relationship between rotavirus vaccination rates (ImmunisationPerc) and the estimated proportion of rotavirus deaths (EstimatedRotavirusProp).
What kind of relationship can you see? Is it what you would expect, if not, why might we be observing this relationship?
Code
#remove rows where immunisation percentage is missing
<-rotavirusFull[!is.na(rotavirusFull$ImmunisationPerc),]
rotavirusFull3
#scatter plot
plot(rotavirusFull3$ImmunisationPerc,rotavirusFull3$EstimatedRotavirusProp,xlab="Rotavirus Immunisation (% at 1 year old)",pch=15,col="green", ylab="Estimated Diarrhoea Deaths due to Rotavirus (proportion)")
#correlation between variables
cor(rotavirusFull3$ImmunisationPerc,rotavirusFull3$EstimatedRotavirusProp)
Code
#remove rows where immunisation percentage is missing
<-rotavirusFull[!is.na(rotavirusFull$ImmunisationPerc),]
rotavirusFull3
#scatter plot
plot(rotavirusFull3$ImmunisationPerc,rotavirusFull3$EstimatedRotavirusProp,xlab="Rotavirus Immunisation (% at 1 year old)",pch=15,col="green", ylab="Estimated Diarrhoea Deaths due to Rotavirus (proportion)")
Code
#correlation between variables
cor(rotavirusFull3$ImmunisationPerc,rotavirusFull3$EstimatedRotavirusProp)
[1] 0.1295355
There is a positive relationship, as immunisation percentage goes up so does the estimated proportion of diarrhoea deaths due to rotavirus.
This seems counter-intuitive as we would expect the proportion of diarrhoea deaths due to rotavirus to decrease with increasing immunisation coverage. Some reasons we do not see a negative relationship between immunisation rates and proportion of diarrhoea deaths attributable to rotavirus are:
Limited data
The greatest amount of data is available for countries which have high immunisation rates, there are very few observations where immunisation rates are less than 40% and the relationship may be different here.
The immunisation statistics are for 1 year olds in 2014. Children older than 2 years in 2016 would have been due for these immunisations earlier and the coverage was lower then.
Countries with high proportions of diarrhoea deaths due to rotavirus virus were likely prioritised in the vaccine roll-out.
5. Missing Data (linear regression)
5a. Simple Linear Regression
Predict all the proportions using FemaleLifeExpect as a predictor in a simple regression. (105 countries have a proportion).
Compare the result from this procedure with the result obtained by using strata based on this variable in Task 3.
Use the function lm()
to fit this model.
Code
#linear model
<-lm(EstimatedRotavirusProp~FemaleLifeExpect,data=rotavirus)
FLEMod
#view fitted model
summary(FLEMod)
Once the model is fitted we can calculate the predicted values
Code
#use fitted model and values in rotavirus data frame to predict estimated rotavirus proportions
<-predict(FLEMod,newdata=rotavirus)
predFLE
#total number of rotavirus deaths
sum(rotavirus$DiarrhoeaDeaths*predFLE,na.rm=T)
#residual checks
plot(FLEMod)
Code
#linear model
<-lm(EstimatedRotavirusProp~FemaleLifeExpect,data=rotavirus)
FLEMod
#view fitted model
summary(FLEMod)
Call:
lm(formula = EstimatedRotavirusProp ~ FemaleLifeExpect, data = rotavirus)
Residuals:
Min 1Q Median 3Q Max
-0.24060 -0.16119 -0.04814 0.11020 0.74942
Coefficients:
Estimate Std. Error t value Pr(>|t|)
(Intercept) 2.440e-01 2.019e-01 1.208 0.230
FemaleLifeExpect -4.458e-05 2.718e-03 -0.016 0.987
Residual standard error: 0.2116 on 103 degrees of freedom
(78 observations deleted due to missingness)
Multiple R-squared: 2.611e-06, Adjusted R-squared: -0.009706
F-statistic: 0.000269 on 1 and 103 DF, p-value: 0.9869
Code
#residual checks
plot(FLEMod)
Code
#use fitted model and values in rotavirus data frame to predict estimated rotavirus proportions
<-predict(FLEMod,newdata=rotavirus)
predFLE
#total number of rotavirus deaths
sum(rotavirus$DiarrhoeaDeaths*predFLE,na.rm=T)
[1] 138583.7
Using female life expectancy as a continuous predictor in a linear regression results in a higher estimated sum of diarrhoea deaths attributable to rotavirus compared to using it categorically to define strata. 138,583.7 against 118,973.9.
However, there appear to be some problems with this fitted model. The residuals are not standard normally distributed - as shown by their departure from the quantiles on the QQ-plot and the increased number of large positive residuals relative to negative residuals.
5b. Multiple Linear Regression
Include the remaining predictors TotFertility, TotHealthExpend and InfantMortality in the regression, then using this model to predict the missing proportions.
How does this compare to other estimates of total rotavirus deaths in 2016?
Fit linear model.
Code
#linear model
<-lm(EstimatedRotavirusProp~FemaleLifeExpect+TotFertility+TotHealthExpend+InfantMortality,data=rotavirus)
ALLMod
#view fitted model
summary(ALLMod)
#anova
anova(ALLMod)
#residuals
plot(ALLMod)
Use results to calculated predicted values.
Code
#use fitted model and values in rotavirus data frame to predict estimated rotavirus proportions
<-predict(ALLMod,newdata=rotavirus)
predALL
#total number of rotavirus deaths
sum(rotavirus$DiarrhoeaDeaths*predALL,na.rm=T)
Code
#linear model
<-lm(EstimatedRotavirusProp~FemaleLifeExpect+TotFertility+TotHealthExpend+InfantMortality,data=rotavirus)
ALLMod
#view fitted model
summary(ALLMod)
Call:
lm(formula = EstimatedRotavirusProp ~ FemaleLifeExpect + TotFertility +
TotHealthExpend + InfantMortality, data = rotavirus)
Residuals:
Min 1Q Median 3Q Max
-0.28560 -0.15282 -0.05648 0.12420 0.72064
Coefficients:
Estimate Std. Error t value Pr(>|t|)
(Intercept) 0.333086 0.640359 0.520 0.604
FemaleLifeExpect -0.003262 0.007529 -0.433 0.666
TotFertility 0.055618 0.038076 1.461 0.147
TotHealthExpend 0.012985 0.008281 1.568 0.120
InfantMortality -0.003872 0.002934 -1.320 0.190
Residual standard error: 0.2098 on 98 degrees of freedom
(80 observations deleted due to missingness)
Multiple R-squared: 0.057, Adjusted R-squared: 0.01851
F-statistic: 1.481 on 4 and 98 DF, p-value: 0.2138
Code
#anova
anova(ALLMod)
Analysis of Variance Table
Response: EstimatedRotavirusProp
Df Sum Sq Mean Sq F value Pr(>F)
FemaleLifeExpect 1 0.0021 0.002130 0.0484 0.82635
TotFertility 1 0.0458 0.045761 1.0393 0.31049
TotHealthExpend 1 0.1363 0.136252 3.0946 0.08167 .
InfantMortality 1 0.0767 0.076687 1.7417 0.18999
Residuals 98 4.3149 0.044029
---
Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
Code
#residuals
plot(ALLMod)
Code
#use fitted model and values in rotavirus data frame to predict estimated rotavirus proportions
<-predict(ALLMod,newdata=rotavirus)
predALL
#total number of rotavirus deaths
sum(rotavirus$DiarrhoeaDeaths*predALL,na.rm=T)
[1] 113606.9
This multiple linear regression model predicts the lowest sum of rotavirus deaths out of the methods we have looked at (113,606.9). However, the model has similar problems as the previous linear regression with residuals that do not follow a standard normal distribution. It also appears that many of the predictors we have included are not particularly useful (female life expectancy, total fertility and infant mortality), with the anova indicating a non-significant increase in variance if they are removed.
6. Extension: Scatter Plots, Correlation, Predict Missing Data (predictor and response)
Explore other country health indicators from the WHO website that may be worthwhile to include as predictors in a linear regression, using techniques from earlier tasks in this lesson.
References
Troeger C, Khalil IA, Rao PC, et al. Rotavirus Vaccination and the Global Burden of Rotavirus Diarrhea Among Children Younger Than 5 Years. JAMA Pediatr. 2018;172(10):958–965. doi:10.1001/jamapediatrics.2018.1960
World Health Organisation. Data Collections. https://www.who.int/data/collections