19 Dangerous Driving
Drink-driving is one of the most prevailing causes of road accidents in New Zealand. New Zealand teenagers and young adults are over-represented in motor vehicle traffic injury and death statistics. It is important to investigate what influences the development of dangerous driving behaviours that may cause these accidents. This lesson investigates the different factors that may influence drink-driving (Driving While Impaired (DWI)) in New Zealand youths. The research was conducted by Pauline Gulliver (Injury Prevention Research Unit, University of Otago).
Data
There are 2 files associated with this presentation. The first contains the data you will need to complete the lesson tasks, and the second contains descriptions of the variables included in the data file.
Video
Objectives
Tasks
0. Read data
0a. Read in the data
First make sure you have installed the package readxl
and set the working directory.
Load the data into R.
The code has been hidden initially, so you can try to load the data yourself first before checking the solutions.
Code
#loads readxl package
library(readxl)
#loads the data file and names it accidents
<-read_xls("YouthAccidentData.xls")
accidents
#view beginning of data frame
head(accidents)
Code
#loads readxl package
library(readxl)
Warning: package 'readxl' was built under R version 4.2.2
Code
#loads the data file and names it accidents
<-read_xls("YouthAccidentData.xls")
accidents
#view beginning of data frame
head(accidents)
# A tibble: 6 × 11
sex per_sa…¹ crash18 crash15 drunkad drunk…² drunk…³ drtee18 aggre…⁴ aggre…⁵
<dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl>
1 1 -2.5 1 0 1 NA 0 0 5 5
2 1 -2.5 0 0 0 NA 1 0 5 5
3 1 -3 0 1 0 0 1 1 5 5
4 1 -3 1 NA NA 0 NA 1 5 5
5 1 -3 0 0 0 NA 1 0 5 5
6 1 0 0 0 1 NA 0 0 5 5
# … with 1 more variable: drnk_dr21 <dbl>, and abbreviated variable names
# ¹per_safe21, ²drunkad18, ³drunktee, ⁴aggress15, ⁵aggress18
# ℹ Use `colnames()` to see all variable names
This opens a data set from a study investigating the effect of alcohol consumption on youth motor accidents, focusing on drivers aged 15 to 21 years.
The variables recorded are sex (1=Female, 2=Male), per_safe21 (difference between estimated number of standard drinks they perceived to be ‘safe’ to consume before driving and the number of standard drinks participant could legally consume before driving \(*\)), crash18 (if respondents at age 18 had been in any traffic accidents in the last 3 years: 0=No and 1=Yes), crash15 (if respondents at age 15 had been in any traffic accidents in the last 2 years: 0=No and 1=Yes), drunkad (at age 15, if the respondent had been a passenger in a car with an adult driver that was considered to be over the legal limit*: 0=No and 1=Yes), drunkad18 (the same as drunkad but at age 18), drunktee (at age 15, if the respondent had been a passenger in a car with a youth/teenage driver that was considered to be over the legal limit), drtee18 (the same as drunktee but at age 18), aggress15 (frequency of aggressive behaviour at age 15), aggress18 (frequency of aggressive behaviour at age 18), and drnk_dr21 (if the respondent had driven after drinking too much between the ages 18- 21: 0=No and 1=Yes).
*legal limit referred to as 5 or more glasses of beer or wine, this has been lowered since the study was conducted.
Note that constants have been added to some variables to ensure anonymity for respondents. Additionally, due to confidentiality issues, the original data cannot be used. The dataset for this lesson is simulated data that produces the same results as the original.
1. Subsetting Data Frame, \(\chi^2\) Tests
Carry out chi-square tests for males to investigate the relationship between driving while impaired (DWI, variable name drnk_dr21) behaviour and various factors.
1a. Subsetting
Select the male respondents in the data set. We will be keeping this restriction throughout the lesson, so create a new data frame with only males sex=2. This is also a good opportunity to remove the NA values from the crash18 and drnk_dr21 variables.
Code
#subsetting the rows of the original data frame where sex=2 (males), removing rows with NAs
<-accidents[accidents$sex==2&!is.na(accidents$crash18)&!is.na(accidents$drnk_dr21),] accidentsM
Code
#subsetting the rows of the original data frame where sex=2 (males), removing rows with NAs
<-accidents[accidents$sex==2&!is.na(accidents$crash18)&!is.na(accidents$drnk_dr21),] accidentsM
1b. \(\chi^2\) Tests
Construct a table of counts and carry out the chi-square test for DWI (driving while impaired, drnk_dr21) with having crashed in the last 3 years at age 18 (crash18) using the male data.
Report your conclusion from the chi-square value and its associated p-value.
Code
table(accidentsM$crash18,accidentsM$drnk_dr21,dnn=c("Crash18","DWI"))
chisq.test(accidentsM$crash18,accidentsM$drnk_dr21)
Code
table(accidentsM$crash18,accidentsM$drnk_dr21,dnn=c("Crash18","DWI"))
DWI
Crash18 0 1
0 391 98
1 110 13
Code
chisq.test(accidentsM$crash18,accidentsM$drnk_dr21)
Pearson's Chi-squared test with Yates' continuity correction
data: accidentsM$crash18 and accidentsM$drnk_dr21
X-squared = 5.3176, df = 1, p-value = 0.02111
The chi-squared p-value is 0.0211, so there is a significant relationship between driving while impaired and experiencing a traffic accident between the ages of 15 and 18.
Repeat the chi-squared test for DWI (drnk_dr21) with travelled with an impaired adult at age 15 (drunkad), and for DWI (drnk_dr21) with travelled with an impaired youth at age 18 (drtee18).
Report your conclusions and discuss any results of interest.
2. Confidence Interval, Hypothesis Test (difference in proportions)
Carry out a test for difference in proportions of males involved in a crash from ages 15-18 between those who had driven while impaired from ages 18-21 and those who had not.
Establish null and alternative hypotheses and interpret your result in the context of these. Do your conclusions from the confidence interval for the difference between the two proportions also line up with this?
Code
#first argument is the number of successes (in this case having a crash at age 18)
#second argument is number of trials
prop.test(c(length(which(accidentsM$crash18=="1"&accidentsM$drnk_dr21=="0")),
length(which(accidentsM$crash18=="1"&accidentsM$drnk_dr21=="1"))),
n=c(length(which(accidentsM$drnk_dr21=="0")),length(which(accidentsM$drnk_dr21=="1"))))
Code
#first argument is the number of successes (in this case having a crash at age 18)
#second argument is number of trials
prop.test(c(length(which(accidentsM$crash18=="1"&accidentsM$drnk_dr21=="0")),
length(which(accidentsM$crash18=="1"&accidentsM$drnk_dr21=="1"))),
n=c(length(which(accidentsM$drnk_dr21=="0")),length(which(accidentsM$drnk_dr21=="1"))))
2-sample test for equality of proportions with continuity correction
data: c(length(which(accidentsM$crash18 == "1" & accidentsM$drnk_dr21 == "0")), length(which(accidentsM$crash18 == "1" & accidentsM$drnk_dr21 == "1"))) out of c(length(which(accidentsM$drnk_dr21 == "0")), length(which(accidentsM$drnk_dr21 == "1")))
X-squared = 5.3176, df = 1, p-value = 0.02111
alternative hypothesis: two.sided
95 percent confidence interval:
0.02699603 0.17789149
sample estimates:
prop 1 prop 2
0.2195609 0.1171171
We have the null hypothesis that there is no difference in the proportion of males involved in a crash from ages 15-18 between those who had driven while impaired from ages 18-21 and those who had not, and the alternative hypothesis that this difference is not equal to 0.
The p-value is 0.0211, this provides significant evidence to reject the null hypothesis in favour of the alternative. We conclude there is a difference in the proportion of males who previously experienced a crash corresponding to whether they had recently driven while impaired.
The true proportion of males who had not driven while impaired that experienced a crash between ages 15 and 18 is estimated to be 0.0270 to 0.1779 higher than the true proportion of males who had driven while impaired that experienced a crash. This does not include 0 so matches the hypothesis test conclusion of a significant difference.
3. Confidence Interval, Hypothesis Test (difference in means)
Perform t-tests on the aggression variable for age 15 (aggress15) using the DWI (drnk_dr21) categories as the groups.
Report your findings from the t-test, with reference to the p-value and confidence interval.
Code
#first test if variances are equal
var.test(aggress15 ~ drnk_dr21, data=accidentsM, alternative = "two.sided")
#significant evidence against null hypothesis that variances are equal, use var.equal=F in t test
t.test(accidentsM$aggress15[accidentsM$drnk_dr21=="0"],
$aggress15[accidentsM$drnk_dr21=="1"],var.equal = F) accidentsM
Code
#first test if variances are equal
var.test(aggress15 ~ drnk_dr21, data=accidentsM, alternative = "two.sided")
F test to compare two variances
data: aggress15 by drnk_dr21
F = 0.24213, num df = 485, denom df = 110, p-value < 2.2e-16
alternative hypothesis: true ratio of variances is not equal to 1
95 percent confidence interval:
0.1780009 0.3204206
sample estimates:
ratio of variances
0.2421293
Code
#significant evidence against null hypothesis that variances are equal, use var.equal=F in t test
t.test(accidentsM$aggress15[accidentsM$drnk_dr21=="0"],
$aggress15[accidentsM$drnk_dr21=="1"],var.equal = F) accidentsM
Welch Two Sample t-test
data: accidentsM$aggress15[accidentsM$drnk_dr21 == "0"] and accidentsM$aggress15[accidentsM$drnk_dr21 == "1"]
t = -1.5654, df = 122.42, p-value = 0.1201
alternative hypothesis: true difference in means is not equal to 0
95 percent confidence interval:
-0.61442934 0.07177557
sample estimates:
mean of x mean of y
5.368313 5.639640
The null hypothesis is that there is no difference in the mean frequency of aggressive behaviour at age 15 between those who had driven while impaired from ages 18-21 and those who had not, and the alternative hypothesis that this difference is not equal to 0.
The p-value of the t-test is 0.1201, this provides no significant evidence to reject the null hypothesis in favour of the alternative.
The true mean frequency of aggressive behaviour at age 15 in males who had not driven while impaired is estimated with 95% confidence to be between 0.6144 lower and 0.0718 higher than the true mean frequency of aggressive behaviour of males who had driven while impaired. This confidence interval includes 0 so matches the hypothesis test conclusion of no significant difference.
Repeat the hypothesis test using drnk_dr21 categories with the aggression variable for age 18 (aggress18), and for the perceived number of standard drinks consumed to be safe (persafe_21).
Report your findings as above.
4. Logistic Regression, ANOVA
Perform a logistic regression with DWI (drnk_dr21: 0=No, 1=Yes) as the response. Use sex, perceived safe number of drinks (per_safe21) and crash involvement at age 15 (crash15) as the predictors.
Write down the model for the regression of driving while impaired on sex, perceived safe number of drinks to consume (per_safe21) and crash rate at age 15 (crash15).
State any conclusions you have based on the model estimates and associated chi-squared p-values.
Code
<-glm(drnk_dr21~sex+crash15+per_safe21,data=accidents,family=binomial(link="logit"))
DSCmodelsummary(DSCmodel)
anova(DSCmodel,test="Chisq")
Code
<-glm(drnk_dr21~sex+crash15+per_safe21,data=accidents,family=binomial(link="logit"))
DSCmodelsummary(DSCmodel)
Call:
glm(formula = drnk_dr21 ~ sex + crash15 + per_safe21, family = binomial(link = "logit"),
data = accidents)
Deviance Residuals:
Min 1Q Median 3Q Max
-2.2031 -0.5670 -0.4615 -0.3854 2.3343
Coefficients:
Estimate Std. Error z value Pr(>|z|)
(Intercept) -2.923201 0.354665 -8.242 < 2e-16 ***
sex 0.776756 0.200155 3.881 0.000104 ***
crash15 -0.007918 0.268532 -0.029 0.976476
per_safe21 0.125563 0.022234 5.647 1.63e-08 ***
---
Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
(Dispersion parameter for binomial family taken to be 1)
Null deviance: 864.51 on 1048 degrees of freedom
Residual deviance: 805.86 on 1045 degrees of freedom
(788 observations deleted due to missingness)
AIC: 813.86
Number of Fisher Scoring iterations: 5
Code
anova(DSCmodel,test="Chisq")
Analysis of Deviance Table
Model: binomial, link: logit
Response: drnk_dr21
Terms added sequentially (first to last)
Df Deviance Resid. Df Resid. Dev Pr(>Chi)
NULL 1048 864.51
sex 1 23.057 1047 841.45 1.573e-06 ***
crash15 1 0.005 1046 841.45 0.9441
per_safe21 1 35.586 1045 805.86 2.441e-09 ***
---
Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
The regression equation for the fitted model may be written as follows
\[ \log \frac{p}{(1-p)} = -2.9232 + 0.7768_{SEX} - 0.0079_{CRASH15} + 0.1256_{PERSAFE21} \] where \(p\) is the probability of driving while impaired at age 21.
For males the odds of driving while impaired is increased by a multiplicative factor of \(exp(0.776756) = 2.1744\) compared to females. With a p-value of <0.001 this is a significant effect. The difference between perceived and safe number of drinks before driving is also significantly related to driving while impaired. For each one unit increase in the difference between perceived and actually safe quantities, the odds of driving while impaired increase by a multiplicative factor of \(exp(0.125563) = 1.1338\). There is no significant association between the indicator for experiencing a crash between ages 13 and 15 and the log odds of driving while impaired, so the model should be refitted with this predictor removed.
The analysis of deviance chi-squared values for the sex and per_safe21 variables are much less than 0.05, indicating a strongly significant increase in deviance if these were removed from the model. However the crash15 variable has a non-significant p-value of 0.9441, so should be removed from the model in the interests of parsimony (fitting the best model with as few parameters as possible).