One of the main benefits of the UK Governments Open Data initiative is the ability to perform exploratory data analysis using a wide range of data at no extra cost. This will allow you as researchers to explore data and methods prior to developing complex models.
This document details how you can use Open Source software and Open Data to explore exposure to road traffic accidents.
To perform this exploratory data analysis we will be using R (https://www.r-project.org/) and R Studio (https://www.rstudio.com/products/rstudio/#Desktop). R and R Studio are platform independent suite of software which allow us to build geospatial models, perform statistical analysis and produce reproducible research in one environment. You will need to download and install these software on your machine to continue with this tutorial.
Once you have installed R and R Studio we need to acquire the data that will allow us to explore the aims above. Download the r-project zip file from here.
In the scripts folder you will find a script (DataAquisition.R) to download the data, pre-process it and a r-markdown pdf document detailing the data sources. Run the DataAquisition.R file to setup directories and download and extract the data.
The first step is to load the data into R for analysis. Create a new script (Ctrl+Shift+N on Windows, Command+Shift+N on Mac) - and use the following code to load your data into R.
pkgs = c("ggplot2", "plyr", "dplyr", "GISTools", "rgdal", "rgeos", "reshape2",
"DT")
# check installed
new.packages <- pkgs[!(pkgs %in% installed.packages()[, "Package"])]
if (length(new.packages)) install.packages(new.packages)
# load libraries
lapply(pkgs, library, character.only = T)
## Load Stats 19 data
# load stats 19 accident into a data frame
Stats19.accidents <- read.csv(file = "./Raw Data/Stats 19/unzipped/Accidents0514.csv")
# load stats 19 casualty data into a dataframe
Stats19.casualties <- read.csv(file = "./Raw Data/Stats 19/unzipped/Casualties0514.csv")
## Load census data
census.age <- read.csv(file = "./Raw Data/Census/unzipped/age.csv")
census.sex <- read.csv(file = "./Raw Data/Census/unzipped/sex.csv")
# Load Spatial Data
# define projection
bng = "+proj=tmerc +lat_0=49 +lon_0=-2 +k=0.9996012717 +x_0=400000 +y_0=-100000 +ellps=airy +datum=OSGB36 +units=m +no_defs"
# load spatial data - may take some time depending on your laptop
welsh.lsoa <- readOGR(dsn = "./Raw Data/Spatial Data/LSOA 2011/unzipped/wales_lsoa_2011_gen.shp",
layer = "wales_lsoa_2011_gen", p4s = bng)
# load road summary
welsh.roads <- read.csv(file = "./Raw Data/Spatial Data/OS Open Roads/road_summary.csv")
# cast rows to columns so we have a col for each road type
welsh.roads <- dcast(welsh.roads, LSOA11CD ~ class)
# create total column
welsh.roads <- welsh.roads %>% mutate(Total_km = rowSums(.[, 2:6], na.rm = T))
# replace NAs with 0
welsh.roads[is.na(welsh.roads)] <- 0
You can use head(dataframename) or View(dataframename) to view and explore the data in RStudio.
The next step is to discard unwanted columns and merge the Stats 19 dataframes to make the data more manageable with the aim of creating counts by small area geographies. As you can see there is a lot of information contained in Stats 19 - you may want to consider how you could use some of the extra information in due course.
# remove unwanted accident data Windows
cols.to.keep = c("ĂŻ..Accident_Index", "Location_Easting_OSGR", "Location_Northing_OSGR",
"Accident_Severity", "Date", "Day_of_Week", "Time", "X1st_Road_Class", "LSOA_of_Accident_Location")
# Mac Linux
cols.to.keep = c("Accident_Index", "Location_Easting_OSGR", "Location_Northing_OSGR",
"Accident_Severity", "Date", "Day_of_Week", "Time", "X1st_Road_Class", "LSOA_of_Accident_Location")
Stat19.accidents.filtered <- Stats19.accidents[, cols.to.keep]
# Windows
cols.to.keep = c("ĂŻ..Accident_Index", "Age_of_Casualty", "Sex_of_Casualty")
# Mac Linux
cols.to.keep = c("Accident_Index", "Age_of_Casualty", "Sex_of_Casualty")
Stats19.casualties.filtered <- Stats19.casualties[, cols.to.keep]
# join data frames on 'ĂŻ..Accident_Index'
Stats19.final <- dplyr::left_join(Stat19.accidents.filtered, Stats19.casualties.filtered)
# filter out non Welsh LSOA codes by matching on the W of the LSOA
Stats19.final <- Stats19.final[substr(Stats19.final$LSOA_of_Accident_Location,
1, 1) == "W", ]
# quick plot to check the accidents are where we hope - i.e. Wales
qplot(data = Stats19.final, x = Stats19.final$Location_Easting_OSGR, y = Stats19.final$Location_Northing_OSGR,
xlab = "Eastings", ylab = "Northings") + coord_equal()
We are going to do the same to the census data so that we have a more manageable dataset - in this case we will group counts of people into males, females, children and adults.
# Merge Census data
census.age.sex <- left_join(census.sex[, c("mnemonic", "Males", "Females")],
census.age, by = "mnemonic")
# Create age groups by summing columns
census.final <- census.age.sex %>% mutate(children = rowSums(.[, 6:11]), adults = rowSums(.[,
12:21]))
# drop unwanted cols
census.final <- census.final[, c("mnemonic", "All.usual.residents", "Males",
"Females", "children", "adults")]
The aim of this document is to demonstrate how we can use Open Data to model exposure to road traffic accidents using small area geographies. Specifically, we want to explore four different models of exposure and map the outputs to visualise the differences. These models can be broadly split into two categories:
A principal method in epidemiology is to describe distributions of disease or injury and identify populations at risk. One of the main methods you can use to estimate to explore those populations at risk is to create a ratio - for example:
\[ Injury Rate = \frac {number\ of\ injuries}{exposed\ population} \]
In this example we will be using road accident data published by the UK Government (Stats 19), population data in the form of census data and road network data published by the Ordnance Survey to explore the spatial distribution of the risk of being involved in a road traffic accident.
The first step in creating a basic rate is to create counts of accidents by a small area geography - in this case a Lower Level Super Output area (LSOA). As there is an LSOA code attached to the Stats19 data we can just create a summary count by grouping on LSOA code as follows:
# create a count of accidents by LSOA
stats19.lsoa <- Stats19.final %>% group_by(LSOA_of_Accident_Location) %>% dplyr::summarise(n_accidents = n())
In this example I have grouped the data simply using the LSOA of the accident, you may want to consider how you can use some of the other data fields in the Stats19 to create a more refined accident rate for example accident severity, age of casualty or weather conditions.
We can now join the datasets together to create a single table from which we can calculate injury rates:
# rename LSOA id cols so we can join
stats19.lsoa <- plyr::rename(stats19.lsoa, c(LSOA_of_Accident_Location = "id"))
census.final <- plyr::rename(census.final, c(mnemonic = "id"))
welsh.roads <- plyr::rename(welsh.roads, c(LSOA11CD = "id"))
# join census to accidents
injury.rates <- left_join(stats19.lsoa, census.final)
# join road data to previous table
injury.rates <- left_join(injury.rates, welsh.roads)
This results in a final table of data as follows:
# cut down for html display
datatable(head(injury.rates[, 1:8]), rownames = F)
We can map this using some simple code in R to show the distribution of accidents:
# make the shapefile ggplot2 friendly
welsh.lsoa.fortify <- fortify(welsh.lsoa, region = "code")
# join the plot data to the injury.rates data
plot.data <- left_join(welsh.lsoa.fortify, injury.rates)
# map it!
ggplot() + geom_polygon(data = plot.data, aes(x = long, y = lat, group = group,
fill = n_accidents), colour = "grey", size = 0) + coord_equal()
As you can see from the map there are some areas which have higher numbers of accidents than others - but it is hard to tell from the map how many accidents there are in the more urban South Wales areas. So lets change our study area to Swansea:
# load swansea area
swansea.lsoa <- readOGR(dsn = "./Raw Data/Spatial Data/LSOA 2011/unzipped/swansea.lsoa.2011.shp",
layer = "swansea.lsoa.2011", p4s = bng)
## OGR data source with driver: ESRI Shapefile
## Source: "./Raw Data/Spatial Data/LSOA 2011/unzipped/swansea.lsoa.2011.shp", layer: "swansea.lsoa.2011"
## with 148 features
## It has 5 fields
swansea.lsoa@data <- plyr::rename(swansea.lsoa@data, c(LSOA11CD = "code"))
# fortify
swansea.lsoa.fortify <- fortify(swansea.lsoa, region = "code")
# join injury rates to new geometry
plot.data <- left_join(swansea.lsoa.fortify, injury.rates)
# Map it!
ggplot() + geom_polygon(data = plot.data, aes(x = long, y = lat, group = group,
fill = n_accidents), colour = "grey", size = 0) + coord_equal() + xlab("Eastings") +
ylab("Northings")
As you can see from the map this tells us where the accidents have occurred but doesn’t tell us anything about the population. We can estimate the rate of accidents by population to see if some areas of Swansea are at higher risk than others from the data table we have created. First we will create a basic exposure rate with population as the denominator. To accomplish this we simply need to create a ratio for each LSOA by dividing the number of accidents by the population exposed.
# create rates
plot.data$basic.rate.pop <- plot.data$n_accidents/plot.data$All.usual.residents
# map rates to show spatial distribution
ggplot() + geom_polygon(data = plot.data, aes(x = long, y = lat, group = group,
fill = basic.rate.pop), colour = "grey", size = 0) + coord_equal() + xlab("Eastings") +
ylab("Northings")
As you can see from the map not much appears to have changed - this is in part due to the way we have displayed the data. If we bin the data into quintiles we can see much more clearly any changes in the spatial distribution.
# quintile for accidents
plot.data$accident.q <- cut(plot.data$n_accidents, 5, labels = c("1", "2", "3",
"4", "5"))
# quintile for basic pop rate
plot.data$BR.quintile <- cut(plot.data$basic.rate.pop, 5, labels = c("1", "2",
"3", "4", "5"))
# map it! accident
ggplot() + geom_polygon(data = plot.data, aes(x = long, y = lat, group = group,
fill = accident.q), colour = "grey", size = 0) + scale_fill_brewer(type = "seq",
palette = "Blues") + coord_equal() + xlab("Eastings") + ylab("Northings")
# basic rates
ggplot() + geom_polygon(data = plot.data, aes(x = long, y = lat, group = group,
fill = BR.quintile), colour = "grey", size = 0) + scale_fill_brewer(type = "seq",
palette = "Blues") + coord_equal() + xlab("Eastings") + ylab("Northings")
This is a very “crude” measure of injury risk - a simple way of refining it would be to be calculate a rate per km of road within an LSOA. Using the open roads data we can estimate a rate per kilometre of road - this may give us an indication of whether a road in the area is particularly dangerous.
# create road rates
plot.data$road.rate = plot.data$n_accidents/plot.data$Total_km
# plot standard rate
ggplot() + geom_polygon(data = plot.data, aes(x = long, y = lat, group = group,
fill = road.rate), colour = "grey", size = 0) + coord_equal() + xlab("Eastings") +
ylab("Northings")
# create quintiles
plot.data$road.rate.q <- cut(plot.data$road.rate, 5, labels = c("1", "2", "3",
"4", "5"))
# quintile plot
ggplot() + geom_polygon(data = plot.data, aes(x = long, y = lat, group = group,
fill = road.rate.q), colour = "grey", size = 0) + scale_fill_brewer(type = "seq",
palette = "Blues") + coord_equal() + xlab("Eastings") + ylab("Northings")
The previous section demonstrated how we can create basic rates of accidents per head of population or rates of accidents per km of road. These give an initial overview of the spatial distribution of accidents, however, they don’t really tell us anything about the accidents, people or roads involved. We can create a new adjusted risk model where we more precisely define the numerators and denominators:
\[AR_i=\sum{\frac{injuries_{ir}}{total_r * basic\ rate}}\]
Where \(AR\) = adjusted rate, \(_i\) = geographical unit, \(_r\) = adjusment variable (e.g. Road Type, Age, Sex)
Using this formula we can substitute in a variety of variables depending on what we want to explore. For example to look at road classification we could adjust our rates by road class. The first step is to perform some basic data exploration to check that the data is of good quality:
# draw a histogram of road types in data to look at the distribution of road
# type lengths
hist.data <- tapply(melt(welsh.roads[, 1:6])$value, melt(welsh.roads[, 1:6])$variable,
FUN = sum)
barplot(hist.data)
As we can see from the basic distribution of road lengths by road class the data is not of very good quality for road classifications with the vast majority “Not Classified” or “Unclassified”. We should consider using another source of data such as OpenStreetMap or Ordnance Survey Integrated Transport Network (ITN). In the data folder there is a second road summary file containing road classifications from ITN summarised by LSOA.
# Load data
welsh.roads.itn <- read.csv(file = "./Raw Data/Spatial Data/OS Open Roads/road_summary_itn.csv")
# draw histogram
hist.data <- tapply(welsh.roads.itn$total_length, welsh.roads.itn$descriptiveterm,
FUN = sum)
barplot(hist.data)
We can see from the ITN data that it is more widely distributed and is probably a better representation of the road types in an area. The next step is to clean the data to make it more useable, for example aggregating all the minor road types into one group.
# cross tab so road type in cols
welsh.roads.itn <- dcast(welsh.roads.itn, LSOA11CD ~ descriptiveterm)
welsh.roads.itn[is.na(welsh.roads.itn)] <- 0
# aggregate minor roads into C roads
welsh.roads.itn <- welsh.roads.itn %>% mutate(`C Road` = rowSums(.[, c("Alley",
"Local Street", "Minor Road", "Private Road - Publicly Accessible")]))
# subset to remove unwanted columns
welsh.roads.itn <- welsh.roads.itn[, c("LSOA11CD", "A Road", "B Road", "C Road",
"Motorway")]
# rename LOSA field to join to the stats 19 data
welsh.roads.itn <- plyr::rename(welsh.roads.itn, c(LSOA11CD = "id"))
Stats 19 has six road classifications which detail where the accident took place:
Code | Label |
---|---|
1 | Motorway |
2 | A(M) |
3 | A |
4 | B |
5 | C |
6 | Unclassified |
These broadly map onto our road types categories extracted from the ITN data - we can aggregate them so we have matching groups.
# regroup stats19 roads so that they reflect itn create counts by road type
# per lsoa
stats19.lsoa.road <- Stats19.final %>% group_by(LSOA_of_Accident_Location, X1st_Road_Class) %>%
dplyr::summarise(n_accidents = n())
# crosstab so road type is cols
stats19.lsoa.road <- stats19.lsoa.road %>% dcast(., LSOA_of_Accident_Location ~
X1st_Road_Class)
# fill NAs with 0
stats19.lsoa.road[is.na(stats19.lsoa.road)] <- 0
# Group Road types together using table above
stats19.lsoa.road <- stats19.lsoa.road %>% mutate(n_motorway = rowSums(.[, c("2",
"3")]), n_A_Roads = .[, "3"], n_B_Roads = .[, "4"], n_C_Roads = rowSums(.[,
c("5", "6")]))
# remove unwanted columns
stats19.lsoa.road <- stats19.lsoa.road[, c("LSOA_of_Accident_Location", "n_A_Roads",
"n_B_Roads", "n_C_Roads")]
# rename LSOA col
stats19.lsoa.road <- plyr::rename(stats19.lsoa.road, c(LSOA_of_Accident_Location = "id"))
datatable(head(stats19.lsoa.road))
Now we have created the counts by road type we can create rates by road classification as an adjusted rate.
# join tables
road.class.rates <- left_join(welsh.roads.itn, stats19.lsoa.road)
# create Basic Rate for calcs
road.class.rates$BasicRate <- rowSums(road.class.rates[, 5:8])/rowSums(road.class.rates[,
2:4])
# create rates
road.class.rates$A_Road_Rate <- road.class.rates$n_A_Roads/(road.class.rates$`A Road` *
road.class.rates$BasicRate)
road.class.rates$B_Road_Rate <- road.class.rates$n_B_Roads/(road.class.rates$`B Road` *
road.class.rates$BasicRate)
road.class.rates$C_Road_Rate <- road.class.rates$n_C_Roads/(road.class.rates$`C Road` *
road.class.rates$BasicRate)
# change inf to NA
is.na(road.class.rates) <- sapply(road.class.rates, is.infinite)
# zero out NA
road.class.rates[is.na(road.class.rates)] <- 0
# Map it!
plot.data.ar <- left_join(swansea.lsoa.fortify, road.class.rates)
ggplot() + geom_polygon(data = plot.data.ar, aes(x = long, y = lat, group = group,
fill = A_Road_Rate), colour = "grey", size = 0) + coord_equal() + xlab("Eastings") +
ylab("Northings")
# Quintile
plot.data.ar$A_Road_Rate_Q <- cut(plot.data.ar$A_Road_Rate, 5, labels = c("1",
"2", "3", "4", "5"))
ggplot() + geom_polygon(data = plot.data.ar, aes(x = long, y = lat, group = group,
fill = A_Road_Rate_Q), colour = "grey", size = 0) + scale_fill_brewer(type = "seq",
palette = "Blues") + coord_equal() + xlab("Eastings") + ylab("Northings")
Try mapping the other rates and see how the maps change - this will give us an idea of the spatial distribution of road traffic accidents adjusted for road type.
This web page has gone through the very basic process of creating rates of exposure by small geographic areas. The data we have used is freely available to download and re-use, and contains a wealth of information that you could use to perform your own data exploration. For example:
Think about how you can use the data supplied here and other data available on sites like http://www.data.gov.uk such as traffic counts, travel to work areas and data like running routes to create adjusted risk models. Have a go!
Contact: e. r.j.fry@swansea.ac.uk twitter: @richfry