Visualizing UK accident data with Logscape

In my ever onward quest to show to the world how easy it is to get up and started with Logscape, today I’m going to use a Logscape docker container in order to build visualisations based off some publicly available CSV files in no time at all. If you’ve never used the Logscape docker image, then check out my previous blog.

Today we’re going to be analysing data made available via the gov.uk website, which offers statistics for crashes in the UK for the year of 2015. The specific dataset is available for download here.

After downloading the dataset, I extracted it to a directory and set up a data source in order to make the data searchable.

What Day is the most dangerous?

A simple question that the dataset answers is which day of the week most accidents occur, sticking with the table format associated with CSV data, I performed a simple count of the days, to get a breakdown of the most common day of the week…

Most Dangerous Day of the week (Unsorted)

In this format it isn’t immediately obvious day was the most common, we could order the table by the count column, but it’s going to be much clearer, and interesting, if we change the visualisation, I opted for a pie and ended up with the following.

Most Dangerous Day of the week (Showing Saturday as most dangerous)

This is a much nicer visualisation,  we can still see the individual event counts, but we get a clear pie chart showing us a percentage breakdown. At a glance, we can easily see that Saturday is the most dangerous day, with Wednesday, Thursday and Friday following closely behind. Monday sits a surprising 7116 (5%) reported events below Saturday.

We already know the data covers the 2015 period, and now we know the most dangerous day is a Saturday, so the question is which is the worst month?

Logscape will automatically read the header line of any files using a .csv file extension (Read more on CSV Discovery), and tag the fields with the name the header line defines. As such the month field is readily available for us to use in our analytics, so it’s simple to visualise; for this I opted for the table view.

 

Most Dangerous Months (Showing July as the most dangerous)

Tables are getting boring, let’s introduce some colour with a cluster graph which ranks data per-bucket, from smallest to largest.

Most Dangerous Months (Alt)

Having visualised that data, we now know that the most dangerous day of the week is Saturday and that July is the worst month.

If we wanted to get a breakdown of the events in a month, we can limit our search to a specific month by adding a month.equals cause to our search.

Most Dangerous week day, July (Showing Saturday in July as the most dangerous)

This allows us to see that while Saturday being the most dangerous holds true for July, it doesn’t entirely conform to the stats for the entire year. Maybe we’ll have more luck looking at which day of the month was dangerous, was it Friday the 13th?

Most Dangerous Day of July (Showing July 3rd as most dangerous day in the year)

Almost. The most dangerous day in July was the 3rd, which is actually a Friday. However, July 13th follows closely, with a staggering 448 incidents reported across the entirety of the UK. Even though it falls on a Monday, which overall is the safest day of the year for your trip!

Let’s continue to analyse July, we now know that the most dangerous day was the 3rd, but our dataset also offers speed limit data. So which speed limit had the most accidents on the 3rd?

Accidents by Speed limit, July 3rd (Showing a distinct weighting towards 30MPH zones)

The data shows a huge discrepancy towards 30 miles per hour zones, with around five times as many accidents occurring in this zone as the next closest. This either means people are predominantly driving in 30 Mph zones, or that 30 Mph zones are overwhelmingly dangerous.

We’ve asserted that 30 Mph zones, during the month of July had the highest accident rate, however let’s take a step back. Looking at the entire year to determine the time, month and speed, let’s see if July being the worst holds true, even at a higher granularity. Adding in the hour value via concatenation, and then grouping our results by month (As can be seen in the pictured search). We can see that the most dangerous hour was 5pm, speed limit 30 miles per hour during the month of November. Interesting given that November was only the 3rd most dangerous month overall.  

Accidents by Day and Speed limit (Showing Nov 17th, 30Mph zones, to be the most dangerous for the year)

Who and what is involved?

So we’ve done a good job of analysing when and where the accidents are occurring, so let’s see who’s involved. The dataset includes both the number of casualties and number of vehicles involved. In this scenario casualty meaning simply involved, rather than injured as you might expect. Rendering these figures as a pie, we see the following.

Casualties per accident (78% one person, 15% two people, 17% mixed higher values)

This shows that 78% of accidents only affect one person, 15% effect 2, and the remaining 17% is broken up between higher amounts. The above graph doesn’t make it immediately obvious as to what the max, min and average casualties are, so let’s move that into a table where we can display it more clearly.

 

Min, Max and Average casualties (Max:38, Avg:1.3, Min:1)

Moving over to look at the number of vehicles, we can see that whilst the spread is high, values of over 5 vehicles make up a negligible sum, whilst 2 vehicles take the top spot.

Involved Vehicles (2 Vehicles accidents are more than twice as common than 1 Vehicle)

Two vehicles being the most common occurrence raises an interesting question – Thus far, when viewed in isolation it seems that the average accident includes one person, but two vehicles. If we combine the statistics, does it hold true?

Number of Casualties by Number of Cars (Collisions including only 1 person and two cars, are over 4 times more common than 2 people and 2 cars)

Re-adding number of casualties, we can indeed see that the most common accident is a collision including two vehicles, but only one person, in second we have collisions including only vehicle and one person. Meaning that statistically, your vehicle is more likely to be damaged by you hitting an unoccupied vehicle, or another stationary object than by another driver colliding with you. Looking back at the data we have to drop to the third spot before we finally introduce a second person to the scenario, though for every incident of this type, there are over 4 of only one person and two vehicles.

Let’s introduce more fields

Reintroducing speed limit we get the following –

Number of casualties by Speed limit and cars involved (Continuing to show a heavy weighting towards 30Mph zones)

In keeping with our previous findings, the top spot belongs to accidents involving two vehicles and only one casualty inside 30 MPH zones. Now let’s re-add time to the equation.

Number of casualties by Speed limit, Cars involved and hour of occurrence. (Groupings are now significantly closer, though 30Mph continues to be prevalent)

Our data is becoming much more closely grouped, however two vehicles with one casualty still has the top spot for occurrence. The top hours break down into 3pm-6pm with one outlier of 8am (Which can be seen as the orange on the far right).

Bringing back our previous most dangerous day, of July 3rd, let’s see if those stats hold true.

Accidents by Vehicle, Number of Casualties, Speed Limit and hour for July 3rd (Breaking from the previous trend, for the most dangerous day of the year, 8am is actually the most dangerous hour)

Looking at this data, we see that the previous outlier of 8am actually becomes the top spot for accidents, with 22 occurring. Our previous top range doesn’t even make an appearance until the third position, and even then it’s isolated, with most of the top spots belonging to time-frames within the normal working day. A question that comes to mind is where did these accidents occur, was it a set of pile ups within a small area, or were they spread out across the country? Our dataset does include longitude and latitude, and Logscape does support geo-mapping, however, that’s a question for another blog post, as we’ve answered what we set out to answer.

Conclusion

Breaking down our data has led to some interesting discoveries, and reveals that most of the presumptions I had coming into the analysis were true. Winter months are generally more dangerous than summer months, and “rush hour” times are the worst for accidents. However, July being the worst month comes as a surprise, at first I thought this could be attributed to the school holidays, but the data showed a weighting towards accidents in the first half of the month, before the holidays, rather than the later half, during the holidays. It was also interesting to see that whilst the most common accidents include two vehicles, the majority of these include only one casualty.

Result: July 3rd was the most dangerous day of 2015, the time between 8am and 9am the most dangerous time, and 30mph zone the most dangerous speed.

Hopefully, you enjoyed this breakdown as there’s more to come in the future.

If you’re interested in using Logscape, then you can download it free.