In this blog post we’re going to be looking at what some people might call “big” data. No that doesn’t mean big in the conventional sense, it means big in the sense that the single file dataset is 10 Gb in size, and I wanted to make a “big data” pun.
The data in question is a record of NYC’s 311 complaints since 2010, the 6th most popular dataset on the opendata website. “311” is a complaints hotline in NYC, for those interested in following along or investigating the data themselves, it is freely available from the open data website.
Today we’re going to cover
- Creating a data source and importing the data
- First look at the data to determine interesting fields
- Some basic visualisations of the data.
Ingesting the Data
Logscape offers two commonly used methods of getting data into the system. The first option is via the front-end we have the option to upload our data through the user page, we simply select a file, input the tag we want to use for the data and hit go.
User Upload Page
The alternative and the better option for a single file this large is to use a data source. If you want to fully understand how to set up a datasource, then you should follow this tutorial.
Regardless of which method you chose, after your data has been uploaded and indexed it will be made available at search time.
What complaints are being made?
Once your data has been indexed it will be available via the search page. If you created a data source this can be quickly accessed with the search controls on the far right of the page. Otherwise it can be accessed via the search option in the left-hand navigation menu. Once there, you’ll want to run a search for your tag.
Search displaying only partial indexing
We can see our data but not in a quantity that matches our file, this is due to the file size. Indexing could take a few minutes depending on your hardware, so just wait patiently. Once my data was indexed, I wanted to get an overview of the fields that the CSV file was making available to me, a simple search of the datasource tag reveals that information.
Snapshot of discovered fields
I decided a logical place to start was to see which type of complaint was the most common, this was easy to achieve due to all the fields of our CSV data being automatically extracted. Limiting the search to only the tag we’re interested in, finding a break down of complaint types was as simple as adding
Complaint-Type.count() to my search. It’s worth noting that the field is actually “Complaint Type” in the CSV data, but Logscape automatically substitutes the space with a hyphen.
Complaint Type Breakdown across entire dataset
The weighting is obviously heavily in favour of heating complaints, with our dataset including over 257,000 heating/hot water complaints over the past 7 years. That’s over 100 a day. Coming in second with just over 165,000 is noise complaints. The leader of our tighter groupings is blocked driveway complaints with just over 97,000 and followed closely by Illegal parking.
The next question in my mind is also simple to implement – are all of these complaints being made to the same agency? adding agency with
Agency+Complaint-Type.count() shows us a breakdown of complaints, by the agency they were submitted to.
Complaint by receiving agency
Un-surprisingly the data breakdown looks almost, if not completely identical. I think it’s safe to say that this data is recorded after it’s assigned to the correct agency, rather than New Yorkers simply being this efficient at reporting their issues correctly.
So we now know the most common complaint is for heating and water, and we know that the House Preservation and Development agency is responsible for handling those complaints. But which agency overall, deals with the most complaints?
To find out, we’ll keep the
.count() analytic, drop the
Complaint-Type field, and change over to a table view and reveal that…
Number of complaints by Agency
The House Preservation and Development agency also deal with the most complaints, closely followed by the NYPD. However, whilst the HPD have dealt with over 571k complaints; just over 253k, almost half of their total amount, of those complaints were for heating and water. Whereas the NYPD see a greater variety of less-reported complaints.
Who’s making the complaints?
Now that we know how many requests are made to each department, a logical question is, who’s making them? Doing a count of the
City field reveals the answer.
Breakdown of top complaint sources by City
Brooklyn, New York, and the Bronx take the top spots. At a glance that may seem completely reasonable, but it’s actually quite surprising if you look into it.
- Brooklyn – 2.6 million population, 751k complaints
- New York – 8.4 million population, 464k complaints
- The Bronx – 1.4 million population, 455k complaints
The outlier here is New York, with a such a large population in comparison to The Bronxn you would expect to see significantly higher number of complaints, but as it stands New York has approximately 60% of Brooklyn’s total. Similarly surprising, is the fact the Bronx has over 4 times as many complaints as Staten Island, despite only having a population around 3 times larger. Since Brooklyn has a disproportionately large number of complaints, let’s focus on it.
Complaint breakdown in Brooklyn
Looking at a complaint breakdown for just Brooklyn we see that Heating and Residential noise maintain the top spots that we saw when looking at the entire dataset. But construction complaints, previously towards the bottom of our chart makes a huge leap to the third position. Paint/plaster complaints also make a smaller leap to claim the 4th position. It seems fair to say based off these complaints that houses in Brooklyn are in need of some tender loving care.
For the sake of comparison, let’s take a look at complaint type broken down for New York
Complaint break in New York
Remarkably New Yorks biggest contributors are the same as Brooklyn’s, so maybe Brooklyn isn’t in as much need for repair as we thought? Though evidently noise, and heating problems, are a terror on both cities.
So it seems that there’s no definitive conclusions to be drawn thus far. It’s probably still a fair assumption that Brooklyn is in a worse state of repair than New York, because whilst they share the same type of complaints, Brooklyn has a significantly higher amount of them despite having a population less than 1/3 of the size. Let’s keep looking and see if we can determine a source.
Another field made available to us is “location-type” – Maybe it can shine some more light on the situation. Introducing the location to the New York data we get.
Complaints in New York by Location type
Showing a heavy weighting towards residential complaints – Looking at the same search but for Brooklyn…
Complaints in Brooklyn by Location type
Once again our data is strikingly similar despite the huge disparity in the number of complaints made per person that we saw earlier. Whilst I was incredibly tempted to dive further into the “where” by geo-mapping the longitude and latitude values the dataset includes that’s something we’ll cover in another blog dedicated just to the feature.
Which leaves me stumped, from looking at the data I can see plain as day that Brooklyn generates a disproportional amount of complaints, however, I simply can’t determine the why. Two possible conclusions that you can draw from the data are, that either Brooklyn is in a significantly worse state, or that a larger proportion of the population lives in council managed affordable housing, and thus report their issues to 311. However, I think this brings around an important point –
Just because your data shows there is a problem, doesn’t mean it will tell you why
I think that’s an important takeaway that can be applied universally. When you ask people if their application has logging, they’ll often smile and nod happily. However a week later during a production outage the same person is screaming that “The logs show nothing”. Just because a dataset is large, doesn’t mean it will include the data you need, in this case there is no indication of whether or not the caller lives in affordable housing. Or if the same residence has made the complaint multiple times.
Maybe the answer is hiding in there somewhere, but the dataset includes an incredibly large amount of data, and an equal amount of noise. Having additional datasets where parallels could be drawn could assist in both enriching the original dataset, and helping to highlight fields of key importance. Logscape did a great job of cutting through that to find the value where there was some. However, I can’t help but feel like someone needs to give our guide to logging a read.
If you want to try Logscape, then download it for free.