07 Jan 2017

Subscribe to Newsletter

There is a pretty fabulous dataset offered up through NYC’s 311 service. 311 is a service started in NYC in 2003 to provide a consolidated end point for non-emergency municipal services. 311 is offered in many other cities around the world. Noise complaints, trash removal, and towing requests and all examples of requests you might file through 311. NYC offers up an anonymized near realtime log of all requests filed through 311, which furthermore includes historical records going back to its inception. The records are surprisingly rich, beyond the basic details of the request and which department handled it, they contain the timing of when it was opened and closed, free form comments, and most records have some notion of location attached to them. I attempted to use the request type and location to answer this question: Can we differentiate NYC neighborhoods based on the types of requests that are filed within them?


I ran a clustering algorithm over the request types, broken down into census tracts, and came up with the following map:

Clusters in NYC

Note, the fact the clusters are geographically consolidated makes sense but was not part of the clustering algorithm formulation. The clustering only applied to the profile of request types in each tract, and naturally you would expect nearby tracts to have similar issues, but the extent to which this is the case is enlightening in its own right. As for what types of requests differentiate each cluster, here is a summary for each:

  • Light Blue: Pretty evenly split between sewage, noise, heating, and plumbing issues
  • Red: Almost half of the issues are sewage related, then a pretty split distribution
  • Dark Blue: Noisy area - 76% of issues are noise related
  • Green: Housing issues like heating and plumbing are dominate
  • Olive: 77% of issues relate to sewage

Here is a little more detail on the distribution of issue types in each cluster. This distribution is the centroid of the cluster, measured using a median amongst the constituent census tracts and then normalized to 100%. There are almost 300 different issue types, so I bucketed some similar complaints and filtered to the most popular issues here:

Cluster Centroids


It is tough to look at the map and not see some familiar patterns. Take Manhattan for instance - it is divided into three clusters. One (in light blue) is sort of a blend of a lot of different issues with no one issue dominating. The area it covers loosely aligns with the residential areas of Manhattan. The second (in dark blue) is the noisy area, which covers a lot of the business and most traffic congested areas. And finally there is the area that appears to have a lot of housing issues (green), that roughly lines up with Harlem. The way the issues line up with the geography is telling of what separates Harlem from the rest of Manhattan. And if you look closely, you can see that there are areas of light blue permeating into the southern border of Harlem, perhaps a sign of the changing neighborhood.

The other boros are more of a mixed bag. The bronx shares much of the same character as Harlem, where housing issues dominating. Queens and Staten Island appear to be somewhat similar in their issue types, and unfortunately they appear to have the most issues with sewage. I hate to say it, being a former Queens resident, but the data suggests it is in the running to be the smelly boro.

Not surprisingly Brooklyn looks pretty diverse, as measured by different distributions of issue types. Downtown Brooklyn is much like the noisy areas of Manhattan, and the expensive rent areas like Williamsburg, Park Slope, and Cobble Hill look a lot like lower Manhattan. Bed Stuy, Crown Heights, and Flatbush have much of the same issues as Harlem and the Bronx. Interestingly, the further south you go in Brooklyn the more “block by block” it appears to become - with a lot more fragmentation in the types of issues you see in nearby geographic areas.

Further Reading

Here is an interesting paper that profiles other aspects of NYC’s 311 data: paper


The 311 data is big and dense, so it took a little processing to get to this analysis. A couple of notes on the methodology present here:

  • About half of the 13M records had lat/long coordinates, which I processed through this api in order to convert them into census blocks. I chose to summarize on the census tract level though, as it provided a more statistically signigicant binning of requests.
  • To compute the clusters, I first computed the by tract distribution of request types, filtering out outliers in the distribution. Then I computed a similarity matrix over the census tract, where the distance score between any two tracts was the euclidean distance of their issue distributions. In other words, if one tract had 50% noise complains and 50% heating issues, and another had 20% noise complaints and 80% heating, the distance I used was sqrt( (50% - 20%)^2 + (50% - 80%)^2 ).
  • The actual clustering algorithm used with an aggolmerative approach using wards method. I computed clusters at varying levels, from 2 to 15 different clusters, but got the most stable results at 5, which is what I reported here.
  • The cluster centroids were computed using a median of the underlying tract constituents distributions. Average reported similar results.
  • Props to digital ocean for making it cheap/easy to spin up temporary compute resources.

There is a lot more detailed data that I generated in the course of this analysis. If you want access to any of it, feel free to contact me (my information is in the about page).

nyc 311 cluster opendata

22 Nov 2016

Subscribe to Newsletter

There are animals on the streets! Well, not really - but there are a surprising number of “illegal” pets reported throughout the city. 1592 to be exact, since 2010. So what are these pets, and where are they?

Number 3: Snakes

Apparently there have been 150 reports of snakes slithering about the city, accounting for about 9.4% of all the illegal pets. Ophiophobes might be a little distraught to see that they are fairly well distributed throughout the boroughs. I am overjoyed to see that several snakes have been reported in the area around the gowanus canal, near my home.

Snakes in the City

Number 2: Farm Animals

Narrowly beating out snakes, there have been 179 reports of “farm animals” around the city, or about 11.2% of all reports. Presumably this covers a wide range of animals - goats, sheep, cows maybe. There is a conspicuous concentration of reports filed around the Wall Street area - perhaps people are calling in the bull of Wall Street.

Farm Animals in the City

Number 1: Roosters!

Flying away with the lead, roosters have been behind 832 of the complaints, or about 52% of all illegal pet complaints in NYC. Roosters! Who would have thought… But, I suppose there is some selection bias afoot here. A secret pet goat might be easier to hide than a rooster who happily greets the day for all their neighbors to hear. Somewhat surprisingly, it looks like you can find them virtually everywhere in the city.

Roosters in the City

Honorable Mentions

There are other reported pets that did not make the top 3 - including a catchall category of “Other” that accounts for 359 (23%) of all the illegal pets. There are also 44 instances of ferrets, 16 instances of iguanas, 9 instances of monkeys, and 3 instances of turtles (specifically turtles under 4 inches long, illegal since they are a source of salmonella according to the FDA).

Who knew there was such an eclectic array of pets in NYC? I am a dog person, but if roosters are more your jam, looks like you are definitely not alone!

nyc 311 animals opendata

06 Nov 2016

Subscribe to Newsletter

I stumbled on a pretty awesome dump of NYC bike stats recently through the NYC OpenData project. Apparently, every year on a particular day in the fall teams go out to specific locations throughout the city and collect data on a bunch of bike related measures - such as how many riders, the split between male and female riders, whether people are wearing helmets, etc. They have been doing this consistently since 2005 in 10 locations throughout the city. With all of those counts, it makes me wonder if I have ever been counted as part of this program! Here are some interesting things that I found.


Ridership Over Times

In general, data suggests that bike riding is on the rise in NYC. On average, the ridership has gone up 10% year on year. I would argue that this is a lower bound, and the real growth is more likely to be higher. I say this because the growth is a little erradic, and I am guessing that surveying was done in poor weather in some years (like 2014, where there was a 2.35% drop in ridership despite a 32% growth in 2013 and a 20.84% growth in 2015). Ridership is growing throughout the city, but as you can see below, lower Manhattan has been pretty dominate in the ranking of areas with the most riders.

Ridership Rank Over Time


It is pretty clear that there are more male riders than female riders in NYC, but it is a little surprising to see just how off balance it is. It is about 80% male versus 20% female, and while that has come down a bit from 2005 (14% female and 86% male) the change is not that signficant and women are still vastly under-represented in the bike lanes..

Helmet Usage

Let me just start by saying that there is no good reason not to wear a helmet when riding a bike in NYC. To each their own, but I question anyone’s intelligence and/or sense of self worth when I see them riding around without a lid. You can be the best cyclist in the world, but all it takes is a pedestrian staring at their iPhone stepping into the bike lane, another less coordinated cyclist clipping your wheel, or a passenger opening a door without looking to land you in the hospital. And with a helmet, you are more likely to walk out of that hospital and ride another day. </sermon>

Helmet usage is pretty disappointing in NYC. Based on the stats, it looks like only around 41% of riders wear helmets. Women are better about it than men, with 45% of women versus 40% of men wearing helmets, but even 45% is pretty low. Pro tip: NYC gives away helmets for free, find out more information here.

Helmet Utilization

Bike Lane Usage

There are some good reasons not to use bike lanes sometimes (like cars parking in them), but in general I would argue that riding in the bike lane is safer than riding in traffic. The hierarchy might look something like: bike lane > traffic > against traffic > biking on the sidewalk. Don’t bike on the sidewalk. Seriously.

Looks like NYC does pretty good by this measure, and is getting even better. Most riders were counted in the bike lanes (78.15%), very few on the sidewalk (0.83%), and not many riding against traffic (3.91%). Back in 2005, 4.09% of riders were counted on the sidewalk, so we can see a big improvement in this regard.

Bike Lane Utilization

And here is a handy link to the laws for cyclists in NYC. No biking on the sidewalks!

nyc bike safety opendata

30 Oct 2016

Subscribe to Newsletter

I made a point in my last point about differentiating between a change in the crime rate and a significant change. I want to elaborate on that point a little more, because this is something that is so often overlooked but is vital to understanding any analysis. So if you know all of this already, feel free to skip the next two paragraphs.

When someone quotes a stat, like saying that overall major felonies were down 1.6% in 2015, they are only presenting part of the story. It is not a lie - crime was really down 1.6% in 2015 - but you need to ask is that significant, ie, is that something real or could it be random noise. This is where a more adroit publication would quote a significance level, or p value, but I personally do not find them intuitive. So here is my explanation in a nutshell. Something like the crime rate is going to have a natural fluctuation. Crime stats are a complex system, a lot of factors move the stats, and this all contributes to what amounts to random noise.

Luckily, we have ways of dealing with this randomness, and determining the degree to which stats are part of it or not. We ask the question “what is the chance that this stat is because of noise, versus a real effect” and say something is significant if it is above a certain threshold. Commonly, that threshold is 95%, but it is important to recognize that it is not binary - the probability exists on a spectrum, even though we use this shorthand of calling something “significant” or not. We do this by calculating the variance of the changes, which is a measure of how much they bounce around. Through a simple transformation we turn this into a standard deviation. Back to our example, the standard deviation in the crime rate in NYC from 2001 to 2014 is 3.97%. Using what is called a normal distribution (which is a pretty awesome math thing in its own right), we can judge how likely particular values are to be noise. The normal distribution tells us that at one standard deviation, i.e. a 3.97% rise or fall, there is a 32% chance that what you are measuring is noise. And at two standard deviations, i.e. a 7.94% rise or fall, there a 5% chance that what are are measuring is noise, or said differently, you are 95% confident in the measure. That is a normal cutoff- 95% - but as you can see, there is more to it than just a significant or not explanation.

What does this mean for the NYC crime data? The 2015 stats suggest a 1.6% drop in major felonies, however the standard deviation of the crime rate is 3.97%, so this nets out to about a 69% chance that what we are measuring is noise. So that is to say, we cannot really say much. The drop in crime rate for 2015, based on these assumptions, is just not significant.

There are some precincts though which do have a significant change in crime rate in 2015:

Significant Changes in NYC Crime

The picture does not look great, of the 6 precincts that had a significant change (using the 95% cutoff), 5 had a turn for the worse, and there is some evidence that the one good one is a misnomer. Here is a quick rundown of what I was able to find for each:

  • 1st Precinct (Tribeca area) - 2.5 stars on Yelp, seems like grand theft auto is on the rise in that area.
  • 32nd and 34th Precinct (Harlem) - looks like they have a good connection with the community, however the head cop transfered out in 2015 (story). It looks like there was a significant rise in murders in these precincts.
  • 40th Precinct (Mott Haven, Bronx) - with a 25% rise in crime in 2015, this begs for some serious questions about what is going on. And it looks like the NYT did some in depth reporting on this. Also, this precinct was caught juicing their 2014 stats, which makes any numbers coming from this precinct highly suspect (and statistically speaking will make it difficult to make any assertions about crime stats moving forward - so the impact of this will last for years).
  • 105th Precinct (Queens Village)- this is a huge precinct, and it looks like it is split to better serve the community.
  • 104th Precinct (Ridgewood, Queens)- the one darling, with a 10.8% drop in major crimes, however cursory investigation yields articles like this and this, suggesting that the drop in crime rate has more to do with cops not responding to calls (earning them 1.5 stars on Yelp).
nyc crime opendata