07 Oct 2016

Subscribe to Newsletter

One of the datasets available on the amazing NYC bike data website (available here) is information on accident rates, sorted by police precinct and accident type. They make this information available in the dreaded PDF format, and it took me a little while to get over this, however with the help of Tabula I was able to convert the PDFs into something useful.

There are four different types of crashes, defined by what the bike crashes into: car, pedestrian, themselves, and other bikes. Presumably “themselves” encompasses a miscellaneous assortment of tragic collisions, including street poles, pot holes, and the occasional spontaneous combustion.

In 2015 there were 5270 crashes reported, with collisions between cars and bikes being by far the most representative with 83.7% of all accidents. I imagine there is a reporting bias here, in that these types of collisions are more likely to get picked in a police report than other types. Car collisions are followed by self collisions (8%), pedestrian collisions (6.6%), and lastly collisions with other bikes (1.7%). It is interesting to speculate why car crashes have such a higher representation - as a cyclist in NYC I would apriori expect pedestrian crash rates to be about as high if not higher - but I do not see a clear reason in the data.

One might think car crashes are reported more is because of more damage (both in terms of property damage and personal injury). We cannot judge property damage rates from the data, however the dataset does contain injury rates. They are all pretty close - with a car, 50% of the time the bicyclist is injured, with another bike 45%, themselves 42%, and with pedestrians only 4% (although pedestrians suffer some sort of injury in those accidents at a high rate - 52% of the time). So it is unclear why bike on car crashes are so highly represented, unless it is just the reality. What is perhaps more clear is that as expected, in asymmetical exchanges, the lower mass/more vulnerable median is more likely to get hurt (between bike and car, it is 0.6% for car vs 50% for bike, and between bike and pedestria it is 4% for bike and 52% for pedestrian).

Since this data is indexed by police precinct, it is possible to plot accident rates by geographic area. Below is a plot of bike accidents with cars in 2015, with accident rates normalized by the 2010 census population in each precinct.

Bike on Car Accidents, 2015

It is a little disheartening to note that my normal commute crosses almost all of the most dangerous precincts. Oh well…


The plots in this writeup are done using the basemap package from matplotlib, and a helpful tutorial here. NYC provides the data on collisions rates here. NYC also provides all sorts of useful information by census tract, including population, here. Since the collision rates are reported by precinct and the population is by tract, you need a way to translate census tract into police precincts, for which I found a static mapping that you can pull down here.

nyc biking accidents opendata

23 Sep 2016

Subscribe to Newsletter

Of all the services that I have running to pull down data on my life, the survey system is arguably the most important. Ultimately, a major goal is to eventually be able to predict things like my happiness and concentration, and perhaps even help guide them. As such, finding a high quality way to pull down data on my mental state is important.

To achieve this, I have a very simple service that periodically sends me a predefined survey in a text message, to which I respond with measures for how happy, alert, and concetrated I am at that particular moment. The survey times are somewhat random, but garaunteed to be delivered throughout the day.

My analysis is only going to be effective if I get sufficient data, though I need to be realistic in that I cannot expect to answer my survey every 10 minutes. Or even every couple of hours. Realistically, sometimes I just do not want to respond. The survey system understands this, and through a response feedback loop will decrease or increase its question frequency based on my engagement level.

So, that said, what does it look like so far? I have some interesting results, nothing ground breaking so far. First, when it comes to responding, I am most likely to respond to my surveys in the afternoon - 33% of my responses came between 1 and 5, compared to only 20% coming in the morning. When it comes to my survey itself, ideally I wanted to pick indepent attributes, with low correlations. It looks like I mostly achieved this - the correlation between most pairs is around 35%, however the correlation between concentration and happiness is around 50%. That is in someway an interesting result in of itself, however it probably means I need to redesign the survey to have more independence in the measures.

There are two early questions that I wanted to answer:

  1. Am I a morning person?
  2. Does biking to work make my day better?

Am I morning person?

So far it is a little hard to tell, but early indications suggest that I have a minor (1%) lift in concentration in the morning, despite having a negative 3% draw on alertness. I can qualitatively corroborate this, however I will be interested to see how this evolves as more data comes in. Also worth noting - while the data suggest I concentrate the most in the morning, it is clear that I am happiest in the evening, with an 11% lift in happiness coming after 6pm (morning is ok, and then there is a big dip midday).

Does biking to work make my day better?

It looks like there is some immediate impact, but it wears off pretty quickly. On days that I bike to work, I experience about a 3% bump in happiness, and a 6% bump in concentration. That said, over the course of the whole day it averages out, and the only affect at the day level might be a slight uptick in volatility in all the measures. But it appears to be minor and may just be noise at the moment.

So where to go from here? I may need to tweak my survey to measure a more differentiated set of attributes, and I need to find a way to bump up my midday happiness (I am starting a new job soon, will be interesting to see what impact that has on this midday lull).

quantlife survey

19 Sep 2016

Subscribe to Newsletter

Take a look a look at this map of NYC population density, based on the 2010 census:

NYC Population Density

As you might expect? Looking at that map forced me to invalidate some of my previously held assumptions about NYC neighborhoods. I knew that Manhattan was likely to be the densist borough overall, however I did not realize how much more dense it is than Brooklyn! In fact, based on this map, it appears that Brooklyn is a lot less dense than most of Queens, Staten Island, and certainly the Bronx. That was a big surprise to me.

I was also confounded by the dense cluster in the north east corner of the Bronx. At first I figured this was just a fluke, but I investigated separately here and here. It appears that this census tract (tract 046201) has a very high population count, about 60% more than anything else. I am not sure what that means, but it gives me a little pause since it is such a statistical outlier when it comes to the other tracts in NYC (much less in the rest of the country).


The plots in this writeup are done using the basemap package from matplotlib, and a helpful tutorial here. NYC provides useful shapefiles for all of its census tracts, those can be found here, and you can also pull down the population data. In order to get reasonable midrange differentiation, I winsorized the population density (seemed better than mapping it to a logarithmic scale, since the data is visualized with a linear color scale). But to each his own.

nyc opendata

07 Sep 2016

Subscribe to Newsletter

I regularly ride my bike to work. I find that it is the best way to get two moderately intense workouts in a day, and besides, is significantly nicer than riding the subway (no slight on the subways, I just happen to like being outside). I have my own bike, but since Citibike has come onto the scene here in NYC I have been seeing more and more people commuting and/or riding about the city on the blue behemoths (I prefer to call them blue whales). I was poking around the NYC bike ridership information, and found this amazing page of stats available on bikes, including a complete dump of all the citibike rides ever taken! That is close to 30 million rides so far, thanks Citibike!

The data can be split in a lot of different ways, however I focused in on rider age and its relation to geography and time. Some surprising patterns emerged. Most notably, Citibike riders are a lot older than I expected! The weekday median age is 38, while the weekend median age is 35 (the overall median age in NYC is 36). I find it a little surprising that weekend ridership is significantly lower than weekday ridership - my hypothesis is that commuters are in general older than joy riders, which is responsible for tipping the scales here.

Another interesting look at the age of citibike riders is by geography. Below is a choloropeth of the overall median age of riders around the city (by where they start their rides, darker is older, range is between 44 years old in the darkest to 33 in the lightest):

Ridership Age by Census Tract

As expected, it is far from an even distribution. Midtown and the Upper West Side have the oldest riders (median age is 44), whereas Murray Hill, East Village, and Alphabet city tend towards younger riders (median age is 33).

Certain areas (like around midtown or the financial district) are naturally going to have a higher concentation of commute related rides. By plotting the age of weekday riders versus weekend riders, it is pretty clear that the age distribution in those neighborhoods change the most, supporting the idea that commuters trend older than the average rider:

Weekday Ridership Age by Census Tract Weekend Ridership Age by Census Tract

And to further drive the point home, if you take into account time of day, you can see that morning riders (during commute hours) trend much older than night time riders (after 8pm). In fact, night time riders look a lot like the weekend riders (at least age wise).

Weekday Morning Ridership Age by Census Tract Weekday Evening Ridership Age by Census Tract

NYC is an interesting place partly because it is non-homogenous, and you can definitely see that with respect to the age of citibike riders around the city. With the expansion of the citibike program this summer, particularly into more residential neighborhoods, I am excited to see how it evolves!


The plots in this writeup are done using the basemap package from matplotlib, and a helpful tutorial here. The citibike data providers lat/long coordinates of starting and ending stations, which can in turn be translated into FIPS coordinates (ie, census tracts/blocks) using this FCC provided API. NYC provides useful shapefiles for all of its census tracts, those can be found here.

Not all records have age information, so I assume you do not have to provider this data when you sign up. This might introduce a bias in the data (I can argue both ways, younger and older). For this analysis, those records are just filtered out.