Playing Around with Origin-Destination Analysis with NetworkX in Python
Citibikes have proven to be an outstanding way to get around the city for commuters, tourists, and virtually anyone who needs a set of wheels. Consequently, Citibike data provides urbanists with the unique opportunity to investigate human mobility in cities with a level of specificity that is not always easily available. The way in which Citibikes capture data allow urban scholars to see where people are going, and when.
For this exercise, I used Citibike data in New York City (and New Jersey) from November 2021. The goal here are to get some general insights into Citibike utilization, some descriptive information about which Citibike stations see the most traffic and then ultimately to map (using origin-destination analysis with NetworkX) some of this data.
For starters, there were a total of 2,151,486 Citibike trips taken in November of this year. Over three fourths of all of these bike trips were taken by Citibike members, who pay a monthly fee for access to Citibikes. The rest of the rides were taken by non-members, who pay based on how far they go and how long they ride for.
We can also see some trends in ridership throughout the month that are quite interesting. Based on the graph below, ridership in the beginning of the month is significantly higher than it is towards the end of the month — with the number of trips bottoming out around the 26th of the month. This is likely due to the holiday season, as many people in the city opt to spend time with family for Thanksgiving at the end of the month, but the dropoff feels very significant.
On average, in November, people rode their citibikes for 14.36 minutes. Interestingly enough, the average ride duration peaks around the same time as the lul in number of rides near the end of the month around the 26th. I’m not sure what this means, but this would be interesting to explore in future analysis!
Origin destination analysis allows researchers to identify patterns in travel behavior in a certain location, or over a period of time. This type of analysis requires two data sets — node data and connection data. Thus, for the purposes of this analysis, I first needed to create a dataframe that contained node information. For Citibikes, the station locations represent the nodes. Next, I created a data frame that used weighted edge information — in this case, I used the start-station and end-station information from each individual trip. Using these two data frames, I am able to identify the stations that have the most traffic (inbound and outbound). Additionally, I am able to investigate a few different measures of centrality to see how these nodes can be ranked (using a number of different criteria) amongst each other.
The table below represents the 10 Citibike stations that received the most riders (meaning the most trips ended at these station)
The table below represents the 10 Citibike stations from which the most riders originated their trip.
The table below represents the 10 Citibike stations with the highest total number of trips (inbound and outbound)
Although there is some slight variation in their order, we generally see that nearly all of same 10 stations populate the top spots for all three of these breakdowns. Of note, the Citibike station at 1st Ave & E 68th Street is ranked number one for all three. Frankly, I do find this to be a little bit surprising, as this station is located on the Upper East Side, which is typically not the neighborhood that comes to mind when I think of high traffic areas. Perhaps its proximity to Weill Cornell Hospital, Rockefeller University, and parts of Queens are part of what’s contributing to its ranking, but that is purely speculation.
We were able to conduct three different measures of centrality: degree centrality (which looks purely at the number of links attached to the node), eigenvector centrality (which measures how “influential” a given node is, which means it looks to see how many other high-valued nodes it’s connected to), and closeness centrality which looks to investigate how central the node is relative to other nodes in the network. The results can be found below:
We can see that based on these three measures of centrality, the station at 1st Ave & E 68th street is still number one for all three measures. In fact, the top 10 stations for all three of these measures of centrality are the same, and in the same placement. What’s important to note, however, is that while there is some overlap, the top ranked stations based on these different centrality parameters are different from the top ranked stations based on the total number of inbound and outbound rides coming from the station!
We are able to visualize all of the nodes and connections for Citibike rides in November 2021, which can be seen below.
However, because of the number of nodes and connections, it is nearly impossible to derive much meaning from this map. In order to make this more easily digestible, it’s helpful to take a look at one specific station, which can be seen below.
I chose the station E 17 St & Broadway as it was one of the top ranked stations, but not the top ranked station. The visual is much more appealing to look at below, and helps with being able to understand where riders are coming from when they end up at this station, and also where riders are going to, when they do depart from this station.