Proposal by: Christopher Streich, Streich676, cjs676
Problem Description: I want to determine what are the
explanatory variables which have the greatest impact on whether a person
uses the New York City Citi-Bike bike share program to commute to work.
I will explore and quantify which of the following factors are the
greatest determinant of a person’s choice; distance to work, income
level, industry, distance from subway stations/bus lines, and age.
Data: As stated above the major dataset I wish to use is the
Longitudinal Employer-Household
Dynamics, Origin-Destination Employment Statistics (LODES) from the US
Census for the year 2014. As stated by the US Census Bureau; “Data
files are…organized into three types: Origin-Destination (OD),
Residence Area Characteristics (RAC), and Workplace Area Characteristics
(WAC), all at census block…detail.” Since the LODES dataset is
aggregated at Census Block level I will need to pull in the
Census
Tracts shapefile. Thirdly I will need to use
Citi-Bike use data from
the year 2014, the most current year available for New York in LODES.
I’ve chosen to limit the datasets and the island of Manhattan for these
reasons:
The population density is relatively high so the Census block tracts
are smaller, perhaps granting more granularity.
The combination of a grid-street system on an island make straight
line calculations for distance less problematic.
There is more overlap in the data between Citi-Bike and LODES for the
year 2014 since the program started in Manhattan.
Methodology/Analysis: I will use the Citi-bike station as a
centroid to intersect the Census tracts and aggregate the LODES data to
the Citi-bike station. I will perform this intersect twice for both the
RAC and WAC data. This process will aggregate the stations into the Census tract. I will find aggregate commute distances between
Census tracts and use this as an independent variable to explain
the number of rides starting at each Census tract, and thus each station.
Parsing out only the commuting trips will be a challenge as that is not explicitly captured by the Citi-bike data. In order to find the trips that are most likely commuting trips I will need to make some assumptions. First, I will limit my analysis to only Citi-bike subscribers. Commuting is an iterative process and I will assume that those with subscriptions use their subscription more than once. This also assumes that tourists and those that do not plan to make Citi-bike part of their commuting plan are more likely to buy one-off trips. Additionally, I will limit the time period of the analysis to only include those trips which occur on weekdays (Monday though Friday) and only between the hours of 5:00AM and 12:00pm. In this regard I am assuming that the first trip people take in the morning is to work.
I will then use regression
analysis to find explanatory variables and identify which is the most
important factor for choosing Citi-bike as a means of commute. Since I
am not looking at the general trends over time I do not believe that
time-series analysis is necessary. This may prove incorrect.
I plan to use primarily a Juypter Notebook loaded with the geopandas
tool to carry out this experiment. I may use ArgGIS to provide better
quality maps and to perform some of the analysis.
References : I have found the following papers and analysis
which relate to my proposal.