Problem Description: what is the question you want to answer and how you plan to answer it. State the questions/tasks you want to answer/complete your project (note that these may very well evolve during the course of your project).
Deploying, rebalancing, and servicing bikes are crucial parts of bike sharing business. I am certain that CitiBike has insights on how to balance bikes at the end of each business day. However, the insight of bike balancing operation is not available to the public.
I am curious to understand the dynamics of how bikes distribute across the city throughout the day. If I am a ground operator who is responsible for balancing bikes, I'd be interested to know the following:
- What are the most popular departing and destination stations?
- What are the common origination and arrival area clusters (not just based on Zip Code)?
- What are the top 10 stations with the least number of bikes by the end of the day at 11 pm over the month of September?
- What are the top 3 stations with the most number of bikes near each of the stations from question 3)?
Data: indicate the data you identified as available and suitable to answer the question and why that data is suitable to answer your question. Include a description of the anticipated processing and transformations you plan to make on this data
I am going to use the September 2017 CitiBike data to support this analysis. Here are some preliminary ideas on data processing and transformation:
- Extract day of week and hour from the timestamp
- Count the number of trips and group by starting and ending station accordingly
- Create a table of bike station, hour, total # of departure, total # of arrival, # of bikes at the dock, and lat/long; assuming there are 20 bikes at the beginning of the day at 12 a.m.
Analysis: what analytical tools and methodology you envision to use to answer the question
A set of simple quantitative analysis, such as aggregation, count, and rank, is sufficient to answer question 1), 3), and 4). I want to try k-means clustering to address question 2). To find common departing and arrival areas, I will perform k-means clustering based on the number of departure and arrival as well as latitude and longitude.
References: include information about papers, reports, existing work or other references that are related to your project. At this stage you do not have to have studied these references, but you must be familiar enough with the proposal idea to have identified resources that will support and guide your analysis.
Deliverable: what is the deliverable you expect to produce (a statistical conclusion, a graphical tool, an algorithm that can be used in the future e.g. by agencies, etc.)
My deliverables will include Carto visualizations of clusters of origination and arrival, locations of the stations that have low and high bike stock by the end of the day, and graphical analysis (e.g. bar charts) for any of the quantitative analysis.