The Temporal and Weather Data Analysis on NYC Yellow Taxi Ridership Demands

Abstract:

Several researches have been done since The NYC Taxi & Limousine Commission has released the detailed historical dataset covering over 1.1 billion individual taxi trips, from January 2009 through June 2016. Many Data scientists have examined this dataset passionately, in order to discover this great city’s neighborhoods, nightlife, airport traffic, and more. In this contribution, the likelihood of occurrence of long taxi trips during the day and night has been studied, as well as the relationship of weather and taxi demands. The present study has investigated NYC yellow taxi trips by looking at the two months period of 2016(January and June) based on the temporal factors and weather condition. The results show it was more likely long trip would occur during the nighttime compares to daytime, and the snow depth does greatly affect the demand of taxi trips, but precipitation does not display evident correlation with demand of taxi rides.

Keywords: NYC, yellow taxi, data, demand

Introduction:

It is generally known that: Fridays’ and Saturdays’ nights are the busiest time period during the week, or bad weather could also help to drive up taxi demands. This paper is aimed to study the occurrence frequency of longer taxi trips during the different set of time of the day. Study taxi ridership data could be helpful for the allocation car services based on weekly prediction. Reduce further traffic by helping taxi drivers to better understand when could be the best time to work. Companies like Uber or Lyft would love to optimize their allocation of their drivers to maximize the efficiency and improve the personal car services.

Data:

Taxi trip data:

Source: NYC Taxi & Limousine Commission - NYC.gov

Since there will be time series analysis, the taxi data from NYC.gov has included the time of the trip occurring, both the pick-up time and drop-off time, and the total fare as well as the individual travel distance. Pandas’ data frame was used throughout the data processing.

1. Read Jan 2016 taxi trip data into pandas’ dataframe use read_csv. Random chose subset for 20000 rows;

2. Cleaned the data by drop out meaningless rows, such as rows with same values in both pickup time and drop-off time, or zero value in trip fare.

3. Visualized the data using Seaborn to have a general idea of the data (distance and fare); Identified outliers.

4. Divided the data into two categories using time series : daytime and nighttime, which the cut is Daytime (from 6:00 to 18:00), Night time (from 18:00 to 6:00);Chose 2 sigma threshold of taxi fare amount to obtain the longer trip.

5. Grouped-by and counted the time series data by different weekdays; Normalization of the absolute counts also took consideration of statistical error.

6. Repeated the process for the June 2016 data.