Fare Analysis

Data Given! We had data of NYC Taxi which is a huge dataset having information of large number of trips.Around 143 million rows in the dataset. Each row in the dataset contains

  1. Latitude and Longitude of Pickup location

  2. Latitude and Longitude of Dropoff location

  3. Distance Travelled(tripdist)(in miles)

  4. Duration of trip(duration)(seconds)

  5. Fare(totamt)(US dollars),

  6. location(combined the pickup and dropoff pair in string)

  7. Timezone (From 0 to 23). Like Timezone 6 = Time between 5:30 am to 6:30 am

From this dataset we can get a list of Pickup-Dropoff pairs and for these pairs we can calculate the fare amount for companies like Uber,Lyft.
After this we can do fare comparison and also do analysis how UBER earns it revenue and also know more about the dynamic surcharge pricing.

Data to Collect!
Next task was to collect the fare data for all these OD(Origin Destination) Pairs. We use the UBER API Price “GET /v1/estimates/price” to collect the fare data for all these locations. What we found out in the Uber Api that it does take the timezone as parameter , it just takes four parameters

  1. start_latitude

  2. start_longtude

  3. end_latitude

  4. end_longitude

Uber API gives the result according to the current time or time at which query or request was made. On making the request , a json response we get containing the details for all the different UBER services namely uberPOOL, uberX, uberXL, uberFAMILY,UberBLACK, UberSUV.For handling the huge dataset and to run the api request at the specified time the dataset was divided into smaller parts and all code run in parallel.Large number of UBER API keys around 100 keys were used and data was collected.Approximate Time for a given timezone data was desired to be less than or equal to one hour.
For initial comparison we will collect data for four timezones 6 , 10 , 16 , 20, As we are 9 hrs 30 minutes ahead and also we collect the list of locations for different timezones using the Query: Only selecting locations
select location from (select location, count(location) as cnt, avg(tripdist) avgdist, avg(duration) avgtime, avg(totamt) avgfare from nyctaxi where pickup!=dropoff and duration >0 and tripdist > 0 and pathdistkey >0 and timezone=x group by location having cnt>=5 order by cnt desc);
(x = 6,10,16,20) Here 5 is the threshold we have taken which considers locations which have a minimum of Frequency 5 or atleast 5 times that trip is there in the dataset. Some of the statistics are

  1. Timezone 6 63409 locations (Time aprox. 13 minutes) (6 am in NYC = 3:30 PM IST) (Start Code at 3 PM IST)

  2. 10 210065 locations (Time approx 42 minutes) (10 am in NYC = 7:30 PM IST)(Start Code at 7 PM IST)

  3. 16 169463 locations (Time approx. 34 minutes) (4 pm in NYC = 1:30 AM IST )(Start Code at 1 AM IST)

  4. 20 251318 locations (Time approx 51 minutes.)(8 pm in NYC = 5:30 am ist)(Start Code at 5 AM IST)

Python scheduler was used to schedule the code to run at the specified time. The result is dumped into a json in the format like the key is location(that is OD pair) and value is the json response we got.