Predicting Unmet Trip Demand
- Midterm Progress Report

Urban Science Intensive II - Summer 2017

Team Members: Anita Ahmed, Alexey Kalinin, Pooneh Famili, Xin Tang, and Ziman(Kay) Zhou
Faculty Advisors: Dr. Huy T Vo, Dr. Kaan Ozbay
Project Sponsor: NYC Taxi and Limousine Commission

Problem Definition

New York City Taxi and Limousine Commission (TLC) declares in their mission statement, “The mission of the Taxi and Limousine Commission is to ensure that New Yorkers and visitors to the City have access to taxicabs, car services, and commuter van services that are safe, efficient, sufficiently plentiful, and provide a good passenger experience.” (TLC2017MissionStatement) However, 100,000 For Hire Vehicles and new Borough Taxi program (also known as the Street Hail Livery program) that has licensed thousands of green Borough Taxis to serve areas of New York not commonly served by yellow medallion cabs cannot resolve the issue with unmet trip demand. Unmet Trip Demand is a situation when New Yorkers or tourists would like to take a taxi, but hardly can do it and should spend more than 5 minutes to find one. It is usually happening in areas of New York City (NYC) historically underserved by the taxi industry. The major objective of this Capstone project is to determine the contributing aspects and develop legitimate metrics to reveal unmet demand locations across New York City.

Our client is Taxi and Limousine Commission the agency responsible for licensing and regulating New York City’s medallion (yellow) taxicabs, for-hire vehicles (community-based liveries, black cars and luxury limousines), commuter vans, and paratransit vehicles. The client is looking for reproducible and reasonably interpretable metrics that could help to identify unmet trip demand using existing taxi data. One of the challenges that we faced is that TLC is very sensitive to share a data due to strict internal policies and regulations. Nevertheless, the client is willing to collaborate and offers some solutions to work with sensitive data. From our side we are going to use CUSP Data Facility “yellow environment” to work with TLC data and make sure that all manipulations are secure enough.

Previous work

Last year the capstone was done by another group of CUSP students. Their approach was to tackle the problem from the observed demand and supply point of view meaning using the available taxi occupancy and vacancy data to predict unmet demand. They broke the taxi data down to Building Block Level(BBL) and per minute. Then for each BBL they calculated the number of pick up and the number of vacant taxi per minute. The ratio of the pickup to supply was defined as unmet demand. Their method is not answering the question of defining unmet demand since it only gives us the demand where taxi is available but totally ignores the fact that there might be neighborhoods in NYC which has demand but no taxi supply hence that demand is not captured in their project.

Literature review

Existing studies that analyzed similar questions to our research agenda were driven by exploring of relationship between taxi supply and passenger trip demand all over the world. A study on the Asian market revealed spatio-temporal patterns to help taxi drivers spend less time for cruising (Powell 2011). Another, example of demand prediction introduced based on city district scale in Munich, revealed timed-variant demand prediction for individual districts (Jäger 2016). Mineta National Transit Consortium presented a report with results of taxi demand across time and space using GPS data from taxis (Yang 2016).

Inspired by several studies that applied simulation methods in finding patterns and deriving formulae for processes like ours, and based on the characteristics of our data, we developed the second approach by assuming that the number of taxi records marked as free status in a given time interval in a given Census Tract followed a Poisson distribution. Sanghoon Lee’s book Communications in Statistics-Simulation and Computation identified a cyclic behavior in NHPP(Nonhomogeneous Poisson process) (Lee 1991). Later Larry Leemis introduced a way to estimate the mean λ(\(t\)) in NHPP using a linear function (Leemis 2003). In the book Simulation published in 2013, Sheldon Ross further verified that using Poisson process with a time-dependent arriving rate function λ(\(t\)) was appropriate for studies associated with human activities (Ross 2013).

In our case specifically, the data points – records of taxies with free status – satisfy the following two conditions:

  1. 1.

    the number of free status records in two disjoint time intervals are independent;

  2. 2.

    the probability of an occurrence of such taxi record during a small time interval is proportional to the entire length of the time interval.

Therefore, the free taxi records could be treated as Poisson random variables. A similar method was developed in a study “Where to wait for a taxi” which focused more on studying drivers’ behaviors through their driving patterns (Zheng 2012). In our case, as we emphasized more on the counts in each pointed area instead of the exact moving path the taxi drivers picked, we directly used the time intervals and compared the real average waiting time to the the expected waiting time to find zones being underserved. Later we might use the probabilities as outputs and exclude the vacant taxis being parked when we count the free records, if needed.

Poulsen compared Green Taxi against Uber pickup data to locate the areas of NYC where TLC loses the greatest share of the market (Poulsen 2016). The comparison was performed on the data collected between April and September of 2014. The authors performed the following data processing and analysis:

  1. 1.

    Merging Green Taxi and Uber data with zip code shape files through geo-processing, grouping the pickup counts per zip code;

  2. 2.

    Splitting the data into time periods: by week of year, by hour of the day, by weekday / weekend and by demographics;

  3. 3.

    Plotting choropleth map and search for significant differences.

The major findings of the article were:

  • Demand for Green cab’s is still growing, but that the number of Uber rides in the same area is growing more rapidly.

  • In relatively poor neighborhoods, Green cabs are performing better than Uber’s.

  • No differences between Green Taxi and Uber are found for weekdays/ weekends patterns.

Lu applied several clustering techniques on the GPS data of taxi pickups in order to differentiate the areas according to the demand (Zhang 2016). The data used in the research was collected in Shanghai from April, 1, 2015 to April, 30, 2015 by Shanghai Qiangsheng Intelligent Navigation Technology Company and contains GPS coordinates and time of individual taxis. The approach was as follows:

  1. 1.

    Extract pick-ups and drop-offs, calculate distributions.

  2. 2.

    Cluster different locations of the city by pick-up distributions.

  3. 3.

    Analyze clusters, determine hot-spots.

As a result of the research, the authors identified the hot spots of the city and introduced a new modification of the DBSCAN clustering algorithm that helped them to improve the performance of the DBSCAN by 10%. Based on these above mentioned papers we developed the third approach to compare Uber pickup with combines Yellow and green taxi pickups.

The literature reviewed above and previous research studies help us to develop three major approaches to identify unmet trip demand across NYC based on Census Tract properties, using combined data for Medallion, Street Hail Vehicles , and For-hire Vehicles (Uber) taxi services, and hunt for metrics that help Taxi and Limousine Commission to update current policy and make taxis more accessible across NYC.


Three approaches have been derived for this task: finding places with large discrepancies in yellow/green taxi dropoffs and pickups; places where wait-time for a taxi is too long; places where Ubers are not fully served. By combining the results, we will be able to spot and rank the specific census tracts with potential taxi unmet demand.

These three metrics uses three major datasets–Breadcrumb, Lion Street and the shapefile of Census Tracts. The study region covers Manhattan and other 4 boroughs; the counts are aggregated at different evaluation durations. The evaluation durations are taken into account the several important time ranges across NYC that impact taxi supply and passengers trip demand. Time ranges for Mondays through Saturdays include day rush hours (peaks) which are from 6am to 10am, middle day from 10am to 3.30pm, taxi shift change from 3.30pm to 4.30 pm, evening rush hours (peaks) from 4.30to 6pm, and 6-9pm, and night from 9pm to 6am. During the Sundays there are also several time ranges – 12 to 2 am, 2 am to 2 pm, 2 to 4 pm, 4pm to 12am – that have been considered for this study. Note that the notation “weekdays” in the report and maps represents days from Mondays to Saturdays.

Approach #1 - Comparing Drop-offs and Pick-ups (yellow + green cabs)

This approach aims to identify underserved areas by finding census tracts with considerable number of taxi trip activities (yellow & green) while the number of pickups are significantly lower than that of drop-offs. Given an area, the certain level of drop-offs suggests a demand of taxies associated with this area, and the much fewer pickups shows that there are more taxies coming to the place than going out. This indicates a possibility that it is harder for people traveling from this place to get a taxi, which is potentially the “unmet demand”. We set several partition criteria — 90%, 75%, 50%, and 25% — to compare with the ratio of the number of pickups and drop-offs in each Census Tracts for each evaluation duration. The equation can be expressed as the following:

\begin{equation} PickupCounts<c\times DropoffCounts\nonumber \\ \end{equation}

As is shown, for a Census Tract reaching certain taxi trip numbers, the smaller the ratio c, the larger the discrepancy, and the more likely the area is being underserved. As there are many places with small c ratios, we found it reasonable to use 25% as the partition criterion. In Figure 1.1, The maps demonstrate the partition of areas based on whether the pickups take up 25% of drop-offs in each CT during weekdays (including Saturdays) in January 2015. Light orange represents areas meeting the criteria whereas the burgundy regions indicate potential unmet demands. Although there are generally more pickups during rush hours, pick-ups in regions such as north west Bronx, north Queens, south west/east Brooklyn, north east Staten Island remain under 25% of drop-offs.

To ensure the comparability of different Census Tracts, we took both the ratio and the absolute difference in counts into consideration. As is shown in Figure 1.2, in the 12-6am evaluation duration, the colors in the first map represents the different ratio intervals — except for the brown and dark which represents areas with no information or drop-off count. It can be observed that the darker the color, the similar the pickup and drop-off counts. In red areas the pickups are more than 90% of drop-offs; in white areas pickups only make up less than 25% of drop-offs.

By taking into account the taxi activities and the differences between drop-off and pickup counts, we would never treat \(\frac{1}{3}\) and 10/30 the same (Figure 1.2). If in a CT the ratio \(c\) is small with substantial difference in counts, then it is more convincing that unmet demand occurs in that area. As the map indicates that many outer boroughs have small differences in counts while these areas are in white in map1, it is not applicable to draw any conclusion for these regions from this approach.

Overlapping the ratio map (with 25% as partition criteria) and the corresponding difference map (with count 30 as threshold), we obtained the Census Tracts with potential unmet demand for each evaluation duration. Figure 1.3 are two output map examples demonstrating the patterns during 12-6am and 6-10am on weekdays. The areas in red are CTs satisfying both conditions and that are the findings of this Approach. Assuming that during 12-6am people do not request pickups but drop-offs in most of places in NYC, we focused more on the daytime and evening peak hours because the demands for pick-ups are expected to be similar as drop-offs for each borough especially the outer ones. Places such as the south west edge of Brooklyn, north Queens, and north Bronx are marked as regions with “potential unmet demand”.

Approach #2 - Average wait-time for free taxis across NYC Census Tracts.

In approach 2, the methodology is to evaluate the average waiting time for a taxi in each census tract. Based on the feature of the yellow/green cab breadcrumb data, each pin was recorded every 2 minutes; therefore, it is proper to assume that whenever a free taxi status was captured, the taxi had already been vacant for the past 2 minutes (under the optimal condition where everything was evenly distributed). Noticing that for each record of a free taxi status, it indicated an availability in that 2 minutes. So we treated each “free status” independently disregarding the associated taxi IDs, as if every record represented a different taxi, then its corresponding free minutes equaled to the recorded interval, and the total free minutes could be represented as follows:

\begin{equation} TotalFreeTaxiMinutes=\displaystyle\sum_{t\in T}({FreeCount_{t}}\times{RecordInterval_{t}})\nonumber \\ \end{equation}

Where T is the evaluation duration we examined, and it was the moment of a pinned record within T. In our case, each record interval is a constant equal to 2 minutes, therefore we derived

\begin{equation} TotalFreeTaxiMinutes=\displaystyle 2\times\sum_{t\in T}{FreeCount_{t}}\nonumber \\ \end{equation}

With such assumption, in an evaluation time duration (e.g., 8-10am of the day), if the average time waiting for a taxi in a census tract is expected to be N minutes, then for every N minutes of that evaluation period, there should be at least 1 free minute observed. Equivalently, to meet the wait-time expectation, the ratio of the evaluation duration to the total free minutes of all the free states recorded within that duration shall be at least N. The equation of real wait-time can be demonstrated as follows:

\begin{equation} RealAverageWaitingTime=\frac{EvaluationDuration}{TotalFreeTaxiMinutes}\nonumber \\ \end{equation}

Considering that density of taxis are much higher in Manhattan than in the other boroughs, we compared the wait-times in each Census Tract (CT) and in each evaluation duration with different values– 2, 4, 6, 8, 10 minutes.

By comparing the real wait-time to the expected wait-times, we were be able to identify the potentially underserved census tracts in which the waiting time was significantly longer than expected, and also ranked the census tracts based on different levels of expected wait-time.

The expected outcome for approach #2 is series of maps indicating underserved and ranked CTs that clearly show areas across NYC where unmet demand possibly exists. Our initial findings revealed series of maps that show the ratios of the real average waiting time and expected waiting time in each CT. These computation were done on data from January 2015. Counts were divided on 11 batches showing the number of records with free status across NYC Census Tracts in each evaluation duration.

Firstly we depicted the busy/free ratio (Figure 2.1,2.2) to evaluate the relationship between free taxis and busy taxis so as to evaluate the possible unmet demand. However, due to the fact that most of the Manhattan areas had higher busy/free ratios shown in the plots of 10 minutes intervals (e.g., 08:20 - 08:30, 08:40 - 08:50, etc. ) between 08:00 - 09:10 in January 2015, we refined the metrics as the “Average Wait-time” models shown above for more insights.

According to Figure 2.3 and Figure 2.4, the average wait-time between various time intervals such as 12am to 6am and 6am to 10am a similar pattern that in most of the areas the average waiting time was short, which indicates that passengers generally do not need to wait for a long time to get a taxi in those areas during those time intervals.

Approach #3 - Compare Uber pickups vs Yellow+Green

In Approach 3, the goal is to analyse and compare Uber pick ups with combined Yellow and Green pick ups for each census tract. Our assumption is that if there were areas that has been underserved by Yellow and Green cabs it has been addressed by Uber. Since Uber does not use the traditional street hailing but a sophisticated mobile app, it is more accessible by users.

For Uber trip records we have 6 consecutive months of data available from April 2014 to September 2014. The data set contains pick up datetime, latitude and longitude of the pickup location. First using geo-processing the data set it merged with NYC Census Tract 2010 shapefile. After merging each pickup datapoint is assigned to a census tract. Then the data has been aggregated at a monthly level and time segments as described in “Identification of potential underserved regions” section for each census tract. This generated 6 data point (1 for each month) for each time segment at each census tract.

For Yellow and Green trip record we use pre-processed breadcrumb data for year 2015. This data set contains the following data for every 2 minutes for each vehicle : Vehicle ID, ping time, X-Y coordinates , occupancy/ number of passengers, upto 3 nearest streets ID within 150ft of vehicle location. To make a comparison with available uber data, we decided to use 6 consecutive months of data from April 2015 to September 2015. Because the lack of availability of data, we are using different years but we tried to be consistent using the same months. To process the data we used spark filter to filter out data by each month. After initial filtering, we compared occupancy column for two consecutive records with same vehicle ID, if the occupancy changed from 0 to any other number we count it as a pick up and store it. After getting all the pick up record we merge the data with lion census tract data based on the closest street ID on the breadcrumb data. Once merged record gets a right and a left census tract assigned to the pick up. Then we split the data for time segments as described in “Identification of potential underserved regions” section for each census tract and get the aggregated count. Then for each time segment we combine count for all the 6 months, doing so we get individual files for each time segment, with counts for pickup for each census tract for every month from April to September, giving 6 data points for every census tract.

After processing is complete for both Uber and Yellow+Green dataset we fit regression line for on the 6 data points for each census tract. Whichever census tract saw a regression over the 6 months uber we identify and map those census tract. These are the census tract identified as having unmet demand.

Data Description

In order to conduct our analysis, we used the data from various open data sources involves both open data and restricted data. The access to the restricted data in restricted green and yellow environment is provided through the CUSP compute remote desktop. Restricted data includes the following:

  • Yellow taxicabs (TPEP) – trip records, breadcrumbs, rate4 and shift from two vendors i.e. VTS and CMT.

  • Green taxicabs (LPEP) – trip records, breadcrumbs, rate4 and shift from two vendors i.e. VTS and CMT.

  • For-hire vehicles (FHV) (various file formats)

  • E-Hail requests – requests made for yellow and green taxicab services, along with some information related to Uber trip records.

  • Wheelchair Accessible Vehicle(WAV) – for yellow and green taxicab vehicles specifically

The data acquired from open data sources covers the following fields:

  • TLC data:

  • data of subway locations over the NYC:

  • Weather:

  • NYC crime data:

  • American Community Survey(ACS) - demographic and socioeconomic information:

  • Lion Street, Pluto & MapPluto Data - with land use data:

Next steps

Based on what our team has done by now(working on three approaches). There are multiple tasks to do in next 5 weeks. First, we have to dig in more on literature review to figure out how much each approach has been popular among scholars in this field. Based on our understanding from literature review we will assign weight for each approaches. After that we will overlap the results from each approach that is already weighted and will select top ten census tracts that has got the most weight for each evaluation duration. After narrowing down our study to those 10 census tracts. We will take into account five feature:

  • Employment rate in a census tract(since the commute time of employee is limited to rush hour, we will investigate this feature in “rush hour” time slots.)

  • Population density at census tract level

  • Commercial square footage per census tract

  • Car ownership rate

  • Median Rent per census tract

A way that by now we are thinking to apply to investigate the role of these features in the rate of Taxi unmet demand is using multivariate regression; using these features as “\(x\)” and compare the “\(y\)” with the numbers that we get from our final calculation. The other task that we will do is estimating the number of taxis that are demanded at each time slot for the top ten census tracts that we will focus on them. We will come up with a model that estimate the number of taxi for each time slot for our focused census tracts that based on the features that we will investigate it could be generalized.

Finally, we plan to end up with an interactive map that show the number of demanded taxi for each census tract for all of the time slots (8 time slots for weekdays and 4 time slots for weekend) that our study was based on them.

Analyzing Noise Complaints in New York City

\firstnameZiman (Kay) \lastnameZhou    \firstnameKristin \lastnameKorsberg    \firstnameClaire (Xueqi) \lastnameHuang    \firstnameMaisha \lastnameLopa 

Spatial-temporal analysis of New York City’s 311 Noise Complaints revealed that hourly noise complaint trends can be clustered into three distinct groups and identified 11 demographic and land use features which significantly affect the number of complaints New York City receives. In addition, noise complaints exhibit strong spatial-autocorrelation tendencies.



footnotetext: GitHub:



Noise in New York City remains an under-studied phenomena despite it being a key area of concern for many residents. In 2016 alone, the city’s 311 system, which houses all complaints, received 405,282 noise services requests. The data collected is fairly robust, but existing research using that data falls short in many capacities. For example, it does not quantify noise using basic standardization methods, such as normalizing for population. Additionally, few publications study noise complaint trends through space and time. There is a lot of space for novel and interesting analysis.

Figure 1.1

Figure 1.2

Figure 1.3

Figure 2.1

Figure 2.2

Figure 2.3

Figure 2.4


  1. TLC Taxi and Limousine Commission, 2017. Link

  2. Jason W. Powell, Yan Huang, Favyen Bastani, Minhe Ji. Towards Reducing Taxicab Cruising Time Using Spatio-Temporal Profitability Maps. 242–260 In Advances in Spatial and Temporal Databases. Springer Berlin Heidelberg, 2011. Link

  3. Benedikt Jäger, and Michael Wittmann, Markus Lienkamp. Analyzing and Modeling a City’s Spatiotemporal Taxi Supply and Demand: A Case Study for Munich. Journal of Traffic and Logistics Engineering EJournal Publishing, 2016. Link

  4. Ci Yang, Eric J. Gonzales. Modeling Taxi Demand and Supply in New York City Using Large-Scale Taxi GPS Data. 405–425 In Springer Geography. Springer International Publishing, 2016. Link

  5. Sanghoon Lee, James R. Wilson, Melba M. Crawford. Modeling and simulation of a nonhomogeneous poisson process having cyclic behavior. Communications in Statistics - Simulation and Computation 20, 777–809 Informa UK Limited, 1991. Link

  6. Larry Leemis. Estimating and Simulating Nonhomogeneous Poisson Processes. (2003).

  7. Sheldon Ross. The Discrete Event Simulation Approach. 111–134 In Simulation. Elsevier, 2013. Link

  8. Xudong Zheng, Xiao Liang, Ke Xu. Where to wait for a taxi?. In Proceedings of the ACM SIGKDD International Workshop on Urban Computing - UrbComp 12. ACM Press, 2012. Link

  9. Lasse Korsholm Poulsen, Daan Dekkers, Nicolaas Wagenaar, Wesley Snijders, Ben Lewinsky, Raghava Rao Mukkamala, Ravi Vatrapu. Green Cabs vs. Uber in New York City. 222–229 In Big Data (BigData Congress), 2016 IEEE International Congress on. (2016).

  10. Lu Zhang, Cailian Chen, Yiyin Wang, Xinping Guan. Exploiting Taxi Demand Hotspots Based on Vehicular Big Data Analytics. 1–5 In Vehicular Technology Conference (VTC-Fall), 2016 IEEE 84th. (2016).

[Someone else is editing this]

You are editing this file