Transferability Study of Video Tracking Optimization for Traffic Data Collection and Analysis
Despite the extensive studies on the performance of video sensors and computer vision algorithms, calibration of these systems is usually done by trial and error using small datasets and incomplete metrics such as brute detection rates. There is a widespread lack of systematic calibration of tracking parameters in the literature.
This study proposes an improvement in automatic traffic data collection through the optimization of tracking parameters using a genetic algorithm by comparing tracked road user trajectories to manually annotated ground truth data with Multiple Object Tracking Accuracy and Multiple Object Tracking Precision as primary measures of performance. The optimization procedure is first performed on training data and then validated by applying the resulting parameters on non-training data. A number of problematic tracking and visibility conditions are tested using five different camera views selected based on differences in weather conditions, camera resolution, camera angle, tracking distance, and camera site properties. The transferability of the optimized parameters is verified by evaluating the performance of the optimization across these data samples.
Results indicate that there are significant improvements to be made in the parametrization. Winter weather conditions require a specialized and distinct set of parameters to reach an acceptable level of performance, while higher resolution cameras have a lower sensitivity to the optimization process and perform well with most sets of parameters. Average spot speeds are found to be insensitive to MOTA while traffic counts are strongly affected.
The use of video data for automatic traffic data collection and analysis has been on an upward trend as more powerful computational tools, detection and tracking technology become available. Not only have video sensors been able for a long time to emulate inductive loops to collect basic traffic variables such as counts and speed as in the commercial system Autoscope (Michalopoulos 1991), but they can also provide higher-level information regarding road user behaviour and interactions more and more accurately. Examples include pedestrian gait parameters (Saunier 2011), crowd dynamics (Johansson 2008) and surrogate safety analysis applied to motorized and non-motorized road users in various road facilities (St-Aubin 2013, Sakshaug 2010, Autey 2012). Video sensors are relatively inexpensive and easy to install or already installed, for example by transportation agencies for traffic monitoring: large datasets can therefore be collected for large scale or long term traffic analysis. This so-called “big data” phenomenon offers opportunities to better understand transportation systems, presenting its own set of challenges for data analysis (St-Aubin 2015).
Despite the undeniable progress of the video sensors and computer vision algorithms in their varied transportation applications, there persists a distinct lack of large comparisons of the performance of video sensors in varied conditions defined for example by the complexity of the traffic scene (movements and mix of road users), the characteristics of cameras (Wan 2014) and their installation (height, angle), the environmental conditions (e.g. the weather) (Fu 2015), etc. This is particularly hampered by the poor characterization of the datasets used for performance evaluation and the limited availability of benchmarks and public video datasets for transportation applications (Saunier 2014). Tracking performance is often reported using ad hoc and incomplete metrics such as “detection rates” instead of detailed, standardised, and more suitable metrics such as CLEAR MOT (Bernardin 2008). Finally, the computer vision algorithms are typically manually adjusted by trial and error using a small dataset covering few conditions affecting performance while the reported performance evaluated on the same dataset is thus over-estimated: comparing to other fields such as machine learning, it should be clear that the algorithms should be systematically optimized on a calibration dataset, while performance should be reported for a separate validation dataset (Ettehadieh 2015).
While the performance of video sensors for more simple traffic data collection systems has been extensively studied, not all factors have been systematically analyzed and issues with parameter optimization and lack of separate calibration and validation datasets is widespread. Besides, the relationship of tracking performance with performance of traffic parameters has never been fully investigated.
The objective of this paper is first to improve the performance of existing automated detection and tracking methods for video data in terms of the accuracy of tracking. This is done through the optimization of tracking parameters using a genetic algorithm comparing the tracker output with manually annotated trajectories. The method is applied to a set of traffic videos extracted from a large surrogate safety study of roundabout merging zones (St-Aubin 2015), covering factors such as the distance of road users to the camera, the types of cameras, the camera resolution and weather conditions. The second objective is to study the relationship between tracking accuracy, its optimization, and different kinds of traffic data such as counts and speeds. The third and last objective is to explore the transferability of parameters for separate datasets with the same properties (consecutive video samples) and across different properties, by reporting how optimizing tracking for one condition impacts tracking performance for the other conditions. As a follow up on (Ettehadieh 2015), this new paper investigates more factors and how tracking performance is related to the accuracy of traffic parameters. This paper is organized as follows: in the next section a brief overview of the current state of computer vision and calibration in traffic applications is provided; then the methodology is provided in detail including the ground truth inventory, measures of performance and calibration procedure; and finally the last two sections discuss the results of the tracking optimisation procedure and conclusions regarding ideal tracking conditions and associated parameter sets.
Computer vision is used extensively in traffic applications as an instrument of data collection and monitoring. Cameras and computer vision are slowly being implemented on-board motorised vehicles as part of the sensor suite necessary for vehicle automation, including advanced driver assistance systems (e.g. pedestrian-vehicle collision avoidance system (Llorca 2009), vehicle overtaking (Milanés 2012)) and optical camera communications systems (Ifthekhar 2015). For traffic engineers, the two primary applications of computer vision using stationary cameras include vehicle presence detection systems (sometimes referred to as virtual loops) and motion tracking. Presence detection has widespread commercial application due to its relatively high degree of reliability which is on par with embedded sensor technology such as inductive loops; its primary application is in providing traffic counts, queue lengths, and basic presence detection (Hoose 1990) for a range of traffic engineering tasks ranging from data collection to traffic light control and optimisation.
Motion tracking is a more complex application which aims to extract the road users’ trajectories continuously with great precision, i.e. their position for every video frame, within the camera field of view, from which velocity, acceleration, and a number of other traffic behaviour measures may be derived. Due to the increased complexity of tracking, it is generally considered less reliable than presence detection systems. There are three main categories of tracking methods:
tracking by detection, which typically relies on background subtraction to detect foreground objects and appearance-based object classification (Zangenehpour 2015)
tracking with probability based on Bayesian tracking frameworks.
The NGSIM project was one of the first large-scale video data collection projects making use of semi-automated vehicle tracking from freeway and urban arterial video data to obtain vehicle trajectories for traffic model calibrations (Kim 2005). Surrogate safety analysis also makes use of trajectory data, for example with the early SAVEME project (Ervin 2000, Gordon 2012), and now more recently with extensive open source projects such as Traffic Intelligence (Saunier 2006, Jackson 2013).
The work done to optimize parametrization of the various trackers is sparse and usually set manually from experimental results. The instances of automated calibration in (Sidla 2006) and (Ali 2009) used Adaboost training strictly for shape detectors and (Pérez 2006) used evolutionary optimization for the segmentation portion. One of the only cases of systematic improvement of the tracking method as a whole through evolution algorithms was done recently at Polytechnique Montréal (Ettehadieh 2015): the current work shares similarities with this work such as the use of MOTA for optimization, but this paper deals with motorized traffic instead of pedestrians and investigates further the transferability of calibrated parameters not only for the same camera view, but across different types of cameras, camera views, and visibility/weather conditions. It should be noted that tracking optimisation deals primarily with tracking errors related to artificial intelligence. The other major challenge of computer vision accuracy involves potential issues with line-of sight and other optic effects. Various strategies have been formulated to deal with issues of occlusion. Partial occlusion has been shown to be corrected via object decomposition (Winn 2006,