# Transferability Study of Video Tracking Optimization for Traffic Data Collection and Analysis

Abstract

Despite the extensive studies on the performance of video sensors and computer vision algorithms, calibration of these systems is usually done by trial and error using small datasets and incomplete metrics such as brute detection rates. There is a widespread lack of systematic calibration of tracking parameters in the literature.

This study proposes an improvement in automatic traffic data collection through the optimization of tracking parameters using a genetic algorithm by comparing tracked road user trajectories to manually annotated ground truth data with Multiple Object Tracking Accuracy and Multiple Object Tracking Precision as primary measures of performance. The optimization procedure is first performed on training data and then validated by applying the resulting parameters on non-training data. A number of problematic tracking and visibility conditions are tested using five different camera views selected based on differences in weather conditions, camera resolution, camera angle, tracking distance, and camera site properties. The transferability of the optimized parameters is verified by evaluating the performance of the optimization across these data samples.

Results indicate that there are significant improvements to be made in the parametrization. Winter weather conditions require a specialized and distinct set of parameters to reach an acceptable level of performance, while higher resolution cameras have a lower sensitivity to the optimization process and perform well with most sets of parameters. Average spot speeds are found to be insensitive to MOTA while traffic counts are strongly affected.

# Introduction

The use of video data for automatic traffic data collection and analysis has been on an upward trend as more powerful computational tools, detection and tracking technology become available. Not only have video sensors been able for a long time to emulate inductive loops to collect basic traffic variables such as counts and speed as in the commercial system Autoscope (Michalopoulos 1991), but they can also provide higher-level information regarding road user behaviour and interactions more and more accurately. Examples include pedestrian gait parameters (Saunier 2011), crowd dynamics (Johansson 2008) and surrogate safety analysis applied to motorized and non-motorized road users in various road facilities (St-Aubin 2013, Sakshaug 2010, Autey 2012). Video sensors are relatively inexpensive and easy to install or already installed, for example by transportation agencies for traffic monitoring: large datasets can therefore be collected for large scale or long term traffic analysis. This so-called “big data” phenomenon offers opportunities to better understand transportation systems, presenting its own set of challenges for data analysis (St-Aubin 2015).

Despite the undeniable progress of the video sensors and computer vision algorithms in their varied transportation applications, there persists a distinct lack of large comparisons of the performance of video sensors in varied conditions defined for example by the complexity of the traffic scene (movements and mix of road users), the characteristics of cameras (Wan 2014) and their installation (height, angle), the environmental conditions (e.g. the weather) (Fu 2015), etc. This is particularly hampered by the poor characterization of the datasets used for performance evaluation and the limited availability of benchmarks and public video datasets for transportation applications (Saunier 2014). Tracking performance is often reported using ad hoc and incomplete metrics such as “detection rates” instead of detailed, standardised, and more suitable metrics such as CLEAR MOT (Bernardin 2008). Finally, the computer vision algorithms are typically manually adjusted by trial and error using a small dataset covering few conditions affecting performance while the reported performance evaluated on the same dataset is thus over-estimated: comparing to other fields such as machine learning, it should be clear that the algorithms should be systematically optimized on a calibration dataset, while performance should be reported for a separate validation dataset (Ettehadieh 2015).

While the performance of video sensors for more simple traffic data collection systems has been extensively studied, not all factors have been systematically analyzed and issues with parameter optimization and lack of separate calibration and validation datasets is widespread. Besides, the relationship of tracking performance with performance of traffic parameters has never been fully investigated.

The objective of this paper is first to improve the performance of existing automated detection and tracking methods for video data in terms of the accuracy of tracking. This is done through the optimization of tracking parameters using a genetic algorithm comparing the tracker output with manually annotated trajectories. The method is applied to a set of traffic videos extracted from a large surrogate safety study of roundabout merging zones (St-Aubin 2015), covering factors such as the distance of road users to the camera, the types of cameras, the camera resolution and weather conditions. The second objective is to study the relationship between tracking accuracy, its optimization, and different kinds of traffic data such as counts and speeds. The third and last objective is to explore the transferability of parameters for separate datasets with the same properties (consecutive video samples) and across different properties, by reporting how optimizing tracking for one condition impacts tracking performance for the other conditions. As a follow up on (Ettehadieh 2015), this new paper investigates more factors and how tracking performance is related to the accuracy of traffic parameters. This paper is organized as follows: in the next section a brief overview of the current state of computer vision and calibration in traffic applications is provided; then the methodology is provided in detail including the ground truth inventory, measures of performance and calibration procedure; and finally the last two sections discuss the results of the tracking optimisation procedure and conclusions regarding ideal tracking conditions and associated parameter sets.

# Literature Review

## Computer Vision in Traffic Applications

Computer vision is used extensively in traffic applications as an instrument of data collection and monitoring. Cameras and computer vision are slowly being implemented on-board motorised vehicles as part of the sensor suite necessary for vehicle automation, including advanced driver assistance systems (e.g. pedestrian-vehicle collision avoidance system (Llorca 2009), vehicle overtaking (Milanés 2012)) and optical camera communications systems (Ifthekhar 2015). For traffic engineers, the two primary applications of computer vision using stationary cameras include vehicle presence detection systems (sometimes referred to as virtual loops) and motion tracking. Presence detection has widespread commercial application due to its relatively high degree of reliability which is on par with embedded sensor technology such as inductive loops; its primary application is in providing traffic counts, queue lengths, and basic presence detection (Hoose 1990) for a range of traffic engineering tasks ranging from data collection to traffic light control and optimisation.

Motion tracking is a more complex application which aims to extract the road users’ trajectories continuously with great precision, i.e. their position for every video frame, within the camera field of view, from which velocity, acceleration, and a number of other traffic behaviour measures may be derived. Due to the increased complexity of tracking, it is generally considered less reliable than presence detection systems. There are three main categories of tracking methods:

1. tracking by detection, which typically relies on background subtraction to detect foreground objects and appearance-based object classification (Zangenehpour 2015)

2. tracking using flow, also called feature-based tracking (Saunier 2006), first introduced in (Coifman 1998)

3. tracking with probability based on Bayesian tracking frameworks.

The NGSIM project was one of the first large-scale video data collection projects making use of semi-automated vehicle tracking from freeway and urban arterial video data to obtain vehicle trajectories for traffic model calibrations (Kim 2005). Surrogate safety analysis also makes use of trajectory data, for example with the early SAVEME project (Ervin 2000, Gordon 2012), and now more recently with extensive open source projects such as Traffic Intelligence (Saunier 2006, Jackson 2013).

## Tracking Optimisation and Sensor Calibration

The work done to optimize parametrization of the various trackers is sparse and usually set manually from experimental results. The instances of automated calibration in (Sidla 2006) and (Ali 2009) used Adaboost training strictly for shape detectors and (Pérez 2006) used evolutionary optimization for the segmentation portion. One of the only cases of systematic improvement of the tracking method as a whole through evolution algorithms was done recently at Polytechnique Montréal (Ettehadieh 2015): the current work shares similarities with this work such as the use of MOTA for optimization, but this paper deals with motorized traffic instead of pedestrians and investigates further the transferability of calibrated parameters not only for the same camera view, but across different types of cameras, camera views, and visibility/weather conditions. It should be noted that tracking optimisation deals primarily with tracking errors related to artificial intelligence. The other major challenge of computer vision accuracy involves potential issues with line-of sight and other optic effects. Various strategies have been formulated to deal with issues of occlusion. Partial occlusion has been shown to be corrected via object decomposition (Winn 2006, Tian 2015).

# Methodology

The approach proposed in this paper consists in identifying different conditions that may have an impact on tracking performance and traffic variables such as counts. For each condition, we need two video samples or regions in the same video where the only or primary difference is the change in that condition. The four main steps are as follows:

1. Selection of sites and analysis zones: five different camera views are specifically selected to allow for analysis based on the chosen conditions to be compared. Ten minutes of video are manually annotated for an analysis zone in each camera view to be used as a baseline for the analysis.

2. Optimization of tracking parameters over the whole annotated period (10 min): a subset of the tracking parameters are optimized for each camera view using the chosen measure of performance. These results are used to evaluate traffic data and correlations of parameters with the measure of performance.

3. Optimization of tracking parameters over the first five minute period: the tracking parameters optimized for the first five annotated minutes of each camera view are applied to the whole 10 minute annotated video, as well as to sub-regions of the analysis zones (two sub-regions, one close and one far from the camera), in order to evaluate over-fitting.

4. The optimized tracking parameters from step 3 are applied to the full ten minutes of each camera view to evaluate the transferability between sites, camera types and weather conditions.

The overview of the methodology is presented in FIGURE \ref{fig:optimization_overview}.

## Ground Truth Inventory

The ground truth data is obtained through manual annotation of the source video data using the Urban Tracker annotation application (Jodoin 2014). There are five video sequences selected from three different roundabouts presented in Table \ref{tab:gt_inv} (a video sequence correspond to a camera views and the terms are used interchangeably). S1S and S1W were recorded with similar field of views on the first site to compare the weather condition. S1S and S2 show comparable views of two different roundabouts (sites 1 and 2). S3V1 was recorded using the same camera as on S1 and S2, and can be compared to S1S and S2 to evaluate the impact of the resolution. S3V1 and S3V2 allow to compare the impact of the type of camera on the same site, with two different views. For each camera view, an analysis zone covering a merging zone of the roundabout is defined inside the zone where automated tracking is performed (site analysis mask) as can be seen in FIGURE \label{fig:analysis_zone_trajectories}. All vehicles going through the analysis zone of each camera view were manually tracked for 10 min (with bounding boxes drawn around each vehicle every 5-10 frames). Manual annotation is labour intensive: the annotations require between half an hour to one hour of manual labour per minute of video, depending on the frame rate and the traffic flow.

\label{tab:gt_inv}

Ground Truth Inventory: S1, S2 and S3 refer to the sites 1 to 3, S1S and S1W refer respectively to the videos recorded on the first site in Summer and Winter, and S3V1 and S3V2 refer respectively to the videos recorded simultaneously on the third site with two different cameras covering complementary zones of the roundabout (resolution is in pixels(pix) and video frame rate in frames per second (fps))
Site Time & Date Number of Annotated Road Users Camera Type Conditions Sample View
S1S 12:00pm, July 2012 (Thursday) 266 IP Camera 800x600 pix, 15 fps Sunny, shadows
S1W 8:00am, February 2013 (Friday) 209 IP Camera 800x600 pix, 15 fps Low visibility, winter
S2 7:00am, July 2012 (Wednesday) 64 IP Camera 800x600 pix, 15 fps Sunny, some shadows
S3V1 4:00pm, August 2013 (Friday) 80 IP Camera 1280x1024 pix, 15 fps Sunny
S3V2 4:00pm, August 2013 (Friday) 312 GoPro 1920 x 1080 pix, 15 fps, corrected for distortion Sunny

## Video Tracking

The video analysis tool used in this work relies on feature-based tracking and is available in the open source “Traffic Intelligence” project1 (Saunier 2006). Feature-based tracking is composed of two main steps:

1. distinct points such as corners are detected in the whole image and tracked frame after frame until they are lost;

2. a road user will typically have several feature on it: the second step consists in grouping the features corresponding to individual road users. Two feature trajectories are grouped if they are close enough (within distance mm-connection-distance ) and if the different between their maximum and minimum distance is small enough (within distance mm-segmentation-distance ). A group of features is saved, corresponding to a road user trajectory, if it has at least on average min-nfeatures-group features per frame.

The main parameters are listed in TABLE \ref{tab:parameters}, including the grouping parameters mentioned above. The resulting road user positions may be measured in image or world space, depending on whether or not a homography transformation was computed before tracking to project image positions to world positions, on the ground plane. A homography was computed for all camera views used in this paper using a tool provided in Traffic Intelligence. In addition, because the second type of camera used, the GoPro, is subject to strong radial distortion (so called “fish-eye” effect), a processing step is added to correct the distortion (see the sample view of S3V2 in Table \ref{tab:gt_inv}). However, since ground truths are also built from undistorted video data, feature-based tracking quality should in theory be identical regardless of camera distortion; instead distortion affects primarily the quality of the world-space data after homography transformation, be they tracked trajectories or manual annotations. Fortunately, this type of error is easily corrected by examination of the superposition of satellite imagery and does not warrant special optimisation approaches. Error tolerance for homography transformation is no more than 1 metre at a tracking distance of up to 50 metres. On the other hand, since undistortion is applied to image space instead of to the trajectories directly (for a number of technical reasons) some microscopic distortion effects might occur to individual pixels, especially at the edges of image space (far data) which could have a small impact on tracking quality.

## Quality Control Routines

A small percentage of trajectories (typically less than 5 per 10-min of data) generated by the tracker have several obvious issues and errors. For example: cases of obstruction by large vehicles or lamp posts or ghost trajectories seemingly driving through each other. The frequency of these errors is too infrequent to optimise algorithmically, but are severe enough to generate strong false alarms. Fortunately, they are also easy to identify and correct. Most of these issues are corrected or eliminated entirely using several quality control routines from the tools developed for the larger project on the roundabout safety (St-Aubin 2015). These include the following functions:

• Object integrity verification: verify any corruption in the data structure.

• Warm-up errors at scene edges: vehicles entering image space are only partially tracked until they come within full view, and therefore are lacking in number of tracked features which causes issues with feature grouping.

• Duplicate detection removal: based on proximity and trajectory similarity. Only the most eggregious examples are handled. Tracking optimisation should correct most duplicate tracking issues.

• Outlier point split: when two distinct objects within the scene are grouped together creating a single object which seems to teleport instantly accross the scene. These are split at the time of teleportation

• Stub removal: minimum trajectory dwell time of 0.66 s.

• Alignment filtering: if alignment metadata (lane and sidewalk centerlines) exists, objects which deviate significantly from any typical movements can be flagged for manual review as either a severe traffic infraction or a tracking error.

## Optimizing Tracking Accuracy

\label{tab:parameters}

Tracking parameters considered for tracking accuracy optimization
Parameter Range Type Description
Feature Tracking
feature-quality [0-0.4] Float Minimum quality of corners to track
min-feature-distance-klt [0-6] Float Minimum distance between features, in pixels
window-size [3-10] Integer Distance within which to search for feature in next frame, in pixels
min-tracking-error [0.01-0.3] Float Minimum error to reach to stop optical flow
min-feature-time [2-10] Integer Minimum time (in frames) a feature must exist to be saved
Feature Grouping
mm-connection-distance [1.5-3] Float Distance to connect features into objects, in world distance unit (m)
mm-segmentation-distance [1-3] Float Segmentation distance, in world distance unit (m)
min-nfeatures-group [2-4] Float Minimum number of features per frame to generate a road user

The tracking parameters listed in TABLE \label{tab:parameters} are optimized using a genetic algorithm that aims to improve tracking accuracy, comparing the tracker output to the ground truth for a video sequence (see overview in FIGURE \ref{fig:optimization_overview}. Each iteration of the genetic algorithm corresponds a set (population) of individuals with each individual representing a complete set of tracking parameters $$\boldsymbol{\theta}$$: the tracker and filtering routines are run on the video sequence for each set of tracking parameters $$\boldsymbol{\theta}$$. The tracker output and the ground annotations are compared in the analysis zone and the genetic algorithm will generate a new population of tracking parameters by favouring and combining the best tracking parameters of the previous population.

The metric of tracking performance is the Multiple Object Tracking Accuracy (MOTA) as described in (Bernardin 2008). It is the most common metric for tracking accuracy, i.e. to evaluate the whole trajectory and not just detections in each frame, used in computer vision. MOTA is basically the ratio of the number of correct detections of each object over the number of frames in which the object appears (in the ground truth):

$MOTA = 1 - \frac{\sum_{t} (m_t + fp_t + mme_t)}{\sum_t g_t}$

where $$m_t$$, $$fp_t$$ and $$mme_t$$ are respectively the number of misses, over detections (false positives), and mismatches for frame $$t$$. These depends on matching the trajectories produced by the tracker to the ground truth. In this work, a road user is considered to be tracked in a frame if its centroid is within a given distance in world space from the ground truth bounding box centre. Since there may be multiple matches, the Hngarian algorithm is used to associate uniquely the ground truth and tracker output so that over detections (more than one trajectory for the same road user) can be counted. The tracking results depend on these choices and a 5 m distance threshold is used as it is approximately the length of a passenger vehicle. The complementary performance measure of Multiple Object Tracking Precision (MOTP) is reported in the results. It is the average distance between the ground truth and road user trajectories (Bernardin 2008). This is particularly important for traffic variables such as time and distance headway, and safety analysis based on the proximity in time and space of interacting road users as measured for example by the time to collision indicator (St-Aubin 2015).

Once the genetic algorithm finds a local maximum, the optimized or calibrated tracking parameters can be applied to the other video sequences to determine the performance of the tracking parameters under different conditions. The relationship between the tracking parameters and MOTA is evaluated using Spearman’s Rank Correlation Coefficient $$\rho$$, calculated as:

$\rho = 1-\frac{6\sum d_{i}^{2}}{n(n^{2}-1)}$

where $$d_i$$ is the difference between the ranks $$x_i$$ and $$y_i$$ for each corresponding MOTA and tracking parameter values in a sample of size $$n$$.

The road user trajectories obtained with the calibrated tracking parameters are analyzed to generate traffic variables. The objective is to identify the relationship of different tracking performance, measured by MOTA, with the traffic variables including traffic flow and spot speeds.

## Over-fitting and Transferability

One of the risks in optimization is that the parameters may be very specific to the video sequence used for optimization. The tracking performance on this video sequence may not be achieved when applied to other video sequences and should not be considered as representing general performance if applied to any other video sequence, even collected in the exact same conditions since traffic will be different. To determine whether tracking parameters optimized for a specific sequence are universally applicable, the performance of the results is examined for each camera view with the following comparisons:

• The analysis area is split into a close section and a far section (see an example in FIGURE \label{fig:analysis_zone_trajectories}): optimization is done in the whole analysis zone and the MOTA is reported for both zones independently.

• The ground truth is split into two 5-min sequences: optimization is done for the first 5 min and reported for each 5-min sequence.

• The same camera is used on two different sites at a similar height and angle (sequences S1S and S2).

• The same camera view is used in both winter conditions and summer conditions (sequences S1S ans S1W).

• The same IP-camera is used at two different resolutions: 800x600 and 1280x1024 (sequences S1S ans S3V1).

• The same site is used for two different cameras, an IP and a GoPro camera (sequences S3V1 and S3V2).

Transferability is thus verified by applying each set of optimized tracking parameters to each of the other annotated sequences. The full ten minute annotated videos are used to calculate the MOTA, which are then compared to the optimized tracking performance and reported also as a percentage of the maximum MOTA. This is done to avoid potential bias from the site selection and vehicle composition, which may lower the MOTA result compared to other cameras, despite having similar relative tracking performance results.

\label{fig:optimization_overview} Optimization of Tracking Parameters

\label{fig:analysis_zone_trajectories} Example of an analysis zone and extracted trajectories for a given camera field of view.

# Experimental Results

## The Relationship of Tracking Accuracy with Traffic Data

In the first phase of optimization, the genetic algorithm was run for the full length of each 10-min video sequence. The population size is set to 20 individuals, using a minimum selection of 20 % of the fitness function, a crossover rate of 60 % of individuals, and a mutation rate of 5 % (of individuals). These parameters were based off of typical values and tweaked to account for a relatively low number of alleles and high sequence processing expense. All the sets of tracking parameters (individuals) and the corresponding trajectories generated by the tracker are saved over the whole optimization process: the traffic counts and average speeds at the entrance and exit of each lane of the analysis zone are extracted and analyzed with respect to tracking accuracy (see FIGURE \ref{fig:mota-traffic-data}).

 Counts Average Speeds S1S S1W S2 S3V1 S3V2

\label{fig:mota-traffic-data}

As expected, there is a strong correlation between MOTA and the number of road users tracked given the definition of MOTA: if tracking errors are uniformly spread over the analysis zone, since MOTA represents the average percentage of correctly tracked road user instants, one can expect that the resulting counts will be around the true number of road users multiplied by MOTA. MOTA seems therefore to be a good indicator of total counting accuracy. On the other hand, average spot speeds seem relatively insensitive to MOTA, except for lower values of MOTA which tend to be associated with a larger range of average extracted spot speed, especially for S1W. This can be related to the counts: lower MOTA values generate low counts and the spot speeds are a small random sample of all road users with high variability. As the performance increases in lanes that have a significant number of vehicles, the average spot speeds converge (one can still observe a lot of variability independently of MOTA for lane 1 at the exit). The two higher resolution camera views (S3V1 and S3V2) had very few results of poor performance. This suggests a lower sensitivity to the tracking parameters chosen for optimization. S1W presents a special case where only 49 % of generation individuals had any tracked objects. These two cases are discussed in subsequent sections.

## Correlation of Tracking Parameters with Tracking Accuracy

All the tested tracking parameters in the optimization history with their associated MOTA are used to compute Spearman’s rank correlation coefficient (see results by parameter and camera view in FIGURE \ref{fig:correlations}). The correlations therefore depend on these specific optimization runs, the number of which is different for each camera view since the processing time for tracking took up to 12 h for the 20 individuals evaluated at each generation of the genetic algorithm. The higher resolution camera also had few data points for lower MOTA values as seen in FIGURE \ref{fig:mota-traffic-data}. The tracking performance of the lower resolution IP camera was largely dependent on feature-quality . In fact, in the case of S1W recorded in Winter, a MOTA above 0.150 was not found unless feature-quality was below 0.10 and the best performance came from a feature-quality of approximately 0.01. This suggests that these video sequences recorded in the Winter should be calibrated separately from the other ones since its most important parameter is directly related to computation time (lower feature-quality will generate more features that take more time to be grouped). The range of feature-quality should also be adjusted for such a calibration. Another point of mention is that the GoPro camera had most of the lowest correlation coefficients with respect to each parameter. The implication is that the high resolution seems to be less sensitive to the choice of tracking parameters to achieve good performance. The third observation is that certain parameters such as min-tracking-error and min-feature-time could be eliminated from the optimization process due to the lack of correlation for all camera views.

## Over-fitting Analysis

The second phase of the validation process was to evaluate the performance of optimized parameters compared to the default parameters. The genetic algorithm was run for the first 5 min of each video file for between 10 and 30 generations over the course of one to three days depending on the site. TABLE \ref{tab:optimized-parameters} presents a summary of the optimal parameters found for each camera view. Some parameters such as min-feature-distanceklt , mm-connection-distance and min-tracking-error seem to have converged towards similar values on three of the camera views. However, it is apparent that in most cases the optimal solution is unique.

\label{tab:optimized-parameters}

 Default S1S S1W S2 S3V1 S3V2 window size 7 9 9 6 10 3 feature quality 0.1 0.08265 0.00963 0.08123 0.11669 0.26521 min-feature-distance-klt 5 3.01262 3.13741 3.54964 3.54244 2.86454 min-tracking-error 0.3 0.18563 0.10376 0.18333 0.18468 0.02839 min-feature-time 20 8 10 9 5 3 mm-connection-distance 3.75 2.7478 2.00049 2.68814 2.7306 1.85201 mm-segmentation-distance 1.5 2.45605 2.18038 1.81512 2.17617 1.28231 min-nfeatures-group 3 2.72307 3.40201 3.16748 2.51106 2.30418

The results of the genetic algorithm are presented in TABLE \ref{tab:performance1} as a comparison between the performance of for the default tracking parameters and the parameters optimized on the first 5 min of each camera view. Both the MOTA and MOTP values for every case are improved, in some cases by a large margin. The results for Winter (S1W) in particular express the need for optimization as the MOTA did not reach over 0.05 using default parameters whereas a potential for a MOTA over 0.70 was found. As expected, the sites with a lower correlation with the parameters did not improve as much as the other three sites. A lower MOTA did not necessarily mean that the camera cannot be properly optimized. When comparing S1S and S2, which have similar physical properties and identical cameras, there was a noteworthy difference (0.905 and 0.797 respectively). The source of the difference could be due to the lower number of vehicles in the site 2 video segment or a difference in traffic compositions (i.e. higher volume of trucks).

\label{tab:performance1}

lcccccccccc &&&&&
&MOTA &MOTP &MOTA &MOTP &MOTA &MOTP &MOTA &MOTP &MOTA &MOTP

First 5 min &0.74595 &1.642 &0.04254 &2.666 &0.64724 &1.251 &0.75359 &1.149 &0.82001 &1.101
Last 5 min &0.67878 &1.49 &0.03138 &3.524 &0.71861 &1.009 &0.76088 &1.074 &0.66971 &1.612
Full 10 min &0.71905 &1.581 &0.04107 &3.001 &0.70318 &1.1 &0.75976 &1.114 &0.75042 &1.34
Close &0.70611 &1.378 &0.04501 &2.742 &0.56925 &1.119 &0.76262 &0.971 &0.85619 &1.175
Far &0.63198 &1.819 &0.02898 &4.594 &0.70024 &1.092 &0.69966 &1.23 &0.63395 &1.519

First 5 min &0.90855 &1.233 &0.70759 &1.974 &0.81178 &1.019 &0.85527 &1.000 &0.85092 &0.736
Last 5 min &0.88377 &1.233 &0.69288 &1.885 &0.71742 &0.853 &0.77248 &0.842 &0.69091 &0.585
Full 10 min &0.9045 &1.237 &0.70993 &1.920 &0.76673 &0.918 &0.81746 &0.927 &0.78852 &0.666
Close &0.87103 &1.174 &0.67357 &1.979 &0.61695 &0.856 &0.78374 &0.857 &0.87462 &0.713
Far &0.82119 &1.298 &0.60789 &1.971 &0.76047 &0.978 &0.73834 &1.040 &0.67978 &0.554

In the same table, the parameters optimized by the genetic algorithm were evaluated on different conditions for each camera view:

• a validation sample made up by the last 5 min of each camera view

• the whole 10 min of annotated video

• the closer half of the analysis zone (whole 10 min) (see FIGURE \ref{fig:analysis_zone_trajectories})

• the further half of the analysis zone (whole 10 min)

This permits investigating over-fitting. It is found that, while the improvement were always greater for the first 5 min on which the parameters were optimized, there was a systematic positive upward trend in performance, both for MOTA and MOTP, on the validation sample (last 5 min) and the other conditions. It should be noted that the MOTA of the close and far zones do not always average to the MOTA for the full analysis zone. This is explained by the small differences created at the borders of the analysis zones as vehicles enter and leave and trajectories are split between the sub-zones.

## Transferability of Optimized Parameters Across Camera Views

The third phase applies the parameters optimized on the first 5 min of each camera view on all annotated videos for the whole 10 min: the result is presented for MOTA in TABLE XXX. The MOTA was compared both as an absolute value as well as a percentage of the known highest MOTA to investigate the relative gains of each camera view (the best MOTA for each camera view is the one obtained with the parameters optimized for itself as expected). The Winter conditions (S1W) proved to be the most difficult ones, where tracking parameters optimized on a different camera view always produce very few road user trajectories, leading to MOTA values far outside the acceptable range. Inversely, using the tracking parameters optimized for S1W resulted in worse tracking performance than obtained for the default parameters for two camera views. This suggests the need for separate tracking parameters for different weather conditions. However, in the case of good weather conditions, a result of around 90 % of the best known MOTA can be expected from any set of optimized parameters.

\label{fig:correlations} Spearman’s rank correlation coefficient by tracking parameter and site computed over the calibration history

# Conclusion

The first finding of this paper is the strong correlation between traffic counts and the measure of tracking performance (MOTA), as well as a correlation as large as -0.879 between MOTA and certain tracking parameters (using Spearman’s rank correlation). The second finding deals with the question of over-fitting. While the results, as expected, are found to be optimal on the sequence on which the calibration was done, there were improvements in all conditions, in separate parts of the video and on both the analysis sub-zones close and far from the camera. The third finding evaluates the transferability of tracking parameters optimized for a specific camera view on the others. Parameters for summer sequences are demonstratively not applicable to winter conditions as results for the winter sequence did not surpass a MOTA of 0.150 unless specifically optimized for, whereas the high resolution camera is shown to be very tolerant to different parameters.

The genetic algorithm used with manually annotated video sequences shows that there is room for noticeable improvements over the default tracking parameters. Considering the strong impact that winter conditions had on the performance results, a logical next step would be the evaluation of other meteorological conditions such as precipitation, low visibility (fog), nighttime and high winds (affecting camera stability). High resolution cameras have fewer issues which the choice of parameters and should be studied further to determine the effects of these conditions compared to the effects on lower resolution cameras. There is also additional work to be done on the relationship between tracking accuracy and other traffic data that could not be extracted in sufficient quantities from ten minutes of video to provide meaningful results: gap time, time-to-collision and road user interactions are examples of such data relevant for safety. Different types of optimization algorithms, e.g. the evolutionary algorithm used in (Ettehadieh 2015), and performance metrics could also be evaluated based on both computation time and reliability of the solutions. The accuracy of traffic data (counts and speeds) was not measured in this paper and its relationship with tracking accuracy should be investigated. In particular, can better accuracy for traffic variables be obtained by optimizing it directly rather than by optimizing tracking accuracy? The real world application is to develop a reliable single or dynamic set of tracking parameters that could be calibrated and applied to a network of both fixed and mobile cameras to be used under all conditions. Combined with automated trajectory analysis tools, a wide-scale deployment of automatically calibrated video tracking would provide researchers with very large datasets for traffic monitoring and surrogate safety analysis.

# Acknowledgements

The authors would like to acknowledge the funding of the Québec road safety research program supported by the Fonds de recherche du Québec Nature et technologies, the Ministère des Transports du Québec and the Fonds de recherche du Québec Santé (proposal number 2012-SO-163493), as well as the varying municipalities for their logistical support during data collection. The authors also wish to thank Shaun Burns for his help in the collection of video data and Karla Gamboa for annotating two of the five videos used in this work.

### References

1. Irshad Ali, Matthew N. Dailey. Multiple Human Tracking in High-Density Crowds. 540–549 In Advanced Concepts for Intelligent Vision Systems. Springer Science $$\mathplus$$ Business Media, 2009. Link

2. Jarvis Autey, Tarek Sayed, Mohamed H. Zaki. Safety evaluation of right-turn smart channels using automated traffic conflict analysis. Accident Analysis & Prevention 45, 120–130 Elsevier BV, 2012. Link

3. Keni Bernardin, Rainer Stiefelhagen. Evaluating Multiple Object Tracking Performance: The CLEAR MOT Metrics. EURASIP Journal on Image and Video Processing 2008, 1–10 Springer Science $$\mathplus$$ Business Media, 2008. Link

4. Benjamin Coifman, David Beymer, Philip McLauchlan, Jitendra Malik. A real-time computer vision system for vehicle tracking and traffic surveillance. Transportation Research Part C: Emerging Technologies 6, 271–288 Elsevier BV, 1998. Link

5. R. Ervin, C. MacAdam, J. Walker, S. Bogard, M. Hagan, A. Vayda, E. Anderson. System for Assessment of the Vehicle Motion Environment (SAVME). (2000).

6. D. Ettehadieh, B. Farooq, N. Saunier. Systematic Parameter Optimization and Application of Automated Tracking in Pedestrian-Dominant Situations. In Transportation Research Board Annual Meeting Compendium of Papers. (2015).

7. Ting Fu, Sohail Zangenehpour, Paul St-Aubin, Liping Fu, Luis F. Miranda-Moreno. Using microscopic video data measures for driver behavior analysis during adverse winter weather: opportunities and challenges. Journal of Modern Transportation 23, 81–92 Springer Science + Business Media, 2015. Link

8. Timothy Gordon, Zevi Bareket, Lidia Kostyniuk, Michelle Barnes, Michael Hagan, Zu Kim, Delphine Cody, Alexander Skabardonis, Alan Vayda. Site-Based Video System Design and Development. (2012).

9. N. Hoose. COMPUTER VISION AS A TRAFFIC SURVEILLANCE TOOL. 57–64 In Control Computers, Communications in Transportation. Elsevier BV, 1990. Link

10. Md. Shareef Ifthekhar, Nirzhar Saha, Yeong Min Jang. Stereo-vision-based cooperative-vehicle positioning using OCC and neural networks. Optics Communications 352, 166–180 Elsevier BV, 2015. Link

11. Stewart Jackson, Luis F. Miranda-Moreno, Paul St-Aubin, Nicolas Saunier. Flexible, Mobile Video Camera System and Open Source Video Analysis Software for Road Safety and Behavioral Analysis. Transportation Research Record: Journal of the Transportation Research Board 2365, 90–98 Transportation Research Board, 2013. Link

12. Jean-Philippe Jodoin, Guillaume-Alexandre Bilodeau, Nicolas Saunier. Urban Tracker: Multiple object tracking in urban mixed traffic. In IEEE Winter Conference on Applications of Computer Vision. IEEE, 2014. Link

13. A. Johansson, D. Helbing, H. Al-Abideen, S. Al-Bosta. From crowd dynamics to crowd safety: a video-based analysis. Advances in Complex Systems 11, 497-527 World Scientific Publishing Company, 2008.

14. Z. Kim, G. Gomes, R. Hranac, A. Skabardonis. A Machine Vision System for Generating Vehicle Trajectories over Extended Freeway Segments. In 12th World Congress on Intelligent Transportation Systems. (2005).

15. D.R. Llorca, M.A. Sotelo, I. Parra, J.E. Naranjo, M. Gavilan, S. Alvarez. An Experimental Study on Pitch Compensation in Pedestrian-Protection Systems for Collision Avoidance and Mitigation. IEEE Trans. Intell. Transport. Syst. 10, 469–474 Institute of Electrical & Electronics Engineers (IEEE), 2009. Link

16. P. G. Michalopoulos. Vehicle detection video through image processing: the Autoscope system. IEEE Transactions on Vehicular Technology 40, 21-29 (1991). Link

17. Vicente Milanés, David F. Llorca, Jorge Villagrá, Joshué Pérez, Carlos Fernández, Ignacio Parra, Carlos González, Miguel A. Sotelo. Intelligent automatic overtaking system using vision for vehicle detection. Expert Systems with Applications 39, 3362–3373 Elsevier BV, 2012. Link

18. O. Pérez, M. Á. Patricio, J. García, J. M. Molina. Improving the Segmentation Stage of a Pedestrian Tracking Video-Based System by Means of Evolution Strategies. 438–449 In Lecture Notes in Computer Science. Springer Science + Business Media, 2006. Link

19. Lisa Sakshaug, Aliaksei Laureshyn, Åse Svensson, Christer Hydén. Cyclists in roundaboutsDifferent design solutions. Accident Analysis & Prevention 42, 1338–1351 Elsevier BV, 2010. Link

20. N. Saunier, A. El Husseini, K. Ismail, C. Morency, J.-M. Auberlet, T. Sayed. Pedestrian Stride Frequency and Length Estimation in Outdoor Urban Environments using Video Sensors. Transportation Research Record: Journal of the Transportation Research Board 2264, 138-147 (2011). Link

21. N. Saunier, H. Ardö, J.-P. Jodoin, A. Laureshyn, M. Nilsson, Å. Svensson, L. F. Miranda-Moreno, G.-A. Bilodeau, K. Åström. Public Video Data Set for Road Transportation Applications. In Transportation Research Board Annual Meeting Compendium of Papers. (2014).

22. N. Saunier, T. Sayed. A feature-based tracking algorithm for vehicles in intersections. In The 3rd Canadian Conference on Computer and Robot Vision (CRV06). Institute of Electrical & Electronics Engineers (IEEE), 2006. Link

23. Oliver Sidla, Yuriy Lypetskyy, Norbert Brandle, Stefan Seer. Pedestrian Detection and Tracking for Counting Applications in Crowded Situations. In 2006 IEEE International Conference on Video and Signal Based Surveillance. Institute of Electrical & Electronics Engineers (IEEE), 2006. Link

24. Paul St-Aubin, Nicolas Saunier, Luis Miranda-Moreno, Karim Ismail. Use of Computer Vision Data for Detailed Driver Behavior Analysis and Trajectory Interpretation at Roundabouts. Transportation Research Record: Journal of the Transportation Research Board 2389, 65–77 Transportation Research Board, 2013. Link

25. P. St-Aubin, N. Saunier, L. F. Miranda-Moreno. Large-scale automated proactive road safety analysis using video data. Transportation Research Part C: Emerging Technologies Elsevier BV, 2015. Link

26. Bin Tian, Ming Tang, Fei-Yue Wang. Vehicle detection grammars with partial occlusion handling for traffic surveillance. Transportation Research Part C: Emerging Technologies 56, 80–93 Elsevier BV, 2015. Link

27. Yiwen Wan, Yan Huang, Bill Buckles. Camera calibration and vehicle tracking: Highway traffic video analytics. Transportation Research Part C: Emerging Technologies 44, 202–213 Elsevier BV, 2014. Link

28. J. Winn, J. Shotton. The Layout Consistent Random Field for Recognizing and Segmenting Partially Occluded Objects. In 2006 IEEE Computer Society Conference on Computer Vision and Pattern Recognition - Volume 1 (CVPR06). Institute of Electrical & Electronics Engineers (IEEE), 2006. Link

29. Sohail Zangenehpour, Luis F. Miranda-Moreno, Nicolas Saunier. Automated classification based on video data at intersections with heavy pedestrian and bicycle traffic: Methodology and application. Transportation Research Part C: Emerging Technologies 56, 161–176 Elsevier BV, 2015. Link