File size: 8,792 Bytes
47af768
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
137
138
139
140
141
142
143
144
145
146
147
148
149
150
151
152
153
154
155
156
157
158
159
160
161
Taken from download link found at: http://www.cvlibs.net/datasets/kitti/eval_tracking.php

###########################################################################
#           THE KITTI VISION BENCHMARK SUITE: TRACKING BENCHMARK          #
#              Andreas Geiger    Philip Lenz    Raquel Urtasun            #
#                    Karlsruhe Institute of Technology                    #
#                Toyota Technological Institute at Chicago                #
#                             www.cvlibs.net                              #
###########################################################################

For recent updates see http://www.cvlibs.net/datasets/kitti/eval_tracking.php.

This file describes the KITTI tracking benchmarks, consisting of 21 
training sequences and 29 test sequences. 

Despite the fact that we have labeled 8 different classes, only the classes 
'Car' and 'Pedestrian' are evaluated in our benchmark, as only for those 
classes enough instances for a comprehensive evaluation have been labeled. 

The labeling process has been performed in two steps: First we hired a set 
of annotators, to label 3D bounding boxes for tracklets in 3D Velodyne 
point clouds. Since for a pedestrian tracklet, a single 3D bounding box 
tracklet (dimensions have been fixed) often fits badly, we additionally 
labeled the left/right boundaries of each object by making use of Mechanical
Turk. We also collected labels of the object's occlusion state, and computed 
the object's truncation via backprojecting a car/pedestrian model into the
image plane.

NOTE: WHEN SUBMITTING RESULTS, PLEASE STORE THEM IN THE SAME DATA FORMAT IN
WHICH THE GROUND TRUTH DATA IS PROVIDED (SEE BELOW), USING THE FILE NAMES
0000.txt 0001.txt ... CREATE A ZIP ARCHIVE OF THEM AND STORE YOUR
RESULTS (ONLY THE RESULTS OF THE TEST SET) IN ITS ROOT FOLDER. FOR A 
RE-SUBMISSION, _ONLY_ THE RE-SUBMITTED RESULTS WILL BE SHOWN IN THE TABLE.

Data Format Description
=======================

The data for training and testing can be found in the corresponding folders.
The sub-folders are structured as follows:

  - image_02/%04d/ contains the left color camera sequence images (png)
  - image_03/%04d/ contains the right color camera sequence images  (png)
  - label_02/ contains the left color camera label files (plain text files)
  - calib/ contains the calibration for all four cameras (plain text files)

The label files contain the following information. 
All values (numerical or strings) are separated via spaces, each row 
corresponds to one object. The 17 columns represent:

#Values    Name      Description
----------------------------------------------------------------------------
   1    frame        Frame within the sequence where the object appearers
   1    track id     Unique tracking id of this object within this sequence
   1    type         Describes the type of object: 'Car', 'Van', 'Truck',
                     'Pedestrian', 'Person_sitting', 'Cyclist', 'Tram',
                     'Misc' or 'DontCare'
   1    truncated    Integer (0,1,2) indicating the level of truncation.
                     Note that this is in contrast to the object detection
                     benchmark where truncation is a float in [0,1].
   1    occluded     Integer (0,1,2,3) indicating occlusion state:
                     0 = fully visible, 1 = partly occluded
                     2 = largely occluded, 3 = unknown
   1    alpha        Observation angle of object, ranging [-pi..pi]
   4    bbox         2D bounding box of object in the image (0-based index):
                     contains left, top, right, bottom pixel coordinates
   3    dimensions   3D object dimensions: height, width, length (in meters)
   3    location     3D object location x,y,z in camera coordinates (in meters)
   1    rotation_y   Rotation ry around Y-axis in camera coordinates [-pi..pi]
   1    score        Only for results: Float, indicating confidence in
                     detection, needed for p/r curves, higher is better.
                     

Here, 'DontCare' labels denote regions in which objects have not been labeled,
for example because they have been too far away from the laser scanner. To
prevent such objects from being counted as false positives our evaluation
script will ignore objects tracked in don't care regions of the test set.
You can use the don't care labels in the training set to avoid that your object
detector/tracking algorithm is harvesting hard negatives from those areas, 
in case you consider non-object regions from the training images as negative 
examples.

The reference point for the 3D bounding box for each object is centered on the
bottom face of the box. The corners of bounding box are computed as follows with
respect to the reference point and in the object coordinate system:
x_corners = [l/2, l/2, -l/2, -l/2,  l/2,  l/2, -l/2, -l/2]^T
y_corners = [0,   0,    0,    0,   -h,   -h,   -h,   -h  ]^T
z_corners = [w/2, -w/2, -w/2, w/2, w/2, -w/2, -w/2, w/2  ]^T
with l=length, h=height, and w=width.

The coordinates in the camera coordinate system can be projected in the image
by using the 3x4 projection matrix in the calib folder, where for the left
color camera for which the images are provided, P2 must be used. The
difference between rotation_y and alpha is, that rotation_y is directly
given in camera coordinates, while alpha also considers the vector from the
camera center to the object center, to compute the relative orientation of
the object with respect to the camera. For example, a car which is facing
along the X-axis of the camera coordinate system corresponds to rotation_y=0,
no matter where it is located in the X/Z plane (bird's eye view), while
alpha is zero only, when this object is located along the Z-axis of the
camera. When moving the car away from the Z-axis, the observation angle
(\alpha) will change.

An overview of the coordinate systems, reference point and geometrical 
definitions is given in cs_overview.pdf.

To project a point from Velodyne coordinates into the left color image,
you can use this formula: x = P2 * R0_rect * Tr_velo_to_cam * y
For the right color image: x = P3 * R0_rect * Tr_velo_to_cam * y

Note: All matrices are stored row-major, i.e., the first values correspond
to the first row. R0_rect contains a 3x3 matrix which you need to extend to
a 4x4 matrix by adding a 1 as the bottom-right element and 0's elsewhere.
Tr_xxx is a 3x4 matrix (R|t), which you need to extend to a 4x4 matrix 
in the same way!

The sensors were not moved between the different days while taking footage.
However, the full camera calibration was performed for every day separately.
Therefore, only 'Tr_imu_velo' is constant for all sequences.

Note that while all this information is available for the training data,
only the data which is actually needed for the particular benchmark must
be provided to the evaluation server. However, all 17 values must be provided
at all times, with the unused ones set to their default values (=invalid).
Additionally a 18'th value must be provided
with a floating value of the score for a particular tracked detection, where 
higher indicates higher confidence in the detection. The range of your scores 
will be automatically determined by our evaluation server, you don't have to
normalize it, but it should be roughly linear.

Tracking Benchmark
==================

The goal in the object tracking task is to estimate object tracklets for the 
classes 'Car', 'Pedestrian', and (optional) 'Cyclist'. The tracking
algorithm must provide as output the 2D 0-based bounding boxes in each image in
the sequence using the format specified above, as well as a score, indicating
the confidence in the particular frame for this track. All other values must be
set to their default values (=invalid), see above. One text file per sequence
must be provided in a zip archive, where each file can contain many detections,
depending on the number of objects per sequence. In our evaluation we only
evaluate detections/objects larger than 25 pixel (height) in the image and do
not count Vans as false positives for cars or Sitting Persons as wrong positives
for Pedestrians due to their similarity in appearance. (All ignored objects 
are considered as DontCare areas.) As evaluation criterion we follow the 
HOTA, CLEARMOT and Mostly-Tracked/Partly-Tracked/Mostly-Lost metrics.

Raw Data
========

Raw data is mapped to the tracking benchmark sequences and available for 
download.

The velodyne and positioning data for training and testing can be found in the 
corresponding folders. The sub-folders are structured as follows:

  - velodyne/%04d/ contains the raw velodyne point clouds (binary file)
  - oxts/ contains the raw position (oxts) data (plain text files)