Taken from: https://bdd-data.berkeley.edu/wad-2020.html

BDD100K MOT Dataset

To advance the study on multiple object tracking, we introduce BDD100K MOT Dataset. We provide 1,400 video sequences for training, 200 video sequences for validation and 400 video sequences for testing. Each video sequence is about 40 seconds long with 5 FPS resulting in approximately 200 frames per video.

BDD100K MOT Dataset is not only diverse in visual scale among and within tracks, but in the temporal range of each track. Objects in the BDD100K MOT dataset also present complicated occlusion and reappearing patterns. An object may be fully occluded or move out of the frame, and then reappear later. BDD100K MOT Dataset shows real challenges of object re-identification for tracking in autonomous driving. Details about the MOT dataset can be found in the BDD100K paper (https://arxiv.org/abs/1805.04687). Access the BDD100K data website (https://bdd-data.berkeley.edu/) to download the data.

Folder Structure
bdd100k/
├── images/
|   ├── track/
|   |   ├── train/
|   |   |   ├── $VIDEO_NAME/
|   |   |   |   ├── $VIDEO_NAME-$FRAME_INDEX.jpg
|   |   ├── val/
|   |   ├── test/
├── labels-20/
|   ├── box-track/
|   |   ├── train/
|   |   |   ├── $VIDEO_NAME.json
|   |   |   |
|   |   ├── val/
The frames for each video are stored in a folder in the images directory. The labels for each video are stored in a json file with the format detailed below.

Label Format
Each json file contains a list of frame objects, and each frame object has the format below. The format follows the schema of BDD100K data format (https://github.com/ucbdrive/bdd100k/blob/master/doc/format.md).

- name: string
- videoName: string
- index: int
- labels: [ ]
    - id: string
    - category: string
    - attributes:
        - Crowd: boolean
        - Occluded: boolean
        - Truncated: boolean
    - box2d:
        - x1: float
        - y1: float
        - x2: float
        - y2: float
There are 11 object categories in this release:

pedestrian
rider
other person
car
bus
truck
train
trailer
other vehicle
motorcycle
bicycle

Notes:
The same instance shares "id" across frames.
The "pedestrian", "bicycle", and "motorcycle" correspond to the "person", "bike", and "motor" classes in the BDD100K Detection dataset.
We consider "other person", "trailer", and "other vehicle" as distractors, which are ignored during evaluation. We only evaluate the multi-object tracking of the other 8 categories.
We set three super-categories: "person" (with classes "pedestrian" and"rider"), "vehicle" ("car", "bus", "truck", and "train"), and "bike" ("motorcycle" and "bicycle") for the purpose of evaluation.

Submission Format
The submission file for each of the two phases is a json file compressed by zip. Each json file is a list of frame objects with the format detailed below. The format also follows the schema of BDD100K data format (https://github.com/ucbdrive/bdd100k/blob/master/doc/format.md).

- name: string
- labels [ ]:
    - id: string
    - category: string
    - box2d:
       - x1: float
       - y1: float
       - x2: float
       - y2: float

Note that objects with the same identity share id across frames in a given video, and should be unique across different videos. Our evaluation will match the category string in evaluation, so you can assign your own integer ID for the categories in your model. But we recommend to encode the 8 relevant categories in the following order so that it is easier for the research community to share the models.

pedestrian
rider
car
truck
bus
train
motorcycle
bicycle

The evaluation server will perform evaluation for each category and aggregate the results to compute the overall metrics. Then the server will merge both the ground-truth and predicted labels into super-categories and evaluate for each super- category.

Evaluation
Evaluation platform: We host our evaluation server on CodaLab (https://competitions.codalab.org/competitions/24492). There are two phases for the challenge: val phase and test phase. The final ranking will be based on the test phase.
Pre-training: It is a fair game to pre-train your network with ImageNet or COCO, but if other datasets are used, please note in the submission description. We will rank the methods without using external datasets except ImageNet and COCO.
Ignoring distractors: As a preprocessing step, all predicted boxes are matched and the ones matched to distractor ground-truth boxes ("other person", "trailer", and "other vehicle") are ignored.
Crowd region: After bounding box matching, we ignore all detected false-positive boxes that has >50% overlap with the crowd region (ground-truth boxes with the "Crowd" attribute).
Super-category: In addition to the evaluation of all 8 classes, we merge ground truth and prediction categories into 3 super-categories specified above, and evaluate the results for each super-category. The super-category evaluation results will be provided only for the purpose of reference.