We require the participator to submit the results as a single .zip file. Each .txt file in the .zip file contains the results of the corresponding image or video clip. Notably, the results of each image/video clip must be stored in the archive’s root folder.

The results file for each task should be stored in the SAME format as the provided ground-truth file, i.e., the CSV (Comma-Separated Values) text-file containing one object instance per line. If there exists no output detection/tracking result, please provide an empty file. We suggest the participator reviewing the ground truth format before proceeding. For different tasks, each line in the text-file contains different content. The format of the text-file of different tasks is described below in detail.

Object Detection in Images

Both the ground truth annotations and the submission of results on test data have the same format for object detection in videos. That is, each text file stores the detection results of the corresponding image, with each line containing an object instance in the image. The format of each line is as follows:

<bbox_left>,<bbox_top>,<bbox_width>,<bbox_height>,<score>,<object_category>,<truncation>,<occlusion>

Please find the example format of the submission of results for object detection in videos here (BaiduYun|Google Drive).

Position NameDescription
1<bbox_left>The x coordinate of the top-left corner of the predicted bounding box
2<bbox_top>The y coordinate of the top-left corner of the predicted object bounding box
3<bbox_width>The width in pixels of the predicted object bounding box
4<bbox_height>The height in pixels of the predicted object bounding box
5<score>The score in the DETECTION result file indicates the confidence of the predicted bounding box enclosing an object instance.The score in GROUNDTRUTH file is set to 1 or 0. 1 indicates the bounding box is considered in evaluation, while 0 indicates the bounding box will be ignored.
6<object_category>The object category indicates the type of annotated object, (i.e., ignored regions (0), pedestrian (1), people (2), bicycle (3), car (4), van (5), truck (6), tricycle (7), awning-tricycle (8), bus (9), motor (10), others (11))
7<truncation>The score in the DETECTION result file should be set to the constant -1. The score in the GROUNDTRUTH file indicates the degree of object parts appears outside a frame (i.e., no truncation = 0 (truncation ratio 0%), and partial truncation = 1(truncation ratio 1% ∼ 50%)).
8<occlusion>The score in the DETECTION result file should be set to the constant -1. The score in the GROUNDTRUTH file indicates the fraction of objects being occluded (i.e., no occlusion = 0 (occlusion ratio 0%), partial occlusion = 1(occlusion ratio 1% ∼ 50%), and heavy occlusion = 2 (occlusion ratio 50% ~ 100%)).

Single-Object Tracking

Both the ground truth annotations and the submission of results on test data have the same format for single-object tracking. That is, each text file stores the single-object tracking results of the corresponding video clip, with each line containing the location and size of the target in the video frame. The format of each line is as follows:

<bbox_left>,<bbox_top>,<bbox_width>,<bbox_height>

Please find the example format of the submission of results for single-object tracking here (BaiduYun|Google Drive).

PositionNameDescription
1<bbox_left>The x coordinate of the top-left corner of the predicted bounding box
2<bbox_top>The y coordinate of the top-left corner of the predicted object bounding box
3<bbox_width>The width in pixels of the predicted object bounding box
4<bbox_height>The height in pixels of the predicted object bounding box

Besides the CSV text-file, the OTB-style MAT file in [1] is also welcome. That is, the results should be saved in a cell structure named “results” with several components. The components in the cell structure are as follows:

<type>,<res>,<fps>,<len>,<annoBegin>,<startFrame>

Please find the example MAT format of the submission of results for single-object tracking here (BaiduYun|Google Drive).

IndexVariableDescription
1<type>The representation type of the predicted bounding box representation. It should be set as ‘rect’.
2<res>The tracking results in the video clip. Notably, each row includes the frame index, the x and y coordinates of the top-left corner of the predicted bounding box, and the width and height in pixels of the predicted bounding box.
3<fps>The running speed of the evaluated tracker, namely frame-per-second
4<len>The length of the evaluated sequence
5<annoBegin>The start frame index for tracking. The default value is 1.
6<startFrame>The start frame index of the video. The default value is 1.

[1] Y. Wu, J. Lim, and M. Yang, “Object tracking benchmark,” IEEE Transactions on Pattern Analysis and Machine Intelligence, vol. 37, no. 9, pp. 1834–1848, 2015.

Multi-Object Tracking

Both the ground truth annotations and the submission of results on test data have the same format for multi-object tracking. That is, each text file stores the multi-object tracking results of the corresponding video clip, with each line containing an object instance with the assigned identity in the video frame. The format of each line is as follows:

<frame_index>,<target_id>,<bbox_left>,<bbox_top>,<bbox_width>,<bbox_height>,<score>,<object_category>,<truncation>,<occlusion>

Please find the example format of the submission of results for object detection in images here(BaiduYun|Google Drive).

PositionNameDescription
1<frame_index>The frame index of the video frame
2<target_id>In the DETECTION result file, the identity of the target should be set to the constant -1.In the GROUNDTRUTH file, the identity of the target is used to provide the temporal corresponding relation of the bounding boxes in different frames.
3<bbox_left>The x coordinate of the top-left corner of the predicted bounding box
4<bbox_top>The y coordinate of the top-left corner of the predicted object bounding box
5<bbox_width>The width in pixels of the predicted object bounding box
6<bbox_height>The height in pixels of the predicted object bounding box
7<score>The score in the DETECTION file indicates the confidence of the predicted bounding box enclosing an object instance.The score in the GROUNDTRUTH file is set to 1 or 0. 1 indicates the bounding box is considered in evaluation, while 0 indicates the bounding box will be ignored.
8<object_category>The object category indicates the type of annotated object, (i.e., ignored regions (0), pedestrian (1), people (2), bicycle (3), car (4), van (5), truck (6), tricycle (7), awning-tricycle (8), bus (9), motor (10), others (11))
9<truncation>The score in the DETECTION file should be set to the constant -1.The score in the GROUNDTRUTH file indicates the degree of object parts appears outside a frame (i.e., no truncation = 0 (truncation ratio 0%), and partial truncation = 1 (truncation ratio 1% ∼ 50%)).
10<occlusion>The score in the DETECTION file should be set to the constant -1.The score in the GROUNDTRUTH file indicates the fraction of objects being occluded (i.e., no occlusion = 0 (occlusion ratio 0%), partial occlusion = 1 (occlusion ratio 1% ∼ 50%), and heavy occlusion = 2 (occlusion ratio 50% ~ 100%)).

Crowd Counting

The submission of results on test data have the different format from the ground truth annotations for crowd counting. That is, each text file stores the counting numbers of the corresponding sequence, with each line representing the number of people heads in the frame. The format of each line is as follows:

<frame_index>,<counting_number>

Please find the example format of the submission of results for crowd counting here (BaiduYun|Google Drive).

Position Name Description
1 <frame_index> The frame index of the video frame
2 <counting_number> The counting number of people heads in the frame