Tracking object with CenterNet, with very small modifications to the original work.

our two main technical contributions: tracking-conditioned detection (Section 4.1) and offset prediction (Section 4.2)

tracking_objects_as_points, page 12

The intro of the paper summarize it pretty well:

We condition the detector on two consecutive frames, as well as a heatmap of prior tracklets, represented as points. We train the detector to also output an offset vector from the current object center to its center in the previous frame. We learn this offset as an attribute of the center point at little additional computational cost. A greedy matching, based solely on the distance be- tween this predicted offset and the detected center point in the previous frame, suffices for object association. The tracker is end-to-end trainable and differentiable.

tracking_objects_as_points, page 2

Pay attention to the heatmap here. Just like CenterNet, this is Gaussian splatted keypoints. In training time, this come from ground truth, with augmentation. In inference time, this come from model output from , very convenient. This can be done because the heatmap both capture model internal state, and is human constructible. The augmentation is needed so that ground truth label may look like wacky model output. They added in jitter, false positives and false negatives. With this, the model can be trained with only static image also too. Just shift the image. See tracking_objects_as_points, page 7.

The offset prediction is used for association. There’s ablation study proving both it and heatmap as input is important.

The paper compare itself to Kalman filter and optical flow based methods. See tracking_objects_as_points, page 13.

On the high-framerate MOT17 dataset, any motion model suffices, and even no motion model at all performs competitively. On KITTI and nuScenes, where the intra-frame motions are non-trivial, the hand-crafted motion rule of the Kalman filter performs significantly worse, and even the performance of optical flow degrades. This emphasizes that our offset model does more than just motion estimation. CenterTrack is conditioned on prior detections and can learn to snap offset predictions to exactly those prior detections. Our training procedure strongly encourages this through heavy data augmentation.

tracking_objects_as_points, page 14