Note: this and the centernet_triplets paper share the same network name.

Unlike the centernet_triplets paper, this one is much straight forward, no center pooling or cascade corner pooling. It only uses a single point, the center of object as anchor, and predict all the other stuff from it.

Loss

The final layer of the model is a heatmap with output stride of . Ground truth center is splatted with Gaussian kernel (so that pixels around it is also fine), and then compare with the heatmap with Focal loss. See centernet_object_as_points, page 3. A local offset is also predicted to recover the discretization error.

Here is the bounding box params, and is local offset. The values used in the experiments are and .

From center to object

No NMS here. As long as point is a global maxima, we say it’s a center.

3D detection

However, depth is difficult to regress to directly. We instead use the output transformation of Eigen et al. [13] and d = 1/σ( ˆd) − 1, where σ is the sigmoid function.

centernet_object_as_points, page 4

Orientation prediction is more elaborate.

We use an 8-scalar encoding to ease learning. The 8 scalars are divided into two groups, each for an angular bin. One bin is for angles in and the other is for angles in . Thus we have 4 scalars for each bin. Within each bin, 2 of the scalars are used for softmax classification (if the orientation falls into to this bin i). And the rest 2 scalars are for the sin and cos value of in-bin offset (to the bin center ).

And then the loss is classification for and regression for .

centernet_object_as_points, page 11

Human pose

How do you know which keypoints belong to the same human?

  1. Add a output, offset from center to all the keypoints.
  2. Refine the keypoints, by predicting a joint keymap (just like how we get the center)

We then snap our initial predictions to the closest detected keypoint on this heatmap. Here, our center offset acts as a grouping cue, to assign individual keypoint detec- tions to their closest person instance.

centernet_object_as_points, page 4

There’s an extra experiments(centernet_object_as_points, page 7) showing that not predicting this heatmap makes things worse.

Inference

They used augmentation in inference and then use bagging to output average. It’s the first time I’ve seen this trick (is it even legal?)

We use three levels of test augmentations: no augmentation, flip augmentation, and flip and multi-scale (0.5, 0.75, 1, 1.25, 1.5). For flip, we average the network outputs before decoding bounding boxes. For multi-scale, we use NMS to merge results. These augmentations yield different speed-accuracy trade-off, as is shown in the next section

centernet_object_as_points, page 5