BEVFormer

BEVFormer: Learning Bird’s-Eye-View Representation from Multi-Camera Images via Spatiotemporal Transformers

The network’s core representation is in BEV view (BEV query), it uses temporal self attention + spatial cross attention. The author states that they are inspired by Tesla’s video spatial transformer.

BEV Queries

A grid-shaped learnable parameters $Q \in R^{H \times W \times C}$ , each cell correspond to real world geometry.

Spatial Cross-Attetion

This part is like BEV baseline, but it does not just grab context from CNN embedding. Instead, it attend on multiple cameras. It’s not global attention. Instead, it uses Deformable Attention, only interaction with small regions of interest. The reference points are sampled from the lifted pillar.

SC A (Q_{P,} F_{t)} = \frac{1}{∣ V _{hit} ∣} i \in V_{hit} \sum j = 1 \sum N_{re f} DeformAttn (Q_{P}, P (p, i, j), F_{t}^{i})

Basically this means: sum all the things that’s the attention between our BEV query and the image feature around the projected selected points.

Temporal Self-Attention

Align $B_{t - 1}$ to $Q$ according to ego-motion, then do self-attention (also deformed) to $Q, B_{t - 1}^{'}$ .

How well does it perform?

Better than other camera-only method that time, but way worse than PointPillars, and even worse than BEV baseline.

Yanda's Random Notes

Explorer

BEVFormer

BEV Queries

Spatial Cross-Attetion

Temporal Self-Attention

How well does it perform?

Graph View

Table of Contents

Backlinks

Yanda's Random Notes

Explorer

BEVFormer

BEV Queries §

Spatial Cross-Attetion §

Temporal Self-Attention §

How well does it perform? §

Graph View

Table of Contents

Backlinks

BEV Queries

Spatial Cross-Attetion

Temporal Self-Attention

How well does it perform?