Autonomous systems require a continuous and dependable environment perception for navigation and decision making which is best achieved by combining different sensor types. Radar continues to function robustly in compromised circumstances in which cameras become impaired, guaranteeing a steady inflow of information. Yet camera images provide a more intuitive and readily applicable impression of the world. This work combines the complementary strengths of both sensor types in a unique self-learning fusion approach for a probabilistic scene reconstruction in adverse surrounding conditions. After reducing the memory requirements of the synchronized measurements through a decoupled stochastic self-supervised compression technique, the proposed algorithm exploits similarities and establishes correspondences between both domains at different feature levels during training. Then, at inference time, relying exclusively on radio frequencies the model successively predicts camera constituents in an autoregressive and self-contained process. These discrete tokens are finally transformed into an instructive view of the respective surrounding allowing to visually perceive potential dangers for important tasks downstream.
(hover for animation) |
(hover for animation) |
(hover for animation) |
|
|
|
|
|
|
|
|
|
|
The author would like to mention the EleutherAI community and members of the EleutherAI discord channels for fruitful and interesting discussions along the way of composing this paper. Additional thanks to Phil Wang (lucidrains) for his tireless efforts of making attention-based algorithms accessible to the humble deep learning research community.