Goal

This paper aims at high-accuracy 3D object detection in autonomous driving scenario.

We propose Multi-View 3D networks (MV3D), a sensory-fusion framework that takes both LIDAR point cloud and RGB images as input and predicts oriented 3D bounding boxes.

  • Encode the sparse 3D point cloud with a compact multi-view representation.
  • The network is composed of two subnetworks: one for 3D object proposal generation and another for multi-view fea- ture fusion.
  • The proposal network generates 3D candidate boxes efficiently from the bird’s eye view representation of 3D point cloud.
  • We design a deep fusion scheme to combine region-wise features from multiple views and enable interactions between intermediate layers of different paths.

Multi-View 3D object detec- tion network (MV3D),a well-designed model is required to make use of the strength of multiple modalities.

The main idea for utilizing multimodal information is to perform region-based feature fusion.

  • The 3D proposal network utilizes a bird’s eye view representation of point cloud to generate highly accurate 3D candidate boxes. The benefit of 3D object proposals is that it can be projected to any views in 3D space.

  • The multi-view fusion network extracts region-wise features by projecting 3D proposals to the feature maps from mulitple views. We design a deep fusion approach to enable inter- actions of intermediate layers from different views.

  • Combined with drop-path training and auxiliary loss, our approach shows superior performance over the early/late fusion scheme. Given the multi-view feature representation, the network performsoriented3D box regression which predict accurate 3D location, size and orientation of objects in 3D space.


relate work

1.3D Object Detection in Point Cloud

encode pointcloud:

  • Most existing methods encode 3D point cloud with voxel grid representa- tion. Sliding Shapes [22] and Vote3D [26] apply SVM clas- sifers on 3D grids encoded with geometry features.
  • Some recently proposed methods [23,7,16] improve feature rep- resentation with 3D convolutions.networks, which, how- ever require expensive computations.

  • In addition to the 3D voxel representation, VeloFCN [17] projects point cloud to the front view, obtaining a 2D point map. They apply a fully convolutional network on the 2D point map and pre- dict 3D boxes densely from the convolutional feature maps.

In this work, we encode 3D point cloud with multi-view feature maps, enabling region-based representation for multimodal fusion.

results matching ""

    No results matching ""