Baseline and Dataset for Cross-Dataset 3D Visual Grounding on Different RGB-D Scans

MIYANISHI Taiki, AZUMA Daichi, KURITA Shuhei, KAWANABE Motoaki

doi:10.11517/pjsai.jsai2023.0_2a5gs204

<p>We introduce Cross3DVG, a new task for cross-dataset visual grounding in 3D scenes, revealing the shortcomings of current 3D visual grounding models developed in the limited datasets and hence easy to overfit specific scene sets. For Cross3DVG, we have created a new large-scale 3D visual grounding dataset that contains over 63k diverse linguistic annotations to 3D objects in 1,380 RGB-D indoor scans from the 3RScan dataset with human annotation. This is corresponding to the existing 52k descriptions on the ScanNet-based 3D visual grounding dataset of ScanRefer. We perform cross 3D visual grounding experiments in that we train a 3D visual grounding model with the source 3D visual grounding dataset and then evaluate it on the target 3D visual grounding dataset without target labels (i.e., zero-shot setting.) Extensive experiments using well-established visual grounding models as well as a CLIP-based 2D-3D integration method show that (i) cross 3d visual grounding has significantly lower performance than learning and evaluation in a single dataset (ii) better detectors and transformer-based headers for 3D grounding are useful, and (iii) fusing 2D-3D data using CLIP can further improve performance.</p>

Baseline and Dataset for Cross-Dataset 3D Visual Grounding on Different RGB-D Scans

Bibliographic Information

Abstract

Journal

Keywords

Details 詳細情報について

Export

Report a problem