Disc3D: Automatic Curation of High-Quality 3D Dialog Data via Discriminative Object Referring
3D Multi-modal Large Language Models (MLLMs) still lag behind their 2D peers, largely because large-scale, high-quality 3D scene–dialogue datasets remain scarce. Prior efforts hinge on expensive human annotation and leave two key ambiguities unresolved: viewpoint ambiguity, spatial language presumes unknown camera poses, and object referring ambiguity, non-exclusive descriptions blur the line between targets and distractors. We therefore present a fully automated pipeline that converts raw 3D scans into unambiguous, high-quality dialogue data at a fraction of the previous cost. By synergizing rule-based constraints with 2D MLLMs and LLMs, the pipeline enables controllable, scalable generation without human intervention. The pipeline comprises four stages: (1) meta-annotation collection harvesting object-, frame-, and scene-level captions, (2) scene graph construction with relation correction to capture the proximal object relations, (3) discriminative object referring that generates exclusive and compact descriptions, and (4) multi-task data generation synthesizing diverse dialogues. Our pipeline systematically mitigates inherent flaws in source datasets and produces the final Disc3D dataset, over 2 million samples in 25K hybrid 3D scenes, spanning scene&view&object captioning, visual grounding, and five object-centric QA tasks. Extensive experiments demonstrate that training with Disc3D yields consistent, significant improvements on both public benchmarks and our multifaceted Disc3D-QA tasks. Code, data, and models will be publicly available.
Figure 1. Overview of our Disc3D dataset. (a) Disc3D comprises millions of dialogue samples across 25K hybrid (real and synthetic) scenes, covering five object-centric QA tasks and three caption tasks for training & benchmarking 3D MLLMs. Object descriptions are color-coded to match the corresponding highlighted objects. (b) Task and scene distributions in the training set. The object counting task is excluded due to category-specific answer bias.
Figure 2. A schematic overview of two key stages in our data curation pipeline. Specifically: a) Scene Graph Construction proceeds in two stages: an initial graph is built automatically, after which a LLM & MLLM-guided module refines the misjudged relations (highlighted in red). b) Discriminative Object Referring produces exclusive, unambiguous descriptions for each object in the distractor group across five orthogonal axes. The first sub-stage, Comparative Disambiguation, contrasts objects of the same category along appearance, size, and relational cues; the second sub-stage, Spatially Anchoring, injects 3D context by explicitly conditioning every description on the designated anchor object or sight.
Environment Setup
Clone Github repo
Setup environment.
python3 -m pip install -r requirements.txt
Preparation for Data Synthesis
Note:
Due to the inconsistent data structures and annotation types (e.g., instance annotation on mesh, 2D annotation only, or only bounding-box-level annotation) across different raw 3D scan datasets, it is difficult to create a single, unified processing logic.
For most datasets, we recommend lifting the annotation information to mesh-level instance annotations to facilitate uniform processing.
Steps:
Download and decompress the original 3D scan data, which includes scene frame information (RGB, depth, camera intrinsics, camera pose) and annotation information (mesh, 2D map annotation, or 3D bounding-box-level annotation).
Organize the metadata of the original dataset. You can store this as a JSON file. Each JSON file, named [scene_id].json, should contain the metadata for all frames in a scene. The content should be a dictionary mapping frame_id to frame_meta_info, following this schema:
{
"frame_id_0": {
"image_path": "Path to the RGB image",
"depth_path": "Path to the Depth image",
"image_intrinsic": "Camera intrinsic matrix",
"pose": "Camera pose matrix",
"axis_align_matrix": "Transformation matrix from the original scene coordinate system to the gravity-aligned coordinate system. Some datasets provide this matrix, while others do not."
},
...
}
Based on the annotation type of your 3D source dataset, implement a corresponding SourceDataset subclass in dataset/src_dataset.py. Also, implement the routing logic in dataset/src_dataset.py:get_default_dataset to load the 3D source dataset’s metadata. Your SourceDataset subclass should inherit from the JsonFileDataset class to use its frame information processing logic. For mesh-level annotation information, you will need to implement a get_mesh_info method. We recommend starting with Scannet++, a high-quality dataset with rich scenes and comprehensive object annotations, which provides scene meshes and per-point instance annotations. You can use the existing ScannetPPDataset implementation in dataset/src_dataset.py.
Prepare your label graph to handle hierarchical relationships within the source dataset’s label system during data generation.
# step1: metadata collection
# the first run need GPU to render the images and derive the object visible views
# the following run utilize the scene meta file by default
python3 collect_metadata.py -c [config_yaml]
# step2: scene graph construction
python3 generate_scene_graph.py -c [config_yaml]
# step3: discriminative object referring (./exclusive_description.py#L51)
# step1: metadata collection
# the first run need GPU to render the images and derive the object visible views
# the following run utilize the scene meta file by default
python3 collect_metadata.py -c [config_yaml]
# step2: scene graph construction
python3 generate_scene_graph.py -c [config_yaml]
# step3: discriminative object referring (./exclusive_description.py#L51) \& multi-task data generation
python3 generate_task_data.py -c [config_yaml]
multi-task data generation
python3 generate_task_data.py -c [config_yaml]
Visualize
During development, we found that Streamlit lacks easy-to-use 3D visualization components. Therefore, we integrated pyviz3d into Streamlit (visualize/stpyviz3d.py: stpyviz3d_visualize), which provides stable and smooth rendering, supports various types of annotations along with a control panel, and can be embedded into Streamlit web applications for multimodal data visualization and debugging.
Acknowledgements
This project borrows and adapts the implementation of constructing scene graphs from SceneVerse.
To suppress noise and ambiguity, we discard multi-object relations (involving more than two nodes) and all relative directions (e.g., left/right or clock-face positions)
Disc3D: Automatic Curation of High-Quality 3D Dialog Data via Discriminative Object Referring
3D Multi-modal Large Language Models (MLLMs) still lag behind their 2D peers, largely because large-scale, high-quality 3D scene–dialogue datasets remain scarce. Prior efforts hinge on expensive human annotation and leave two key ambiguities unresolved: viewpoint ambiguity, spatial language presumes unknown camera poses, and object referring ambiguity, non-exclusive descriptions blur the line between targets and distractors. We therefore present a fully automated pipeline that converts raw 3D scans into unambiguous, high-quality dialogue data at a fraction of the previous cost. By synergizing rule-based constraints with 2D MLLMs and LLMs, the pipeline enables controllable, scalable generation without human intervention. The pipeline comprises four stages: (1) meta-annotation collection harvesting object-, frame-, and scene-level captions, (2) scene graph construction with relation correction to capture the proximal object relations, (3) discriminative object referring that generates exclusive and compact descriptions, and (4) multi-task data generation synthesizing diverse dialogues. Our pipeline systematically mitigates inherent flaws in source datasets and produces the final Disc3D dataset, over 2 million samples in 25K hybrid 3D scenes, spanning scene&view&object captioning, visual grounding, and five object-centric QA tasks. Extensive experiments demonstrate that training with Disc3D yields consistent, significant improvements on both public benchmarks and our multifaceted Disc3D-QA tasks. Code, data, and models will be publicly available.
Figure 1. Overview of our Disc3D dataset.
(a) Disc3D comprises millions of dialogue samples across 25K hybrid (real and synthetic) scenes, covering five object-centric QA tasks and three caption tasks for training & benchmarking 3D MLLMs.
Object descriptions are color-coded to match the corresponding highlighted objects.
(b) Task and scene distributions in the training set. The object counting task is excluded due to category-specific answer bias.
Figure 2. A schematic overview of two key stages in our data curation pipeline. Specifically:
a) Scene Graph Construction proceeds in two stages: an initial graph is built automatically, after which a LLM & MLLM-guided module refines the misjudged relations (highlighted in red).
b) Discriminative Object Referring produces exclusive, unambiguous descriptions for each object in the distractor group across five orthogonal axes.
The first sub-stage, Comparative Disambiguation, contrasts objects of the same category along appearance, size, and relational cues; the second sub-stage, Spatially Anchoring, injects 3D context by explicitly conditioning every description on the designated anchor object or sight.
Environment Setup
Preparation for Data Synthesis
Note:
Steps:
Download and decompress the original 3D scan data, which includes scene frame information (RGB, depth, camera intrinsics, camera pose) and annotation information (mesh, 2D map annotation, or 3D bounding-box-level annotation).
Organize the metadata of the original dataset. You can store this as a JSON file. Each JSON file, named
[scene_id].json, should contain the metadata for all frames in a scene. The content should be a dictionary mappingframe_idtoframe_meta_info, following this schema:Based on the annotation type of your 3D source dataset, implement a corresponding
SourceDatasetsubclass indataset/src_dataset.py. Also, implement the routing logic indataset/src_dataset.py:get_default_datasetto load the 3D source dataset’s metadata. YourSourceDatasetsubclass should inherit from theJsonFileDatasetclass to use its frame information processing logic. For mesh-level annotation information, you will need to implement aget_mesh_infomethod. We recommend starting with Scannet++, a high-quality dataset with rich scenes and comprehensive object annotations, which provides scene meshes and per-point instance annotations. You can use the existingScannetPPDatasetimplementation indataset/src_dataset.py.Prepare your label graph to handle hierarchical relationships within the source dataset’s label system during data generation.
Run the Data Generation Pipeline
export API_KEY=your_api_key. Define your model name of model service client in annotation/gpt_assitant.py:MODEL_NAME_MAP| Dataset | Config | |, –|, –| | ScanNet++ | configs/scannetpp.yaml | | ScanNet200 | configs/scannet200.train.yaml | | 3RScan | configs/3rscan.yaml | | CA-1M | configs/ca1m.yaml | | Structured3D | configs/structured3d_room.yaml |
Visualize
During development, we found that Streamlit lacks easy-to-use 3D visualization components. Therefore, we integrated pyviz3d into Streamlit (
visualize/stpyviz3d.py: stpyviz3d_visualize), which provides stable and smooth rendering, supports various types of annotations along with a control panel, and can be embedded into Streamlit web applications for multimodal data visualization and debugging.Acknowledgements
This project borrows and adapts the implementation of constructing scene graphs from SceneVerse. To suppress noise and ambiguity, we discard multi-object relations (involving more than two nodes) and all relative directions (e.g., left/right or clock-face positions)