DR-MV3D: Dense Reward for Multi-View 3D Reasoning

TL;DR

Problem. Trained with sparse, answer-level supervision, multi-view 3D VQA models often yield inconsistent cross-view reasoning and brittle view selection.
DR-MV3D. A map-grounded framework that supervises the reasoning process with dense, verifiable rewards, decomposing MV3D-VQA into three stages:
- allocentric global-map construction
- question-conditioned view-trajectory planning
- egocentric grounding for answer prediction
Two dense rewards, optimized with GRPO:
- Global consistency reward – aligns the predicted map with geometry-consistent pseudo targets from frozen VFMs (VGGT+SAM3), needing no manual map annotations.
- Local trajectory reward – supervises ordered viewpoint selection.
Results. A 3B model reaches 66.5% on MindCube-Tiny (+28.7%p), with further gains on VSI-Bench and BLINK (MV).

Overview

DR-MV3D in one minute

The idea in one figure

Sparse Rewards Guess. Dense Rewards Check.

Multi-view input + question

Image 1

Image 2

Image 3

Image 4

Question: If I am standing at the same spot and facing the same direction as shown in image 4, then I turn left and move forward, will I get closer to the cabinet and potted plant?
Choices: A. No · B. Yes (GT)

two ways to supervise the answer

Existing MLLM
sparse reward (answer only)

The cognitive map the MLLM drew on its own

the same map as a top-down grid with ego directions

The objects land in the wrong cells (the black sofa and the cabinet are swapped front-to-back). Nothing checks this geometry, so the inconsistent map is never corrected.

… In image 4, I can see yellow bottle in front of the light-colored sofa … Through analyzing these perspective changes, … if I move forward, I will be farther to the cabinet and potted plant.

Answer Reward only

✗ “A. No”

DR-MV3D
dense, verifiable rewards

DR-MV3D's predicted map

≈ Global
Reward

VFM target VGGT+SAM3

cognitive map (allocentric), as a top-down grid

The Global Reward pulls the model's predicted map toward the geometry-consistent VFM target, so the two match and the objects are placed correctly.

Global Reward · accurate cognitive map

ego-centric map (the queried view's frame)

egocentric map with the agent at the queried viewpoint

↶ Left cabinet & potted plant

Rotated into Image 4's frame after turning left: “cabinet and potted plant” is now to the left, so it can be checked in Image 3.

Local Trajectory Reward · proper reasoning trajectory

The question uses Image 4, so we should consider the viewpoint at Image 4 … “cabinet and potted plant” has ego direction as “left”, so it can be checked in Image 3 …

✓ “B. Yes”

1 / 5

Show the full teaser figure from the paper

Figure 1. Overview of DR-MV3D. Sparse, answer-level supervision (left) builds inconsistent cognitive maps and misreads egocentric directions, so the model lands on a wrong answer even when the reasoning looks plausible. DR-MV3D (right) supervises the process with dense, verifiable rewards: a global reward aligns the predicted allocentric map with a geometry-consistent target from a frozen VFM, and a local trajectory reward guides ordered viewpoint selection.

Abstract

Multi-view 3D Visual Question Answering (MV3D-VQA) asks a model to combine partial observations into a coherent 3D scene and to pick informative viewpoints for multi-step spatial reasoning. Current multimodal LLMs are usually trained with sparse, answer-level supervision. This often produces inconsistent cross-view reasoning and brittle view selection. We present DR-MV3D, a map-grounded framework that supplies dense, verifiable rewards to supervise the reasoning process. The approach splits MV3D-VQA into three steps: allocentric global map construction, question-conditioned view-trajectory planning, and egocentric grounding for answer prediction. To make the intermediate steps learnable without manual annotations, we add two rewards. A global consistency reward aligns the predicted map with geometry-consistent pseudo targets from frozen 3D vision foundation models (VGGT and SAM3). A local trajectory reward supervises ordered viewpoint selection. We optimize the full pipeline with trajectory-level policy optimization (GRPO). On MindCube, VSI-Bench, and BLINK (MV), DR-MV3D consistently improves over strong multi-image baselines. Process-level dense supervision is a more effective signal for multi-view 3D reasoning than answer-level optimization alone.

66.5%

MindCube-Tiny
+28.7%p over baseline

37.1

VSI-Bench (Avg)
+6.7 over baseline

56.4

BLINK (MV)
+14.3 over baseline

Motivation

More Views Are Not Enough

The hard part of MV3D-VQA is not getting more images. It is keeping a single, consistent scene representation that supports spatial reasoning. Three problems stand in the way.

Problem 1

Allocentric and egocentric do not match

Most methods reason in a world-centered map. Many queries are egocentric, asking about viewpoint-dependent relations or hypothetical moves. The two frames drift apart.

Problem 2

Maps from MLLMs are not geometric

MLLMs rarely produce geometrically consistent 3D structure. A frozen 3D VFM (VGGT+SAM3) gives maps that sit closer to the ground truth, so it can act as a reliable structural prior.

Problem 3

Answer-only rewards are too sparse

A single reward on the final answer says little about a long chain of spatial steps. The reasoning trajectory needs feedback at each step.

Figure 2. Cognitive maps from a VFM are closer to ground truth. Across directional (Dir.), facing (Face.), and overall scores, a VFM-derived map (VGGT+SAM3) matches the ground-truth map better than an MLLM-generated one (Qwen2.5-3B).

Takeaway: when ground-truth maps are missing, a frozen VFM supplies a reliable pseudo-structural target to learn from.

Method

Build a Map, Plan the Views, Ground the Answer

Step through the pipeline below. Each stage fades in as the reasoning unfolds, and the reward that supervises it lights up.

Multi-view input + question

Question: If I am standing at the same spot and facing the same direction as shown in Image 1, what is behind me?
Choices: A. Window (GT) · B. Lots of toys · C. Wall · D. Printed glass door

▸

Frozen 3D VFM VGGT + SAM3

Global allocentric map

objects on a top-down grid; the 4 views point at the queried object

Global Reward · align map to VFM

▸

View-trajectory planning

1✓

2

3

4

Image 1 selected · reference V^* = (Image 1)

“Selecting a view: The question uses Image 1, so we should consider the viewpoint at Image 1.”

Local Trajectory Reward

▸

Egocentric grounding

Image 3

“In the EgoMap, ‘Window’ has ego_direction=‘back’, so it matches ‘behind’. Confirming in Image 3: ‘Window’ can be checked in Image 3.”

✓ Therefore, the answer is “A. Window”

Answer + Format Reward

GRPO trajectory-level optimization with all four rewards

1 / 6

Plays as you scroll in. Use Play / Prev / Next or the dots to step through.

Show the full diagram from the paper

Overall framework with VFM-guided dense rewards

Figure 3. The DR-MV3D framework. From multi-view observations and a question, the MLLM builds a global allocentric cognitive map, plans a question-conditioned view trajectory, and grounds the selected views into egocentric maps to predict the answer. GRPO supervises every stage with multi-level verifiable rewards.

Formally, given multi-view images I = {I₁, …, I_N} and a spatial question q, the policy π_θ produces a trajectory that first builds a coarse allocentric map, then selects views, then grounds them before answering. Each stage below carries its own verifiable reward.

Stage 1

Global allocentric cognitive map

The model aggregates the views into one world-centered map C^allo ∼ π_θ(· | I) that stays fixed as the camera moves. The trouble is that language-driven maps are often not geometrically consistent. So we read a pseudo-target off a frozen 3D vision foundation model, C^* = SAM3(VGGT(I)), and align to it.

Global Reward R_global = sim(C^allo, C^*)

Why this matters: the supervision comes from geometry models, not from hand-labeled maps. This is what makes process-level training scale.

Stage 2

Question-conditioned view-trajectory planning

Conditioned on the map and the question, the policy autoregressively selects an ordered trajectory V = (v₁, …, v_T), keeping informative viewpoints and dropping visual noise. A reference order V^* is derived deterministically from benchmark metadata (the views that best see the queried objects).

Local Trajectory Reward R_local = (1/T) ∑_t 𝟙[v_t = v_t^*]

Worked example · the “Among” query

Question: If I am standing at the same spot and facing the same direction as shown in Image 1, what is behind me? (A. Window (GT) · B. Lots of toys · C. Wall · D. Printed glass door)

The query is anchored to Image 1, so the reference trajectory is V^* = (Image 1).

Baseline · sparse

<think> … The opposite view of the first view is the third view. As “printed glass door” is observed behind “toy helicopter” in the third view, it means that in the first view, “printed glass door” is positioned behind me. … So the answer is “D. Printed glass door”. </think>

It reasons from Image 3, not the queried view, so v₁ ≠ v₁^* and R_local = 0.

DR-MV3D · dense

<think> (1) Selecting a view: The question uses Image 1, so we should consider the viewpoint at Image 1. (2) Egocentric layout: The asked direction is “behind”. In the EgoMap, “Window” has ego_direction=“back”, so it matches “behind”. (3) Confirming in Image 3: “Window” can be checked in Image 3. The direction decision comes from the egomap ego_direction for the asked viewpoint. (4) Therefore, the answer is “A. Window”. </think>

It plans v₁ = Image 1, matching the reference, so R_local = 1.

The local trajectory reward grades which view the model reads, and in what order. Reading the queried view is exactly what lets the egocentric map resolve “behind” to “Window” instead of “printed glass door”.

Why this matters: it scores the order of looking, not just the final answer, so the model learns to gather the right evidence before committing.

Stage 3

Egocentric grounding and answer

For each selected view the model converts global evidence into a viewpoint-aligned egocentric map C^ego_t ∼ π_θ(· | C^allo, v_t, I, q) through coordinate rotation, spatial filtering, and reference-frame alignment. MLLMs are pretrained mostly on egocentric, view-aligned data, so translating the abstract 3D layout back into that native format is where the answer becomes reliable. The final answer is read from the trajectory-conditioned egocentric evidence.

Putting it together with GRPO

Answer-only rewards are too sparse, so we add the two process rewards and optimize the whole trajectory with GRPO (no separate value model):

R = R_ans + R_format + λ_g R_global + λ_l R_local

Since R_global can use the VFM pseudo-target instead of an annotated map, the method also runs annotation-free and still beats strong baselines.

Results

Consistent Gains Over the Baseline, Across Benchmarks

MindCube-Tiny

Method	Params	Rotation	Among	Around	Overall
Proprietary models
GPT-5	–	94.5	38.2	68.4	56.3
Gemini-2.5-Pro	–	85.0	49.2	58.3	59.3
Open-weight multi-image models
Qwen2.5-VL-Instruct (baseline)	3B	34.0	36.0	45.2	37.8
Spatial-MLLM	4B	32.5	47.5	35.0	42.8
MindCube-CGMap-SFT	3B	34.5	54.2	70.8	54.4
Think3D (Qwen3-VL-RL)	4B	41.7	35.0	39.2	41.7
DR-MV3D (SFT)	3B	42.0	66.3	69.2	62.4
DR-MV3D (SFT + GRPO)	3B	43.0	71.3	73.6	66.5

Accuracy (%). Our 3B model posts the best overall score among open-weight baselines, a 28.7%p gain over the Qwen2.5-VL-3B baseline.

VSI-Bench

Method	Route Plan	Rel. Dir.	Rel. Dist.	App. Order	Avg
LLaVA-OneVision-7B	29.4	35.2	42.5	24.4	32.9
InternVL2-8B	28.9	33.4	38.0	46.4	36.7
Qwen2.5-VL-3B-Instruct (baseline)	27.3	43.8	25.9	24.4	30.4
DR-MV3D (SFT)	30.9	43.9	35.1	26.9	34.2
DR-MV3D (SFT + GRPO)	32.4	46.6	37.8	31.4	37.1

Accuracy (%). Dense-reward training carries over to sequential, navigation-style evaluation, raising the average by 6.7 over the 3B baseline.

BLINK (Multi-View)

Method	BLINK (MV)
Proprietary models
Gemini-2.5-Pro	44.9
Doubao-1.5	50.9
Open-weight models
Qwen2.5-VL-3B (baseline)	42.1
Spatial-MLLM	56.0
Think3D (Qwen3-VL-4B-T3RL)	53.4
DR-MV3D (SFT)	54.9
DR-MV3D (SFT + GRPO)	56.4

Accuracy (%). Best result among open-weight models, ahead of the 3B baseline by 14.3.

Ablation: SFT components (Overall)

Config	Allo.	Traj.	Ego.	Overall
Vanilla Qwen2.5-VL-3B	✗	✗	✗	37.8
+ Allocentric map	✓	✗	✗	52.8
+ Trajectory	✗	✓	✗	58.2
Full SFT	✓	✓	✓	62.4

Ablation: GRPO rewards (Overall)

Rewards	Global	Local	Overall
Answer & format only	✗	✗	63.8
+ Global	✓	✗	64.9
+ Global + Local	✓	✓	66.5

Main message: each piece helps. The global reward adds geometry, and the local reward adds the right viewing order. Together they take GRPO from 63.8 to 66.5.

Qualitative

Reading the Right View at the Right Step

Each example shows the real multi-view input, the VFM scene, and the cognitive map (rendered here as live SVG from the paper source). The baseline reasons itself into the wrong object; DR-MV3D selects the right view and grounds the answer in its egocentric map.

among_group646_q0_2_1 · MindCube · Among

1

2

3

4

(a) Multi-view images

(b) Scene from VFM

(c) Cognitive map with ego-directions

Question: If I am standing at the same spot and facing the same direction as shown in Image 1, what is behind me?

Choices: A. Window (GT) B. Lots of toys C. Wall D. Printed glass door

MindCube-SFT: <think> … The opposite view of the first view is the third view. As “printed glass door” is observed behind “toy helicopter” in the third view, it means that in the first view, “printed glass door” is positioned behind me. … So the answer is “D. Printed glass door”. </think>

DR-MV3D (Ours): <think> (1) Selecting a view: The question uses Image 1, so we should consider the viewpoint at Image 1. (2) Egocentric layout: The asked direction is “behind”. In the EgoMap, “Window” has ego_direction=“back”, so it matches “behind”. (3) Confirming in Image 3: “Window” can be checked in Image 3. The direction decision comes from the egomap ego_direction for the asked viewpoint. (4) Therefore, the answer is “A. Window”. </think>

rotation_group005_q2_3 · MindCube · Rotation

1

2

3

(a) Multi-view images

(b) Scene from VFM

(c) Cognitive map with ego-directions

Question: If I am standing at the same spot and facing the same direction as shown in image 2, then I turn 90 degrees to the left, what is to my right?

Choices: A. Study table and black chair B. Bed C. Window (GT)

MindCube-SFT: <think> … From the perspective of image 2: ‘bed’ is in front, ‘window’ is to the right, ‘study table and black chair’ is behind. After turning 90 degrees to the left: ‘window’ is now to my right, ‘study table and black chair’ is now behind me. … So the answer is “B. Bed”. </think>

DR-MV3D (Ours): <think> (1) Selecting a view: After turning “left” from Image 2, it corresponds to the facing direction of Image 1, so we should consider the viewpoint at Image 1. (2) Egocentric layout: The asked direction is “right”. In the EgoMap, “Window” has ego_direction=“right”, so it matches “right”. (3) Confirming in Image 2: “Window” can be checked in Image 2. (4) Therefore, the answer is “C. Window”. </think>

BibTeX

@inproceedings{choi2026drmv3d,
  title     = {Dense Reward for Multi-View 3D Reasoning with Global Maps and Local Views},
  author    = {Choi, Jiho and Lee, Seonho and Park, Seojeong and Shim, Hyunjung},
  booktitle = {European Conference on Computer Vision (ECCV)},
  year      = {2026}
}