ECCV 2026

Dense Reward for Multi-View 3D Reasoning
with Global Maps and Local Views

DR-MV3D
1 KAIST AI    2 KRAFTON
* Equal contribution  ·  Corresponding author

TL;DR

  • Problem. Trained with sparse, answer-level supervision, multi-view 3D VQA models often yield inconsistent cross-view reasoning and brittle view selection.
  • DR-MV3D. A map-grounded framework that supervises the reasoning process with dense, verifiable rewards, decomposing MV3D-VQA into three stages:
    • allocentric global-map construction
    • question-conditioned view-trajectory planning
    • egocentric grounding for answer prediction
  • Two dense rewards, optimized with GRPO:
    • Global consistency reward – aligns the predicted map with geometry-consistent pseudo targets from frozen VFMs (VGGT+SAM3), needing no manual map annotations.
    • Local trajectory reward – supervises ordered viewpoint selection.
  • Results. A 3B model reaches 66.5% on MindCube-Tiny (+28.7%p), with further gains on VSI-Bench and BLINK (MV).

Overview

DR-MV3D in one minute

The idea in one figure

Sparse Rewards Guess. Dense Rewards Check.

agent
Multi-view input + question
image 1Image 1
image 2Image 2
image 3Image 3
image 4Image 4
Question: If I am standing at the same spot and facing the same direction as shown in image 4, then I turn left and move forward, will I get closer to the cabinet and potted plant?
Choices: A. No · B. Yes (GT)
two ways to supervise the answer
Existing MLLM
sparse reward (answer only)
The cognitive map the MLLM drew on its own
MLLM's inconsistent cognitive map
the same map as a top-down grid with ego directions
The objects land in the wrong cells (the black sofa and the cabinet are swapped front-to-back). Nothing checks this geometry, so the inconsistent map is never corrected.
In image 4, I can see yellow bottle in front of the light-colored sofa … Through analyzing these perspective changes, … if I move forward, I will be farther to the cabinet and potted plant.
Answer Reward only
✗ “A. No”
DR-MV3D
dense, verifiable rewards
DR-MV3D's predicted map
DR-MV3D predicted cognitive map
Global
Reward
VFM target VGGT+SAM3
VFM target cognitive map
cognitive map (allocentric), as a top-down grid
The Global Reward pulls the model's predicted map toward the geometry-consistent VFM target, so the two match and the objects are placed correctly.
Global Reward · accurate cognitive map
ego-centric map (the queried view's frame)
egocentric map with the agent at the queried viewpoint ↶ Left cabinet & potted plant
Rotated into Image 4's frame after turning left: “cabinet and potted plant” is now to the left, so it can be checked in Image 3.
Local Trajectory Reward · proper reasoning trajectory
The question uses Image 4, so we should consider the viewpoint at Image 4 … “cabinet and potted plant” has ego direction as “left”, so it can be checked in Image 3
✓ “B. Yes”
1 / 5
Show the full teaser figure from the paper
DR-MV3D overview teaser (Figure 1)

Figure 1. Overview of DR-MV3D. Sparse, answer-level supervision (left) builds inconsistent cognitive maps and misreads egocentric directions, so the model lands on a wrong answer even when the reasoning looks plausible. DR-MV3D (right) supervises the process with dense, verifiable rewards: a global reward aligns the predicted allocentric map with a geometry-consistent target from a frozen VFM, and a local trajectory reward guides ordered viewpoint selection.

Abstract

Multi-view 3D Visual Question Answering (MV3D-VQA) asks a model to combine partial observations into a coherent 3D scene and to pick informative viewpoints for multi-step spatial reasoning. Current multimodal LLMs are usually trained with sparse, answer-level supervision. This often produces inconsistent cross-view reasoning and brittle view selection. We present DR-MV3D, a map-grounded framework that supplies dense, verifiable rewards to supervise the reasoning process. The approach splits MV3D-VQA into three steps: allocentric global map construction, question-conditioned view-trajectory planning, and egocentric grounding for answer prediction. To make the intermediate steps learnable without manual annotations, we add two rewards. A global consistency reward aligns the predicted map with geometry-consistent pseudo targets from frozen 3D vision foundation models (VGGT and SAM3). A local trajectory reward supervises ordered viewpoint selection. We optimize the full pipeline with trajectory-level policy optimization (GRPO). On MindCube, VSI-Bench, and BLINK (MV), DR-MV3D consistently improves over strong multi-image baselines. Process-level dense supervision is a more effective signal for multi-view 3D reasoning than answer-level optimization alone.

66.5%
MindCube-Tiny
+28.7%p over baseline
37.1
VSI-Bench (Avg)
+6.7 over baseline
56.4
BLINK (MV)
+14.3 over baseline

Motivation

More Views Are Not Enough

The hard part of MV3D-VQA is not getting more images. It is keeping a single, consistent scene representation that supports spatial reasoning. Three problems stand in the way.

Problem 1

Allocentric and egocentric do not match

Most methods reason in a world-centered map. Many queries are egocentric, asking about viewpoint-dependent relations or hypothetical moves. The two frames drift apart.

Problem 2

Maps from MLLMs are not geometric

MLLMs rarely produce geometrically consistent 3D structure. A frozen 3D VFM (VGGT+SAM3) gives maps that sit closer to the ground truth, so it can act as a reliable structural prior.

Problem 3

Answer-only rewards are too sparse

A single reward on the final answer says little about a long chain of spatial steps. The reasoning trajectory needs feedback at each step.

Cognitive map and metric comparison

Figure 2. Cognitive maps from a VFM are closer to ground truth. Across directional (Dir.), facing (Face.), and overall scores, a VFM-derived map (VGGT+SAM3) matches the ground-truth map better than an MLLM-generated one (Qwen2.5-3B).

Takeaway: when ground-truth maps are missing, a frozen VFM supplies a reliable pseudo-structural target to learn from.

Method

Build a Map, Plan the Views, Ground the Answer

Step through the pipeline below. Each stage fades in as the reasoning unfolds, and the reward that supervises it lights up.

Multi-view input + question
view 1 view 2 view 3 view 4
Question: If I am standing at the same spot and facing the same direction as shown in Image 1, what is behind me?
Choices: A. Window (GT) · B. Lots of toys · C. Wall · D. Printed glass door
Frozen 3D VFM VGGT + SAM3
VFM 3D reconstruction
Global allocentric map
objects on a top-down grid; the 4 views point at the queried object
Global Reward · align map to VFM
View-trajectory planning
1
Image 1 selected · reference V* = (Image 1)
“Selecting a view: The question uses Image 1, so we should consider the viewpoint at Image 1.”
Local Trajectory Reward
Egocentric grounding
Image 3 where Window is visibleImage 3
“In the EgoMap, ‘Window’ has ego_direction=‘back’, so it matches ‘behind’. Confirming in Image 3: ‘Window’ can be checked in Image 3.”
✓ Therefore, the answer is “A. Window”
Answer + Format Reward
GRPO trajectory-level optimization with all four rewards
1 / 6

Plays as you scroll in. Use Play / Prev / Next or the dots to step through.

Show the full diagram from the paper
Overall framework with VFM-guided dense rewards

Figure 3. The DR-MV3D framework. From multi-view observations and a question, the MLLM builds a global allocentric cognitive map, plans a question-conditioned view trajectory, and grounds the selected views into egocentric maps to predict the answer. GRPO supervises every stage with multi-level verifiable rewards.

Formally, given multi-view images I = {I1, …, IN} and a spatial question q, the policy πθ produces a trajectory that first builds a coarse allocentric map, then selects views, then grounds them before answering. Each stage below carries its own verifiable reward.

Stage 1

Global allocentric cognitive map

The model aggregates the views into one world-centered map Callo ∼ πθ(· | I) that stays fixed as the camera moves. The trouble is that language-driven maps are often not geometrically consistent. So we read a pseudo-target off a frozen 3D vision foundation model, C* = SAM3(VGGT(I)), and align to it.

Global Reward Rglobal = sim(Callo, C*)

Why this matters: the supervision comes from geometry models, not from hand-labeled maps. This is what makes process-level training scale.

Stage 2

Question-conditioned view-trajectory planning

Conditioned on the map and the question, the policy autoregressively selects an ordered trajectory V = (v1, …, vT), keeping informative viewpoints and dropping visual noise. A reference order V* is derived deterministically from benchmark metadata (the views that best see the queried objects).

Local Trajectory Reward Rlocal = (1/T) ∑t 𝟙[vt = vt*]
Worked example · the “Among” query

Question: If I am standing at the same spot and facing the same direction as shown in Image 1, what is behind me?  (A. Window (GT) · B. Lots of toys · C. Wall · D. Printed glass door)

The query is anchored to Image 1, so the reference trajectory is V* = (Image 1).

Baseline · sparse
<think> … The opposite view of the first view is the third view. As “printed glass door” is observed behind “toy helicopter” in the third view, it means that in the first view, “printed glass door” is positioned behind me. … So the answer is “D. Printed glass door”. </think>
It reasons from Image 3, not the queried view, so v1 ≠ v1* and Rlocal = 0.
DR-MV3D · dense
<think> (1) Selecting a view: The question uses Image 1, so we should consider the viewpoint at Image 1. (2) Egocentric layout: The asked direction is “behind”. In the EgoMap, “Window” has ego_direction=“back”, so it matches “behind”. (3) Confirming in Image 3: “Window” can be checked in Image 3. The direction decision comes from the egomap ego_direction for the asked viewpoint. (4) Therefore, the answer is “A. Window”. </think>
It plans v1 = Image 1, matching the reference, so Rlocal = 1.

The local trajectory reward grades which view the model reads, and in what order. Reading the queried view is exactly what lets the egocentric map resolve “behind” to “Window” instead of “printed glass door”.

Why this matters: it scores the order of looking, not just the final answer, so the model learns to gather the right evidence before committing.

Stage 3

Egocentric grounding and answer

For each selected view the model converts global evidence into a viewpoint-aligned egocentric map Cegot ∼ πθ(· | Callo, vt, I, q) through coordinate rotation, spatial filtering, and reference-frame alignment. MLLMs are pretrained mostly on egocentric, view-aligned data, so translating the abstract 3D layout back into that native format is where the answer becomes reliable. The final answer is read from the trajectory-conditioned egocentric evidence.

Putting it together with GRPO

Answer-only rewards are too sparse, so we add the two process rewards and optimize the whole trajectory with GRPO (no separate value model):

R = Rans + Rformat + λg Rglobal + λl Rlocal

Since Rglobal can use the VFM pseudo-target instead of an annotated map, the method also runs annotation-free and still beats strong baselines.

Results

Consistent Gains Over the Baseline, Across Benchmarks

MindCube-Tiny

MethodParamsRotationAmongAroundOverall
Proprietary models
GPT-594.538.268.456.3
Gemini-2.5-Pro85.049.258.359.3
Open-weight multi-image models
Qwen2.5-VL-Instruct (baseline)3B34.036.045.237.8
Spatial-MLLM4B32.547.535.042.8
MindCube-CGMap-SFT3B34.554.270.854.4
Think3D (Qwen3-VL-RL)4B41.735.039.241.7
DR-MV3D (SFT)3B42.066.369.262.4
DR-MV3D (SFT + GRPO)3B43.071.373.666.5

Accuracy (%). Our 3B model posts the best overall score among open-weight baselines, a 28.7%p gain over the Qwen2.5-VL-3B baseline.

VSI-Bench

MethodRoute PlanRel. Dir.Rel. Dist.App. OrderAvg
LLaVA-OneVision-7B29.435.242.524.432.9
InternVL2-8B28.933.438.046.436.7
Qwen2.5-VL-3B-Instruct (baseline)27.343.825.924.430.4
DR-MV3D (SFT)30.943.935.126.934.2
DR-MV3D (SFT + GRPO)32.446.637.831.437.1

Accuracy (%). Dense-reward training carries over to sequential, navigation-style evaluation, raising the average by 6.7 over the 3B baseline.

BLINK (Multi-View)

MethodBLINK (MV)
Proprietary models
Gemini-2.5-Pro44.9
Doubao-1.550.9
Open-weight models
Qwen2.5-VL-3B (baseline)42.1
Spatial-MLLM56.0
Think3D (Qwen3-VL-4B-T3RL)53.4
DR-MV3D (SFT)54.9
DR-MV3D (SFT + GRPO)56.4

Accuracy (%). Best result among open-weight models, ahead of the 3B baseline by 14.3.

Ablation: SFT components (Overall)

ConfigAllo.Traj.Ego.Overall
Vanilla Qwen2.5-VL-3B37.8
+ Allocentric map52.8
+ Trajectory58.2
Full SFT62.4

Ablation: GRPO rewards (Overall)

RewardsGlobalLocalOverall
Answer & format only63.8
+ Global64.9
+ Global + Local66.5

Main message: each piece helps. The global reward adds geometry, and the local reward adds the right viewing order. Together they take GRPO from 63.8 to 66.5.

Qualitative

Reading the Right View at the Right Step

Each example shows the real multi-view input, the VFM scene, and the cognitive map (rendered here as live SVG from the paper source). The baseline reasons itself into the wrong object; DR-MV3D selects the right view and grounds the answer in its egocentric map.

among_group646_q0_2_1 · MindCube · Among
view 11
view 22
view 33
view 44
(a) Multi-view images
VFM scene reconstruction
(b) Scene from VFM
(c) Cognitive map with ego-directions
Question: If I am standing at the same spot and facing the same direction as shown in Image 1, what is behind me?
Choices: A. Window (GT)   B. Lots of toys   C. Wall   D. Printed glass door
MindCube-SFT: <think> … The opposite view of the first view is the third view. As “printed glass door” is observed behind “toy helicopter” in the third view, it means that in the first view, “printed glass door” is positioned behind me. … So the answer is “D. Printed glass door”. </think>
DR-MV3D (Ours): <think> (1) Selecting a view: The question uses Image 1, so we should consider the viewpoint at Image 1. (2) Egocentric layout: The asked direction is “behind”. In the EgoMap, “Window” has ego_direction=“back”, so it matches “behind”. (3) Confirming in Image 3: “Window” can be checked in Image 3. The direction decision comes from the egomap ego_direction for the asked viewpoint. (4) Therefore, the answer is “A. Window”. </think>
rotation_group005_q2_3 · MindCube · Rotation
view 11
view 22
view 33
(a) Multi-view images
VFM scene reconstruction
(b) Scene from VFM
(c) Cognitive map with ego-directions
Question: If I am standing at the same spot and facing the same direction as shown in image 2, then I turn 90 degrees to the left, what is to my right?
Choices: A. Study table and black chair   B. Bed   C. Window (GT)
MindCube-SFT: <think> … From the perspective of image 2: ‘bed’ is in front, ‘window’ is to the right, ‘study table and black chair’ is behind. After turning 90 degrees to the left: ‘window’ is now to my right, ‘study table and black chair’ is now behind me. … So the answer is “B. Bed”. </think>
DR-MV3D (Ours): <think> (1) Selecting a view: After turning “left” from Image 2, it corresponds to the facing direction of Image 1, so we should consider the viewpoint at Image 1. (2) Egocentric layout: The asked direction is “right”. In the EgoMap, “Window” has ego_direction=“right”, so it matches “right”. (3) Confirming in Image 2: “Window” can be checked in Image 2. (4) Therefore, the answer is “C. Window”. </think>

BibTeX

@inproceedings{choi2026drmv3d,
  title     = {Dense Reward for Multi-View 3D Reasoning with Global Maps and Local Views},
  author    = {Choi, Jiho and Lee, Seonho and Park, Seojeong and Shim, Hyunjung},
  booktitle = {European Conference on Computer Vision (ECCV)},
  year      = {2026}
}