DART: Disambiguation-Aware Reasoning for Video-guided Machine Translation

Published in CCF A ACL 2026, 2026

Abstract

Video-guided Machine Translation (VMT) seeks to enhance translation quality by incorporating contextual information derived from paired short video clips. However, many VMT samples are text-sufficient; even when visual information is needed, only minimal cues are required. Aiming to tackle these issues, we propose a novel framework DART (Disambiguation-Aware Reasoning for Video-guided Machine Translation). Reinforcement learning is used to incorporate multimodal large language models’ multimodal reasoning into VMT. The model dynamically switches between text-only processing and multimodal integration, contingent on the necessity of visual disambiguation. Furthermore, we present TVRF (Translation-oriented Video Relevance Filtering), a systematic pipeline for constructing training data based on multimodal relevance to translation. This pipeline filters samples where video information is translation-relevant, mitigating training collapse caused by video-irrelevant data in conventional VMT. Experimental results show that our approach improves multimodal information utilization in VMT, yielding gains in both translation quality and computational efficiency.

DART


Comparison of existing LMRMs and DART
Comparison of existing LMRMs (top) and DART (bottom) for VMT. LMRMs apply verbose and inefficient reasoning to all samples, whereas DART adapts to input ambiguity, translating directly when unambiguous and selectively using multimodal cues for disambiguation. Green and red indicate correct and incorrect translations, respectively.


Overview of the TVRF framework
Overview of the TVRF framework. For each VMT data instance, TVRF determines whether video-based multimodal cues aid translation. An MLLM extracts and verbalizes multimodal cues, which are supplied to an LLM alongside the source sentence for translation, while the baseline uses the source sentence alone. Quantitative comparisons between these settings evaluate the impact of multimodal cues on translation quality.


Schematic of the DART training workflow
Schematic of the DART training workflow. The pipeline begins with SFT cold-start using TVRF-curated data to initialize the VMT reasoning format in the MLLM. This is followed by GRPO-based optimization to enhance reasoning depth. Crucially, we implement dual-path reward functions tailored to the presence or absence of multimodal cues, allowing the model to adaptively refine its reasoning logic


>Main results
Main results on TriFine (Chinese–English general and ambiguity subsets) and VATEX. All results are averaged over three random seeds; statistical significance (p < 0.01) verifies robustness. SPS (Samples Per Second) denotes the average end-to-end inference speed. For each dataset and metric, the best score is highlighted in bold. Rows marked with * are reported from Guan et al. (2025a).


Recommended citation: DART: Disambiguation-Aware Reasoning for Video-guided Machine Translation (Guan et al., ACL 2026)
Download Paper | Code