DART: Disambiguation-Aware Reasoning for Video-guided Machine Translation

Published in CCF A ACL 2026, 2026

Abstract

Video-guided Machine Translation (VMT) seeks to enhance translation quality by incorporating contextual information derived from paired short video clips. However, many VMT samples are text-sufficient; even when visual information is needed, only minimal cues are required. Aiming to tackle these issues, we propose a novel framework DART (Disambiguation-Aware Reasoning for Video-guided Machine Translation). Reinforcement learning is used to incorporate multimodal large language models’ multimodal reasoning into VMT. The model dynamically switches between text-only processing and multimodal integration, contingent on the necessity of visual disambiguation. Furthermore, we present TVRF (Translation-oriented Video Relevance Filtering), a systematic pipeline for constructing training data based on multimodal relevance to translation. This pipeline filters samples where video information is translation-relevant, mitigating training collapse caused by video-irrelevant data in conventional VMT. Experimental results show that our approach improves multimodal information utilization in VMT, yielding gains in both translation quality and computational efficiency.

DART

Overview of the TVRF framework. For each VMT data instance, TVRF determines whether video-based multimodal cues aid translation. An MLLM extracts and verbalizes multimodal cues, which are supplied to an LLM alongside the source sentence for translation, while the baseline uses the source sentence alone. Quantitative comparisons between these settings evaluate the impact of multimodal cues on translation quality.

Schematic of the DART training workflow. The pipeline begins with SFT cold-start using TVRF-curated data to initialize the VMT reasoning format in the MLLM. This is followed by GRPO-based optimization to enhance reasoning depth. Crucially, we implement dual-path reward functions tailored to the presence or absence of multimodal cues, allowing the model to adaptively refine its reasoning logic

>Main results — Main results on TriFine (Chinese–English general and ambiguity subsets) and VATEX. All results are averaged over three random seeds; statistical significance (p < 0.01) verifies robustness. SPS (Samples Per Second) denotes the average end-to-end inference speed. For each dataset and metric, the best score is highlighted in bold. Rows marked with * are reported from Guan et al. (2025a).

Recommended citation: DART: Disambiguation-Aware Reasoning for Video-guided Machine Translation (Guan et al., ACL 2026)
Download Paper | Code

Share on

Bluesky Facebook LinkedIn X (formerly Twitter)

Boyu Guan (管博宇)

Abstract

DART

Share on