ROMA Icon ROMA: Real-time Omni-Multimodal Assistant
with Interactive Streaming Understanding

1CAS Key Laboratory of AI Safety, Institute of Computing Technology, CAS
2University of Chinese Academy of Sciences 3Tsinghua University
ROMA Streaming Capabilities

Figure 1: ROMA's streaming understanding capabilities. It supports proactive tasks, including event alerts and narration, alongside reactive question answering.

Abstract

Recent Omni-multimodal Large Language Models show promise in unified audio, vision, and text modeling. However, streaming audio-video understanding remains challenging, as existing approaches suffer from disjointed capabilities: they typically exhibit incomplete modality support or lack autonomous proactive monitoring.

To address this, we present ROMA, a real-time omni-multimodal assistant for unified reactive and proactive interaction. ROMA processes continuous inputs as synchronized multimodal units, aligning dense audio with discrete video frames to handle granularity mismatches. For online decision-making, we introduce a lightweight speak head that decouples response initiation from generation to ensure precise triggering without task conflict.

We train ROMA with a curated streaming dataset and a two-stage curriculum that progressively optimizes for streaming format adaptation and proactive responsiveness. To standardize the fragmented evaluation landscape, we reorganize diverse benchmarks into a unified suite covering both proactive (alert, narration) and reactive (QA) settings. Extensive experiments across 12 benchmarks demonstrate ROMA achieves state-of-the-art performance on proactive tasks while competitive in reactive settings, validating its robustness in unified real-time omni-multimodal understanding.

Demo Video

Methodology

ROMA unifies reactive answering and proactive timing over continuous inputs. The framework processes streaming signals as synchronized multimodal units, utilizing Chunked Time-aligned Multimodal RoPE (TMRoPE) for precise cross-modal alignment. By integrating a dedicated Speak Head, the model achieves robust temporal grounding and autonomous interaction control.

ROMA Architecture

Figure 2: Model Architecture. ROMA processes streaming inputs as aligned multimodal units. The speak head determines response timing, activating the LM head upon crossing a probability threshold.

Key Mechanisms:

  • Aligned Multimodal Units & TMRoPE: We segment continuous streams into fixed one-second intervals. Within each unit, dense audio signals and discrete video frames are synchronized via Chunked TMRoPE, which assigns cumulative positional IDs to maintain strict cross-modal correspondence along the global timeline.
  • Decoupled Speak Head: To enable proactive monitoring, we introduce a lightweight MLP module parallel to the LM head. This head explicitly predicts binary response probabilities based on the stream prefix, decoupling the decision of "when to speak" from "what to generate" to mitigate interference from generative biases.

ROMA Streaming Dataset

We constructed a comprehensive streaming dataset covering Proactive (Alert, Narration) and Reactive (QA) tasks, totally over 676K samples.

Dataset Statistics

Figure 4: Dataset Overview.

Qualitative Analysis

Proactive Interaction: Event Alerts

ROMA can accurately detect short-duration events and recurrences in real-time streams.

Single Alert Case

Figure: Single Alert Case. ROMA triggers precise alerts for one-time events.

Multi Alert Case

Figure: Recurring Alert Case. ROMA handles recurring events effectively.

Proactive Interaction: Real-Time Narration

Compared with baseline VideoLLMs, ROMA provides more succinct and time-aligned summaries of events.

Narration Case Study

Figure: Narration Case Study. Comparison on YouCook2.

Reactive Interaction: Question Answering

In reactive QA settings, ROMA demonstrates robust omni-multimodal capabilities by directly processing audio queries, maintaining strong context understanding without relying on text transcription.

Reactive QA Case

Figure: Reactive QA Case.

Quantitative Results

Proactive Tasks: Event-Driven Alert

Alert Table 1

Table: Performance on QVHighlights & Charades-STA.

Alert Table 2

Table: Single & Recurring Alert Performance.

Proactive Tasks: Real-Time Narration

Narration Table

Table: Streaming Narration on YouCook2 & OVO-Bench (SSR).

Reactive Tasks: QA & Streaming Understanding

OVO Bench Table

Table: Reactive QA on OVO-Bench (Real-time Visual Perception & Backward Tracing).

Streaming Bench Table

Table: Performance on StreamingBench.

Omni Bench Table

Table: Full-Modality QA on Video-MME & EgoSchema.

BibTeX

@article{roma2025,
  title={ROMA: Real-time Omni-Multimodal Assistant with Interactive Streaming Understanding},
  author={Anonymous Authors},
  journal={ACL Submission},
  year={2025}
}