TriMotion: Modality-Agnostic Camera Control
for Video Generation

Seunghyun Shin1,2* Jifei Song2 Wooseok Jeon3 Hae-Gon Jeon3† Jiankang Deng4†
1GIST    2Huawei Noah's Ark Lab    3Yonsei University    4Imperial College London
*Work done during an internship at Huawei Noah's Ark Lab    Corresponding authors

Abstract

Camera motion control is essential for directing viewpoint changes in generative systems. However, existing methods typically condition the generation process on a single specific modality, limiting their ability to support heterogeneous user inputs. We present TriMotion, a modality-agnostic framework for camera-controlled video generation that maps video, pose, and text inputs describing the same camera trajectory into a shared motion embedding space. We build the Motion Triplet Dataset by extending a Multi-Cam Video Dataset with geometry-grounded motion descriptions derived from camera extrinsics. We further introduce a latent motion consistency objective that encourages the generated video to follow the target camera trajectory directly in latent space. Beyond standard generation, the shared space also enables sequential motion composition and cross-modal motion interpolation.

TriMotion-Text
TriMotion-Video
TriMotion-Pose
Reference modality: Text
"The camera starts with a steady dolly-in motion, moving forward at a consistent pace. As the sequence progresses, a smooth truck-left motion begins. Throughout the transition, the dolly-in continues at a reduced rate, blending both movements to the end of the shot."
Reference modality: Video
Reference modality: Pose
TriMotion-Text
TriMotion-Video
TriMotion-Pose
Reference modality: Text
"The camera starts with a steady pan-right motion, rotating horizontally at a consistent pace. As the sequence progresses, this smooth, continuous movement is maintained. Throughout the duration, this singular motion is kept steadily to the end of the shot."
Reference modality: Video
Reference modality: Pose

Motion Triplet Dataset

Training a unified cross-modal space requires synchronized multi-modal supervision. We build the Motion Triplet Dataset by extending the Multi-Cam Video Dataset (136K Unreal Engine 5 videos, 122K camera trajectories) with geometry-grounded text descriptions generated by Qwen3-4B-Instruct. Each entry provides aligned video, pose, and text for the same camera trajectory.

136K
Training videos
(Motion Triplet Dataset)
13.6K
Dynamic scenes
across 40 environments
122K
Unique camera
trajectories
3
Supported input
modalities

Short Caption: "The camera starts with a steady Pedestal Up movement while gently tilting down, maintaining consistent motion throughout the sequence."

Long Caption: "Initially, the camera maintains a steady pedestal position above the starting point while tilting slightly downward, establishing a stable baseline. Gradually, the tilt angle decreases in a consistent manner, reducing gradually late in the sequence, indicating a slow, controlled downward motion. Throughout this progression, there is no significant change in horizontal position or dolly movement, with lateral and depth displacements remaining near zero. The movement evolves with a steady pace, maintaining constant vertical offset and minimal rotational variation, resulting in a stable final composition."

Short Caption: "The camera starts with a steady left truck movement and dolly-in motion, while gradually panning right and tilting up. As the shot progresses, the tilt and roll accelerate significantly, culminating in a sharp upward tilt and clockwise roll that dominate the final frame."

Long Caption: "Initially, the camera moves slightly left on the truck while dollying in at a steady pace, with a gentle pan to the right. As the sequence progresses, both the inward dolly motion and the rightward pan gradually grow stronger, while the sideways shift to the left remains modest and controlled. At the same time, the tilt slowly lifts upward and the roll turns clockwise, so the camera’s attitude becomes progressively more elevated and rotated. Towards the end, the camera continues to move inward at a steady rate, holding this more pronounced upward and rolled viewpoint."

Short Caption: "The camera remains stationary throughout the entire sequence."

Long Caption: "The camera remains stationary throughout the entire sequence, with no movement along the any axes. There is no change in pan, tilt, or roll during any phase of the shot. The motion profile is consistently static, maintaining a steady, unchanging position from frame zero to the final frame. No acceleration, deceleration, or transition in camera movement occurs. The tracking data confirms a completely fixed camera setup with zero displacement in all directions."

Method

TriMotion maps diverse control signals — video, pose, and text — into a single continuous motion embedding space, and injects this shared representation into a latent video diffusion transformer to steer camera dynamics during generation. Three modality-specific encoders feed into a unified temporal Transformer that produces the shared embedding em, injected into each DiT block via residual addition.

📝

Text Motion Encoder

Frozen T5 encoder produces contextualized tokens. N learnable motion queries cross-attend to text tokens, lifting static language into a temporal motion sequence without requiring metric values.

🎬

Video Motion Encoder

Feature aggregation from VGGT. Alternating-Attention blocks interleave frame-wise and global self-attention to aggregate multi-view 3D geometric information into camera tokens.

📐

Pose Motion Encoder

Frame-wise MLP projects flattened 3×4 camera extrinsic matrices into the shared embedding space, preserving the original trajectory structure with explicit geometric grounding.

The three encoders are aligned via an InfoNCE contrastive loss on global tokens, a cosine temporal synchronization loss on per-frame tokens, and an L1 pose regression for geometric grounding. A Latent Motion Consistency objective using a Motion Embedding Predictor regularizes the diffusion backbone entirely in latent space without pixel-space decoding.

Image-to-Video Results

We compare TriMotion against CamI2V[1], MotionClone[2], and CamCloneMaster[3] in the I2V setting.

Example 1

Source Image

CamI2V

MotionClone

CamCloneMaster

Reference Video

TriMotion-Text

TriMotion-Video

TriMotion-Pose


Example 2

Source Image

CamI2V

MotionClone

CamCloneMaster

Reference Video

TriMotion-Text

TriMotion-Video

TriMotion-Pose

Example 3

Source Image

CamI2V

MotionClone

CamCloneMaster

Reference Video

TriMotion-Text

TriMotion-Video

TriMotion-Pose


Example 4

Source Image

CamI2V

MotionClone

CamCloneMaster

Reference Video

TriMotion-Text

TriMotion-Video

TriMotion-Pose


Video-to-Video Results

We compare TriMotion against TrajectoryCrafter[5], ReCamMaster[6], and CamCloneMaster[3] in the V2V setting.

Example 1

Source Video

TrajectoryCrafter

ReCamMaster

CamCloneMaster

Reference Video

TriMotion-Text

TriMotion-Video

TriMotion-Pose


Example 2

Source Video

TrajectoryCrafter

ReCamMaster

CamCloneMaster

Reference Video

TriMotion-Text

TriMotion-Video

TriMotion-Pose


Example 3

Source Video

TrajectoryCrafter

ReCamMaster

CamCloneMaster

Reference Video

TriMotion-Text

TriMotion-Video

TriMotion-Pose


Example 4

Source Video

TrajectoryCrafter

ReCamMaster

CamCloneMaster

Reference Video

TriMotion-Text

TriMotion-Video

TriMotion-Pose


Motion Composition

The shared motion embedding space enables two novel composition modes. Sequential composition executes two motions one after the other. Interpolation blends two motions simultaneously throughout the clip.

(a) Sequential Motion Composition

Motion A

A → B

Motion B

Motion A

A → B

Motion B

Motion A

A → B

Motion B

Motion A

A → B

Motion B

Motion A

A → B

Motion B

Motion A

A → B

Motion B

Motion A

A → B

Motion B

Motion A

A → B

Motion B

Motion A

A → B

Motion B

Motion A

A → B

Motion B


(b) Cross-Modal Motion Interpolation

Motion A

A + B

Motion B

Motion A

A + B

Motion B

Motion A

A + B

Motion B

Motion A

A + B

Motion B

Motion A

A + B

Motion B

Motion A

A + B

Motion B

Motion A

A + B

Motion B

Motion A

A + B

Motion B

Motion A

A + B

Motion B

Motion A

A + B

Motion B

BibTeX

@inproceedings{shin2025trimotion, title = {TriMotion: Modality-Agnostic Camera Control for Video Generation}, author = {Shin, Seunghyun and Song, Jifei and Jeon, Wooseok and Jeon, Hae-Gon and Deng, Jiankang}, booktitle = {European Conference on Computer Vision (ECCV),}, year = {2026}, }

References

[1] He, Dejia et al., CamI2V: Camera-Controlled Image-to-Video Diffusion Model, 2024.

[2] Ling, Pengcheng et al., MotionClone: Training-Free Motion Cloning for Controllable Video Generation, 2024.

[3] Bai, Yanbo et al., CamCloneMaster: Camera Motion Clone for Video Generation, 2025.

[4] Hou, Yifeng et al., DaS: Depth-aware Synchronization for Video Generation, 2024.

[5] Yu, Mark et al., TrajectoryCrafter: Redirecting Camera Trajectory for Monocular Videos via Diffusion Models, ICCV 2025.

[6] Bai, Jianhong et al., ReCamMaster: Camera-Controlled Generative Rendering from A Single Video, ICCV 2025.