Research AIGC

A video reasoning model can look accurate on benchmarks yet still rely on static shortcuts. The real test is spatiotemporal sensitivity: if motion direction, temporal order, or event dynamics change, the answer should change for the right reason.

Our method follows a counterfactual RL view: we train on paired original / transformed videos and enforce cross-branch relational consistency with CRR. This makes shortcut policies much harder to optimize and pushes the model toward motion-grounded reasoning.

Why it matters:

Stronger motion grounding under reversal and direction-sensitive questions.
Instead of being rewarded for static shortcuts, the model is explicitly trained to react when temporal evidence changes. This pushes prediction behavior closer to real video understanding rather than answer-pattern matching.
Reward shaping without expensive trace annotation.
CRPO-style constraints supervise relational correctness across paired clips, so we get temporal sensitivity signals without hand-labeling long reasoning traces. That makes post-training substantially more scalable for large video corpora.
More trustworthy temporal reasoning for high-stakes use cases.
In robotics, surveillance, and decision support, “what changed” is often the core question. Improving sensitivity to order, direction, and event transition reduces brittle failure modes that standard benchmark accuracy can hide.

More details can be found at: https://ddz16.github.io/crpo.github.io/

When a clip is temporally reversed or motion direction changes, shortcut-heavy models tend to keep the same answer. This exposes weak temporal grounding and motivates explicit counterfactual consistency training.

CRPO uses dual-branch RL over original and counterfactual videos, with CRR constraints that require answers to change for dynamic questions and remain for static ones, improving true spatiotemporal sensitivity.

In production 3D workflows, quality is not only about geometry fidelity, but also about editability, deformation stability, and artist-friendly edge flow. Direct native quad generation matters.

Our method predicts mixed triangle/quad face sequences directly with an autoregressive model, then refines topology quality via tDPO preference optimization rather than fragile post-hoc conversion.

What this unlocks:

Cleaner topology and smoother deformation for animation-ready assets.
Native quad-dominant generation preserves coherent edge flow, which directly improves rigging, subdivision, and downstream editing quality compared with many tri-to-quad conversion artifacts.
Higher artist usability than triangle-first conversion pipelines.
Artists get meshes that are closer to production-ready structure from the start, reducing manual cleanup passes and making iteration faster in practical DCC workflows.
A direct path from point clouds to structured quad meshes.
By predicting mixed triangle/quad face sequences autoregressively and refining with tDPO, the method bridges geometric fidelity and topological regularity in one trainable pipeline.

More details can be found at: https://hitcslj.github.io/QuadGPT/

See QuadGPT below:

‍

High-quality visual creation is not one-shot generation. It needs a system that can understand intent, reason over constraints, plan multi-step actions, and execute reliably.

Our method treats this as a native UTPC loop (Understanding–Thinking–Planning–Creation), then scales capability with PST and VRL in simulated environments for long-horizon tasks.

Why our method matters:

Less brittle than prompt-only orchestration for complex creation workflows.
Instead of stitching together many fragile prompts and tool calls, the UTPC-native loop internalizes understanding, planning, and execution in one model behavior, which improves stability on long multi-step tasks.
Better step-level coherence across image/video pipelines.
By explicitly modeling intermediate planning states, the method reduces drift between early intent and late-stage outputs, keeping style, structure, and constraints aligned throughout generation.
A stronger foundation for autonomous creative copilots.
With PST and VRL, the model can scale from simple tasks to long-horizon creation, making it more practical for real production scenarios where iterative correction and consistent intent tracking are essential.

More details can be found at: https://layjins.github.io/visioncreator/

Camera navigation alone is not enough for a truly interactive world model. Real interaction is object-centric: click an object, sketch a path, and generate coherent future frames under moving viewpoints.

Our method decomposes this into three key pieces: camera-invariant trajectory representation, non-destructive control injection, and persistent state memory for long autoregressive rollouts.

Why it matters:

Composable camera + object control in one interactive loop.
This closes a core gap in prior camera-only world models: users can now navigate viewpoints and still issue object-centric actions, which is much closer to real interactive environments.
Trajectory-faithful object manipulation under viewpoint changes.
Camera motion and object motion are disentangled, so user-drawn paths remain semantically consistent even as perspective changes, improving controllability for embodied simulation and planning.
Persistent object state after off-camera excursions for long sessions.
The model preserves world-state continuity when manipulated objects leave and re-enter view, reducing “state reset” artifacts that otherwise break long autoregressive interaction.

You can find WorldCraft at: https://nevsnev.github.io/WorldCraft/

High-resolution regional weather forecasting is hard to scale if we ignore Earth-wide dependencies. Neighbor-only boundary assumptions often miss long-range interactions, while direct high-resolution global modeling is computationally prohibitive.

Our method, STCast, addresses this with two coordinated components: SAA (Spatial-Aligned Attention) for adaptive global-regional boundary coupling, and TMoE (Temporal Mixture-of-Experts) for month-aware temporal specialization. Together, they form a unified framework evaluated on global forecasting, regional forecasting, extreme event prediction, and ensemble forecasting.‍

Why it matters:

Adaptive boundaries replace static regional cropping.
SAA initializes global-regional coupling with a physically informed distance prior (Great Circle distance + exponential decay), then refines it during training. This allows the model to learn which distant regions matter for a target forecast, instead of assuming only local neighbors contribute.
Temporal specialization improves seasonal generalization.
TMoE routes monthly atmospheric inputs to specialized experts using a discrete Gaussian prior over months, which better captures inter-month variability and intra-month consistency than a single shared temporal pathway.
One framework supports four operational forecasting tasks.
STCast is validated not only on global and regional deterministic forecasting, but also on extreme event prediction and ensemble forecasting, showing consistent improvements across RMSE/ACC-style quality indicators and practical weather scenarios.

You can find STCast at: https://github.com/chenhao-zju/STCast

Overview of STCast across four weather tasks. The figure contrasts prior regional strategies (neighbor cropping or direct regional training) with STCast’s Earth-aware coupling pipeline, where SAA dynamically aligns global and regional distributions, and TMoE allocates month-conditioned inputs to specialized experts. It also illustrates downstream extensions to typhoon-track-related prediction and probabilistic long-range ensemble rollout.

🎬 CRPO: Counterfactual Motion Sense for Video Reasoners

🤖 QuadGPT: Native Quad Meshes, Autoregressively Built

🧠 VisionCreator: Understand, Plan, and Create

🕹️ WorldCraft: Object-Level Interaction Beyond Camera Control

🌦️ STCast: Adaptive Global-Regional Forecasting