Research System

Mixture-of-Experts (MoE) models keepcompute cheap by firing only a few experts per token — but every expert stillhas to be in memory, and Mixtral 8×7B wants tens of GB. Uniform quantization istoo blunt: some experts barely tolerate INT4, some tokens deserve INT8, and onedge devices the real wall is streaming experts from storage.

D²MoE is analgorithm–system co-design that mixes precision per-token, per-layer, on thefly:

Matryoshka Weight Quantization. Each expert is stored once but unfolds into nested precisions — INT2 inside INT4 inside INT8. Pull out the precision you need.
Dual routing. The usual MoE gate picks which experts handle a token; a second gate picks how precise each should be, learned jointly with the model.
Hottest-Expert-Bit-First scheduling. Reorder the I/O queue so the most-needed bits of the most-important experts arrive first, hiding I/O latency behind compute.

Up to 3.37× higher throughput onJetson Orin under tight memory budgets, with INT8-level accuracy and noretraining required.

A 100K-token context on Llama-3-8B withbatch 8 needs ~100 GB just for the KV cache — far more than any commodity GPU.Offloading to host DRAM helps, but sparsity has a ceiling, and once contextsgrow, the transfers themselves become the bottleneck. Half of decodingtime is spent waiting for KV entries.

KVDriverethinks the problem across GPU HBM, host DRAM, and SSD as one coordinatedhierarchy:

Attention-based cache management. A sliding window of recently-critical KV entries stays resident on GPU; a lookahead policy evicts based on current attention scores; a per-layer-per-head allocation (solved as a knapsack) gives more cache where it actually pays off.
Elastic pipeline scheduling. Selection, fetching, and computation are disaggregated and overlapped across micro-batches — the CPU evaluates cache hits for one while the GPU runs selection for another and I/O fetches a third.
Coordinated multi-tier storage. SSD becomes a real third tier: warmed up at prefill via attention scores, laid out by semantic contiguity and layer–head for sequential I/O, and prefetched into pinned memory to hide SSD latency.

Up to 1.74× throughput over the best KV-offloading baselines, with accuracy preserved. A24 GB RTX 4090 running KVDrive can outperform a 96 GB H20 on standard servingby up to 3× — long context moves from datacenter to workstation.

See KVDrive below 👇

Every device is becoming an AI device, andpeople are starting to do with on-device LLMs what they once did with files:customize them. One agent gets sharp at law, another at medicine, another atcode — and weaker at everything else. The BitTorrent question, twenty yearslater: when your local agent isn't the right one, why not borrow a peer's?

The catch is that agents aren't files. Thepool churns; capabilities overlap; and the best agent for a query isalso the most asked-for, so naive routing creates hotspots. PPAI handlesboth:

Prototype-anchored matching. Queries and agents project into a shared prototype space; a cosine similarity gives the match. When an agent joins, updates, or leaves, only its capability vector gossips through the network — no retraining.
Bayesian-game scheduling. Each user keeps a belief distribution over peers' load, updated via Bayes' rule from feedback. Routing maximizes semantic fit minus an estimated Cost of Delegation — provably a potential game with bounded Price of Anarchy.
Strategic suboptimality. The 2nd- or 3rd-best agent often loses <5% accuracy, but relieves congestion enormously. A tunable β slides between "always best" and "spread the load."

+7.96% accuracy over centralized routing baselines, −16.34% latency under heavy demand. Andit scales the right way: going from 50 to 1000 agents lowers latency ratherthan raising it.

When MoE models get big enough to need awhole cluster — Mixtral, DeepSeek, Qwen3 — the experts get spread across GPUs,and every token has to find its way to the right ones via all-to-allcommunication. Which experts sit on which GPU turns out to mattera lot: co-locate the wrong pair and you pay communication tax forever; let oneGPU host all the hot experts and it becomes a straggler for everyone else.

Existing placement strategies are either offline(compute a fixed layout from historical traces — fast, but stale when workloadsshift) or online reactive (rebalance based on the last batch — always astep behind the actual demand). Both treat placement as something to catch upto. DIRECTOR flips it: predict what the next batch will route to,and place experts before that batch arrives.

Adaptive routing predictor. Two options at different fidelity points: a cascade predictor (single-layer transformers that condition each layer's prediction on the previous layer's output — <0.1% the size of the served model, 77–91% top-k accuracy) or a low-bit quantized replica (4-bit AWQ, 91–96% accuracy). One favors speed, the other fidelity.
Relaxation-based placement optimizer. The placement problem is NP-hard (reducible to Generalized Assignment) with a search space of roughly (C₆₄¹⁶)²⁷ for DeepSeekMoE-16B. DIRECTOR solves an LP relaxation under a latency budget, then uses iterative rounding to recover an integer placement — polynomial time, with a provable (1+ε) approximation guarantee.
Computation-overlapped live migration. Naively migrating experts stalls everything; naive pipelining clashes with all-to-all traffic. DIRECTOR schedules migrations strictly during compute-bound phases (attention, FFN) when the network is idle, hiding migration entirely off the critical path.

11–55% lower end-to-end latency across Mixtral, DeepSeekMoE, DeepSeek-V2-Lite, and Qwen3, on bothRTX 4090 and H200 clusters. The gain grows with cluster size — from 11% at 4GPUs to 48% at 32 GPUs — because larger systems are more sensitive to a singlemis-placed expert. And it beats Simulated Annealing by 14–26 percentage points,showing the LP-relaxation path finds genuinely better placements, not justfaster ones.

🪆 D²MoE:Right Bits at the Right Time

📚KVDrive: Long Context Without the Memory Wall

🤝 PPAI: BorrowingEach Other's Agents

🎬 DIRECTOR:Placing Experts Before They're Needed