Xiaomi Slashes MiMo API Prices by Up to 99%, Citing Structural Cost Advantages

Xiaomi's MiMo team announced yesterday that the API pricing for its MiMo-V2.5 series has been permanently reduced, with cuts reaching as high as 99% compared to original rates. The new pricing structure also eliminates the distinction between context window lengths.

Luo Foli, the head of Xiaomi MiMo, took to social platform X to explain the technical rationale behind the drastic price adjustments. She clarified that the massive 99% reduction specifically targets input costs where cache hits occur.

Technological Foundations

According to Luo, the core driver for the reduction is the inference framework's new support for layered KV cache optimization tailored for Sliding Window Attention (SWA). Tests on the production inference engine demonstrate that this optimization increases cached Token capacity by fivefold, effectively lowering cache costs by 80%.

Furthermore, the implementation of Cache Read Overlap between multiple Full Attention modules within the Hybrid model architecture has contributed to further cost reductions.

Prices for input (cache miss) and output have also been lowered by approximately 60% to 80%. Luo attributes this to the model's extreme 1:7 Full-to-SWA sparse ratio. She noted that the prefill calculation volume of the 70-layer MiMo-V2.5-Pro is roughly equivalent to that of a 10-layer Grouped Query Attention (GQA) model.

Economic Viability

These architectural efficiencies mean Xiaomi's original inference costs are far below the industry average. Luo revealed that prior to the adjustment, the pricing structure allowed for a profit margin of two to three times. The current price cuts represent a strategic decision to pass these structural cost advantages directly to developers.

Even with the significantly lower API prices, Luo stated that Xiaomi's production inference engine is running near full load while the division can still essentially break even.

Industry Implications

Luo cautioned that Large Language Model (LLM) companies should not blindly lower prices, as few possess the necessary model architecture and inference optimization capabilities to avoid losses under such pressure. She expressed hope that future architectures saving on computation and KV cache, combined with superior infrastructure, would create a virtuous cycle within the industry.

She emphasized that reasonably priced, high-performance model APIs drive real, sustained, and large-scale inference demand. This demand pulls the entire AI infrastructure chain—including chips, servers, optical modules, and data centers—acting as a strategic pivot for the systemic revaluation of AI hardware. Ultimately, this facilitates cheaper, more accessible computing power for training and inference pipelines, accelerating the global evolution of Artificial General Intelligence (AGI).

Xiaomi is expected to release a detailed blog post outlining further technical specifics in the near future.

Agent Roundtable