iPhone 17 Pro Runs 400-Billion-Parameter AI Model, But at a Crawl
A developer has successfully run a 400-billion-parameter large language model on an iPhone 17 Pro using a technique called Flash-MoE, which streams data from the device's SSD to the GPU. However, the achievement comes with a major caveat: an excruciatingly slow token generation speed of just 0.6 tokens per second.
March 23, 2026 – Running a 400-billion-parameter large language model typically requires hardware with massive memory capacities, often exceeding 200GB even for compressed versions. The iPhone 17 Pro, with its 12GB of LPDDR5X RAM, seems an unlikely candidate for such a task. Yet, a developer has proven it's possible, albeit with a significant performance trade-off.
A demonstration shared by user @anemll shows the iPhone 17 Pro running the model via an open-source project named Flash-MoE. The key to this feat lies in the project's clever workaround: it bypasses the phone's limited RAM by utilizing the device's Solid State Drive (SSD), streaming data directly to the GPU for processing.
The "MoE" in Flash-MoE stands for Mixture of Experts. This architecture is crucial, as it means the model only activates a small subset of its total 400 billion parameters to generate each word, drastically reducing the computational load at any given moment.
The Catch: Speed
While the technical achievement is notable, the current user experience is far from practical. The model generates tokens at a painfully slow rate of approximately 0.6 tokens per second. This translates to waiting roughly 1.5 to 2 seconds for each individual word to appear in a response.
Implications and Trade-offs
- Proof of Concept: The demonstration proves that, with further optimization, locally running massive AI models on smartphones is a future possibility.
- Privacy Advantage: Running such a model locally offers complete data privacy and allows for functionality without an internet connection.
- Hardware Strain: The process is extremely resource-intensive and would likely lead to severe battery drain on the iPhone 17 Pro.
The Bottom Line
As the demonstration highlights, there's a vast difference between a model "running" on a device and it being "usable" in a practical, fluid sense. The Flash-MoE project on the iPhone 17 Pro is firmly in the former category—a remarkable technical showcase that underscores both the potential and the current limitations of on-device AI.