How to run an LLM on your phone
Real on-device AI on Android — no rooting, no terminal, no waiting for a cloud reply.
May 19, 2026 · 4 min read
What you'll actually get
A chat assistant that lives in your app drawer, answers in plain English, reads images, accepts voice — and never sends any of it to a server. Inference happens on the same chip that runs your camera filters.
Expect roughly conversational speed on a mid-range phone (10-25 tokens per second on Gemma 3n E4B at INT4). Faster on flagships with an NPU. Slower on the very oldest devices.
Hardware that works
Three things matter, in order: RAM, chip generation, storage.
- RAM. 6 GB is the floor. 8 GB and above feels comfortable.
- Chip. Anything from the last four years — Snapdragon 7-series and above, Tensor G2+, Dimensity 8000+, recent Exynos. Older chips work but slowly.
- Storage. 4 GB free for the model and runtime.
- OS. Android 9 or newer.
iPhone users: on-device LLMs are coming, but App Store policy and the closed ML stack make the same install flow harder today.
The step-by-step
Using Localyze.ai as the example, but the shape is the same for any well-built on-device app:
- 1. Install the app. Grab the APK from the download page or the Play Store listing. The app itself is small — about 30 MB.
- 2. First launch downloads the model. Roughly ~4 GB. Use Wi-Fi. It happens once, then you can go offline forever.
- 3. Grant the permissions you actually want. Microphone for voice, photos for image input. Skip them and the rest still works.
- 4. Start typing. No account, no email, no onboarding wall.
That's the whole flow. There is no "connect your API key" step because there is no API.
What to try first
Good prompts to confirm everything is local:
- Turn on airplane mode and chat anyway. It still works.
- Send a screenshot and ask what it says — image input handled on-device.
- Hold the mic button, speak, release. Transcription is local too.
- Open battery use in Android settings after a long chat. The work shows up under the app, not the network stack.
The honest expectations
An on-device model is not a frontier cloud model. It will be slightly less clever at unusual reasoning tasks. It will not browse the web. The first token will sometimes take a second to appear while the model warms up.
In exchange: privacy, offline, one-time payment, no rate limits, no account, no telemetry. For 90% of how people actually use AI — write this, explain that, summarise this email, what's in this photo — the trade is worth it almost every time.
More from Localyze.ai: What is Gemma 3n E4B? · Why on-device AI matters · The cloud AI privacy myth · Download Localyze.ai