Model distribution
Every app has to solve model download, versioning, and update strategy for artifacts that can reach multiple gigabytes.
The Ollama for Mobile
Altio runs a local on-device AI inference service on Android. It exposes private REST and Server-Sent Events endpoints over localhost so multiple apps can share text generation and audio transcription without sending inference data off-device.
Preview
A short Android demo showing the local service flow, client interaction, and inference running without a cloud API.
The problem
Each app that bundles its own model runtime also inherits distribution, storage, memory, and scheduling concerns.
Every app has to solve model download, versioning, and update strategy for artifacts that can reach multiple gigabytes.
The same LLM or transcription model can be stored repeatedly across apps on the same device.
Multiple embedded runtimes compete for memory and accelerator access, increasing the chance of process death or degraded UX.
Each app inherits scheduling, session isolation, crash recovery, native runtime updates, and device-specific edge cases.
The solution
One service owns model downloads, loading, scheduling, and inference. Client apps integrate through localhost HTTP and SSE.
Download and update models once, then expose them to authorized local clients through the shared service.
Keep model residency and accelerator use centralized instead of letting apps load competing copies.
Move lifecycle, scheduling, and recovery logic into one process with a stable local API boundary.
Expose the same capabilities, streaming semantics, and operational constraints to every client app.
Key features
Altio binds to 127.0.0.1 and serves private REST and SSE endpoints without routing prompts, audio, or outputs through a cloud API.
The current runtime supports local text generation and MP3 transcription paths, with LiteRT powering the first backend.
Clients authenticate against a device-owner controlled token instead of assuming every local process should be trusted.
A background service owns model management, active port state, request handling, and operational logs.
Client integration
Native SDKs are planned, but the protocol is intentionally simple: discover the loopback port, create a session, submit jobs, and stream responses.
fun resolvePort(context: Context): Int? {
val uri = Uri.parse(
"content://app.altio.service.port/port"
)
return context.contentResolver
.query(uri, null, null, null, null)
?.use { cursor ->
if (cursor.moveToFirst()) {
cursor.getInt(
cursor.getColumnIndexOrThrow("port")
)
} else {
null
}
}
} val client = AiServiceClient(
port = discoveredPort,
bearerToken = token
)
val sessionId = client.createSession("gemma-2b-it")
val jobId = client.generate(
sessionId,
"Explain gravity simply."
)
client.streamTokens(jobId).collect { token ->
print(token)
} val transcript = client.transcribe(
sessionId = sessionId,
audio = audioMp3Bytes
)
println(transcript) The :demo module exercises streaming chat, transcription flows, job monitoring, health checks, and HTTP logging against the local service.
Project status
Altio is an active Android prototype with a LiteRT-LM backend, localhost HTTP/SSE surface, port discovery, demo client, and ongoing work on scheduling and runtime backends.
The LiteRT path currently assumes one active inference session loaded in memory at a time while mobile GPU/runtime support matures.
Altio is AGPL-3.0-or-later, with repository documentation covering copyleft obligations and commercial licensing options.
Contributions go through the public repository process and require the project Contributor License Agreement.