Rapid-MLX - 2-4x faster local LLM inference on Apple Silicon

MLX-native inference engine with OpenAI-compatible API. The novel piece: DeltaNet state snapshots bring prompt caching to non-trimmable architectures (Qwen3.5 hybrids), restoring RNN state in ~0.1ms. 2-5x faster TTFT, native Metal kernels, continuous batching.

Saved Apr 30, 20263 min readView source ↗

#developer-tools #local-llm #apple-silicon #python #self-hosted

Rapid-MLX is a local inference engine for Apple Silicon that exposes an OpenAI-compatible API and runs LLMs 2-4x faster than Ollama or llama.cpp on the same hardware. The headline trick - and the part worth understanding before the install steps - is DeltaNet state snapshots: a prompt-caching technique designed for hybrid RNN-attention architectures (Qwen3.5 hybrids and the like) that previously couldn't be cached at all.

For non-trimmable architectures, traditional prompt caches don't work because there's no contiguous KV-cache prefix you can lop off. Rapid-MLX snapshots the RNN state at prompt boundaries and restores it in ~0.1ms. The README claims "the first technique to bring prompt cache to non-trimmable architectures on MLX" - which isn't a claim Ollama or llama.cpp can match today.

The numbers worth quoting

2-4x speedup vs Ollama and llama.cpp on Apple Silicon
2-5x faster TTFT (time-to-first-token) across architectures via state snapshots
~0.1ms state restore on hybrid models
607 GitHub stars at time of writing - the local-LLM-on-Mac space is crowded, this one stands out

Why it's actually faster

Three things stack:

DeltaNet snapshots for hybrid RNN-attention models - the novel piece
Native Metal compute kernels via Apple's MLX framework, built specifically for unified memory (no Metal-shader-meets-CUDA-shaped-API impedance mismatch)
Continuous batching + optimized prefill chunking - standard inference-stack tricks, but tuned

If you've been running llama.cpp on a Mac because it's the default, Rapid-MLX is the first project that's a clear capability upgrade rather than an alternative.

Install

Three paths, pick one:

# Homebrew
brew install raullenchai/rapid-mlx/rapid-mlx

# pip
pip install rapid-mlx

# automated installer
curl -sSfL https://raw.githubusercontent.com/raullenchai/Rapid-MLX/main/install.sh | bash

Requires Python 3.10+. Once running, point any OpenAI-compatible client at the local endpoint - Cursor, Claude Code, Aider, PydanticAI, LangChain are all called out as tested integrations.

When to reach for it

You're on Apple Silicon and your inference workload is large enough that 2-4x matters - long-context coding agents, repeated document processing, anything that re-prompts the same prefix.
You're running a Qwen3.5 hybrid or any RNN-attention model and have given up on prompt caching. Rapid-MLX is the path that gets it back.
You want a drop-in OpenAI endpoint for local dev without changing client code. The compatibility layer is the boring-but-correct part.

When not to

You're on a non-Apple machine. MLX is unified-memory-first; on a discrete-GPU box, llama.cpp or vLLM are the right calls.
Your bottleneck is model quality, not throughput. A faster runtime doesn't change what the model knows.
You need batched serving for many users. The continuous batching is solid, but production multi-tenant inference is a different problem class - look at vLLM or TGI.

Trade-offs

The DeltaNet snapshot technique is specific to hybrid architectures. For pure-attention models (most of the Qwen3 lineup, Llama, Mistral) the gains come from Metal kernels and prefill tuning - still real, but not the dramatic 5x TTFT figure.

The MLX dependency is a feature on Mac and a wall everywhere else. If your team mixes Mac and Linux dev environments, you can't standardise on Rapid-MLX without parallel infrastructure.

The OpenAI-compatible layer covers chat completions and basic streaming. If your client uses non-standard fields (function calling shape varies across providers), check that the round-trip behaves the way you expect before betting a workflow on it.

Featured in

Self-hosted developer tools and PaaS
Open-source, self-hosted alternatives to Heroku, Render, GitHub Actions, and the rest of the SaaS dev stack.

Related entries

GitHub Tool

Intent Bus - SQLite job bus for cross-device scripts

Tiny Flask server (~100 LOC) that lets any script POST a job and any worker poll for matching ones. Atomic SQLite locking, 60s visibility timeout, and API key auth in place of Redis or MQTT.

#python #self-hosted #developer-tools

GitHub Tool

OpenSail - open AI workspace platform

Self-hostable platform for building, running, and sharing AI workspace agents and apps with any model. No vendor lock-in - bring your own LLM provider or run local.

#ai-agent #self-hosted #agent-framework #desktop-app #python

GitHub Tool

Egregore - shared memory for multiplayer Claude Code

Local MIT substrate that gives a team of Claude Code sessions a shared memory and coordination layer, spun up via npx create-egregore.

#claude-code #multi-agent #agent-memory #self-hosted #developer-tools

GitHub Tool

OpenClawdex - orchestrator UI for Claude Code and Codex

Open-source MIT UI that drives both Claude Code and Codex from one app, opens diffs in your editor instead of a side panel, and uses your existing logins instead of API keys.

#claude-code #codex #orchestration #developer-tools #self-hosted