Turn spare GPU capacity into a shared inference mesh. Dense models split by layers, MoE models split by experts — automatically detected. Serve many models across machines with an OpenAI-compatible API on every node.
No coordinator, no cloud, no API keys. Machines pool their VRAM over QUIC.
Pick a model. mesh-llm downloads it, detects the best distribution strategy, starts serving, prints an invite token.
Paste the token or use --auto to discover via Nostr. The mesh auto-assigns models — pipeline split for dense, expert shard for MoE.
Every node gets localhost:9337/v1 — standard OpenAI API. Works with any tool.
Model fits on one machine? Solo mode, full speed. Too big? Dense models pipeline-split by layers across nodes. MoE models (Qwen3, GLM, Mixtral, DeepSeek) split by experts — auto-detected from GGUF metadata, zero config.
Each node gets the full trunk plus an overlapping expert shard. Critical experts replicated everywhere, remaining distributed uniquely. Each node runs its own llama-server — zero cross-node traffic during inference.
Different nodes serve different models. API proxy routes by model field. Nodes auto-assigned based on what's needed and what's on disk.
Unified demand map propagates across the mesh via gossip. Standby nodes promote to serve unserved or hot models. Dead hosts replaced within 60 seconds.
Publish your mesh to Nostr relays. Others find it with --auto. Smart scoring: region match, VRAM, health probe before joining.
Weights read from local GGUF files, not sent over the network. Model load: 111s → 5s. Per-token RPC round-trips: 558 → 8.
GPU nodes gossip. Clients use lightweight routing tables — zero per-client server state. Event-driven: cost proportional to topology changes, not node count.
Draft model runs locally, proposes tokens verified in one batched pass. +38% throughput on code. Auto-detected from catalog.
Live topology, VRAM bars, model picker, built-in chat. API-driven — everything the console shows comes from JSON endpoints.
OpenAI-compatible API on localhost:9337. Use with goose, pi, opencode, or any tool that supports custom OpenAI endpoints.
macOS Apple Silicon. One command to install, one to run.
Standard OpenAI API on localhost:9337. Works with anything.
Add to ~/.pi/agent/models.json:
Claude Code uses Anthropic's API format. Use claude-code-proxy or litellm to translate.
One binary. macOS Apple Silicon and Linux. MIT licensed.