Open source · Decentralized

Pool GPUs to run larger models

Turn spare GPU capacity into a shared inference mesh. Dense models split by layers, MoE models split by experts — automatically detected. Serve many models across machines with an OpenAI-compatible API on every node.

Try it now View on GitHub
QUIC · RPC QUIC · RPC gossip GLM-4.7-Flash host (103 GB) llama-server · :9337 18GB / 103GB GLM-4.7-Flash worker (52 GB) rpc-server 18GB / 52GB Qwen2.5-3B solo (13 GB) llama-server · :9337 2GB / 13GB
mesh-llm console
OpenAI-compatible API Pipeline + expert parallelism Multi-model routing Demand-aware rebalancing Nostr discovery macOS + Linux
How it works

Three commands

No coordinator, no cloud, no API keys. Machines pool their VRAM over QUIC.

1

Start a mesh

Pick a model. mesh-llm downloads it, detects the best distribution strategy, starts serving, prints an invite token.

2

Others join

Paste the token or use --auto to discover via Nostr. The mesh auto-assigns models — pipeline split for dense, expert shard for MoE.

3

Use it

Every node gets localhost:9337/v1 — standard OpenAI API. Works with any tool.

# Start a mesh with two models
mesh-llm --model Qwen2.5-32B --model GLM-4.7-Flash

# Another machine joins — auto-assigned to whichever model needs it
mesh-llm --join <token>

# Or discover public meshes and join automatically
mesh-llm --auto

# Create a shared mesh — everyone runs the same command
mesh-llm --auto --model GLM-4.7-Flash --mesh-name "poker-night"

# Route requests by model name
curl localhost:9337/v1/chat/completions -d '{"model":"GLM-4.7-Flash-Q4_K_M", ...}'
Features

Distributed inference that actually works

Automatic distribution

Model fits on one machine? Solo mode, full speed. Too big? Dense models pipeline-split by layers across nodes. MoE models (Qwen3, GLM, Mixtral, DeepSeek) split by experts — auto-detected from GGUF metadata, zero config.

🧩

MoE expert sharding

Each node gets the full trunk plus an overlapping expert shard. Critical experts replicated everywhere, remaining distributed uniquely. Each node runs its own llama-server — zero cross-node traffic during inference.

🔀

Multi-model routing

Different nodes serve different models. API proxy routes by model field. Nodes auto-assigned based on what's needed and what's on disk.

📊

Demand-aware rebalancing

Unified demand map propagates across the mesh via gossip. Standby nodes promote to serve unserved or hot models. Dead hosts replaced within 60 seconds.

📡

Nostr discovery

Publish your mesh to Nostr relays. Others find it with --auto. Smart scoring: region match, VRAM, health probe before joining.

🚀

Zero-transfer loading

Weights read from local GGUF files, not sent over the network. Model load: 111s → 5s. Per-token RPC round-trips: 558 → 8.

📈

Scales passively

GPU nodes gossip. Clients use lightweight routing tables — zero per-client server state. Event-driven: cost proportional to topology changes, not node count.

🎯

Speculative decoding

Draft model runs locally, proposes tokens verified in one batched pass. +38% throughput on code. Auto-detected from catalog.

💻

Web console

Live topology, VRAM bars, model picker, built-in chat. API-driven — everything the console shows comes from JSON endpoints.

🤖

Works with agents

OpenAI-compatible API on localhost:9337. Use with goose, pi, opencode, or any tool that supports custom OpenAI endpoints.

Quick start

Install & run

macOS Apple Silicon. One command to install, one to run.

# Install (downloads ~18MB bundle)
curl -fsSL https://github.com/michaelneale/decentralized-inference/releases/latest/download/mesh-llm-aarch64-apple-darwin.tar.gz | tar xz && sudo mv mesh-bundle/* /usr/local/bin/

# Join the public mesh — instant chat, zero config
mesh-llm --auto

# Or start your own mesh with a model
mesh-llm --model GLM-4.7-Flash-Q4_K_M
Integrations

Use with coding agents

Standard OpenAI API on localhost:9337. Works with anything.

▸ goose
GOOSE_PROVIDER=openai OPENAI_API_KEY=dummy OPENAI_HOST=http://localhost:9337 \
  GOOSE_MODEL=GLM-4.7-Flash-Q4_K_M goose session
▸ pi

Add to ~/.pi/agent/models.json:

{
  "providers": {
    "mesh": {
      "api": "openai-completions",
      "apiKey": "dummy",
      "baseUrl": "http://localhost:9337/v1",
      "models": [{
        "id": "GLM-4.7-Flash-Q4_K_M",
        "name": "GLM 4.7 Flash (mesh)",
        "contextWindow": 32768, "maxTokens": 8192,
        "reasoning": false, "input": ["text"],
        "cost": { "input": 0, "output": 0, "cacheRead": 0, "cacheWrite": 0 },
        "compat": { "maxTokensField": "max_tokens", "supportsDeveloperRole": false, "supportsUsageInStreaming": false }
      }]
    }
  }
}
pi --provider mesh --model GLM-4.7-Flash-Q4_K_M
▸ opencode
OPENAI_API_KEY=dummy OPENAI_BASE_URL=http://localhost:9337/v1 \
  opencode -m openai/GLM-4.7-Flash-Q4_K_M
▸ claude code (via proxy)

Claude Code uses Anthropic's API format. Use claude-code-proxy or litellm to translate.

# Start the proxy, then:
ANTHROPIC_BASE_URL=http://localhost:8082 claude
▸ curl / any OpenAI client
curl http://localhost:9337/v1/chat/completions \
  -H "Content-Type: application/json" \
  -d '{"model":"GLM-4.7-Flash-Q4_K_M","messages":[{"role":"user","content":"hello"}]}'

Try it

One binary. macOS Apple Silicon and Linux. MIT licensed.

Try it now Install GitHub → Roadmap