Open source · Decentralized

Pool GPUs to run larger models

Turn spare GPU capacity into a shared inference mesh. Serve many models across machines, run models larger than any single device, and scale capacity to meet demand. OpenAI-compatible API on every node.

Try it now View on GitHub
QUIC · RPC QUIC · RPC gossip GLM-4.7-Flash host (103 GB) llama-server · :9337 18GB / 103GB GLM-4.7-Flash worker (52 GB) rpc-server 18GB / 52GB Qwen2.5-3B solo (13 GB) llama-server · :9337 2GB / 13GB
mesh-llm console
OpenAI-compatible API Layer split across GPUs Multi-model routing Demand-aware rebalancing Nostr discovery macOS + Linux
How it works

Three commands

No coordinator, no cloud, no API keys. Machines pool their VRAM over QUIC.

1

Start a mesh

Pick a model. mesh-llm downloads it, starts serving, prints an invite token.

2

Others join

Paste the token or use --auto to discover via Nostr. The mesh auto-assigns models and splits layers by VRAM.

3

Use it

Every node gets localhost:9337/v1 — standard OpenAI API. Works with any tool.

# Start a mesh with two models
mesh-llm --model Qwen2.5-32B --model GLM-4.7-Flash

# Another machine joins — auto-assigned to whichever model needs it
mesh-llm --join <token>

# Or discover public meshes and join automatically
mesh-llm --auto

# Create a shared mesh — everyone runs the same command
mesh-llm --auto --model GLM-4.7-Flash --mesh-name "poker-night"

# Route requests by model name
curl localhost:9337/v1/chat/completions -d '{"model":"GLM-4.7-Flash-Q4_K_M", ...}'
Features

Distributed inference that actually works

Smart layer splitting

Model doesn't fit? Layers split across nodes by VRAM. Peers selected by lowest RTT first — 80ms hard cap keeps splits fast. Solo mode when a model fits on one machine.

🔀

Multi-model routing

Different nodes serve different models. API proxy routes by model field. Nodes auto-assigned based on what's needed and what's on disk.

📊

Demand-aware rebalancing

Request rates tracked per model, shared via gossip. Standby nodes promote to serve hot models. Dead hosts replaced within 60 seconds.

📡

Nostr discovery

Publish your mesh to Nostr relays. Others find it with --auto. Smart scoring: region match, VRAM, health probe before joining.

🚀

Zero-transfer loading

Weights read from local GGUF files, not sent over the network. Model load: 111s → 5s. Per-token RPC round-trips: 558 → 8.

📈

Scales passively

GPU nodes gossip. Clients use lightweight routing tables — zero per-client server state. Event-driven: cost proportional to topology changes, not node count.

🎯

Speculative decoding

Draft model runs locally, proposes tokens verified in one batched pass. +38% throughput on code. Auto-detected from catalog.

💻

Web console

Live topology, VRAM bars, model picker, built-in chat. API-driven — everything the console shows comes from JSON endpoints.

🤖

Works with agents

OpenAI-compatible API on localhost:9337. Use with goose, pi, opencode, or any tool that supports custom OpenAI endpoints.

Quick start

Install & run

macOS Apple Silicon. One command to install, one to run.

# Install (downloads ~18MB bundle)
curl -fsSL https://raw.githubusercontent.com/michaelneale/decentralized-inference/main/install.sh | bash

# Join the public mesh — instant chat, zero config
mesh-llm --auto

# Or start your own mesh with a model
mesh-llm --model GLM-4.7-Flash-Q4_K_M
Integrations

Use with coding agents

Standard OpenAI API on localhost:9337. Works with anything.

▸ goose
GOOSE_PROVIDER=openai OPENAI_API_KEY=dummy OPENAI_HOST=http://localhost:9337 \
  GOOSE_MODEL=GLM-4.7-Flash-Q4_K_M goose session
▸ pi

Add to ~/.pi/agent/models.json:

{
  "providers": {
    "mesh": {
      "api": "openai-completions",
      "apiKey": "dummy",
      "baseUrl": "http://localhost:9337/v1",
      "models": [{
        "id": "GLM-4.7-Flash-Q4_K_M",
        "name": "GLM 4.7 Flash (mesh)",
        "contextWindow": 32768, "maxTokens": 8192,
        "reasoning": false, "input": ["text"],
        "cost": { "input": 0, "output": 0, "cacheRead": 0, "cacheWrite": 0 },
        "compat": { "maxTokensField": "max_tokens", "supportsDeveloperRole": false, "supportsUsageInStreaming": false }
      }]
    }
  }
}
pi --provider mesh --model GLM-4.7-Flash-Q4_K_M
▸ opencode
OPENAI_API_KEY=dummy OPENAI_BASE_URL=http://localhost:9337/v1 \
  opencode -m openai/GLM-4.7-Flash-Q4_K_M
▸ claude code (via proxy)

Claude Code uses Anthropic's API format. Use claude-code-proxy or litellm to translate.

# Start the proxy, then:
ANTHROPIC_BASE_URL=http://localhost:8082 claude
▸ curl / any OpenAI client
curl http://localhost:9337/v1/chat/completions \
  -H "Content-Type: application/json" \
  -d '{"model":"GLM-4.7-Flash-Q4_K_M","messages":[{"role":"user","content":"hello"}]}'

Try it

One binary. macOS Apple Silicon and Linux. MIT licensed.

Try it now Install GitHub → Roadmap