Open source · Decentralized

Pool GPUs to run larger models

Turn spare GPU capacity into a shared inference mesh. Serve many models across machines, run models larger than any single device, and scale capacity to meet demand. OpenAI-compatible API on every node.

Try it now View on GitHub

OpenAI-compatible API Layer split across GPUs Multi-model routing Demand-aware rebalancing Nostr discovery macOS + Linux

How it works

Three commands

No coordinator, no cloud, no API keys. Machines pool their VRAM over QUIC.

Start a mesh

Pick a model. mesh-llm downloads it, starts serving, prints an invite token.

Others join

Paste the token or use --auto to discover via Nostr. The mesh auto-assigns models and splits layers by VRAM.

Use it

Every node gets localhost:9337/v1 — standard OpenAI API. Works with any tool.

            # Start a mesh with two models

            mesh-llm --model Qwen2.5-32B --model GLM-4.7-Flash

            # Another machine joins — auto-assigned to whichever model needs it

            mesh-llm --join <token>

            # Or discover public meshes and join automatically

            mesh-llm --auto

            # Create a shared mesh — everyone runs the same command

            mesh-llm --auto --model GLM-4.7-Flash --mesh-name "poker-night"

            # Route requests by model name

            curl localhost:9337/v1/chat/completions -d '{"model":"GLM-4.7-Flash-Q4_K_M", ...}'

Features

Distributed inference that actually works

⚡

Smart layer splitting

Model doesn't fit? Layers split across nodes by VRAM. Peers selected by lowest RTT first — 80ms hard cap keeps splits fast. Solo mode when a model fits on one machine.

🔀

Multi-model routing

Different nodes serve different models. API proxy routes by model field. Nodes auto-assigned based on what's needed and what's on disk.

📊

Demand-aware rebalancing

Request rates tracked per model, shared via gossip. Standby nodes promote to serve hot models. Dead hosts replaced within 60 seconds.

📡

Nostr discovery

Publish your mesh to Nostr relays. Others find it with --auto. Smart scoring: region match, VRAM, health probe before joining.

🚀

Zero-transfer loading

Weights read from local GGUF files, not sent over the network. Model load: 111s → 5s. Per-token RPC round-trips: 558 → 8.

📈

Scales passively

GPU nodes gossip. Clients use lightweight routing tables — zero per-client server state. Event-driven: cost proportional to topology changes, not node count.

🎯

Speculative decoding

Draft model runs locally, proposes tokens verified in one batched pass. +38% throughput on code. Auto-detected from catalog.

💻

Web console

Live topology, VRAM bars, model picker, built-in chat. API-driven — everything the console shows comes from JSON endpoints.

🤖

Works with agents

OpenAI-compatible API on localhost:9337. Use with goose, pi, opencode, or any tool that supports custom OpenAI endpoints.

Integrations

Use with coding agents

Standard OpenAI API on localhost:9337. Works with anything.

▸ goose

                GOOSE_PROVIDER=openai OPENAI_API_KEY=dummy OPENAI_HOST=http://localhost:9337 \

                  GOOSE_MODEL=GLM-4.7-Flash-Q4_K_M goose session

▸ pi

Add to ~/.pi/agent/models.json:

{

  "providers": {

    "mesh": {

      "api": "openai-completions",

      "apiKey": "dummy",

      "baseUrl": "http://localhost:9337/v1",

      "models": [{

        "id": "GLM-4.7-Flash-Q4_K_M",

        "name": "GLM 4.7 Flash (mesh)",

        "contextWindow": 32768, "maxTokens": 8192,

        "reasoning": false, "input": ["text"],

        "cost": { "input": 0, "output": 0, "cacheRead": 0, "cacheWrite": 0 },

        "compat": { "maxTokensField": "max_tokens", "supportsDeveloperRole": false, "supportsUsageInStreaming": false }

      }]

    }

  }

}

                pi --provider mesh --model GLM-4.7-Flash-Q4_K_M
            

▸ opencode

                OPENAI_API_KEY=dummy OPENAI_BASE_URL=http://localhost:9337/v1 \

                  opencode -m openai/GLM-4.7-Flash-Q4_K_M

▸ claude code (via proxy)

Claude Code uses Anthropic's API format. Use claude-code-proxy or litellm to translate.

                # Start the proxy, then:

                ANTHROPIC_BASE_URL=http://localhost:8082 claude

▸ curl / any OpenAI client

                curl http://localhost:9337/v1/chat/completions \

                  -H "Content-Type: application/json" \

                  -d '{"model":"GLM-4.7-Flash-Q4_K_M","messages":[{"role":"user","content":"hello"}]}'