A blazingly fast AI proxy gateway

Butter sits between your application and AI providers, offering a unified OpenAI-compatible API with multi-provider routing, automatic failover, and sub-50μs overhead. Written in Go.

View on GitHub Quick Start

Features

Everything you need to route AI traffic with confidence.

OpenAI-Compatible API

Drop-in replacement for any OpenAI SDK client. Just change the base URL.

🎯

Multi-Provider Routing

Route models to 10 providers — OpenAI, Anthropic, AWS Bedrock, Gemini, OpenRouter, Groq, Mistral, Together.ai, Fireworks, Perplexity — with priority or round-robin strategies.

🔃

Streaming & SSE

Full streaming support with immediate per-chunk flush via SSE relay.

🛡️

Automatic Failover

Configurable retry-on status codes with exponential backoff across providers.

🔑

Key Rotation

Weighted random key selection with per-key model allowlists.

🔌

Plugin System

Ordered hook chains (pre/post HTTP, pre/post LLM) with fail-open design. Built-in plugins for rate limiting, logging, metrics, and tracing. External plugins via WASM sandbox.

🚷

Rate Limiting

Built-in token bucket rate limiter with global or per-IP modes. Plugins can short-circuit requests before they reach providers.

📊

Prometheus Metrics

Built-in metrics plugin powered by OTel SDK — request totals, latency histograms, and error counters exposed at /metrics.

Response Caching

In-memory LRU or Redis cache with TTL for deterministic requests (temperature=0, non-streaming). Reduces costs and latency for repeated queries.

🧰

WASM Plugin Sandbox

Load external .wasm plugins via Extism/wazero — pure Go, no CGo, full sandbox isolation. Write plugins in Go, Rust, TypeScript, Python, and more.

🔍

Distributed Tracing

Built-in OpenTelemetry tracing plugin with OTLP HTTP export. Zero overhead when unconfigured.

🔄

Config Hot-Reload

Routing and key configuration reloads automatically on file change — no restart, no dropped connections.

🔑

Application Keys

Vend btr_ tokens for usage tracking and attribution. Per-key request and token counters with optional require_key enforcement.

🛡️

Prompt Injection Guard

WASM plugin scanning for ~60 injection patterns across 7 categories with Unicode bypass detection. Block, log, or tag modes.

📑

Embeddings & Models

Full /v1/embeddings and /v1/models endpoints for SDK compatibility. Embeddings route through the same failover logic as chat.

🔐

Credential Passthrough

Per-provider credential_modestored injects managed keys; passthrough forwards the client's own auth headers unchanged.

Coming Soon

More Providers

Azure OpenAI, Vertex AI, and more to match full Bifrost coverage.

Quick Start

Up and running in under a minute.

Install

Download the latest binary from GitHub Releases, or build from source:

git clone https://github.com/temikus/butter.git
cd butter
go build -o pkg/bin/butter ./cmd/butter/

Configure

cp config.example.yaml config.yaml
export OPENAI_API_KEY="sk-..."
export OPENROUTER_API_KEY="sk-or-v1-..."

Run

./pkg/bin/butter -config config.yaml
# {"level":"INFO","msg":"butter listening","address":":8080"}

Send a request

curl http://localhost:8080/v1/chat/completions \
  -H "Content-Type: application/json" \
  -d '{
    "model": "openai/gpt-4o-mini",
    "messages": [{"role": "user", "content": "Say hello!"}]
  }'

Drop-in SDK Replacement

Works with any OpenAI-compatible client. Just change the base URL.

Python

from openai import OpenAI

client = OpenAI(
    base_url="http://localhost:8080/v1",
    api_key="unused",  # Butter uses its own configured keys
)

response = client.chat.completions.create(
    model="openai/gpt-4o-mini",
    messages=[{"role": "user", "content": "Hello!"}],
)
print(response.choices[0].message.content)

Node.js

import OpenAI from "openai";

const client = new OpenAI({
  baseURL: "http://localhost:8080/v1",
  apiKey: "unused",
});

const completion = await client.chat.completions.create({
  model: "openai/gpt-4o-mini",
  messages: [{ role: "user", content: "Hello!" }],
});
console.log(completion.choices[0].message.content);

Architecture

Minimal layers, maximum throughput.

Your App ──▶ Butter ──▶ OpenAI / Anthropic / Gemini / Groq / Mistral / ... │ ├── Unified OpenAI-compatible API ├── Automatic failover & retries ├── Weighted key rotation ├── Plugin hooks (pre/post HTTP, LLM, streaming) └── WASM plugin sandbox (Extism/wazero) Request Flow: Client → transport.Server (HTTP) → Plugin Chain (built-in + WASM pre-hooks) → proxy.Engine (routing/dispatch) → Response Cache (LRU, TTL) → provider.Registry → Provider impl → Plugin Chain (built-in + WASM post-hooks) → Response

Performance Targets

Engineered for negligible overhead.

Metric Target
Per-request overhead (no plugins) <50μs
Per-request overhead (built-in plugins) <100μs
Per-request overhead (1 WASM plugin) <150μs
Streaming TTFB overhead <1ms
Memory at idle <30MB