llama-mcp

MCP server that bridges MCP clients (Claude Code, Cursor, etc.) with any local model running via llama-server. Offload bulk generation tasks to your local GPU and save on API costs.

Works with any model supported by llama.cpp — Gemma, LLaMA, Mistral, Qwen, Phi, etc.

Prerequisites

Node.js 18+
A local LLM backend — either:
- llama-server (llama.cpp)
- Ollama

# llama-server examples
llama-server -m gemma-4-12b-it-Q6_K.gguf --port 8080
llama-server -m llama-3.1-8b-instruct.gguf --port 8080
llama-server -m mistral-7b-instruct-v0.3.Q5_K_M.gguf --port 8080

# Ollama
ollama serve
ollama pull llama3.1

Setup

npm install

Usage

npm start
# or
node server.js

Claude Code

claude mcp add -s user llama-mcp node /path/to/llama-mcp/server.js

Or add manually to your MCP config (~/.claude/claude_desktop_config.json or project .mcp.json):

With llama-server (default):

{
  "mcpServers": {
    "llama-mcp": {
      "command": "node",
      "args": ["/path/to/llama-mcp/server.js"],
      "env": {
        "MODEL_NAME": "gemma4"
      }
    }
  }
}

With Ollama:

{
  "mcpServers": {
    "llama-mcp": {
      "command": "node",
      "args": ["/path/to/llama-mcp/server.js"],
      "env": {
        "LLAMA_SERVER_URL": "http://localhost:11434",
        "MODEL_NAME": "llama3.1"
      }
    }
  }
}

Tools

| Tool | Purpose | Temperature | Max Tokens | |------|---------|-------------|------------| | llm_generate | General text/code generation | 0.7 | 2048 | | llm_code | Production-ready code generation | 0.2 | 4096 | | llm_review | Code review (bugs, security, improvements) | 0.1 | 2048 | | llm_refactor | Code refactoring | 0.3 | 4096 | | llm_test | Unit test generation | 0.2 | 4096 | | llm_debug | Error analysis and fix suggestions | 0.1 | 2048 | | llm_chat | Multi-turn conversation with session history | 0.7 | 2048 | | llm_metrics | Show usage stats and estimated cost savings | - | - |

Streaming

llm_generate and llm_code support an experimental stream: true option. When enabled, tokens are sent progressively via MCP logging notifications. The full response is still returned as the final tool result.

Note: Streaming uses sendLoggingMessage, which is a workaround for stdio transport limitations. Some MCP clients may not display these notifications.

Conversation History

llm_chat maintains per-session message history. Pass a session_id to continue a conversation, or omit it to start a new one. The server keeps up to 10 concurrent sessions, evicting the oldest on overflow.

Caching

Responses to deterministic requests (temperature <= 0.3) are cached for 5 minutes. Identical requests return instantly from cache.

Configuration

| Environment Variable | Default | Description | |---------------------|---------|-------------| | LLAMA_SERVER_URL | http://localhost:8080 | Base URL of the llama-server instance | | MODEL_NAME | default | Model name sent in API requests | | LOG_FILE | ./mcp.log | Path to the log file |

Backend Compatibility

The server auto-detects the backend by probing health endpoints:

| Backend | Health endpoint | API endpoint | Default port | |---------|----------------|--------------|-------------| | llama-server | /health | /v1/chat/completions | 8080 | | Ollama | / | /v1/chat/completions | 11434 |

Any backend exposing an OpenAI-compatible /v1/chat/completions endpoint should work — just set LLAMA_SERVER_URL accordingly.

Logging

Logs are written to both stderr (with [LlamaMCP] prefix) and the log file. Stdout is reserved for the MCP protocol.

To monitor logs in real time:

tail -f mcp.log

MCP Servers