MCP server to use multiple local LLM's with your MCP client coding agent as Claude Code
llama-mcp
MCP server that bridges MCP clients (Claude Code, Cursor, etc.) with any local model running via llama-server. Offload bulk generation tasks to your local GPU and save on API costs.
Works with any model supported by llama.cpp — Gemma, LLaMA, Mistral, Qwen, Phi, etc.
Prerequisites
- Node.js 18+
- A local LLM backend — either:
- llama-server (llama.cpp)
- Ollama
# llama-server examples
llama-server -m gemma-4-12b-it-Q6_K.gguf --port 8080
llama-server -m llama-3.1-8b-instruct.gguf --port 8080
llama-server -m mistral-7b-instruct-v0.3.Q5_K_M.gguf --port 8080
# Ollama
ollama serve
ollama pull llama3.1
Setup
npm install
Usage
npm start
# or
node server.js
Claude Code
Register via the CLI:
claude mcp add -s user llama-mcp node /path/to/llama-mcp/server.js
Or add manually to your MCP config (~/.claude/claude_desktop_config.json or project .mcp.json):
With llama-server (default):
{
"mcpServers": {
"llama-mcp": {
"command": "node",
"args": ["/path/to/llama-mcp/server.js"],
"env": {
"MODEL_NAME": "gemma4"
}
}
}
}
With Ollama:
{
"mcpServers": {
"llama-mcp": {
"command": "node",
"args": ["/path/to/llama-mcp/server.js"],
"env": {
"LLAMA_SERVER_URL": "http://localhost:11434",
"MODEL_NAME": "llama3.1"
}
}
}
}
Tools
| Tool | Purpose | Temperature | Max Tokens |
|------|---------|-------------|------------|
| llm_generate | General text/code generation | 0.7 | 2048 |
| llm_code | Production-ready code generation | 0.2 | 4096 |
| llm_review | Code review (bugs, security, improvements) | 0.1 | 2048 |
| llm_refactor | Code refactoring | 0.3 | 4096 |
| llm_test | Unit test generation | 0.2 | 4096 |
| llm_debug | Error analysis and fix suggestions | 0.1 | 2048 |
| llm_chat | Multi-turn conversation with session history | 0.7 | 2048 |
| llm_metrics | Show usage stats and estimated cost savings | - | - |
Streaming
llm_generate and llm_code support an experimental stream: true option. When enabled, tokens are sent progressively via MCP logging notifications. The full response is still returned as the final tool result.
Note: Streaming uses
sendLoggingMessage, which is a workaround for stdio transport limitations. Some MCP clients may not display these notifications.
Conversation History
llm_chat maintains per-session message history. Pass a session_id to continue a conversation, or omit it to start a new one. The server keeps up to 10 concurrent sessions, evicting the oldest on overflow.
Caching
Responses to deterministic requests (temperature <= 0.3) are cached for 5 minutes. Identical requests return instantly from cache.
Configuration
| Environment Variable | Default | Description |
|---------------------|---------|-------------|
| LLAMA_SERVER_URL | http://localhost:8080 | Base URL of the llama-server instance |
| MODEL_NAME | default | Model name sent in API requests |
| LOG_FILE | ./mcp.log | Path to the log file |
Backend Compatibility
The server auto-detects the backend by probing health endpoints:
| Backend | Health endpoint | API endpoint | Default port |
|---------|----------------|--------------|-------------|
| llama-server | /health | /v1/chat/completions | 8080 |
| Ollama | / | /v1/chat/completions | 11434 |
Any backend exposing an OpenAI-compatible /v1/chat/completions endpoint should work — just set LLAMA_SERVER_URL accordingly.
Logging
Logs are written to both stderr (with [LlamaMCP] prefix) and the log file. Stdout is reserved for the MCP protocol.
To monitor logs in real time:
tail -f mcp.log