Forge: Swarm Agents That Turn Slow PyTorch Into Fast CUDA/Triton Kernels
Forge MCP Server
Swarm agents that turn slow PyTorch into fast CUDA/Triton kernels, from any AI coding agent.
Installation · Tools · Resources · Prompts · Security · Development
Overview
Forge transforms PyTorch models into production-grade CUDA/Triton kernels through automated multi-agent optimization. Using 32 parallel AI agents with inference-time scaling, it achieves up to 14x faster inference than torch.compile(mode='max-autotune-no-cudagraphs') while maintaining 100% numerical correctness.
This MCP server connects any MCP-compatible AI coding agent to Forge. Your agent submits PyTorch code, Forge optimizes it with swarm agents on real datacenter GPUs, and returns the fastest kernel as a drop-in replacement.
What it does
- Optimize existing kernels - Submit PyTorch code, get back an optimized Triton/CUDA kernel benchmarked against
torch.compile(max-autotune) - Generate new kernels - Describe an operation (e.g. "fused LayerNorm + GELU + Dropout"), get a production-ready optimized kernel
- 32 parallel swarm agents - Coder+Judge agent pairs compete to discover optimal kernels, exploring tensor core utilization, memory coalescing, shared memory tiling, and kernel fusion simultaneously
- Real datacenter GPU benchmarking - Every kernel is compiled, tested for correctness, and profiled on actual datacenter hardware
- 250k tokens/sec inference - Results in minutes, not hours
- Smart detection - The agent automatically recognizes when your code would benefit from GPU optimization
- One-click auth - Browser-based OAuth sign-in. No API keys to manage.
Supported GPUs
All optimization and benchmarking runs on datacenter-grade hardware:
| GPU | Architecture | |-----|-------------| | B200 | Blackwell | | H200 | Hopper | | H100 | Hopper | | L40S | Ada Lovelace | | A100 | Ampere | | L4 | Ada Lovelace | | A10 | Ampere | | T4 | Turing |
Supported clients
| Client | Status | |--------|--------| | Claude Code | Fully supported | | Claude Desktop | Fully supported | | OpenCode | Fully supported | | Cursor | Fully supported | | Windsurf | Fully supported | | VS Code + Copilot | Fully supported | | Any MCP client | Fully supported via stdio |
Installation
Claude Code
macOS / Linux:
claude mcp add forge-mcp -- npx -y @rightnow/forge-mcp-server
Windows:
claude mcp add forge-mcp -- cmd /c npx -y @rightnow/forge-mcp-server
Claude Desktop
Add to your claude_desktop_config.json:
macOS: ~/Library/Application Support/Claude/claude_desktop_config.json
{
"mcpServers": {
"forge": {
"command": "npx",
"args": ["-y", "@rightnow/forge-mcp-server"]
}
}
}
Windows: %APPDATA%\Claude\claude_desktop_config.json
{
"mcpServers": {
"forge": {
"command": "cmd",
"args": ["/c", "npx", "-y", "@rightnow/forge-mcp-server"]
}
}
}
VS Code / Copilot
Add to your .vscode/mcp.json (workspace) or user settings:
{
"servers": {
"forge": {
"command": "npx",
"args": ["-y", "@rightnow/forge-mcp-server"]
}
}
}
Windows: Use
"command": "cmd"with"args": ["/c", "npx", "-y", "@rightnow/forge-mcp-server"]
Cursor
Add to your Cursor MCP settings (~/.cursor/mcp.json):
{
"mcpServers": {
"forge": {
"command": "npx",
"args": ["-y", "@rightnow/forge-mcp-server"]
}
}
}
Windows: Use
"command": "cmd"with"args": ["/c", "npx", "-y", "@rightnow/forge-mcp-server"]
Windsurf
Add to your Windsurf MCP configuration:
{
"mcpServers": {
"forge": {
"command": "npx",
"args": ["-y", "@rightnow/forge-mcp-server"]
}
}
}
Windows: Use
"command": "cmd"with"args": ["/c", "npx", "-y", "@rightnow/forge-mcp-server"]
OpenCode
Add to your opencode.json:
{
"mcp": {
"forge": {
"command": "npx",
"args": ["-y", "@rightnow/forge-mcp-server"]
}
}
}
Tools
forge_auth
Authenticate with the Forge service. Opens your browser to sign in via the RightNow dashboard. Required before using any other tool.
- Inputs:
force(boolean, optional): Force re-authentication even if valid tokens exist
- Returns: Authentication status, email, plan type, and credit balance
forge_optimize
Submit PyTorch code for GPU kernel optimization. 32 swarm agents generate optimized Triton or CUDA kernels, evaluate them on real datacenter GPUs, and return the best result with speedup metrics.
The agent will automatically use this tool when it detects:
-
PyTorch custom operations (
torch.autograd.Function, customforward/backward) -
Manual CUDA kernels that could be faster
-
Performance-critical tensor operations (attention, convolution, normalization, softmax)
-
Code with comments like
"slow","bottleneck","optimize" -
torch.compile()targets ortriton.jitkernels -
Any
nn.Modulewith significant compute inforward() -
Matrix multiplication, reduction, or scan operations
-
Custom loss functions with reduction operations
-
Fused operation opportunities (e.g., LayerNorm + activation)
-
Inputs:
pytorch_code(string, required): Complete PyTorch code to optimize. Max 500 KB.kernel_name(string, required): Short name for the kernel (e.g.,"flash_attention")output_format(enum, optional):"triton"(default) or"native_cuda"target_speedup(number, optional): Target speedup multiplier. Default2.0max_iterations(number, optional): Max optimization iterations (1-100). Default10gpu(enum, optional): Target GPU. Default"H100". Options:B200,H200,H100,L40S,A100,L4,A10,T4user_prompt(string, optional): Guidance for the optimizer (e.g.,"focus on memory bandwidth")
-
Returns: Optimized kernel code, speedup metrics, latency comparison, iteration history
forge_generate
Generate an optimized GPU kernel from scratch based on a natural-language specification. Forge creates a PyTorch baseline, then optimizes it into Triton or CUDA.
- Inputs:
operation(string, required): Operation name (e.g.,"fused_attention","softmax")description(string, required): Detailed description of what the kernel should doinput_shapes(number[][], required): Input tensor shapes (e.g.,[[8, 512, 768]])output_shape(number[], optional): Expected output shapedtype(string, optional): Data type. Default"float16"output_format(enum, optional):"triton"(default) or"native_cuda"target_speedup(number, optional): Target speedup. Default2.0max_iterations(number, optional): Max iterations (1-100). Default10gpu(enum, optional): Target GPU. Default"H100"user_prompt(string, optional): Additional guidance
- Returns: Generated kernel code, speedup metrics, iteration history
forge_credits
Check your current Forge credit balance.
- Inputs: None
- Returns: Credit balance, total purchased, total used, plan type
forge_status
Check the status of a running or completed optimization job.
- Inputs:
session_id(string, required): Session ID fromforge_optimizeorforge_generate
- Returns: Job status, current iteration, best speedup
forge_cancel
Cancel a running optimization job.
- Inputs:
session_id(string, required): Session ID of the job to cancel
- Returns: Cancellation confirmation
forge_sessions
List past optimization sessions with results.
- Inputs:
limit(number, optional): Number of sessions to return (1-100). Default10status(enum, optional): Filter by status:"all","completed","failed","running". Default"all"
- Returns: Table of sessions with task name, GPU, speedup, status, and date
Tool Annotations
| Tool | Read-only | Idempotent | Destructive |
|------|-----------|------------|-------------|
| forge_auth | No | Yes | No |
| forge_optimize | No | No | No |
| forge_generate | No | No | No |
| forge_credits | Yes | Yes | No |
| forge_status | Yes | Yes | No |
| forge_cancel | No | No | Yes |
| forge_sessions | Yes | Yes | No |
Resources
| URI | Description |
|-----|-------------|
| forge://auth/status | Current authentication state (authenticated, token expiry, has refresh token) |
| forge://credits | Credit balance, usage, and plan information |
Prompts
forge-optimize
Guided workflow for optimizing a GPU kernel. Instructs the agent to:
- Check credit balance
- Analyze the code for optimization targets
- Call
forge_optimizewith appropriate parameters - Explain the results and suggest integration
forge-analyze
Teaches the agent to scan a codebase for GPU optimization opportunities, ranked by expected impact:
| Priority | Pattern |
|----------|---------|
| HIGH | Custom autograd functions, attention mechanisms, fused operations |
| MEDIUM | Standard nn.Module compositions, normalization + activation fusion |
| LOW | Element-wise operations, simple reductions |
How It Works
┌──────────────┐ stdio ┌──────────────────┐ HTTPS ┌──────────────────┐
│ AI Agent │ ──────────────>│ Forge MCP │ ──────────────>│ Forge API │
│ (Claude, │ │ Server │ │ (RightNow AI) │
│ Cursor, │<──────────────│ │<──────────────│ │
│ etc.) │ MCP result │ - OAuth + PKCE │ SSE stream │ - 32 swarm │
└──────────────┘ │ - SSE streaming │ │ agents │
│ - Token mgmt │ │ - Real GPU │
└──────────────────┘ │ benchmarking │
└──────────────────┘
- Authenticate: The agent calls
forge_auth, which opens your browser. Sign in once, tokens are stored locally at~/.forge/tokens.jsonand auto-refresh. - Optimize: The agent sends your PyTorch code via
forge_optimize. The MCP server POSTs to the Forge API and streams SSE events in real time. - Benchmark: 32 parallel Coder+Judge agents generate kernels, compile them, test correctness against the PyTorch reference, and profile performance on real datacenter GPUs.
- Return: The MCP server collects all results and returns the optimized code, speedup metrics, and iteration history. The output is a drop-in replacement for your original code.
Each optimization costs 1 credit. Credits are only charged for successful runs (speedup >= 1.1x). Failed runs and cancelled jobs are not charged.
Configuration
Authentication
No API keys needed. The server uses OAuth 2.0 with PKCE for secure browser-based authentication:
- Agent calls
forge_auth - Your default browser opens to
dashboard.rightnowai.co - Sign in or create an account
- Authorization completes automatically
- Tokens are stored locally at
~/.forge/tokens.json(mode0600) - Access tokens auto-refresh, you only sign in once
Credits
Forge uses a pay-as-you-go credit system. Each optimization or generation run costs 1 credit.
| Credits | Price | Per Credit | |---------|-------|------------| | 1-9 | $15.00 each | $15.00 | | 10+ | 25% off | $11.25 | | 50 | $562.50 | $11.25 | | Enterprise | Custom volume pricing | Contact us |
Free trial: optimize 1 kernel, no credit card required.
100% refund guarantee: if Forge doesn't beat torch.compile, you get your credit back.
Purchase credits at dashboard.rightnowai.co.
Benchmarks
End-to-end latency on NVIDIA B200. Forge vs torch.compile(mode='max-autotune-no-cudagraphs'):
| Model | torch.compile | Forge | Speedup | |-------|--------------|-------|---------| | Llama-3.1-8B | 42.3ms | 8.2ms | 5.16x | | Qwen2.5-7B | 38.5ms | 9.1ms | 4.23x | | Mistral-7B | 35.2ms | 10.4ms | 3.38x | | Phi-3-mini | 18.7ms | 6.8ms | 2.75x | | SDXL UNet | 89.4ms | 31.2ms | 2.87x | | Whisper-large | 52.1ms | 19.8ms | 2.63x | | BERT-large | 12.4ms | 5.1ms | 2.43x |
See the full benchmarks at rightnowai.co/forge.
Security
Token Protection
- No tokens in errors: All error messages are sanitized through regex filters that strip JWTs, Bearer tokens, hex tokens, and credential parameters before reaching the agent
- Local storage only: Tokens are stored at
~/.forge/tokens.jsonwith file mode0600(owner read/write only) - Auto-refresh: Access tokens expire in 1 hour and auto-refresh using the stored refresh token
- PKCE flow: OAuth uses Proof Key for Code Exchange (SHA-256), preventing authorization code interception
- No secrets in config: The MCP server requires zero environment variables or API keys
Input Validation
- PyTorch code input is capped at 500 KB to prevent memory exhaustion
- User prompts are capped at 10 KB
- All string inputs have maximum length validation via Zod schemas
- Numeric inputs have min/max bounds (e.g.,
max_iterations: 1-100)
Network Security
- All API communication uses HTTPS
- Non-SSE requests have a 30-second timeout to prevent hanging
- SSE streams have a 10-minute timeout with automatic cleanup
- Token refresh uses a mutex to prevent race conditions from concurrent requests
What the server can access
- Network: Only
dashboard.rightnowai.coandforge-api.rightnowai.co - Filesystem: Only reads/writes
~/.forge/tokens.json - No codebase access: The MCP server never reads your files. The agent passes code to it explicitly through tool parameters.
Development
Build from source
git clone https://github.com/RightNow-AI/forge-mcp-server.git
cd forge-mcp-server
npm install
npm run build
Run locally
npm run dev
Type check
npm run typecheck
Debug with MCP Inspector
npx @modelcontextprotocol/inspector node dist/index.js
This opens a web UI where you can invoke each tool, inspect inputs/outputs, and debug the server interactively.
Project structure
forge-mcp-server/
├── src/
│ ├── index.ts # Entry point (McpServer + StdioServerTransport)
│ ├── server.ts # Registers all tools, resources, prompts
│ ├── constants.ts # URLs, client IDs, timeouts, limits
│ ├── types.ts # TypeScript interfaces + type guards + sanitization
│ ├── auth/
│ │ ├── oauth-client.ts # PKCE flow, token refresh, access token management
│ │ └── token-store.ts # ~/.forge/tokens.json read/write/clear
│ ├── api/
│ │ ├── forge-client.ts # HTTP client for all Forge API endpoints
│ │ └── sse-consumer.ts # SSE stream parser via native fetch + ReadableStream
│ ├── tools/ # 7 MCP tools
│ ├── resources/ # 2 MCP resources
│ └── prompts/ # 2 MCP prompts
├── .github/workflows/
│ ├── ci.yml # Typecheck + build on push/PR
│ └── release.yml # npm publish on version tags
├── package.json
├── tsconfig.json
└── tsup.config.ts
Contributing
Contributions are welcome. Please open an issue first to discuss what you'd like to change.
- Fork the repo
- Create a branch (
git checkout -b feature/my-feature) - Make your changes
- Run
npm run typecheckandnpm run build - Commit and push
- Open a pull request
License
Part of the RightNow AI ecosystem. Member of the NVIDIA Inception Program.