Autonomous Infrastructure Repair Agent
Sentinel-MCP: The Autonomous Infrastructure Repair Agent
Bridging the gap between monitoring alerts and autonomous infrastructure remediation using IBM Bob and watsonx.ai
🎯 Problem Statement
Infrastructure teams are overwhelmed by alert fatigue. When a server or application fails, DevOps engineers must manually:
- Correlate logs from multiple sources
- Identify the root cause
- Apply a fix
- Document the remediation
This manual intervention leads to:
- Higher Mean Time to Recovery (MTTR)
- Operational burnout
- Inconsistent remediation practices
- Poor documentation
💡 Solution
Sentinel-MCP is an autonomous remediation agent that uses the Model Context Protocol (MCP) to bridge IBM Bob with live system environments (Kubernetes, Linux, Cloud APIs).
Key Features
- 🤖 Autonomous Reasoning: Uses IBM Bob's agentic capabilities to analyze and fix infrastructure issues
- 🧠 AI-Powered Analysis: Leverages IBM Granite models via watsonx.ai for intelligent log analysis
- 🔒 Security-First: Built-in security constraints and approval workflows
- 📝 Auto-Documentation: Generates comprehensive remediation reports automatically
- 🔄 Rollback Support: Safe execution with automatic rollback capabilities
- 🎯 Multi-Environment: Supports both Kubernetes and Linux bare-metal systems
🏗️ Architecture
graph TB
A[Prometheus AlertManager] -->|Webhook| B[Alert Receiver]
B --> C[Sentinel-MCP Core]
C --> D[MCP Server]
D --> E[System Tools]
E --> F[Linux Operations]
E --> G[Kubernetes Operations]
C --> H[watsonx.ai Integration]
H --> I[IBM Granite Models]
C --> J[Reasoning Engine]
J --> K[Security Validator]
K --> L[Remediation Executor]
L --> M[Documentation Generator]
🚀 Quick Start
Prerequisites
- Rust 1.75+ and Cargo
- Docker and Docker Compose
- Kubernetes cluster (for K8s features)
- IBM Cloud account with watsonx.ai access
- Prometheus and AlertManager (optional, for full demo)
Installation
- Clone the repository
git clone https://github.com/paulmmoore3416/Sentinel-MCP.git
cd Sentinel-MCP
- Set up environment variables
cp .env.example .env
# Edit .env with your credentials
Required environment variables:
# IBM watsonx.ai Configuration
WATSONX_API_KEY=your_api_key_here
WATSONX_PROJECT_ID=your_project_id_here
WATSONX_URL=https://us-south.ml.cloud.ibm.com
# MCP Server Configuration
MCP_SERVER_PORT=3000
MCP_AUTH_TOKEN=your_secure_token_here
# Security Settings
APPROVAL_REQUIRED=true
DRY_RUN_MODE=false
- Build the project
cargo build --release
- Run the server
cargo run --release
📖 Usage Guide
Basic Usage
1. Starting Sentinel-MCP
# Start in interactive mode (requires approval for all actions)
./target/release/sentinel-mcp --mode interactive
# Start in autonomous mode (auto-approves low-risk actions)
./target/release/sentinel-mcp --mode autonomous
# Start in dry-run mode (simulates all actions)
./target/release/sentinel-mcp --mode dry-run
2. Triggering an Alert
Manual trigger via CLI:
# Simulate a disk space alert
curl -X POST http://localhost:3000/api/v1/alerts \
-H "Content-Type: application/json" \
-H "Authorization: Bearer ${MCP_AUTH_TOKEN}" \
-d @examples/alerts/disk-space-low.json
Example alert payload (examples/alerts/disk-space-low.json):
{
"alerts": [{
"status": "firing",
"labels": {
"alertname": "DiskSpaceLow",
"severity": "warning",
"instance": "server-01",
"filesystem": "/var"
},
"annotations": {
"summary": "Disk space is critically low",
"description": "Filesystem /var is at 92% capacity on server-01"
},
"startsAt": "2026-05-02T18:00:00Z"
}]
}
3. Monitoring the Remediation Process
The system will:
- ✅ Receive and parse the alert
- 🔍 Gather system context (logs, disk usage, processes)
- 🧠 Analyze with watsonx.ai
- 💡 Propose remediation steps
- ⏸️ Request approval (if in interactive mode)
- ⚡ Execute remediation
- ✔️ Verify success
- 📝 Generate documentation
Watch the logs:
tail -f logs/sentinel-mcp.log
View the remediation report:
cat logs/remediations/REMEDIATION_LOG_$(date +%Y%m%d).md
Advanced Usage
Using with Prometheus AlertManager
- Configure AlertManager webhook (
alertmanager.yml):
route:
receiver: 'sentinel-mcp'
group_by: ['alertname', 'instance']
group_wait: 10s
group_interval: 10s
repeat_interval: 1h
receivers:
- name: 'sentinel-mcp'
webhook_configs:
- url: 'http://sentinel-mcp:3000/api/v1/alerts'
send_resolved: true
http_config:
bearer_token: 'your_mcp_auth_token'
- Restart AlertManager:
kubectl rollout restart deployment/alertmanager -n monitoring
Kubernetes Deployment
- Create namespace:
kubectl create namespace sentinel-system
- Create secrets:
kubectl create secret generic watsonx-credentials \
--from-literal=api-key=${WATSONX_API_KEY} \
--from-literal=project-id=${WATSONX_PROJECT_ID} \
-n sentinel-system
- Deploy Sentinel-MCP:
kubectl apply -f k8s/
- Verify deployment:
kubectl get pods -n sentinel-system
kubectl logs -f deployment/sentinel-mcp -n sentinel-system
🧪 Testing & Demo
Running the Test Suite
# Run all tests
cargo test
# Run integration tests only
cargo test --test integration
# Run with verbose output
cargo test -- --nocapture
Demo Scenarios
We've included several pre-built failure scenarios for demonstration:
Scenario 1: Disk Space Cleanup
# Inject failure
./scripts/test-failure.sh disk-full
# Watch Sentinel-MCP detect and fix
tail -f logs/sentinel-mcp.log
# Verify remediation
df -h /var
cat logs/remediations/REMEDIATION_LOG_*.md
Expected outcome:
- Sentinel detects disk at 95% capacity
- Analyzes logs to find old/rotatable files
- Proposes cleanup of
/var/log/old-logs - Executes cleanup after approval
- Verifies disk usage reduced to ~45%
- Documents the entire process
Scenario 2: Service Crash Recovery
# Inject failure
./scripts/test-failure.sh service-crash nginx
# Watch auto-recovery
journalctl -u nginx -f
Expected outcome:
- Sentinel detects nginx service stopped
- Analyzes crash logs
- Identifies configuration error or resource issue
- Restarts service with corrected configuration
- Verifies service is running and healthy
Scenario 3: Kubernetes Pod CrashLoop
# Inject failure
kubectl apply -f examples/scenarios/crashloop-pod.yaml
# Watch Sentinel-MCP diagnose and fix
kubectl logs -f deployment/sentinel-mcp -n sentinel-system
Expected outcome:
- Sentinel detects pod in CrashLoopBackOff
- Analyzes pod logs and events
- Identifies missing ConfigMap or resource limits
- Proposes fix (create ConfigMap or adjust limits)
- Applies fix after approval
- Verifies pod is running
Creating Custom Scenarios
Create a new scenario file in examples/scenarios/:
# examples/scenarios/custom-failure.yaml
apiVersion: v1
kind: ConfigMap
metadata:
name: failure-scenario
namespace: default
data:
type: "memory-leak"
severity: "critical"
description: "Simulate memory leak in application"
trigger_command: "stress --vm 1 --vm-bytes 2G --timeout 300s"
expected_remediation: "Restart pod with memory limits"
🔧 Configuration
Security Configuration
Edit config/security-rules.yaml:
security_rules:
# Commands that require approval
high_risk_commands:
- "rm -rf"
- "DROP DATABASE"
- "kubectl delete namespace"
# Commands that can auto-execute
low_risk_commands:
- "systemctl restart"
- "kubectl rollout restart"
- "docker restart"
# Kubernetes namespaces allowed for operations
allowed_namespaces:
- "default"
- "production"
- "staging"
# Maximum disk space to clean (in GB)
max_disk_cleanup: 50
watsonx.ai Configuration
Edit config/watsonx.yaml:
watsonx:
model: "ibm/granite-13b-instruct-v2"
parameters:
max_new_tokens: 1024
temperature: 0.7
top_p: 0.9
prompts:
log_analysis: |
You are an expert SRE analyzing infrastructure logs.
Analyze the following logs and identify the root cause.
Provide a concise analysis and suggest remediation steps.
Logs:
{log_content}
📊 Monitoring & Observability
Metrics
Sentinel-MCP exposes Prometheus metrics at /metrics:
curl http://localhost:3000/metrics
Key metrics:
sentinel_alerts_received_total: Total alerts receivedsentinel_remediations_executed_total: Total remediations executedsentinel_remediations_success_rate: Success rate of remediationssentinel_mttr_seconds: Mean time to recoverysentinel_watsonx_api_calls_total: Total watsonx.ai API calls
Grafana Dashboard
Import the pre-built dashboard:
kubectl apply -f k8s/grafana-dashboard.yaml
Access Grafana and import dashboard ID: sentinel-mcp-overview
🎥 Video Demo Script
Setup (30 seconds)
- Show terminal with Sentinel-MCP running
- Show Prometheus dashboard with healthy metrics
- Explain the scenario: "We'll simulate a disk space crisis"
Action (90 seconds)
-
Inject failure:
./scripts/test-failure.sh disk-full -
Split screen:
- Left: Sentinel-MCP logs showing detection and analysis
- Right: System terminal showing disk usage
-
Show AI reasoning:
- Display watsonx.ai analysis of logs
- Show proposed remediation steps
- Highlight security validation
-
Execute remediation:
- Show approval prompt
- Execute cleanup
- Verify disk space recovered
Value Proposition (60 seconds)
-
Show auto-generated documentation:
cat logs/remediations/REMEDIATION_LOG_20260502.md -
Highlight key benefits:
- MTTR reduced from 30 minutes to 2 minutes
- Zero manual intervention required
- Complete audit trail automatically generated
- AI-powered root cause analysis
-
Show IBM Bob integration:
- Display exported Bob conversation
- Show how Bob orchestrated the solution
- Emphasize AI-native development process
🤝 IBM Bob Integration
Bob Prompts Used
All prompts used with IBM Bob are documented in the /prompts directory:
-
Scaffolding (
prompts/01-scaffold.md):Bob, help me scaffold a new project for an MCP server using Rust. This server needs to expose tools for reading system logs and executing remediation scripts. Follow enterprise security standards. -
MCP Tools (
prompts/02-mcp-tools.md):In Plan Mode, design the MCP tools for Sentinel-MCP: 1. read_system_logs - Read logs from various sources 2. execute_remediation_script - Execute approved commands 3. check_kubernetes_status - Query K8s cluster state Include security validation for each tool. -
watsonx Integration (
prompts/03-watsonx.md):Bob, implement the watsonx.ai integration module. Use IBM Granite models for log analysis. Include error handling and retry logic. -
Testing (
prompts/04-testing.md):Bob, generate a comprehensive test suite with: 1. Unit tests for each component 2. Integration tests for the full workflow 3. Simulated failure scenarios
Exported Bob Report
The complete IBM Bob conversation and development process is documented in:
docs/bob-export.md- Full conversation historydocs/bob-analysis.md- Bob's architectural decisions
📚 Documentation
🏆 Hackathon Submission
This project was built for the IBM watsonx Challenge, demonstrating:
- AI-Native Development: Entire project orchestrated using IBM Bob
- watsonx.ai Integration: Real-world use of IBM Granite models
- Practical Value: Solves real infrastructure pain points
- Innovation: Novel use of MCP for infrastructure automation
Submission Checklist
- ✅ Problem and solution statement
- ✅ IBM Bob and watsonx.ai usage documented
- ✅ Implementation plan with Bob-ready prompts
- ✅ Video demo (3 minutes)
- ✅ Code repository with clear structure
- ✅ Exported Bob report
- ✅ README with usage examples
🤝 Contributing
We welcome contributions! Please see CONTRIBUTING.md for details.
📄 License
This project is licensed under the MIT License - see LICENSE file for details.
🙏 Acknowledgments
- IBM watsonx.ai team for the powerful Granite models
- IBM Bob team for the amazing AI development assistant
- The MCP community for the protocol specification
- All contributors and testers
📞 Contact
- Author: Paul Moore
- GitHub: @paulmmoore3416
- Repository: Sentinel-MCP
Built with ❤️ using IBM Bob and watsonx.ai