Sentinel-MCP: The Autonomous Infrastructure Repair Agent

Bridging the gap between monitoring alerts and autonomous infrastructure remediation using IBM Bob and watsonx.ai

🎯 Problem Statement

Infrastructure teams are overwhelmed by alert fatigue. When a server or application fails, DevOps engineers must manually:

Correlate logs from multiple sources
Identify the root cause
Apply a fix
Document the remediation

This manual intervention leads to:

Higher Mean Time to Recovery (MTTR)
Operational burnout
Inconsistent remediation practices
Poor documentation

💡 Solution

Sentinel-MCP is an autonomous remediation agent that uses the Model Context Protocol (MCP) to bridge IBM Bob with live system environments (Kubernetes, Linux, Cloud APIs).

Key Features

🤖 Autonomous Reasoning: Uses IBM Bob's agentic capabilities to analyze and fix infrastructure issues
🧠 AI-Powered Analysis: Leverages IBM Granite models via watsonx.ai for intelligent log analysis
🔒 Security-First: Built-in security constraints and approval workflows
📝 Auto-Documentation: Generates comprehensive remediation reports automatically
🔄 Rollback Support: Safe execution with automatic rollback capabilities
🎯 Multi-Environment: Supports both Kubernetes and Linux bare-metal systems

🏗️ Architecture

graph TB
    A[Prometheus AlertManager] -->|Webhook| B[Alert Receiver]
    B --> C[Sentinel-MCP Core]
    C --> D[MCP Server]
    D --> E[System Tools]
    E --> F[Linux Operations]
    E --> G[Kubernetes Operations]
    C --> H[watsonx.ai Integration]
    H --> I[IBM Granite Models]
    C --> J[Reasoning Engine]
    J --> K[Security Validator]
    K --> L[Remediation Executor]
    L --> M[Documentation Generator]

🚀 Quick Start

Prerequisites

Rust 1.75+ and Cargo
Docker and Docker Compose
Kubernetes cluster (for K8s features)
IBM Cloud account with watsonx.ai access
Prometheus and AlertManager (optional, for full demo)

Installation

Clone the repository

git clone https://github.com/paulmmoore3416/Sentinel-MCP.git
cd Sentinel-MCP

Set up environment variables

cp .env.example .env
# Edit .env with your credentials

Required environment variables:

# IBM watsonx.ai Configuration
WATSONX_API_KEY=your_api_key_here
WATSONX_PROJECT_ID=your_project_id_here
WATSONX_URL=https://us-south.ml.cloud.ibm.com

# MCP Server Configuration
MCP_SERVER_PORT=3000
MCP_AUTH_TOKEN=your_secure_token_here

# Security Settings
APPROVAL_REQUIRED=true
DRY_RUN_MODE=false

Build the project

cargo build --release

Run the server

cargo run --release

📖 Usage Guide

Basic Usage

1. Starting Sentinel-MCP

# Start in interactive mode (requires approval for all actions)
./target/release/sentinel-mcp --mode interactive

# Start in autonomous mode (auto-approves low-risk actions)
./target/release/sentinel-mcp --mode autonomous

# Start in dry-run mode (simulates all actions)
./target/release/sentinel-mcp --mode dry-run

2. Triggering an Alert

Manual trigger via CLI:

# Simulate a disk space alert
curl -X POST http://localhost:3000/api/v1/alerts \
  -H "Content-Type: application/json" \
  -H "Authorization: Bearer ${MCP_AUTH_TOKEN}" \
  -d @examples/alerts/disk-space-low.json

Example alert payload (examples/alerts/disk-space-low.json):

{
  "alerts": [{
    "status": "firing",
    "labels": {
      "alertname": "DiskSpaceLow",
      "severity": "warning",
      "instance": "server-01",
      "filesystem": "/var"
    },
    "annotations": {
      "summary": "Disk space is critically low",
      "description": "Filesystem /var is at 92% capacity on server-01"
    },
    "startsAt": "2026-05-02T18:00:00Z"
  }]
}

3. Monitoring the Remediation Process

The system will:

✅ Receive and parse the alert
🔍 Gather system context (logs, disk usage, processes)
🧠 Analyze with watsonx.ai
💡 Propose remediation steps
⏸️ Request approval (if in interactive mode)
⚡ Execute remediation
✔️ Verify success
📝 Generate documentation

Watch the logs:

tail -f logs/sentinel-mcp.log

View the remediation report:

cat logs/remediations/REMEDIATION_LOG_$(date +%Y%m%d).md

Advanced Usage

Using with Prometheus AlertManager

Configure AlertManager webhook (alertmanager.yml):

route:
  receiver: 'sentinel-mcp'
  group_by: ['alertname', 'instance']
  group_wait: 10s
  group_interval: 10s
  repeat_interval: 1h

receivers:
  - name: 'sentinel-mcp'
    webhook_configs:
      - url: 'http://sentinel-mcp:3000/api/v1/alerts'
        send_resolved: true
        http_config:
          bearer_token: 'your_mcp_auth_token'

Restart AlertManager:

kubectl rollout restart deployment/alertmanager -n monitoring

Kubernetes Deployment

Create namespace:

kubectl create namespace sentinel-system

Create secrets:

kubectl create secret generic watsonx-credentials \
  --from-literal=api-key=${WATSONX_API_KEY} \
  --from-literal=project-id=${WATSONX_PROJECT_ID} \
  -n sentinel-system

Deploy Sentinel-MCP:

kubectl apply -f k8s/

Verify deployment:

kubectl get pods -n sentinel-system
kubectl logs -f deployment/sentinel-mcp -n sentinel-system

🧪 Testing & Demo

Running the Test Suite

# Run all tests
cargo test

# Run integration tests only
cargo test --test integration

# Run with verbose output
cargo test -- --nocapture

Demo Scenarios

We've included several pre-built failure scenarios for demonstration:

Scenario 1: Disk Space Cleanup

# Inject failure
./scripts/test-failure.sh disk-full

# Watch Sentinel-MCP detect and fix
tail -f logs/sentinel-mcp.log

# Verify remediation
df -h /var
cat logs/remediations/REMEDIATION_LOG_*.md

Expected outcome:

Sentinel detects disk at 95% capacity
Analyzes logs to find old/rotatable files
Proposes cleanup of /var/log/old-logs
Executes cleanup after approval
Verifies disk usage reduced to ~45%
Documents the entire process

Scenario 2: Service Crash Recovery

# Inject failure
./scripts/test-failure.sh service-crash nginx

# Watch auto-recovery
journalctl -u nginx -f

Expected outcome:

Sentinel detects nginx service stopped
Analyzes crash logs
Identifies configuration error or resource issue
Restarts service with corrected configuration
Verifies service is running and healthy

Scenario 3: Kubernetes Pod CrashLoop

# Inject failure
kubectl apply -f examples/scenarios/crashloop-pod.yaml

# Watch Sentinel-MCP diagnose and fix
kubectl logs -f deployment/sentinel-mcp -n sentinel-system

Expected outcome:

Sentinel detects pod in CrashLoopBackOff
Analyzes pod logs and events
Identifies missing ConfigMap or resource limits
Proposes fix (create ConfigMap or adjust limits)
Applies fix after approval
Verifies pod is running

Creating Custom Scenarios

Create a new scenario file in examples/scenarios/:

# examples/scenarios/custom-failure.yaml
apiVersion: v1
kind: ConfigMap
metadata:
  name: failure-scenario
  namespace: default
data:
  type: "memory-leak"
  severity: "critical"
  description: "Simulate memory leak in application"
  trigger_command: "stress --vm 1 --vm-bytes 2G --timeout 300s"
  expected_remediation: "Restart pod with memory limits"

🔧 Configuration

Security Configuration

Edit config/security-rules.yaml:

security_rules:
  # Commands that require approval
  high_risk_commands:
    - "rm -rf"
    - "DROP DATABASE"
    - "kubectl delete namespace"
  
  # Commands that can auto-execute
  low_risk_commands:
    - "systemctl restart"
    - "kubectl rollout restart"
    - "docker restart"
  
  # Kubernetes namespaces allowed for operations
  allowed_namespaces:
    - "default"
    - "production"
    - "staging"
  
  # Maximum disk space to clean (in GB)
  max_disk_cleanup: 50

watsonx.ai Configuration

Edit config/watsonx.yaml:

watsonx:
  model: "ibm/granite-13b-instruct-v2"
  parameters:
    max_new_tokens: 1024
    temperature: 0.7
    top_p: 0.9
  
  prompts:
    log_analysis: |
      You are an expert SRE analyzing infrastructure logs.
      Analyze the following logs and identify the root cause.
      Provide a concise analysis and suggest remediation steps.
      
      Logs:
      {log_content}

📊 Monitoring & Observability

Metrics

Sentinel-MCP exposes Prometheus metrics at /metrics:

curl http://localhost:3000/metrics

Key metrics:

sentinel_alerts_received_total: Total alerts received
sentinel_remediations_executed_total: Total remediations executed
sentinel_remediations_success_rate: Success rate of remediations
sentinel_mttr_seconds: Mean time to recovery
sentinel_watsonx_api_calls_total: Total watsonx.ai API calls

Grafana Dashboard

Import the pre-built dashboard:

kubectl apply -f k8s/grafana-dashboard.yaml

Access Grafana and import dashboard ID: sentinel-mcp-overview

🎥 Video Demo Script

Setup (30 seconds)

Show terminal with Sentinel-MCP running
Show Prometheus dashboard with healthy metrics
Explain the scenario: "We'll simulate a disk space crisis"

Action (90 seconds)

Inject failure:
```
./scripts/test-failure.sh disk-full
```
Split screen:
- Left: Sentinel-MCP logs showing detection and analysis
- Right: System terminal showing disk usage
Show AI reasoning:
- Display watsonx.ai analysis of logs
- Show proposed remediation steps
- Highlight security validation
Execute remediation:
- Show approval prompt
- Execute cleanup
- Verify disk space recovered

Value Proposition (60 seconds)

Show auto-generated documentation:

cat logs/remediations/REMEDIATION_LOG_20260502.md

Highlight key benefits:
- MTTR reduced from 30 minutes to 2 minutes
- Zero manual intervention required
- Complete audit trail automatically generated
- AI-powered root cause analysis
Show IBM Bob integration:
- Display exported Bob conversation
- Show how Bob orchestrated the solution
- Emphasize AI-native development process

🤝 IBM Bob Integration

Bob Prompts Used

All prompts used with IBM Bob are documented in the /prompts directory:

Scaffolding (prompts/01-scaffold.md):

Bob, help me scaffold a new project for an MCP server using Rust.
This server needs to expose tools for reading system logs and
executing remediation scripts. Follow enterprise security standards.

MCP Tools (prompts/02-mcp-tools.md):

In Plan Mode, design the MCP tools for Sentinel-MCP:
1. read_system_logs - Read logs from various sources
2. execute_remediation_script - Execute approved commands
3. check_kubernetes_status - Query K8s cluster state
Include security validation for each tool.

watsonx Integration (prompts/03-watsonx.md):

Bob, implement the watsonx.ai integration module.
Use IBM Granite models for log analysis.
Include error handling and retry logic.

Testing (prompts/04-testing.md):

Bob, generate a comprehensive test suite with:
1. Unit tests for each component
2. Integration tests for the full workflow
3. Simulated failure scenarios

Exported Bob Report

The complete IBM Bob conversation and development process is documented in:

docs/bob-export.md - Full conversation history
docs/bob-analysis.md - Bob's architectural decisions

📚 Documentation

🏆 Hackathon Submission

This project was built for the IBM watsonx Challenge, demonstrating:

AI-Native Development: Entire project orchestrated using IBM Bob
watsonx.ai Integration: Real-world use of IBM Granite models
Practical Value: Solves real infrastructure pain points
Innovation: Novel use of MCP for infrastructure automation

Submission Checklist

✅ Problem and solution statement
✅ IBM Bob and watsonx.ai usage documented
✅ Implementation plan with Bob-ready prompts
✅ Video demo (3 minutes)
✅ Code repository with clear structure
✅ Exported Bob report
✅ README with usage examples

🤝 Contributing

We welcome contributions! Please see CONTRIBUTING.md for details.

📄 License

This project is licensed under the MIT License - see LICENSE file for details.

🙏 Acknowledgments

IBM watsonx.ai team for the powerful Granite models
IBM Bob team for the amazing AI development assistant
The MCP community for the protocol specification
All contributors and testers

📞 Contact

Author: Paul Moore
GitHub: @paulmmoore3416
Repository: Sentinel-MCP

Built with ❤️ using IBM Bob and watsonx.ai

Sentinel-MCP: The Autonomous Infrastructure Repair Agent

🎯 Problem Statement

💡 Solution

Key Features

🏗️ Architecture

🚀 Quick Start

Prerequisites

Installation

📖 Usage Guide

Basic Usage

1. Starting Sentinel-MCP

2. Triggering an Alert

3. Monitoring the Remediation Process

Advanced Usage

Using with Prometheus AlertManager

Kubernetes Deployment

🧪 Testing & Demo

Running the Test Suite

Demo Scenarios

Scenario 1: Disk Space Cleanup

Scenario 2: Service Crash Recovery

Scenario 3: Kubernetes Pod CrashLoop

Creating Custom Scenarios

🔧 Configuration

Security Configuration

watsonx.ai Configuration

📊 Monitoring & Observability

Metrics

Grafana Dashboard

🎥 Video Demo Script

Setup (30 seconds)

Action (90 seconds)

Value Proposition (60 seconds)

🤝 IBM Bob Integration

Bob Prompts Used

Exported Bob Report

📚 Documentation

🏆 Hackathon Submission

Submission Checklist

🤝 Contributing

📄 License

🙏 Acknowledgments

📞 Contact

Installation Command (package not published)

Cursor configuration (mcp.json)