An MCP Server that give Eyes to blind LLM agents.
👁️ Visual Cortex MCP Server
A specialized MCP server that provides "vision" to blind AI agents. It detects UI elements and returns pixel-perfect coordinates for automation.
It runs on the Weird Port 43210 by default.
🚀 Supported Backends
1. Google Gemini (Default & Best)
- Model:
models/gemini-flash-latest - Performance: Fastest, native 2D grounding support.
- Cost: Extremely low.
2. OpenAI / Local Compatible
- Model: Configurable (User defined).
- Support: Works with
gpt-4o, Qwen3 8 VL,vLLMhosting vision models, orLM Studio. - Features: Enhanced prompt engineering for better compatibility with various vision models.
🛠️ Installation
-
Clone & Install:
pip install -r requirements.txt -
Configure Environment: Copy
.env.exampleto.envand set your provider.For Google:
export GROUNDING_PROVIDER=google export API_KEY=your_gemini_keyFor Local LLM:
export GROUNDING_PROVIDER=openai export BASE_URL=http://localhost:1234/v1 export MODEL_NAME=llama-3.2-vision export API_KEY=local -
Run:
python grounding_server.pyServer listening on port 43210
🤖 Usage
Input: Base64 Encoded Image.
Output: JSON with box_px (Absolute pixel coordinates).
Agent Instruction
"Use
analyze_screenshotto see. The tool returns a list of components. Use thecenter_xandcenter_yinsidebox_pxto click elements."
✨ Recent Improvements
Enhanced Compatibility & Robustness
- ✅ Full OpenAI API compatibility - Works seamlessly with Qwen3 8 VL and other vision models
- ✅ Improved error handling - Comprehensive validation and specific error messages
- ✅ Better response parsing - Handles various model output formats gracefully
- ✅ Enhanced prompts - Optimized for both Gemini and OpenAI-compatible models
- ✅ Increased token limit - Now supports up to 4096 tokens for complex UIs (was 2048)
- ✅ Robust image handling - Better EXIF orientation support and format detection
- ✅ Input validation - Validates base64 format, image dimensions, and file sizes
- ✅ Box coordinate validation - Auto-corrects inverted coordinates and validates ranges
- ✅ Type hints - Full type annotations for better IDE support and code quality
🎯 NEW: Optimal UI Capture for Maximum Accuracy
- ✅ Root & Unrooted Support - Works with both rooted and standard Android devices
- ✅ Automated Device Configuration - ADB tools to configure optimal visual settings
- ✅ Hybrid Capture Mode - Combines screenshot + UI hierarchy for ground truth accuracy
- ✅ Layout Bounds Support - Visual tokenization of UI components
- ✅ Setup Guides - Comprehensive instructions for both modes
🐳 Docker
docker build -t visual-cortex .
docker run -p 43210:43210 --env-file .env visual-cortex
📱 Android Device Setup (Optional but Recommended)
For maximum accuracy in UI component identification, configure your Android device with optimal visual settings. This works for both rooted and unrooted devices.
Quick Setup via ADB
The server provides tools to automatically configure your device:
# Get setup instructions for your device type
get_optimal_setup_guide(rooted=False) # or True for rooted devices
# Automatically configure connected device
configure_device_for_capture()
# Capture screenshot + UI hierarchy for enhanced accuracy
capture_with_hierarchy()
# Restore device to normal settings when done
restore_device_settings()
What Gets Configured?
- Layout Bounds - Visual rectangles around every UI element (the MVP setting!)
- Animation Disable - Zero-latency captures (no mid-transition blurriness)
- Demo Mode - Clean status bar (fixed time, battery, no notifications)
- Hierarchy Dump - Ground truth coordinates and metadata
Manual Setup
If you prefer manual configuration or ADB is unavailable, use the get_optimal_setup_guide tool to get detailed step-by-step instructions.
🎯 Enhanced Analysis with UI Hierarchy
Use the hybrid approach for best results:
# Standard analysis (screenshot only)
analyze_screenshot(image_base64)
# Enhanced analysis (screenshot + UI hierarchy)
analyze_screenshot_with_hierarchy(image_base64, ui_hierarchy_xml)
The hybrid approach gives the VL model:
- Visual Context: Screenshot with layout bounds
- Ground Truth: Exact coordinates, resource IDs, and text from XML
- Result: Highest possible accuracy and reduced hallucination
🔧 Available MCP Tools
| Tool | Description |
|------|-------------|
| analyze_screenshot | Standard UI component detection from screenshot |
| analyze_screenshot_with_hierarchy | Enhanced detection using screenshot + UI hierarchy |
| get_optimal_setup_guide | Get setup instructions for rooted/unrooted devices |
| configure_device_for_capture | Auto-configure device via ADB |
| capture_with_hierarchy | Capture screenshot + UI hierarchy from device |
| restore_device_settings | Restore device to default settings |
🧪 Testing
Run the test suite to verify functionality:
pytest test_grounding.py -v
Tests cover:
- Box coordinate scaling and validation
- Backend initialization (Google & OpenAI)
- Image format handling
- Error handling and edge cases
📚 Why This Hybrid Approach?
The "best mode" for VL models isn't a single switch, but a configuration that creates a "clean but semantically dense" visual feed:
- Layout Bounds: Explicit red/blue rectangles "tokenize" the screen
- No Animations: Ensures crisp, deterministic captures
- Demo Mode: Removes visual noise from status bar
- UI Hierarchy: Provides ground truth that eliminates guesswork
This combination gives the model "X-ray vision" - both pixel-level visuals AND structural semantics.