System Perception MCP Server

English | 简体中文

A high-performance Model Context Protocol (MCP) server designed for AI Agents to perceive and control the Windows operating system with ultra-low latency and zero physical mouse/keyboard interference.

🌟 Core Features

Ultra-Low Latency Screen Perception: Bypasses traditional slow screenshot methods. Utilizes dxcam for direct DXGI VRAM capture and OpenCV for memory compression, delivering screen frames to the agent in under ~120ms.
Silent Background Control: Eliminates the fragile and disruptive nature of physical mouse/keyboard simulation. Uses win32api and uiautomation to send underlying system messages (PostMessage) and invoke UI elements silently.
UI Tree Parsing (get_ui_tree): Instantly reads the accessibility tree of standard Windows applications, bypassing the slow Vision-Language Model (VLM) coordinate calculation bottleneck.
Instant Execution (invoke_ui_element): Directly triggers standard OS elements (like desktop icons, buttons, and text fields) in less than a second based on UI definitions rather than screen coordinates.

🛠️ Requirements

OS: Windows 10 / 11 (Requires DXGI and Windows UIAutomation APIs)
Python: 3.8+
Agent Harness: Any MCP-compatible client (e.g., Claude Desktop, DeerFlow)

📦 Installation

Clone this repository:

git clone <YOUR_GITHUB_REPO_URL>
cd system-perception-mcp

Install the required dependencies:
```
pip install -r requirements.txt
```

🚀 Exposed Tools

Once connected to an MCP client, the following tools become available to the LLM/Agent:

get_gpu_frame(): Instantly captures the current screen from the GPU frame buffer.
get_ui_tree(): Scans and returns the current window's hierarchical UI structure.
invoke_ui_element(element_name/id): Directly interacts with a specific UI node without moving the physical cursor.
silent_mouse_click(x, y, hwnd): Sends a background click event to specific coordinates within a target window.
silent_keyboard_type(text, hwnd): Injects keystrokes directly into a background application's message queue.

💡 Why This Approach?

Traditional visual AI agents rely on taking screenshots, sending them to a VLM, waiting 2-4 seconds for coordinate calculation, and then physically moving the user's cursor. This is slow, fragile, and prevents the user from using their computer while the agent is working.

System Perception MCP solves this by fusing computer vision with native OS UI Automation, allowing the agent to "see" instantly and "act" invisibly.

📝 License

MIT License

MCP Servers