MCP server that gives vision to any text-only LLM — 100% local, no API bills. Gemma 4 E2B (LiteRT) for vision + SDXL-Turbo for image gen
blind-vision-mcp
Give vision to any text-only LLM — 100% local, no API costs, your privacy intact.
Why this exists
I use DeepSeek v4 Flash — an incredible text model. But it's blind. It can't see screenshots, images, or UI layouts.
I was tired of:
- Paying $20-200/month for vision-capable APIs (GPT-4 Vision, Claude)
- Sending sensitive screenshots to the cloud
- Context switching between coding and describing images manually
So I built blind-vision-mcp: an MCP server that sits between your text LLM and your desktop, letting it "see" through an on-device vision model — Google's Gemma 4 E2B running via LiteRT.
My specific use case
I control an Android emulator that takes screenshots of the device. DeepSeek v4 Flash reads those screenshots via blind-vision-mcp and tells the emulator what to do next. It works like this:
Emulator takes screenshot → blind-vision-mcp analyzes it with Gemma 4 →
DeepSeek reads the description → decides next action → ADB command
All of this happens locally, privately, and without paying per-token API fees.
What it does
| Capability | Status | Model | |-----------|--------|-------| | 👁️ Image analysis | ✅ Stable | Gemma 4 E2B via LiteRT (~2.6 GB VRAM) | | 🔄 Image comparison | ✅ Stable | Gemma 4 E2B via LiteRT | | 🎨 Image generation | ✅ Stable | SDXL-Turbo (fp16, ~7 GB VRAM, no HF token needed) | | ✏️ Image editing | 🧪 In development | Coming soon |
Key features
- No API keys needed for vision — runs 100% on your GPU
- ~2.6 GB VRAM for vision (not 10+ GB like other solutions)
- GPU-first — falls back to CPU if GPU fails
- Google LiteRT — same stack powering Gemini Nano on Android
- Macro-friendly — perfect for automating emulators, browsers, UIs
- Works with any MCP client — OpenCode, Claude Desktop, Cursor, Cline
Quick Start
# 1. Prerequisites
curl -LsSf https://astral.sh/uv/install.sh | sh
# 2. Clone and install
git clone https://github.com/alexjm19/blind-vision-mcp.git
cd blind-vision-mcp
uv sync
# 3. Import the vision model (one-time, downloads ~2.6 GB)
litert-lm import \
--from-huggingface-repo litert-community/gemma-4-E2B-it-litert-lm \
gemma-4-E2B-it.litertlm \
gemma4-vision
# 4. Start the server
uv run blind-vision-mcp
For image generation/editing: Create a
.envfile with your HF token:HF_TOKEN=hf_your_token_hereThen accept terms at https://huggingface.co/black-forest-labs/FLUX.1-schnell
Configuration for OpenCode
Add to your opencode.json:
{
"mcpServers": {
"blind-vision-mcp": {
"command": "uv",
"args": [
"run",
"--directory",
"/path/to/blind-vision-mcp",
"blind-vision-mcp"
]
}
}
}
Self-Configuration Prompt
Give this prompt to any LLM to have it configure itself to use blind-vision-mcp:
To give yourself vision capabilities, I have installed **blind-vision-mcp** on this machine.
It is an MCP server that allows text-only models like you to analyze images and generate images.
## What you need to do
1. Check if blind-vision-mcp is installed:
```bash
pip show blind-vision-mcp
-
If not installed, ask the user to run:
pip install blind-vision-mcp -
Import the vision model (one-time):
litert-lm import \ --from-huggingface-repo litert-community/gemma-4-E2B-it-litert-lm \ gemma-4-E2B-it.litertlm \ gemma4-vision -
When you need to see an image, call this tool:
vision_describe(image="/path/to/image.jpg")for detailed analysisvision_compare(image_a="...", image_b="...")to compare
-
When you need to generate an image:
image_generate(description="...")
The server starts automatically on first use. The vision model uses ~2.6 GB VRAM via LiteRT.
---
## Usage Examples
```bash
# Analyze a screenshot (perfect for emulator control)
vision_describe(image="/path/to/screenshot.png")
# Compare before/after
vision_compare(image_a="/path/to/before.png", image_b="/path/to/after.png")
# Generate an image (beta)
image_generate(description="a beautiful landscape")
# Check server status
get_status()
How vision works (the cool part)
┌─────────────────────────────────────────────────────────┐
│ DeepSeek v4 Flash (text-only) │
│ "What's on the screen? → vision_describe(screenshot)" │
└────────────────────────┬────────────────────────────────┘
│ MCP protocol (stdin/stdout)
┌────────────────────────▼────────────────────────────────┐
│ blind-vision-mcp server │
│ ┌────────────┐ ┌──────────────┐ ┌──────────────────┐ │
│ │ tools.py │→│ LiteRT server │→│ Gemma 4 E2B │ │
│ │ (MCP tools) │ │ (port 9380) │ │ (2.6 GB VRAM) │ │
│ └────────────┘ └──────────────┘ └──────────────────┘ │
└─────────────────────────────────────────────────────────┘
The vision model (Gemma 4 E2B) runs entirely on your GPU via Google's LiteRT runtime. No data ever leaves your machine. The model is pre-quantized (mixed 2/4/8-bit) and loads directly at ~2.6 GB — no "load BF16 first then quantize" memory spike.
Why not just use a vision LLM?
| Solution | Cost | Privacy | VRAM | Quality | |----------|------|---------|------|---------| | GPT-4 Vision | $10-20/mo | ❌ Cloud | N/A | Excellent | | Claude Vision | $20/mo | ❌ Cloud | N/A | Excellent | | Qwen2-VL-7B (local) | Free | ✅ Local | ~10 GB VRAM | Good | | blind-vision-mcp | Free | ✅ Local | ~2.6 GB VRAM | Great |
Requirements
| Component | Minimum | |-----------|---------| | GPU | NVIDIA ≥8 GB VRAM | | RAM | 16 GB | | Storage | 5 GB free for vision model + 7 GB for gen model | | CUDA | 12.x |
Project Status
- Vision: ✅ Stable and tested
- Image generation: ✅ Stable (SDXL-Turbo, pure GPU, no offload)
- Image editing: 🧪 In development
- Version: 0.2.0 — API may change
License
MIT — see LICENSE.
Support
Star History
If this saves you from another API bill, ⭐ star the repo. It helps others find local-first AI tools.
Built with ❤️ by alexjm19
