It can retrieve MCP server(different) code and issues of the mcp via GitHub API
MCPC Data Pipeline Guide
This repository hosts the tooling we use to crawl Model Context Protocol (MCP) sources, normalize the collected repositories, and build curated issue datasets for annotation. The project is intentionally modular: every stage writes JSON artifacts that the next script consumes, so you can pause/resume or swap inputs at any time.
Requirements
- Python 3.9 or newer (3.11+ recommended for better
asyncioperformance) - Google Chrome + matching ChromeDriver when running crawlers that rely on Selenium (Smithery, MCP Market, Cursor, etc.)
- Dependencies from
requirements.txt
python -m venv .venv
source .venv/bin/activate # Windows: .venv\Scripts\activate
pip install -r requirements.txt
Environment variables
Most scripts talk to the GitHub API. Set a personal access token once before running long jobs:
export GITHUB_TOKEN=ghp_your_token_here # Windows PowerShell: $env:GITHUB_TOKEN="..."
If you need site-specific API keys (for example, Smithery), set them in the same way before invoking the related crawler.
Repository layout
engine/ # Crawler implementations (servers + clients)
scripts/ # Data-processing utilities
mcp_servers/ # Generated server metadata
分组数据/ # Issue sampling & grouping outputs
config/ # YAML configs for crawl targets
End-to-end workflow
1. Crawl raw sources
Use the unified crawler engine to hit one or more data sources defined in your config file.
python -m engine.core.crawler_engine \
--config config/sites_config.yaml \
--sources smithery pulse awesome glama \
--type servers
Common options:
--sources: space-delimited subset. Omit to crawl every known source.--type:servers,clients, orall.--download-sources: immediately download each repository after metadata is collected.--download-only: skip crawling and only download sources for the JSONs already on disk.--categories-only/--categories-source: rebuild category tags without touching metadata.
Each source writes mcp_servers/<source>/<source>.json (or mcp_clients/... for client crawlers).
2. Aggregate into a master list
Combine all source directories into a deduplicated list before filtering:
python scripts/aggregate_servers.py
By default it scans every subdirectory under mcp_servers/, normalizes GitHub URLs, removes duplicates, and writes mcp_servers/servers_aggregate.json (plus per-source counts).
3. Filter low-signal repositories
scripts/filter_servers.py calls the GitHub REST API for each repo and enforces simple health rules (minimum activity, non-template, has a README, etc.).
python scripts/filter_servers.py \
--input mcp_servers/servers_aggregate.json \
--output mcp_servers/servers_filtered.json \
--resume
Key flags:
--limit N: stop afterNretained repositories (helpful for smoke tests).--resume: continue where a previous run left off using<output>.progress.
Reasons for exclusion (e.g., abandoned_project, placeholder_repo) are summarized inside the output JSON.
4. Collect meaningful issues
scripts/collect_meaningful_issues.py walks the filtered list, paginates through every GitHub issue, and stores those that meet both rules:
- At least two comments, and
- Multiple distinct participants (issue author plus commenters).
python scripts/collect_meaningful_issues.py \
--resume \
--input mcp_servers/servers_filtered.json \
--output mcp_servers/servers_issues.json \
--issues-per-repo 100 \
--comments-per-issue 40
Tips:
- Use a high
--issues-per-repo(up to 100) to reduce pagination overhead. The script automatically keeps requesting additional pages until a page returns fewer results. --comments-per-issuecontrols the top-N comments fetched per issue. Increase it if you need larger participant sets.- The script writes both the full dataset and a
.progresscheckpoint so you can restart safely after rate-limit or network failures.
5. Sample and group issues
After servers_issues.json is ready, run scripts/group_issues.py to (a) carve out a sampled subset and (b) optionally partition it into labeled groups for annotation.
# Example: sample 663 issues, then create six 100-item groups + one 63-item group
python scripts/group_issues.py \
--input mcp_servers/servers_issues.json \
--output-dir 分组数据 \
--sample-count 663 \
--group-sizes 100,100,100,100,100,100,63 \
--seed 42 \
--sample-output 分组数据/sampled_servers_issues.json \
--annotated-output 分组数据/annotated_servers_issues.json
What you get:
sampled_servers_issues.json: standalone dataset containing only the sampled servers/issues (withsampled: true).annotated_servers_issues.json: the original input enriched withsampled/group_assignmentflags so you can keep track of what has been used.group_<n>.json: one file per group containing flattened issue metadata for quick handoff.
Supplying --seed guarantees reproducible sampling/grouping. If you only need the sample without groups, omit --group-sizes and --groups/--size.
Resuming long jobs safely
filter_servers.pyandcollect_meaningful_issues.pyboth emit<output>.progress. Deleting the progress file forces a clean restart.collect_meaningful_issues.py --resumealso reloads the main output JSON so partial results are preserved across crashes.group_issues.pydoes not mutate the source file except to add annotations when you point--annotated-outputat the same path. Use a different path if you want to keep the original pristine.
Troubleshooting & best practices
| Symptom | Likely cause | Fix |
| --- | --- | --- |
| 403 with X-RateLimit-Remaining: 0 | GitHub rate limit | Wait for the script to sleep automatically, or add a higher-tier PAT. |
| Selenium timeouts | Headless Chrome missing or mismatched driver | Install Chrome + ChromeDriver and ensure it’s on your PATH. |
| Some repos missing issues | They may be archived/private or unreachable via API | Run the collector with --resume after fixing credentials; repos that previously failed will be retried. |
| Need to test without overwriting production files | Use --output (and related flags) to point at a temporary location. |
License & contributions
Internal project – update this section if you plan to publish the toolkit. Pull requests that add new MCP sources or improve filtering heuristics are welcome.