Validate and clean data.csv URLs via GitHub API, flag invalid/non-GitHub/mirror entries
Validate all URLs in data.csv via GitHub API, collect repo activity metrics, and flag problematic entries for human review.
CSV file at data.csv with columns: 页签,序号,项目名称,分类,上游地址
python3 scripts/clean.py data.csv -o output/cleaned.csv --summary
This script:
github.com/{owner}/{repo}): calls GitHub API to validate existence, fetches open_issues_count, total PR count, fork/archived/mirror statusgithub.com/{owner}): validates org existenceResults are cached in output/.cache/github_clean_cache.json for resumability.
Check output/cleaned.csv columns:
status: valid, valid_user, not_found, non_github, no_urlurl_type: repo, org, user, non_github, or emptyactual_url: populated only if the repo was redirected (renamed/transferred)open_issues_count: open issues + open PRs (GitHub combines them)total_pull_requests: total PRs (all states) — 0 PRs suggests mirror repofork: whether GitHub marks it as a forkarchived: whether the repo is archivedmirror_url: populated if GitHub knows it's a mirrorhas_issues: whether issues are enabled — disabled suggests mirrornote: additional flags (multi-URL, non-GitHub, etc.)Summarize findings and highlight entries that need human attention:
→ After human review: /fix-urls (to resolve problematic URLs), then /classify