Use when the user wants to research official data sources and third-party data sources from a country, region, sector, regulator, target population, or other user-provided brief, and needs evidence-backed acquisition validation with executable artifacts rather than a narrative-only memo.
Turn a user-provided research brief into a data-source acquisition feasibility study.
The goal is not to produce a site list or a narrative report. The goal is to identify the most relevant official and third-party sources for the brief, verify what they expose, measure whether they can be traversed or downloaded, save small real artifacts, and recommend executable acquisition strategies.
homepage-info-extractor.When time or budget is constrained, keep this order:
Do not skip local or domain-native third-party research just because a global provider is easier to name. If local or domain-native third-party sources are missing, weak, or non-executable, say so with evidence.
Required:
research_brief: a natural-language description of the data-source question, including the target geography, target object, filters, and desired data where availableOptional:
brief_slug: lowercase letters, numbers, hyphens only; if missing, derive from the main research targetcountry_or_region: reader-facing geography name when the brief is geography-boundtarget_language: default 中文generated_at: default current date in YYYY-MM-DDsample_size: default 1000; treat this as the target upper bound for one module sample, not a hard quota when the source is paid, capped, or otherwise constrainedtemplate_path: default overseas-registry-source-research/assets/templates/registry-research-template.mdtime_limit_or_budget: use to limit paid-source exploration or test depthvalidation_depth: default deep; use minimum-runnable only when the user explicitly wants the smallest viable proofscoring_weights: optional override for strategy synthesisWrite results to this exact directory:
results/<YYYYMMDD>/overseas-registry-source-research/Do not reverse the nesting into results/overseas-registry-source-research/<YYYYMMDD>/.
Produce exactly one main report:
<brief-slug>-source-research.mdFor each selected module, produce:
<brief-slug>-<source-id>-<module-id>-download-sample.py<brief-slug>-<source-id>-<module-id>-test-dataset-<actual-sample-size>.<ext> or a same-name directory when the source returns many raw filesFor each selected module that uses search, pagination, or browser-gated access, also produce:
<brief-slug>-<source-id>-<module-id>-boundary-probe.jsonOptional helper artifact:
<brief-slug>-download-sample.pyHard rules:
assets/templates/registry-research-template.md.sources.md, feasibility.md, or plan.md.temp/ only as a working directory, never as the final delivery directory.sample_size, save the largest real raw sample you could obtain within constraints and state the actual size plus blocker evidence.brief_slug, source_id, module_id: lowercase letters, numbers, hyphens onlysource_id, every module_id must be unique研究简述: the user-provided brief that defines the target geography, object, filters, and desired data数据源: source at institution or platform level, such as a regulator, authority, public portal, exchange, association, open-data platform, local reseller, specialist vendor, or global provider模块: concrete executable entry under one source, such as an API, bulk package, search page, detail page, filing page, document service, export endpoint, or downloadable dataset官方数据源: a source controlled by a government, regulator, court, exchange, official registry, public body, or other authoritative institution directly tied to the brief第三方数据源: a non-official source that republishes, aggregates, indexes, enriches, or sells relevant data当地第三方: provider focused on the same country, region, or domain context as the brief广域第三方: broader multi-country or cross-market provider not primarily local to the target contextDo not force a fixed coverage model up front. Derive the relevant coverage dimensions from the brief after parsing it. Typical dimensions can include identity, qualification or license status, disclosure documents, market activity, enforcement history, ownership, product inventory, transaction data, or other brief-specific categories.
Also derive the record model from the brief before designing the sweep. The target may be entity-level, event-level, document-level, transaction-level, case-level, or another shape. Do not default to an entity-registry mental model when the brief is really about penalties, procurement events, court cases, filings, notices, or other non-entity records.
Follow this order. Do not skip a gate.
Before searching sources, extract and state all of these from the brief:
Also resolve whether the brief depends on status edges such as licensed vs applying, active vs inactive, current vs historical, federal vs state vs municipal, or domestic vs cross-border. If such distinctions matter, make them explicit before source selection.
If the brief is ambiguous, choose the narrowest reasonable interpretation that still satisfies the user request, and state that interpretation in the report.
Collect only what is necessary to interpret acquisition feasibility:
Do not turn this section into broad country, sector, or industry research. Keep it lean.
Identify the official source categories implied by the brief. Typical categories may include:
Do not mechanically reuse a fixed category list when the brief implies different official channels.
When the target jurisdiction has parallel authorities at multiple levels, such as federal and state or central and municipal, do not declare the official sweep complete until you have checked whether those parallel levels materially affect coverage.
For each official source, record:
Do this in two passes:
For every third-party source, record:
当地第三方 or 广域第三方If a broader provider is selected ahead of a local or domain-native one, explain why with evidence. Acceptable reasons include:
If you conclude that no credible local or domain-native provider exists, record the search path that led to that conclusion, including provider categories checked, representative search queries, and the stop condition. Do not make a shallow "none found" claim without search-path evidence.
For each source under consideration, treat module discovery as a census task, not a shortlist task. List all modules relevant to the brief before choosing any of them.
Do not:
For every source card, report all of these:
complete or partialIf the census is not complete, label it partial, explain the stop condition, and avoid language that implies full coverage.
Search, lookup, and filter modules have special handling:
未发现可执行检索模块 and attach search-path evidence.For every discovered module row, always include:
可用, 部分可用, 不可用, or 未验证可执行, 受限, 不可执行, or 未验证Do not list a module URL as if it were valid without checking whether it resolves to meaningful content. If the URL is dead, redirected to a generic page, login wall, or marketing shell, record that explicitly.
Preferred reason labels when applicable:
无公开入口仅产品页/仅营销页需登录需付费需KYC验证码/反爬403/封禁TLS/连接失败无结构化返回链路不稳定与简述不匹配未完成验证Each selected module must complete this loop:
Do not stop at qualitative conclusions.
Whenever applicable, measure and report actual values for:
Measure actual values where safe, legal, and technically practical. Use documented vendor limits to complement live tests, and do not brute-force past published limits, login walls, paywalls, captchas, or protective controls merely to force a number.
If one metric does not apply, mark it N/A and explain why. If it applies but you could not safely or legally verify it, mark it 未验证, record the blocker evidence, and report the highest safely verified value if you have one. Do not leave it blank.
For hard numeric boundary metrics, such as maximum page size, maximum offset, single-query cap, rate threshold, captcha trigger point, or token expiry, do not use 推断 as a substitute for a measured number. Use 已验证 with the measured value, or 未验证 with blocker evidence. 推断 is allowed only for contextual judgments that are not hard numeric limits.
For every search-capable module, state and test:
Traversal can include:
If traversal is not feasible, show the boundary that breaks it, such as result caps, captchas, hard rate limits, weak identifiers, or legal restrictions. If the limit is only documented or indirectly observed, label that clearly rather than presenting it as directly verified.
Even when a module is not selected, still give it a clear usability verdict, direct URL, and unusable or constrained reason in the source-level module census.
Before writing a downloader from scratch, inspect scripts/ and adapt the closest template.
Bundled templates:
scripts/http_download_template.py: single-endpoint raw download skeletonscripts/paginated_download_template.py: page or offset-based raw download skeletonscripts/search_boundary_probe.py: search and pagination boundary probescripts/playwright_probe_template.py: browser-gated source probe skeletonscripts/validate_registry_artifacts.py: output-structure validatorFor each selected module, save:
The final report must include all of these sections:
官方主方案: the best official-only executable path第三方组合方案: the best executable third-party-led combination, respecting local-before-broad priority, or an explicit 无可执行第三方主组合 verdict with evidence备选方案: at most one fallback combination最小可运行路径: the smallest real path that already proves download feasibilityUse default scoring weights unless the user specifies otherwise:
0.40.30.3If two third-party choices score similarly, prefer the local or domain-native provider.
Every key claim must include:
Explicit labels:
已验证: tested directly未验证: not confirmed推断: inference based on evidence, not directly confirmedDo not use 推断 to present an unmeasured hard boundary value as if it were an operational limit.
Additional rules:
合作洽谈 must be only Y or N, backed by pricing, terms, access-restriction, or sales-contact evidence.Always start from assets/templates/registry-research-template.md.
The template's section order, tables, and checklist are part of the contract. Keep the final report concise, but do not delete required sections. If the bottleneck is executable validation, spend the effort there rather than padding descriptive prose.
results/<YYYYMMDD>/overseas-registry-source-research/?官方主方案 and 第三方组合方案, even when the third-party outcome is 无可执行第三方主组合?