Trigger this skill in 'Data Lake' or 'Enterprise Multi-DB' scenarios where the agent must select the single most relevant database from a repository of dozens or hundreds. It is especially critical for 'Domain Collision' requests where multiple databases share similar terminology (e.g., three different 'Product' or 'Customer' databases). Trigger words/phrases: “find which database is the right one”, “search the enterprise data lake”, “there are many overlapping databases, pick the correct one”, “the question might apply to several sources, verify which one can actually answer it”, or “make sure the selected source has a valid path between all the fields mentioned.”
Dingxingdi0 星標2026年4月10日
職業
分類
LLM 同 AI
技能內容
1. Capability Definition & Real Case
Professional Definition: The ability to perform high-precision database selection (routing) within a multi-database repository by evaluating schema coverage, structural connectivity, and fine-grained semantic alignment. This involves mapping query spans to specific schema entities, verifying that all target entities form a connected subgraph (joinability) within the selected database, and using embedding-based tie-breaking to resolve ambiguities between domain-overlapping sources.
Dimension Hierarchy: Environment Grounding->Retrieval and Alignment->source retrieval
Real Case
[Case 1]
Initial Environment: A massive enterprise environment containing 23 separate databases (ranging from Formula 1 stats to stack-exchange logs) with 319 tables. The agent has no prior knowledge of which database holds which topic and must use exploration tools to find the target.
Real Question: Which driver has the most wins in Formula 1?
Real Trajectory: 1. The agent calls a summary command to see all 23 databases and identifies 'F1_Stats' as the likely source. 2. It then requests a list of tables for 'F1_Stats' and finds 'drivers' and 'race_results'. 3. It requests column metadata for 'drivers' (driver_id, name) and 'race_results' (driver_id, position, wins). 4. It composes a join query to count and rank wins by driver name.
相關技能
Real Answer: SELECT T1.name, COUNT(T2.wins) FROM drivers AS T1 JOIN race_results AS T2 ON T1.driver_id = T2.driver_id WHERE T2.position = 1 GROUP BY T1.name ORDER BY COUNT(T2.wins) DESC LIMIT 1;
Why this demonstrates the capability: This demonstrates hierarchical source retrieval in a high-scale environment. The agent successfully narrowed down 23 domains to one, then isolated two relevant tables from many, correctly identifying the bridge (driver_id) across levels of abstraction without being overwhelmed by the other 317 irrelevant tables.
[Case 2]
Initial Environment: A complex repository with multiple databases covering overlapping domains, including a 'product_catalog' database and a 'products_gen_characteristics' database. One contains attribute metadata as schema, while the other contains product specs as values.
Real Question: Find the attribute data type for the attribute named 'Green'.
Real Trajectory: 1. Analyze the query to identify that 'Green' is the name of an attribute (column/schema item) rather than a value like a product color. 2. Map the term 'attribute' and 'data type' to the metadata columns in the 'product_catalog' schema. 3. Verify that in 'products_gen_characteristics', 'Green' only exists as a string value in a 'Color' column, which does not satisfy the 'attribute data type' structural requirement. 4. Select 'product_catalog' because it allows a connected path from the attribute name to its data type property.
Real Answer: SELECT data_type FROM product_catalog WHERE attribute_name = 'Green';
Why this demonstrates the capability: This demonstrates the ability to differentiate between domain-overlapping databases using structural intent. The agent correctly prioritizes a database where the query terms map to schema entities (columns) over one where they map to cell values, resolving a common 'lexical similarity' trap.
[Case 3]
Initial Environment: A multi-database system including a 'Student_Activity' database. The database contains tables for 'Activity', 'Participates_in', 'Faculty_Participates_in', 'Student', and 'Faculty'. Relations are defined via IDs.
Real Question: What does John do?
Real Trajectory: 1. Extract query phrases: 'John' (subject) and 'do' (action/activity). 2. Map 'John' to potential columns: Student.student_name and Faculty.faculty_name. 3. Map 'do' to Activity.activity_name. 4. Perform a connectivity check using the database adjacency list to verify if a join path exists between 'Student' and 'Activity' (via Participates_in) and between 'Faculty' and 'Activity' (via Faculty_Participates_in). 5. Since at least one valid connected subgraph exists in the 'Student_Activity' DB, confirm this as a valid source.
Real Answer: SELECT T3.activity_name FROM Student AS T1 JOIN Participates_in AS T2 ON T1.student_id = T2.student_id JOIN Activity AS T3 ON T2.activity_id = T3.activity_id WHERE T1.student_name = 'John';
Why this demonstrates the capability: This illustrates connectivity-based source validation. The agent doesn't just look for the words 'John' and 'do'; it ensures the database actually supports the relational link required to answer the question, proving the source is structurally appropriate.
Pipeline Execution Instructions
To synthesize data for this capability, you must strictly follow a 3-phase pipeline. Do not hallucinate steps. Read the corresponding reference file for each phase sequentially:
Phase 1: Environment Exploration
Read the exploration guidelines to discover raw knowledge seeds:
references/EXPLORATION.md
Phase 2: Trajectory Selection
Once Phase 1 is complete, read the selection criteria to evaluate the trajectory:
references/SELECTION.md
Phase 3: Data Synthesis
Once a trajectory passes Phase 2, read the synthesis instructions to generate the final data:
references/SYNTHESIS.md