Unified heterogeneous knowledge QA system. Automatically routes natural language queries to SQL databases, Knowledge Graphs, or table files using 4-layer detection (rule-based, LLM semantic, schema matching, entity verification). Supports multi-LLM providers and bilingual queries. Trigger on data queries, "how many", "show", aggregations, filters, joins, or structured information requests.
Unified heterogeneous knowledge QA system with automatic source detection and multi-stage reasoning.
Natural language queries are automatically routed to the appropriate knowledge source (SQL, Knowledge Graph, or Table files) without requiring users to specify the data source. A 4-layer detection architecture ensures accurate source identification, followed by multi-stage query generation with self-revision and voting.
User Query → Source Detection (4 layers) → Query Generation → Self-Revision → Voting → Execution → Answer
| Trigger | Action |
|---|---|
| "How many employees in X?" | NL2SQL engine |
| "Who is the founder of X?" | NL2SPARQL engine (KG) |
| "Which quarter had highest sales?" | TableQA engine |
| "Show average salary by department" |
| Auto-detect SQL |
| Queries with aggregations, filters, joins | Route to SQL |
| Entity relationship queries | Route to KG |
| Questions about CSV/Excel files | Route to TableQA |
| Multi-hop queries across sources | Decompose + fuse |
Layer 1 (15%): Rule-Based
- 20+ keywords per source type
- 7 regex patterns (aggregation, comparison, relation)
- Fast pre-filtering
Layer 2 (35%): LLM Semantic
- Intent classification
- Entity/predicate detection
- Multi-hop identification
Layer 3a (25%): SQL Schema Match
- Inverted index on tables/columns
- Automatic JOIN inference
- Confidence scoring
Layer 3b (25%): KG Entity Link
- Entity mention extraction
- SPARQL endpoint lookup
- Predicate pattern matching
Layer 3c (25%+30%): Entity Verification
- Cross-source entity existence check
- 30% score boost for verified entities
Layer 4: Multi-Source Fusion
- Weighted aggregation
- Execution plan generation
1. Schema/Entity Linking → Identify relevant tables/columns/entities
2. Parallel Generation → Generate 3 candidates concurrently
3. Multi-Round Revision → 2 rounds of self-review
4. Validation → Syntax and semantic checks
5. Voting → Select best candidate
6. Execution → Run query
7. Result Verification → Validate reasonableness
from src.engines.nl2sql.multi_stage_engine import MultiStageNL2SQLEngine
engine = MultiStageNL2SQLEngine({
"name": "sql_engine",
"schema": schema,
"llm_config": {
"model": "deepseek-chat",
"api_key": "sk-...",
},
"generation_config": {
"num_candidates": 3,
"max_revisions": 2,
"parallel_generation": True,
},
})
result = await engine.execute("How many employees in Engineering?", {})
Features:
from src.engines.nl2sparql.multi_stage_engine import MultiStageNL2SPARQLEngine
engine = MultiStageNL2SPARQLEngine({
"name": "sparql_engine",
"endpoint_url": "https://dbpedia.org/sparql",
"ontology": ontology,
"llm_config": {"model": "gpt-4", "api_key": "sk-..."},
})
result = await engine.execute("Who founded Microsoft?", {})
Features:
from src.engines.table_qa.multi_stage_engine import MultiStageTableQAEngine
engine = MultiStageTableQAEngine({
"name": "table_engine",
"table_path": "data/sales.csv",
"llm_config": {"model": "deepseek-chat", "api_key": "sk-..."},
})
result = await engine.execute("Which quarter had highest sales?", {})
Features:
Override model and API key at runtime:
# Initialize with default
engine = MultiStageNL2SQLEngine({
"llm_config": {"model": "deepseek-chat", "api_key": "sk-deepseek-key"},
})
# Override per-call
result = await engine.execute(
query="Complex query",
context={},
model="gpt-4-turbo", # Override model
api_key="sk-openai-key", # Override API key
)
| Provider | Models | Configuration |
|---|---|---|
| DeepSeek | deepseek-chat | base_url: https://api.deepseek.com/v1 |
| OpenAI | gpt-4, gpt-3.5-turbo | Default endpoint |
| Azure OpenAI | gpt-4 | base_url: https://{resource}.openai.azure.com |
| Local (Ollama) | llama2, mistral | base_url: http://localhost:11434/v1 |
llm_config:
model: deepseek-chat
api_key: sk-...
base_url: https://api.deepseek.com/v1 # Optional
temperature: 0.1
max_tokens: 500
timeout: 30
generation_config:
num_candidates: 3 # SQL/SPARQL candidates to generate
max_revisions: 2 # Self-revision rounds
parallel_generation: true # Concurrent candidate generation
voting_enabled: true # Multi-candidate voting