Build a new OpenMetadata connector from scratch — scaffold JSON Schema, Python boilerplate, and AI context using schema-first architecture with code generation across Python, Java, TypeScript, and auto-rendered UI forms.
When a user asks to build, create, add, or scaffold a new connector, source, or integration for OpenMetadata.
One JSON Schema definition cascades through 6 layers: Python Pydantic models, Java models, UI forms (RJSF auto-render), API validation, test fixtures, and documentation. Define the schema once — everything else is generated or guided.
Before any make or python commands, set up the environment from the repo root:
python3.11 -m venv env
source env/bin/activate
make install_dev generate
Always activate before running commands: source env/bin/activate
Run the scaffold CLI to collect inputs and generate files:
source env/bin/activate
metadata scaffold-connector
Interactive mode collects: connector name, service type, connection type, auth types, capabilities, docs URL, SDK package, API endpoints, implementation notes, Docker image, container port.
Non-interactive mode:
metadata scaffold-connector \
--name my_db \
--service-type database \
--connection-type sqlalchemy \
--scheme "mydb+pymydb" \
--auth-types basic \
--capabilities metadata lineage usage profiler \
--docs-url "https://docs.example.com/api" \
--sdk-package "mydb-sdk" \
--docker-image "mydb/mydb:latest" \
--docker-port 5432
Output: JSON Schema + test connection JSON + Python files + CONNECTOR_CONTEXT.md as an AI working document. SQLAlchemy database connectors get concrete code templates; all others get skeleton files with pointers to reference connectors.
CONNECTOR_CONTEXT.md handling: The scaffold generates CONNECTOR_CONTEXT.md in the connector directory as a working document for any AI tool (Claude Code, Cursor, Codex, Copilot, Windsurf). It is gitignored — it stays local and is never committed to the repo. No cleanup needed.
The scaffold classifies along 3 dimensions. Verify the choices:
Dimension 1 — Service Type (determines directory + base class):
| Service Type | Base Class | Reference |
|---|---|---|
database | CommonDbSourceService | mysql/ |
dashboard | DashboardServiceSource | metabase/ |
pipeline | PipelineServiceSource | airflow/ |
messaging | MessagingServiceSource | kafka/ |
mlmodel | MlModelServiceSource | mlflow/ |
storage | StorageServiceSource | s3/ |
search | SearchServiceSource | elasticsearch/ |
api | ApiServiceSource | rest/ |
Dimension 2 — Connection Type (database only):
sqlalchemy → BaseConnection[Config, Engine] + SQLAlchemy dialectrest_api → get_connection() + custom REST client (ref: salesforce/)sdk_client → get_connection() + vendor SDK wrapperDimension 3 — Capabilities (determines extra files):
metadata (always), lineage, usage, profiler, stored_procedures, data_diff
Read the source-type-specific standard at ${CLAUDE_SKILL_DIR}/standards/source_types/{service_type}.md for detailed patterns.
Read the CONNECTOR_CONTEXT.md generated by the scaffold. Then research the source's API/SDK.
If you can dispatch sub-agents (Claude Code): Launch a connector-researcher agent:
Agent: openmetadata-skills:connector-researcher
Prompt: "Research {source_name} for an OpenMetadata {service_type} connector.
Find: API docs, auth methods, key endpoints, pagination, rate limits, SDK packages."
If you cannot dispatch sub-agents: Perform the research yourself using WebSearch and WebFetch.
The scaffold generates files with # TODO markers. Read the relevant standards before implementing:
${CLAUDE_SKILL_DIR}/standards/connection.md — Connection patterns${CLAUDE_SKILL_DIR}/standards/patterns.md — Error handling, pagination, auth${CLAUDE_SKILL_DIR}/standards/performance.md — Pagination, lookup optimization, anti-patterns${CLAUDE_SKILL_DIR}/standards/memory.md — Memory management, streaming, OOM prevention${CLAUDE_SKILL_DIR}/standards/source_types/{service_type}.md — Service-specific patternsSQLAlchemy database: Templates are mostly complete. Customize _get_client() if needed.
Non-SQLAlchemy: Study the reference connector, then implement each skeleton file.
Critical for JSON Schema:
username, password, token) required when the service needs authentication by default. If omitting a field means an opaque 401 at runtime, make it required so the UI validates upfront.verifySSL + sslConfig $ref) for any connector that communicates over HTTPS — enterprise deployments use internal CAs.connection.py (resolve with get_verify_ssl_fn) → client.py (session.verify = verify_ssl). Missing wiring triggers SonarQube Security Review failure.${CLAUDE_SKILL_DIR}/standards/schema.md for the $ref patterns and required fields guidance.Critical for Pydantic API models (models.py):
model_config = ConfigDict(populate_by_name=True) when using Field(alias=...) — without this, constructing instances with Python attribute names raises ValidationError.${CLAUDE_SKILL_DIR}/standards/code_style.md for the full pattern.Critical for non-database connectors (client.py):
${CLAUDE_SKILL_DIR}/standards/performance.md for correct patterns and anti-patterns.Critical for storage connectors and any connector that reads files:
.read() entire files without a size check — causes OOM on production instances.metadata/readers/dataframe/) for data files.del large objects after processing and call gc.collect().${CLAUDE_SKILL_DIR}/standards/memory.md for correct patterns.Critical for lineage:
table_name="*" in search queries — this links every table in a database to each entity, producing incorrect lineage.${CLAUDE_SKILL_DIR}/standards/lineage.md for correct patterns.Read ${CLAUDE_SKILL_DIR}/standards/registration.md for detailed instructions. Summary:
| Step | File | Change |
|---|---|---|
| 1 | openmetadata-spec/.../entity/services/{serviceType}Service.json | Add to type enum + connection oneOf |
| 2 | openmetadata-ui/.../utils/{ServiceType}ServiceUtils.tsx | Import schema + add switch case |
| 3 | openmetadata-ui/.../locale/languages/ | Add i18n display name keys |
This step is mandatory — always run it before committing. Ensure the Python environment is set up:
# Ensure environment is active and tools are installed
source env/bin/activate
pip install -e ".[dev]" 2>/dev/null || make install_dev
# Generate models from schemas
make generate # Python Pydantic models
mvn clean install -pl openmetadata-spec # Java models
cd openmetadata-ui/src/main/resources/ui && yarn parse-schema # UI schemas
# Format ALL code (mandatory before commit)
cd /path/to/repo/root
make py_format # black + isort + pycln
mvn spotless:apply # Format Java
If make py_format fails: The most common cause is missing dev dependencies. Run make install_dev first, then retry.
Never skip formatting — unformatted code will fail CI.
Run the static analyzer as a self-check before submitting:
python skills/connector-review/scripts/analyze_connector.py {service_type} {name}
Fix any issues it reports. Then verify the full checklist:
[ ] JSON Schema: validates, $ref resolves, supports* flags correct
[ ] JSON Schema: auth fields required when service mandates authentication
[ ] JSON Schema: SSL/TLS config included for HTTPS connectors
[ ] Code gen: make generate + mvn install + yarn parse-schema succeed
[ ] Connection: creates client, test_connection passes all steps
[ ] Source: create() validates config type, ServiceSpec is discoverable
[ ] Pydantic models: populate_by_name=True on all aliased models
[ ] Client: all list endpoints paginate (check API docs for pagination support)
[ ] Client: dict lookups in prepare(), not list iteration per entity
[ ] Lineage: no wildcard table_name="*" — skip if no table-level info available
[ ] Tests: unit + connection integration + metadata integration pass (no empty stubs)
[ ] Formatting: make py_format + mvn spotless:apply pass with no changes
[ ] Cleanup: CONNECTOR_CONTEXT.md is gitignored (verify it's not staged)
[ ] Cleanup: no leftover TODO scaffolding comments
Build everything and bring up a full local OpenMetadata stack with Docker:
Full build (first time or after Java/UI changes):
./docker/run_local_docker.sh -m ui -d mysql -s false -i true -r true
Fast rebuild (ingestion-only changes, ~2-3 minutes):
./docker/run_local_docker.sh -m ui -d mysql -s true -i true -r false
Once services are up (~3-5 minutes):
Other service URLs:
Tear down: cd docker/development && docker compose down -v
Troubleshooting:
-s truetest_fn keys match test connection JSON step namesdocker compose -f docker/development/docker-compose.yml logs ingestionWhen creating a PR for the connector, include the review summary in the PR description so reviewers see the quality assessment upfront:
# Run the static analyzer
analysis=$(python skills/connector-review/scripts/analyze_connector.py {service_type} {name} --json)
# Create PR with quality summary in description
gh pr create --title "feat(ingestion): Add {Name} {service_type} connector" --body "$(cat <<'EOF'
## Summary
- New {service_type} connector for {Name}
- Capabilities: {list capabilities}
## Test plan
- [ ] Unit tests pass (`pytest ingestion/tests/unit/topology/{service_type}/test_{name}.py`)
- [ ] Integration tests pass
- [ ] Local Docker test: connector appears in UI, test connection passes
## Connector Quality Review
**Verdict**: {VERDICT} | **Score**: {SCORE}/10
| Category | Score |
|----------|-------|
| Schema & Registration | X/10 |
| Connection & Auth | X/10 |
| Source, Topology & Performance | X/10 |
| Test Quality | X/10 |
| Code Quality & Style | X/10 |
**Blockers**: 0 | **Warnings**: {count} | **Suggestions**: {count}
<details>
<summary>Static analysis output</summary>
{paste analyze_connector.py output here}
</details>
🤖 Generated with [Claude Code](https://claude.com/claude-code)
EOF
)"
The quality summary gives maintainers confidence about the connector's state without needing to review every file manually.
All standards are in ${CLAUDE_SKILL_DIR}/standards/:
| Standard | Content |
|---|---|
main.md | Architecture overview, connector anatomy, service types |
patterns.md | Error handling, logging, pagination, auth, filters |
testing.md | Unit test patterns, integration tests, pytest style |
code_style.md | Python style, JSON Schema conventions, naming |
schema.md | Connection schema patterns, $ref usage, test connection JSON |
connection.md | BaseConnection vs function patterns, SSL, client wrapper |
service_spec.md | DefaultDatabaseSpec vs BaseSpec |
registration.md | Service enum, UI utils, i18n |
performance.md | Pagination, batching, rate limiting |
memory.md | Memory management, streaming, OOM prevention |
lineage.md | Lineage extraction methods, dialect mapping, query logs |
sql.md | SQLAlchemy patterns, URL building, auth, multi-DB |
source_types/*.md | Service-type-specific patterns |
Architecture guides in ${CLAUDE_SKILL_DIR}/references/:
| Reference | Content |
|---|---|
architecture-decision-tree.md | Service type, connection type, base class selection |
connection-type-guide.md | SQLAlchemy vs REST API vs SDK client |
capability-mapping.md | Capabilities by service type, schema flags, generated files |