Connect to and use the shared MedGemma HF Inference Endpoint (google/medgemma-1-5-4b-it-hae, multimodal medical AI). Use when any project needs to: (1) call the MedGemma endpoint for medical text/image extraction or clinical reasoning, (2) integrate MedGemma into a new service or pipeline, (3) debug MedGemma connection issues (503 cold-start, 404 chat/completions). Triggers: "medgemma", "medical extraction", "HF endpoint", "patient profile extraction", "clinical criterion evaluation".
https://pcmy7bkqtqesrrzd.us-east-1.aws.endpoints.huggingface.cloudgoogle/medgemma-1-5-4b-it-hae (4B param, multimodal)HF_TOKEN env var required/v1/chat/completions returns 404. Use text_generation() with manual Gemma chat template.Install dependency:
uv add huggingface_hub
Copy the client into your project:
cp ~/.claude/skills/medgemma-endpoint/scripts/medgemma_client.py your_project/services/
Usage:
import os
from your_project.services.medgemma_client import MedGemmaClient
client = MedGemmaClient(
endpoint_url="https://pcmy7bkqtqesrrzd.us-east-1.aws.endpoints.huggingface.cloud",
hf_token=os.environ["HF_TOKEN"],
)
# Health check
status = await client.health_check()
# Generate
result = await client.generate(
messages=[
{"role": "system", "content": "You are a medical assistant."},
{"role": "user", "content": "Summarize this pathology report: ..."},
],
max_tokens=2048,
)
# Parse JSON output (handles markdown-wrapped responses)
data = MedGemmaClient.parse_json(result)
Handled automatically by format_gemma_prompt(). Manual format:
<start_of_turn>user
[system prompt here]
[user message here]<end_of_turn>
<start_of_turn>model
Override defaults via constructor:
client = MedGemmaClient(
endpoint_url="...",
hf_token="...",
max_retries=6,
retry_backoff=2.0,
max_wait=60.0,
cold_start_timeout=60.0,
)
scripts/medgemma_client.py — Drop-in async client with retry logic and Gemma template formatting. Copy into any project.references/endpoint-details.md — Full endpoint specs, error codes, env var reference, and common errors table.