Run and evaluate LLM science benchmarks across free Argonne endpoints. Covers GPQA Diamond (198 graduate QA), LitQA2 (199 literature QA), BixBench (296 bioinformatics MCQ), BioPro PQA (1200 biology protocol QA), AAAR (1049 equation inference), DiscoveryBench (239 agentic discovery), GeoBenchX (202 geospatial tool-calling), LLM-SRBench (239 equation discovery), and CORE-Bench (270 computational reproducibility). Use when evaluating models on science tasks, comparing model performance, or adding new models to benchmark tables.
Evaluate LLMs on scientific reasoning across 8+ benchmarks using free Argonne compute endpoints (Argo proxy, ALCF Sophia, CELS vLLM). Produces LaTeX tables and comparison dashboards.
| Endpoint | URL | Key | Models |
|---|---|---|---|
| Argo Proxy | http://127.0.0.1:44497/v1 (tunneled) | stevens | 55 models (GPT-5.x, Claude 4.x, Gemini 2.5, O3/O4) |
| ALCF Sophia | http://127.0.0.1:8890/v1 (tunneled) | stevens | 12 models (Llama, Qwen, Gemma, Mixtral) |
| CELS vLLM | Various public IPs :80/v1 | CELS | OSS-120B, Llama70, Trinity, Gemma4, K2.5 |
Graduate-level science QA. Uses gpqa_eval_v2.py.
python3 gpqa_eval_v2.py --model gpt-5.4 --base-url http://127.0.0.1:44497/v1 \
--api-key stevens --output GPQA-gpt54-cot.json --mode cot --max-tokens 100000 --timeout 300
Key lessons:
max_tokens >= 100000 — reasoning tokens count against limit<think> tags before answer extraction for thinking modelstemperature=0Literature-based biology QA. Uses bench_runner.py.
Bioinformatics multiple choice. Some models (Maverick, Mixtral) get 0% from format mismatch.
Biology protocol QA. Use score_bioprob_robust.py for extraction — official scorer only accepts [ANSWER_START]...[ANSWER_END] format.
Use score_aaar_robust.py — original harness stored full reasoning as "prediction" instead of extracting A/B/C/D letters.
GPT-4.1 as judge. Use score_discoverybench.py via Argo.
Multi-step agentic benchmark with 23 geospatial tools. Requires env vars:
export STATDATAPATH=data/Data/StatData
export GEODATAPATH=data/Data/GeoData
export SCRATCHPATH=scratch
export OPENAI_API_BASE=http://127.0.0.1:44497/v1
export OPENAI_API_KEY=stevens
Evaluation uses LLM-as-judge (GPT-4.1) with 3-point scoring (0=no match, 1=partial, 2=full).
Key findings: Frontier models (GPT-5.4: 64%, Claude Opus: 61%) dominate. Open models score ~2%.
Iterative LLM-guided symbolic regression. Very slow (~1 problem/hr).
nnheui/llm-srbench) — need to accept access termsmultiprocessing.set_start_method("fork", force=True) at top of eval.pyapi_type: "openai", api_url, api_modelRequires Docker. Run on machines with Docker (e.g., DGX Spark).
openai in Dockerfilenetwork_mode="host" for Tailscale endpoint access from inside Docker| Script | Benchmark | Notes |
|---|---|---|
gpqa_eval_v2.py | GPQA | CoT + direct modes, random guess fallback |
score_aaar_robust.py | AAAR | Letter extraction from verbose reasoning |
score_bioprob_robust.py | BioPro | Official + fallback extraction |
score_discoverybench.py | DiscBench | LLM-as-judge via Argo |
rescore_labbench_v2.py | LitQA2 | Accuracy computation |
Results table: benchmark_table.tex (landscape, booktabs, sorted by GPQA).
Compile: /Library/TeX/texbin/pdflatex benchmark_table.tex
refresh_alcf_token.sh