Profile and analyze circe-sanely-auto macro expansion performance. Use this skill whenever investigating compile time, macro performance, slow derivation, or optimizing the macro engine. Also use when the user says "profile", "why is compilation slow", "macro timing", "what's taking so long", "optimize macros", or wants to understand where time is spent during codec derivation. This skill covers the full workflow: running profiled compilation, analyzing results with the bundled Python script, and interpreting the output to plan optimizations.
circe-sanely-auto has built-in compile-time profiling gated by the
SANELY_PROFILE=true environment variable. When enabled, each macro
expansion prints timing data to stderr. This has zero cost when disabled.
# Auto derivation benchmark (~300 types)
mkdir -p results/macro-profile-auto
rm -rf out/benchmark/sanely
SANELY_PROFILE=true ./mill --no-server benchmark.sanely.compile 2>&1 | tee results/macro-profile-auto/raw.txt
# Configured derivation benchmark (~230 types)
mkdir -p results/macro-profile-configured
rm -rf out/benchmark-configured/sanely
SANELY_PROFILE=true ./mill --no-server benchmark-configured.sanely.compile 2>&1 | tee results/macro-profile-configured/raw.txt
Use --no-server so Mill doesn't reuse a cached compilation.
# Full report
python .claude/skills/macro-profile/scripts/analyze_profile.py results/macro-profile-auto/raw.txt
# Top 20 slowest, sorted by summonIgnoring time
python .claude/skills/macro-profile/scripts/analyze_profile.py results/macro-profile-auto/raw.txt --top 20 --sort summonIgnoring
# Only Decoder expansions
python .claude/skills/macro-profile/scripts/analyze_profile.py results/macro-profile-auto/raw.txt --kind Decoder
# JSON output for programmatic use
python .claude/skills/macro-profile/scripts/analyze_profile.py results/macro-profile-auto/raw.txt --json
The report shows:
| Section | What it tells you |
|---|---|
| By Kind | Encoder vs Decoder vs CfgEncoder vs CfgDecoder breakdown |
| Category Breakdown | Where time is spent across all expansions |
| Top N Slowest | Which types are the compilation bottlenecks |
| Optimization Insights | Actionable recommendations based on the data |
Each macro expansion (one per Encoder[T] or Decoder[T] derivation) tracks:
| Category | What it measures |
|---|---|
summonIgnoring | Expr.summonIgnoring[Encoder/Decoder[T]] - compiler implicit search |
summonMirror | Expr.summon[Mirror.Of[T]] - fetching the type's Mirror |
derive | Recursive deriveProduct/deriveSum calls - AST construction |
subTraitDetect | Checking if a sum type variant is itself a sub-trait |
cacheHit | Times the intra-expansion cache avoided re-derivation |
Profiling code lives in these files:
sanely/src/sanely/MacroTimer.scala - Timer utility (zero-cost when disabled)sanely/src/sanely/SanelyEncoder.scala - EncoderDerivation classsanely/src/sanely/SanelyDecoder.scala - DecoderDerivation classsanely/src/sanely/SanelyConfiguredEncoder.scala - ConfiguredEncoderDerivation classsanely/src/sanely/SanelyConfiguredDecoder.scala - ConfiguredDecoderDerivation classReference numbers from M3 Max MacBook Pro:
308 macro expansions, ~2450ms total macro time
Category Breakdown:
summonIgnoring 1256ms (51.3%) 1366 calls avg 0.92ms
derive 716ms (29.2%) 586 calls
summonMirror 80ms (3.3%) 586 calls avg 0.14ms
subTraitDetect 43ms (1.8%) 336 calls
cacheHit 1714 hits
overhead 355ms (14.5%)
Key insight: summonIgnoring (compiler implicit search) dominates at 51%.
The main optimization lever is reducing the number of summonIgnoring calls,
either through cross-expansion caching or lazy val emission patterns.
After analyzing profile data, use this to prioritize:
If summonIgnoring > 40%: Focus on reducing implicit search calls. Cross-expansion caching (lazy val emission) would help most. Currently each macro expansion starts with a fresh cache.
If derive > 30%: Focus on reducing generated AST size. Extract more
logic from inline macro output to SanelyRuntime helper methods.
If specific types are > 50ms: Those types have deep nesting. Consider whether they could benefit from user-provided instances or structural changes.
If Decoder >> Encoder: Decoder's field-by-field decode chain is
inherently more expensive. The buildDecodeChain recursive Expr
construction is the cause.
If cache hit ratio is low: The intra-expansion cache isn't catching
enough repeated types. Check if cacheKey computation is too specific.