Extract structured per-film JSON from LIFF catalogue page PDFs when pages are a mix of film entries and non-film content (intro, index, schedules, adverts, juries). Use for single-page or batch workflows that require multi-pass extraction: text extraction, film-page detection, film block segmentation, JSON normalization, and sanity checks. Default dataset: print-media/2022-cat-pages.
Extract zero or more film records from one catalog page PDF and write one JSON file per film. Skip non-film pages explicitly and log the reason.
Process pages independently so work can be parallelized with low shared context.
Use references/2022-patterns.md for 2022-specific section ranges, skip hints, and examples.
Use these paths by default:
print-media/2022-cat-pages/page-XX.pdfintermediate-extract/2022/page-XX-<slug>.jsonintermediate-extract/2022/raw-text/page-XX.txtintermediate-extract/2022/extraction_log.mdintermediate-extract/2022/review_queue.mdWrite one JSON file per film. If a page has no films, write no JSON files and add a skip log line.
Default extraction should work with CLI tools already used in this repo:
pdftotext (required)jq (recommended for JSON validation)If additional Python packages are needed for parsing or validation, install them only in the repo-local tooling/ environment using uv. Do not install globally.
Example:
cd tooling
uv add <package-name>
Run:
pdftotext -layout "print-media/2022-cat-pages/page-XX.pdf" -
Preserve layout mode because column alignment is useful for parsing metadata labels and film blocks.
Normalize minimally:
film_page or skip)Mark as film_page when at least two film metadata anchors exist:
Running TimeCountry or CountriesYearDirectorPrint SourceMark as skip when anchors are absent and text matches non-film patterns:
Log every page decision.
Use one parser per page.
Layout A (feature_page):
Countries, Year, Running Time, etc.).Layout B (shorts_grid_page):
Print Source lines and repeated inline metadata strings containing Running Time.For feature_page:
For shorts_grid_page:
Print Source pattern.Emit this schema (omit absent fields):
{
"title": "Aftersun",
"page": "page-10.pdf",
"section": "Official Selection",
"program": "International Short Film Competition",
"countries": ["UK", "USA"],
"year": 2022,
"years": [2021, 2022],
"runtime_minutes": 96,
"languages": ["English", "German"],
"directors": ["Charlotte Wells"],
"screenwriters": ["July Jung"],
"producers": ["Dong-ha Kim", "Ji-yeon Kim"],
"cinematographers": ["Gregory Oke"],
"editors": ["Blair McClendon"],
"cast": ["Paul Mescal", "Frankie Corio"],
"premiere_status": "UK",
"original_title": "Die unsichtbare Grenze",
"print_source": "MUBI",
"description": "Main blurb text with paragraph breaks preserved.",
"quote": {
"text": "Quoted text from the page.",
"credit": "Director Li Ruijun, from an interview with Screen Daily"
},
"notes": "Optional presenter note, ambiguity note, or unstructured remainder."
}
Rules:
Country/Countries into countries array.year when exactly one 4-digit value exists.years when multiple years exist.2hr 14min => 134, 95min => 95).languages.with subtitles); do not emit a subtitles field.Director, Screenwriter, Producer, Cinematographer, Editor, and Leading Cast/Key Cast into arrays.section from page heading; infer from page range only when heading is missing.program for competition names (for example International Short Film Competition).description as plain text only; preserve paragraph breaks where possible.description.quote.quote.text for the quote body text.quote.credit for the plain-text attribution exactly as shown (for example Director Li Ruijun, from an interview with Screen Daily).description to avoid mixing editorial voice and film blurb.notes to capture uncertainty or ambiguity (for example unclear names, uncertain segmentation).UNCERTAIN: so they can be found in a post-extraction review pass.Use:
page-XX-<slug>.json for first film.page-XX-<slug>-2.json, -3, etc. for same-page duplicates.Slug rules:
-.-.Run syntax checks:
jq empty intermediate-extract/2022/*.json
Run content checks and add failures to review_queue.md:
title, section, or description.runtime_minutes outside 1-400.Running Time, Director, Print Source.<40 chars) unless clearly valid.quote exists, ensure both quote.text and quote.credit are present.notes contains UNCERTAIN:, add a matching item to review_queue.md.Append one line per page to extraction_log.md:
- Page 010 - extracted Aftersun- Page 120 - extracted The Water Murmurs, Tremor, Tsutsue- Page 107 - skipped, jury pageAssign one worker per page PDF. Require each worker to output:
extracted or skipped).Avoid cross-page assumptions except for section-range fallback from references/2022-patterns.md.
After all pages are processed:
print-media/2022-cat-pages/page-*.pdf has one log line.jq empty).(title, year, section) and verify manually before deleting anything.