Heuristics for extracting book metadata from aithena library paths
Use this when deriving title, author, year, and category from book paths before indexing into Solr.
Honor explicit filename structure first
Author - Title (Year).pdf → author/title/year from the filenameCategory/Author - Title (Year).pdf → category from folder, author/title/year from filenameUse folder depth to separate category vs author
Category/Author/Title.pdf → first folder is category, second folder is authorAuthor/Title.pdf → parent folder is author when the filename does not look like a series/journal issueHandle real aithena library cases
amades/Auca ... amades.pdf → treat amades as author and strip the repeated author suffix from the titlebalearics/ESTUDIS_BALEARICS_01.pdf → treat balearics as category, keep the filename as title text, and use author="Unknown"bsal/Bolletí ... 1885 - 1886.pdf → treat bsal as category; year ranges are metadata, not Author - Title separatorsAlways provide fallbacks
title to the filename stem with underscores normalized to spacesauthor to Unknownfile_path, folder_path, and file_size alongside parsed metadata- blindly — periodicals with year ranges will be misparsed as Author - Title