Ensures agents extract data from context files with validation and fallback strategies before resorting to assumptions or external searches.
Prevent data hallucination and inefficiency by mandating that agents inspect, validate, and fully extract data from provided reference files before attempting web searches or generating synthetic data. When extraction is incomplete, use fallback strategies before making assumptions.
If a reference file is provided in the task context, it is the source of truth. Do not fabricate data or search the web for information that may exist within the provided attachments. If extraction appears incomplete, attempt alternative methods before proceeding.
At the start of every task, explicitly list all files provided in the context window or attachment panel.
.xlsx, .csv), documents (.pdf, .docx, .pptx), or data dumps (.json, , , )..txt.xml.yamlDetermine if any provided file contains the data required to complete the task.
If relevant files are found:
read_file, pandas, pdf_reader).When initial extraction is incomplete or critical data is missing, attempt these strategies in order before making assumptions:
.docx: Try python-docx directly via execute_code_sandbox if read_file was truncated..xlsx/.csv: Try pandas with explicit sheet selection or openpyxl directly..pdf: Try pdfplumber, PyPDF2, or shell-based pdftotext if available.cat, xxd, strings).shell_agent or run_shell for format-specific tools:
unzip -p file.docx word/document.xml | xmllint --format - (extract raw docx XML)in2csv file.xlsx (convert Excel to CSV via shell)pdftotext file.pdf - (extract PDF text via command line)read_file.read_file on Pricing_email.docx (truncated at 980 chars). Attempted python-docx extraction via sandbox—pricing table in section 2 not present in file. Pricing numbers unavailable from provided context."When presenting data in the final output:
Massabama_active_listings.xlsx..."pdftotext from report.pdf..."If all extraction strategies fail to provide the specific data needed:
Task: Create a report on active property listings.
Context: Massabama_active_listings.xlsx is attached.
Incorrect Approach:
Correct Approach:
Massabama_active_listings.xlsx in context.read_file or pandas.Massabama_active_listings.xlsx, the total value is..."Task: Extract pricing tiers from Pricing_email.docx.
Context: Pricing_email.docx attached, but read_file output cuts off mid-sentence.
Incorrect Approach:
Correct Approach:
read_file output ends at 980 chars, mid-sentence.execute_code_sandbox with python-docx to read full document.read_file (truncated) and python-docx (table structure not preserved)."| Tool | Known Limitations | Fallback Strategy |
|---|---|---|
read_file | May truncate at ~1000-5000 chars depending on format | Use execute_code_sandbox with format-specific library |
execute_code_sandbox | May have missing dependencies or sandbox errors | Use shell_agent or run_shell for CLI tools |
shell_agent | Slower, but more flexible with system tools | Use for pdftotext, unzip XML extraction, in2csv, etc. |
If all extraction attempts fail:
*** End Files *** Add File: examples/extraction_fallback.sh #!/bin/bash
extract_docx_raw() { local file="$1" # Extract raw XML from docx (docx is a zip archive) unzip -p "$file" word/document.xml 2>/dev/null | xmllint --format - 2>/dev/null }
extract_xlsx_to_csv() { local file="$1" # Convert Excel to CSV using in2csv (from csvkit) if command -v in2csv &>/dev/null; then in2csv "$file" 2>/dev/null else echo "in2csv not available; try python pandas approach" return 1 fi }
extract_pdf_text() { local file="$1" # Extract text from PDF using pdftotext if command -v pdftotext &>/dev/null; then pdftotext "$file" - 2>/dev/null else echo "pdftotext not available; try PyPDF2 via Python sandbox" return 1 fi }
extract_raw_strings() { local file="$1" # Extract printable strings from any binary file if command -v strings &>/dev/null; then strings "$file" 2>/dev/null | head -500 else echo "strings not available" return 1 fi }
case "$1" in docx_raw) extract_docx_raw "$2" ;; xlsx_csv) extract_xlsx_to_csv "$2" ;; pdf_text) extract_pdf_text "$2" ;; raw_strings) extract_raw_strings "$2" ;; *) echo "Usage: $0 <docx_raw|xlsx_csv|pdf_text|raw_strings> <file>" exit 1 ;; esac