Ensures agents read and use provided reference files before searching or fabricating data
This skill ensures that when reference files are provided in task context, you MUST read and extract data from them FIRST before attempting web searches or generating synthetic data. Ignoring available structured data leads to fabricated outputs and incorrect results.
Reference files in context > Web search > Data fabrication (never)
Before taking any action, identify all files provided in the task context:
.xlsx, .csv, .json, .pdf, .docx, .txtUse the appropriate tool to read each reference file:
# For Excel files
read_file(filetype="xlsx", file_path="path/to/file.xlsx")
# For CSV files
read_file(filetype="csv", file_path="path/to/file.csv")
# For PDF files
read_file(filetype="pdf", file_path="path/to/file.pdf")
# For JSON files
read_file(filetype="json", file_path="path/to/file.json")
# For text files
read_file(filetype="txt", file_path="path/to/file.txt")
If read_file fails on .docx files (returns error, empty content, or 'unknown error'):
Fallback Approach 1: Direct zipfile/XML extraction via run_shell
# .docx files are ZIP archives containing XML; extract document.xml directly
unzip -p path/to/file.docx word/document.xml | grep -oP '(?<=<w:t>)[^<]+' | tr '\n' ' '
Or for more complete extraction:
mkdir -p /tmp/docx_extract && cd /tmp/docx_extract && unzip path/to/file.docx && cat word/document.xml
Fallback Approach 2: Use shell_agent for complex extraction If direct extraction fails, delegate to shell_agent:
shell_agent(task="Extract text content from path/to/file.docx using zipfile and XML parsing")
The agent will attempt multiple extraction methods and report results.
Fallback Approach 3: Verify extraction success before proceeding After any extraction method, confirm content was retrieved:
Important: Never proceed to data fabrication if reference files exist but read_file fails. Always attempt at least one fallback extraction method first.
After reading:
In your outputs, acknowledge the source:
❌ Ignoring reference files and searching the web instead ❌ Fabricating data when structured data is available ❌ Assuming file contents without reading them ❌ Using outdated web data when current reference files exist ❌ Giving up after read_file fails without trying fallback extraction methods
Task Context: "Create a property listings report. See Massabama_active_listings.xlsx for current data."
Correct Approach:
1. Read Massabama_active_listings.xlsx first
2. Extract property addresses, prices, specifications
3. Generate report using actual listing data
4. Note: "Data sourced from Massabama_active_listings.xlsx"
Incorrect Approach:
1. Search web for "Massabama property listings"
2. Fabricate property data from search results
3. Create report with unverified/generated data
Before completing any task with reference files:
Only search the web when: