Name: Import Docs
Author: stmailabs

Search skills.../

Import Docs | Skills Pool

find "$docs_path" -type f \( -name "*.pdf" -o -name "*.docx" -o -name "*.doc" -o -name "*.xlsx" -o -name "*.csv" -o -name "*.pptx" -o -name "*.png" -o -name "*.jpg" -o -name "*.jpeg" -o -name "*.bib" -o -name "*.ris" -o -name "*.md" -o -name "*.txt" -o -name "*.html" \) | sort

Found N documents in "$docs_path":
  1. cv_popescu_2024.pdf
  2. paper_gnn_binding.pdf
  3. budget_template.xlsx
  4. letter_collaborator_munich.pdf
  ...

/markitdown <file_path>

Classification	Signals
CV / Resume	Contains "Curriculum Vitae", "CV", "Education", "Employment", "Publications", "Grants" sections
Research paper	Has Abstract, Introduction, Methods, Results, References sections. Has DOI or journal name.
Previous proposal	Has "Specific Aims", "Research Strategy", "Objectives", "Work Packages", "Budget Justification"
FOA / Call	Has "Funding Opportunity", "Call for Proposals", "Deadline", "Eligibility", "Submission"
Budget spreadsheet	Has columns for personnel, costs, years. Contains salary figures, overhead rates.
Letter of support	Addressed "Dear...", mentions "support", "collaboration", "commit", "pleased to contribute"
Review feedback	Has "Reviewer", "Score", "Strengths", "Weaknesses", "Summary Statement"
Facilities	Describes labs, equipment, computing, institutional resources
DMP	Mentions "data management", "data sharing", "repositories", "FAIR principles"
Figures	Image files, or PDFs with mostly graphics
Presentation	PPTX files, or PDFs with slide-like formatting
Bibliography	.bib or .ris files, or documents that are mostly reference lists
Notes / Outline	Short text without formal structure, brainstorming, bullet points

Document classification:
  cv_popescu_2024.pdf          → CV / Resume
  paper_gnn_binding.pdf        → Research paper
  budget_template.xlsx         → Budget spreadsheet
  letter_collaborator_munich.pdf → Letter of support
  fig_preliminary_results.png  → Figure (preliminary data)
  previous_pce_proposal.pdf    → Previous grant proposal
  foa_horizon_2025.pdf         → FOA / Call document

{
  "type": "cv",
  "source_file": "cv_popescu_2024.pdf",
  "extracted": {
    "name": "Dr. Alexandru Popescu",
    "title": "Associate Professor",
    "institution": "Politehnica University of Bucharest",
    "department": "Computer Science",
    "email": "...",
    "orcid": "0000-0002-...",
    "positions": [
      {"title": "Associate Professor", "institution": "...", "from": 2020, "to": "present"},
      {"title": "Postdoc", "institution": "ETH Zurich", "from": 2017, "to": 2020}
    ],
    "education": [...],
    "selected_publications": [
      {"title": "...", "journal": "...", "year": 2023, "doi": "..."},
      ...
    ],
    "grants": [
      {"title": "...", "agency": "UEFISCDI", "role": "PI", "amount": "500,000 RON", "years": "2021-2024"}
    ],
    "h_index": 18,
    "expertise": ["graph neural networks", "drug discovery", "protein modeling"]
  }
}

{
  "type": "paper",
  "source_file": "paper_gnn_binding.pdf",
  "extracted": {
    "title": "...",
    "authors": ["..."],
    "journal": "...",
    "year": 2023,
    "doi": "...",
    "abstract": "...",
    "key_results": ["achieved 89% accuracy on PDBbind", "outperformed baseline by 15%"],
    "methods_used": ["equivariant GNN", "message passing", "PDBbind v2020"],
    "figures": ["fig1_architecture.png", "fig2_results.png"],
    "is_preliminary_data": true
  }
}

{
  "type": "previous_proposal",
  "source_file": "previous_pce_proposal.pdf",
  "extracted": {
    "title": "...",
    "agency": "UEFISCDI",
    "mechanism": "PCE",
    "aims": ["Aim 1: ...", "Aim 2: ..."],
    "methodology_summary": "...",
    "budget_structure": {"personnel": "...", "equipment": "..."},
    "team": [{"name": "...", "role": "PI"}, ...],
    "references_cited": [...],
    "outcome": "funded/not funded (if known)"
  }
}

{
  "type": "foa",
  "source_file": "foa_horizon_2025.pdf",
  "extracted": {
    "agency": "horizon",
    "mechanism": "RIA",
    "call_id": "HORIZON-HLTH-2025-...",
    "title": "...",
    "deadline": "2025-09-15",
    "budget_range": "EUR 3-5M per project",
    "page_limits": {"part_b1": 45},
    "required_sections": [...],
    "review_criteria": [...],
    "special_requirements": ["gender dimension", "open access", "ethics"]
  }
}

{
  "type": "budget",
  "source_file": "budget_template.xlsx",
  "extracted": {
    "personnel": [
      {"name": "PI", "monthly_salary": 15000, "effort_pct": 50, "months": 36},
      {"name": "Postdoc", "monthly_salary": 10000, "effort_pct": 100, "months": 36}
    ],
    "indirect_rate": 0.25,
    "equipment": [{"item": "GPU Server", "cost": 50000}],
    "travel": 25000,
    "consumables": 30000,
    "currency": "RON"
  }
}

{
  "type": "letter",
  "source_file": "letter_collaborator_munich.pdf",
  "extracted": {
    "from": "Prof. Hans Mueller",
    "institution": "Technical University of Munich",
    "commitment": "Will contribute 6 person-months of computational resources and co-supervise 1 PhD student",
    "areas": ["molecular dynamics", "force field development"]
  }
}

{
  "type": "review_feedback",
  "source_file": "summary_statement_2024.pdf",
  "extracted": {
    "scores": {"excellence": 3, "impact": 4, "implementation": 2},
    "overall": "Below threshold",
    "criticisms": [
      {"reviewer": 1, "category": "methodology", "text": "Statistical plan is insufficient..."},
      ...
    ]
  }
}

# Document Import Report

**Source folder**: <docs_path>
**Documents processed**: N
**Date**: <timestamp>

## Imported Successfully

| Source Document | Type | Populated |
|----------------|------|-----------|
| cv_popescu_2024.pdf | CV | supporting/cv_pi.md |
| paper_gnn_binding.pdf | Paper | sections/preliminary_data.md, sections/figures/fig1.png |
| budget_template.xlsx | Budget | budget/budget_input.yaml |
| letter_mueller.pdf | Letter | supporting/letters/letter_mueller_tum.md |
| foa_horizon.pdf | FOA | foa_requirements.json |

## Pre-populated Sections

| Section | Source | Status |
|---------|--------|--------|
| supporting/cv_pi.md | cv_popescu_2024.pdf | Complete draft — PI should verify |
| budget/budget_input.yaml | budget_template.xlsx | Ready for calculation |
| sections/preliminary_data.md | paper_gnn_binding.pdf | Draft with 2 figures — PI should review |
| sections/bibliography.md | paper_gnn_binding.pdf + refs.bib | 15 references imported |
| landscape/prior_support.md | previous_pce_proposal.pdf | Previous UEFISCDI PCE documented |
| foa_requirements.json | foa_horizon.pdf | Extracted — verify deadline and page limits |

## Still Needs Manual Input

| What | Why |
|------|-----|
| Institutional indirect cost rate | Not found in any document — ask PI |
| Facilities description | No facilities document provided |
| Data management plan | No previous DMP found |
| Ethics self-assessment | Requires PI input on human subjects, animals, data |
| Collaboration details | Letters mention collaborations but scope needs PI confirmation |

## Warnings

- **previous_pce_proposal.pdf**: Extracted aims and methodology as starting points. DO NOT copy verbatim — adapt for the new proposal to avoid overlap issues.
- **budget_template.xlsx**: Salary figures from 2023 — PI should confirm current rates.
- **fig_results.png**: Resolution is 150 DPI — may need higher resolution for submission.

uv run gw-state update <proposal_dir> --phase import_docs --status complete

Import Docs

Import Source Documents

Arguments

Supported Document Types

Import Docs

Import Source Documents

Arguments

Supported Document Types

Procedure

1. Scan Document Folder

2. Convert Documents to Text

3. Classify Each Document

4. Extract Structured Information

From CV / Resume:

From Research Papers:

From Previous Grant Proposals:

From FOA / Call Documents:

From Budget Spreadsheets:

From Letters of Support:

From Review Feedback (for resubmission):

From Figures:

From Bibliography Files:

5. Generate Import Report

6. Human Checkpoint

7. Update State

Error Handling

Openai Whisper

Voice Call

Prose

Clawhub

Sherpa Onnx Tts

Openai Whisper Api