Generate realistic synthetic PDF documents using LLM for RAG (Retrieval-Augmented Generation) and unstructured data use cases.

Overview

This skill uses the generate_pdf_documents MCP tool to create professional PDF documents with:

LLM-generated content based on your description
Accompanying JSON files with questions and evaluation guidelines (for RAG testing)
Automatic upload to Unity Catalog Volumes

Quick Start

Use the generate_pdf_documents MCP tool:

catalog: "my_catalog"
schema: "my_schema"
description: "Technical documentation for a cloud infrastructure platform including setup guides, troubleshooting procedures, and API references."
count: 10

This generates 10 PDF documents and saves them to /Volumes/my_catalog/my_schema/raw_data/pdf_documents/ (using default volume and folder).

With Custom Location

Generate realistic synthetic PDF documents using LLM for RAG (Retrieval-Augmented Generation) and unstructured data use cases.

Overview

This skill uses the generate_pdf_documents MCP tool to create professional PDF documents with:

LLM-generated content based on your description
Accompanying JSON files with questions and evaluation guidelines (for RAG testing)
Automatic upload to Unity Catalog Volumes

Quick Start

Use the generate_pdf_documents MCP tool:

catalog: "my_catalog"
schema: "my_schema"
description: "Technical documentation for a cloud infrastructure platform including setup guides, troubleshooting procedures, and API references."
count: 10

This generates 10 PDF documents and saves them to /Volumes/my_catalog/my_schema/raw_data/pdf_documents/ (using default volume and folder).

Parameter	Type	Required	Default	Description
`catalog`	string	Yes	-	Unity Catalog name
`schema`	string	Yes	-	Schema name
`description`	string	Yes	-	Detailed description of what PDFs should contain
`count`	int	Yes	-	Number of PDFs to generate
`volume`	string	No	`raw_data`	Volume name (created if not exists)
`folder`	string	No	`pdf_documents`	Folder within volume for output files
`doc_size`	string	No	`MEDIUM`	Document size: `SMALL` (~1 page), `MEDIUM` (~5 pages), `LARGE` (~10+ pages)
`overwrite_folder`	bool	No	`false`	If true, deletes existing folder contents first

Issue	Solution
"No LLM endpoint configured"	Set `DATABRICKS_MODEL` or `AZURE_OPENAI_DEPLOYMENT` environment variable
"Volume does not exist"	The tool creates volumes automatically; ensure you have CREATE VOLUME permission
"PDF generation timeout"	Reduce `count` or check LLM endpoint availability
Low quality content	Provide more detailed `description` with specific topics and document types

Unstructured Pdf Generation

Overview

Quick Start

With Custom Location

Unstructured Pdf Generation

Overview

Quick Start

With Custom Location

Parameters

Document Size Guide

Output Files

JSON Structure

Common Patterns

Pattern 1: HR Policy Documents

Pattern 2: Technical Documentation

Pattern 3: Financial Reports

Pattern 4: Training Materials

Workflow

Best Practices

Integration with RAG Pipelines

Environment Configuration

Common Issues

Feishu Doc

Summarize

Nano Pdf

Diffs

Customs Trade Compliance

Nutrient Document Processing

Unstructured Pdf Generation

Overview

Quick Start

With Custom Location

Unstructured Pdf Generation

Overview

Quick Start

With Custom Location

Parameters

Document Size Guide

Output Files

JSON Structure

Common Patterns

Pattern 1: HR Policy Documents

Pattern 2: Technical Documentation

Pattern 3: Financial Reports

Pattern 4: Training Materials

Workflow

Best Practices

Integration with RAG Pipelines

Environment Configuration

Common Issues

Related Skills

Feishu Doc

Summarize

Nano Pdf

Diffs

Customs Trade Compliance

Nutrient Document Processing