How to support ingestion of new file types using the loader registry pattern.
The ingestion pipeline uses a loader registry — a dict mapping file extensions to loader classes. Do NOT add elif chains; extend the registry.
Add the dependency if needed:
uv add <package> # e.g., python-docx, beautifulsoup4, openpyxl
Import the loader at the top of backend/app/rag.py:
from langchain_community.document_loaders import Docx2txtLoader
Register the extension in the loader registry dict in RAGService:
LOADER_REGISTRY = {
".pdf": PyPDFLoader,
".docx": Docx2txtLoader,
".csv": CSVLoader,
".html": BSHTMLLoader,
".py": TextLoader, # use language-aware splitter for code
# ... add new entry here
}
Use language-aware splitting for code files (.py, .js, .ts, etc.):
from langchain.text_splitter import Language, RecursiveCharacterTextSplitter
code_splitter = RecursiveCharacterTextSplitter.from_language(
language=Language.PYTHON, chunk_size=1000, chunk_overlap=200
)
Update file type validation in the upload endpoint to allow the new extension.
Set file_id metadata on every chunk so vector cleanup works on file delete:
for doc in docs:
doc.metadata["file_id"] = str(file_id)
doc.metadata["source"] = filename
uv add docx2txtfrom langchain_community.document_loaders import Docx2txtLoader".docx": Docx2txtLoaderOn file delete, remove all chunks where metadata["file_id"] == str(file_id) from the workspace's ChromaDB collection.
On workspace delete, delete the entire ChromaDB collection for that workspace.