Documentation Pipeline
The Documentation Pipeline is an automated five-stage orchestration system that transforms a raw codebase into fully-documented markdown files using LLM-powered analysis and generation. Each stage builds upon the previous one, persisting results to SQLite and creating a searchable Meilisearch index, enabling incremental re-processing of only changed files on subsequent runs.
Pipeline Overview
Section titled “Pipeline Overview”The documentation pipeline is designed to automatically generate comprehensive documentation from source code with minimal manual intervention. It employs a sequential five-stage architecture where each stage produces concrete artifacts (SQLite tables, search indexes, or markdown files) that downstream stages consume. This design allows the pipeline to be re-executed incrementally—only changed files are reprocessed, dramatically reducing computation time on subsequent runs.
The orchestrator is stateless between stages; all state lives in persistent storage. The pipeline is optimized for a 20B-parameter LLM running on commodity hardware, with built-in parallelism within stages to maximize throughput while respecting API rate limits.
Stage 1: File Tree Building
Section titled “Stage 1: File Tree Building”The first stage performs a deterministic walk of the repository to create a comprehensive file inventory. It recursively scans the directory, respects ignore patterns (.gitignore, .autodocignore), and filters out binary files using content detection.
export async function buildFileTree(repoRoot: string, db: Database): Promise<void> { const files = await scan(repoRoot); console.log(` found ${files.length} text files`);
if (files.length > FILE_COUNT_THRESHOLD) { // Prompt user confirmation for large repositories const response = await getUserConfirmation(); if (response !== "y") { process.exit(0); } }
insertFiles(files, db); await detectLanguages(db);}Each file is stored in SQLite with metadata including relative path, extension, size, and SHA-256 checksum. Programming language detection happens via simple extension mapping—TypeScript, Python, Go, Rust, etc.—with a fallback to ‘unknown’. The checksum enables incremental re-runs by detecting which files have changed since the last execution.
Stage 2: Structure Extraction
Section titled “Stage 2: Structure Extraction”Stage 2 analyzes the code structure of each file in parallel, using LLM calls to extract symbols, imports, exports, and their relationships. It consists of three sequential operations:
Per-file extraction invokes Claude on each file with a prompt requesting structured JSON output:
const result = await generateText({ model, output: Output.object({ schema: SummarySchema }), prompt: `Summarise this ${file.language} file in 2-4 sentences...Path: ${file.relative_path}Symbols: ${symbols}Exports: ${exports}\`\`\`${content}\`\`\``,});This extracts exports (functions, classes, types), imports (with resolved and external flags), and top-level symbols with their signatures and line numbers.
Import path resolution is deterministic: it matches relative import paths against the file tree, handling common patterns like index.ts and __init__.py barrel files. Unresolved imports are flagged for review.
Graph metrics computation calculates in-degree, out-degree, and PageRank on the import dependency graph. High in-degree files are core modules (document first), while high out-degree files are orchestrators or integration points.
Stage 3: Summarization
Section titled “Stage 3: Summarization”Stage 3 generates concise summaries in a bottom-up fashion:
File summaries use the LLM to produce 2-4 sentence summaries for files under 50KB, incorporating their extracted symbols and exports. Up to 10 files are processed concurrently:
const MAX_FILE_SIZE = 50_000;const CONCURRENCY = 10;
export async function summariseFiles(db: Database, usage: UsageTracker): Promise<void> { // Process files with structured output, track concurrency for (const file of files) { while (active >= CONCURRENCY) { await Promise.race(results); } // ... process file concurrently }}Folder summaries traverse the file tree in post-order, aggregating child file and folder summaries into concise 2-4 sentence descriptions. The root folder summary becomes a project overview.
Search indexing pushes all summaries plus symbol names into Meilisearch with filterable language tags and sortable PageRank scores, enabling fast retrieval in later stages.
Stage 4: Topic Planning
Section titled “Stage 4: Topic Planning”Stage 4 uses the LLM to identify what documentation topics should exist, operating on the project overview and highest-importance files:
The planning phase queries the top 20 files by PageRank, assembles their summaries, optionally includes recent git activity, and prompts the LLM to produce a Diátaxis-structured plan. The result is an array of topics with:
- Title: Human-readable topic name (e.g., “Authentication System”)
- Scope: What this topic covers
- Priority: Order in which users should learn the content
- Sections: Sub-sections within the topic
- Relevant files: File paths referenced by the LLM
Reference resolution converts LLM-generated file paths to SQLite IDs using deterministic matching (exact match first, suffix match fallback).
Coverage validation ensures high-PageRank files appear in at least one topic. If core files are missed, the pipeline can optionally re-prompt the planning stage with the missing files explicitly listed.
Stage 5: Documentation Writing
Section titled “Stage 5: Documentation Writing”Stage 5 generates the final markdown documentation files, one per topic, using Claude Sonnet with access to search and file-reading tools:
Context gathering pre-loads relevant file summaries, symbols, and cross-reference data from SQLite before invoking the LLM, reducing tool calls and context window usage.
Draft generation invokes the writer with a structured prompt requesting 200-500 word markdown with code examples, cross-references, and proper markdown formatting. The writer has access to two tools:
search_codebase: Query Meilisearch for files/symbols related to a conceptread_file: Read full source of a specific file by path
The LLM is limited to 2-3 tool calls per topic to control costs.
Cross-referencing post-processes markdown files to detect topic title mentions and inject internal links, then generates a README.md table of contents.
Output validation checks for word count (minimum 100 words), language tags on code blocks, and broken internal links, emitting warnings for quality issues.
Incremental Processing
Section titled “Incremental Processing”The pipeline supports efficient re-runs on modified codebases:
- Changed files: Files with altered checksums re-enter at Stage 2 (structure extraction)
- Folder updates: If any file in a folder changed, its summary is regenerated
- Topic updates: Only topics referencing changed files are rewritten
- Full replan: Stage 4 is re-run only if >20% of files changed, or on explicit request
This strategy dramatically reduces computation time for large repositories with small incremental changes, enabling continuous documentation updates during active development.