Documentation Pipeline

The Documentation Pipeline is an automated five-stage orchestration system that transforms a raw codebase into fully-documented markdown files using LLM-powered analysis and generation. Each stage builds upon the previous one, persisting results to SQLite and creating a searchable Meilisearch index, enabling incremental re-processing of only changed files on subsequent runs.

Pipeline Overview

The documentation pipeline is designed to automatically generate comprehensive documentation from source code with minimal manual intervention. It employs a sequential five-stage architecture where each stage produces concrete artifacts (SQLite tables, search indexes, or markdown files) that downstream stages consume. This design allows the pipeline to be re-executed incrementally—only changed files are reprocessed, dramatically reducing computation time on subsequent runs.

The orchestrator is stateless between stages; all state lives in persistent storage. The pipeline is optimized for a 20B-parameter LLM running on commodity hardware, with built-in parallelism within stages to maximize throughput while respecting API rate limits.

Stage 1: File Tree Building

The first stage performs a deterministic walk of the repository to create a comprehensive file inventory. It recursively scans the directory, respects ignore patterns (.gitignore, .autodocignore), and filters out binary files using content detection.

export async function buildFileTree(repoRoot: string, db: Database): Promise<void> {
  const files = await scan(repoRoot);
  console.log(`  found ${files.length} text files`);

  if (files.length > FILE_COUNT_THRESHOLD) {
    // Prompt user confirmation for large repositories
    const response = await getUserConfirmation();
    if (response !== "y") {
      process.exit(0);
    }
  }

  insertFiles(files, db);
  await detectLanguages(db);
}

Each file is stored in SQLite with metadata including relative path, extension, size, and SHA-256 checksum. Programming language detection happens via simple extension mapping—TypeScript, Python, Go, Rust, etc.—with a fallback to ‘unknown’. The checksum enables incremental re-runs by detecting which files have changed since the last execution.

Stage 2: Structure Extraction

Stage 2 analyzes the code structure of each file in parallel, using LLM calls to extract symbols, imports, exports, and their relationships. It consists of three sequential operations:

Per-file extraction invokes Claude on each file with a prompt requesting structured JSON output:

const result = await generateText({
  model,
  output: Output.object({ schema: SummarySchema }),
  prompt: `Summarise this ${file.language} file in 2-4 sentences...
Path: ${file.relative_path}
Symbols: ${symbols}
Exports: ${exports}
\`\`\`
${content}
\`\`\``,
});

This extracts exports (functions, classes, types), imports (with resolved and external flags), and top-level symbols with their signatures and line numbers.

Import path resolution is deterministic: it matches relative import paths against the file tree, handling common patterns like index.ts and __init__.py barrel files. Unresolved imports are flagged for review.

Graph metrics computation calculates in-degree, out-degree, and PageRank on the import dependency graph. High in-degree files are core modules (document first), while high out-degree files are orchestrators or integration points.

Stage 3: Summarization

Stage 3 generates concise summaries in a bottom-up fashion:

File summaries use the LLM to produce 2-4 sentence summaries for files under 50KB, incorporating their extracted symbols and exports. Up to 10 files are processed concurrently:

const MAX_FILE_SIZE = 50_000;
const CONCURRENCY = 10;

export async function summariseFiles(db: Database, usage: UsageTracker): Promise<void> {
  // Process files with structured output, track concurrency
  for (const file of files) {
    while (active >= CONCURRENCY) {
      await Promise.race(results);
    }
    // ... process file concurrently
  }
}

Folder summaries traverse the file tree in post-order, aggregating child file and folder summaries into concise 2-4 sentence descriptions. The root folder summary becomes a project overview.

Search indexing pushes all summaries plus symbol names into Meilisearch with filterable language tags and sortable PageRank scores, enabling fast retrieval in later stages.

Stage 4: Topic Planning

Stage 4 uses the LLM to identify what documentation topics should exist, operating on the project overview and highest-importance files:

The planning phase queries the top 20 files by PageRank, assembles their summaries, optionally includes recent git activity, and prompts the LLM to produce a Diátaxis-structured plan. The result is an array of topics with:

Title: Human-readable topic name (e.g., “Authentication System”)
Scope: What this topic covers
Priority: Order in which users should learn the content
Sections: Sub-sections within the topic
Relevant files: File paths referenced by the LLM

Reference resolution converts LLM-generated file paths to SQLite IDs using deterministic matching (exact match first, suffix match fallback).

Coverage validation ensures high-PageRank files appear in at least one topic. If core files are missed, the pipeline can optionally re-prompt the planning stage with the missing files explicitly listed.

Stage 5: Documentation Writing

Stage 5 generates the final markdown documentation files, one per topic, using Claude Sonnet with access to search and file-reading tools:

Context gathering pre-loads relevant file summaries, symbols, and cross-reference data from SQLite before invoking the LLM, reducing tool calls and context window usage.

Draft generation invokes the writer with a structured prompt requesting 200-500 word markdown with code examples, cross-references, and proper markdown formatting. The writer has access to two tools:

search_codebase: Query Meilisearch for files/symbols related to a concept
read_file: Read full source of a specific file by path

The LLM is limited to 2-3 tool calls per topic to control costs.

Cross-referencing post-processes markdown files to detect topic title mentions and inject internal links, then generates a README.md table of contents.

Output validation checks for word count (minimum 100 words), language tags on code blocks, and broken internal links, emitting warnings for quality issues.

Incremental Processing

The pipeline supports efficient re-runs on modified codebases:

Changed files: Files with altered checksums re-enter at Stage 2 (structure extraction)
Folder updates: If any file in a folder changed, its summary is regenerated
Topic updates: Only topics referencing changed files are rewritten
Full replan: Stage 4 is re-run only if >20% of files changed, or on explicit request

This strategy dramatically reduces computation time for large repositories with small incremental changes, enabling continuous documentation updates during active development.