Code Analysis

Code analysis is the process of extracting structural information from source files and computing dependency metrics to understand code organization and importance. This three-stage pipeline—per-file extraction, import resolution, and graph metrics computation—transforms raw source code into a queryable database of symbols, dependencies, and file rankings that power documentation generation.

Per-File Extraction

The per-file extraction stage uses an LLM to analyze individual source files and extract their exports, imports, and top-level symbols. Each file up to 50KB is processed in parallel (with a concurrency limit of 10) to extract structured metadata in a single pass.

export async function extractPerFile(db: Database, usage: UsageTracker): Promise<void> {
  const files = db
    .prepare(
      "SELECT id, path, relative_path, language, size_bytes FROM files WHERE size_bytes <= ?",
    )
    .all(MAX_FILE_SIZE) as Array<{
    id: number;
    path: string;
    relative_path: string;
    language: string;
    size_bytes: number;
  }>;

  // Process files with concurrency limit
  const processFile = async (file: (typeof files)[number]) => {
    const content = await Bun.file(file.path).text();
    const result = await generateText({
      model,
      output: Output.object({ schema: ExtractionSchema }),
      prompt: `Extract the exports, imports, and top-level symbols from this ${file.language} file...`,
    });
    // Store results in database
  };
}

The extraction schema captures three types of information:

Exports: Named identifiers exported from the file (functions, classes, types, etc.)
Imports: External dependencies and their module sources
Symbols: All top-level definitions with their line ranges

Results are validated with Zod schemas and persisted in a single database transaction per file, ensuring consistency.

Import Resolution

After extraction, relative import paths are resolved against the actual file tree to establish concrete dependency edges. The resolver queries all imports from the database and matches each relative import path to a target file using multiple strategies.

function tryResolve(base: string, pathToId: Map<string, number>): number | null {
  // Exact match
  const exact = pathToId.get(base);
  if (exact != null) return exact;

  // Try with extensions (.ts, .tsx, .js, .jsx, .mjs)
  for (const ext of EXTENSIONS) {
    const id = pathToId.get(base + ext);
    if (id != null) return id;
  }

  // Try as directory index (index.ts, __init__.py, etc.)
  for (const index of INDEX_FILES) {
    const id = pathToId.get(base + "/" + index);
    if (id != null) return id;
  }

  return null;
}

The resolution process:

Computes absolute paths for each relative import using the importer’s directory
Attempts exact match, then tries common file extensions
Checks for directory index files (both TypeScript/JavaScript and Python conventions)
Creates import edges in the database with automatic deduplication

Dependency Graph Construction

The dependency graph is built from resolved import edges, creating a directed graph where nodes are files and edges represent import relationships. This graph forms the foundation for subsequent metrics computation.

The graph stores:

Nodes: All files in the repository
Edges: Resolved import relationships from source to target files
Edge metadata: Import kind (currently all marked as ‘import’)

Once constructed, the graph is immutable during the metrics phase and enables analysis of code structure and dependency patterns.

PageRank Scoring

File importance is computed using the PageRank algorithm with a damping factor of 0.85 over 20 iterations. PageRank models the probability of reaching a file by following random import paths, treating files with more incoming dependencies as more important.

const DAMPING = 0.85;
const ITERATIONS = 20;

// Iterative PageRank computation
for (let i = 0; i < ITERATIONS; i++) {
  const next = new Map<number, number>();
  for (const id of fileIds) next.set(id, base);

  for (const id of fileIds) {
    const links = outLinks.get(id);
    if (!links || links.length === 0) continue;
    const share = (DAMPING * rank.get(id)!) / links.length;
    for (const target of links) {
      next.set(target, next.get(target)! + share);
    }
  }

  for (const id of fileIds) rank.set(id, next.get(id)!);
}

The algorithm distributes each file’s rank equally among its import targets, with a base contribution from the damping factor. Files that are imported by high-ranking files receive higher scores.

Graph Metrics

The final metrics stage computes three aggregate measures for each file and stores them in the file_metrics table: in-degree (count of incoming imports), out-degree (count of outgoing imports), and PageRank score.

export async function computeGraphMetrics(db: Database): Promise<void> {
  const inDeg = new Map<number, number>();
  const outDeg = new Map<number, number>();
  for (const e of edges) {
    inDeg.set(e.target_file_id, (inDeg.get(e.target_file_id) ?? 0) + 1);
    outDeg.set(e.source_file_id, (outDeg.get(e.source_file_id) ?? 0) + 1);
  }

  // ... PageRank computation ...

  // Store all metrics
  db.transaction(() => {
    for (const id of fileIds) {
      insert.run(id, inDeg.get(id) ?? 0, outDeg.get(id) ?? 0, rank.get(id) ?? 0);
    }
  })();
}

These metrics enable ranking files by importance, identifying hub files with many dependencies, and understanding the overall connectivity of the codebase. The full pipeline runs sequentially—extraction → resolution → metrics—ensuring each stage completes before the next begins.