Skip to content

Code Analysis

Code analysis is the process of extracting structural information from source files and computing dependency metrics to understand code organization and importance. This three-stage pipeline—per-file extraction, import resolution, and graph metrics computation—transforms raw source code into a queryable database of symbols, dependencies, and file rankings that power documentation generation.

The per-file extraction stage uses an LLM to analyze individual source files and extract their exports, imports, and top-level symbols. Each file up to 50KB is processed in parallel (with a concurrency limit of 10) to extract structured metadata in a single pass.

export async function extractPerFile(db: Database, usage: UsageTracker): Promise<void> {
const files = db
.prepare(
"SELECT id, path, relative_path, language, size_bytes FROM files WHERE size_bytes <= ?",
)
.all(MAX_FILE_SIZE) as Array<{
id: number;
path: string;
relative_path: string;
language: string;
size_bytes: number;
}>;
// Process files with concurrency limit
const processFile = async (file: (typeof files)[number]) => {
const content = await Bun.file(file.path).text();
const result = await generateText({
model,
output: Output.object({ schema: ExtractionSchema }),
prompt: `Extract the exports, imports, and top-level symbols from this ${file.language} file...`,
});
// Store results in database
};
}

The extraction schema captures three types of information:

  • Exports: Named identifiers exported from the file (functions, classes, types, etc.)
  • Imports: External dependencies and their module sources
  • Symbols: All top-level definitions with their line ranges

Results are validated with Zod schemas and persisted in a single database transaction per file, ensuring consistency.

After extraction, relative import paths are resolved against the actual file tree to establish concrete dependency edges. The resolver queries all imports from the database and matches each relative import path to a target file using multiple strategies.

function tryResolve(base: string, pathToId: Map<string, number>): number | null {
// Exact match
const exact = pathToId.get(base);
if (exact != null) return exact;
// Try with extensions (.ts, .tsx, .js, .jsx, .mjs)
for (const ext of EXTENSIONS) {
const id = pathToId.get(base + ext);
if (id != null) return id;
}
// Try as directory index (index.ts, __init__.py, etc.)
for (const index of INDEX_FILES) {
const id = pathToId.get(base + "/" + index);
if (id != null) return id;
}
return null;
}

The resolution process:

  1. Computes absolute paths for each relative import using the importer’s directory
  2. Attempts exact match, then tries common file extensions
  3. Checks for directory index files (both TypeScript/JavaScript and Python conventions)
  4. Creates import edges in the database with automatic deduplication

The dependency graph is built from resolved import edges, creating a directed graph where nodes are files and edges represent import relationships. This graph forms the foundation for subsequent metrics computation.

The graph stores:

  • Nodes: All files in the repository
  • Edges: Resolved import relationships from source to target files
  • Edge metadata: Import kind (currently all marked as ‘import’)

Once constructed, the graph is immutable during the metrics phase and enables analysis of code structure and dependency patterns.

File importance is computed using the PageRank algorithm with a damping factor of 0.85 over 20 iterations. PageRank models the probability of reaching a file by following random import paths, treating files with more incoming dependencies as more important.

const DAMPING = 0.85;
const ITERATIONS = 20;
// Iterative PageRank computation
for (let i = 0; i < ITERATIONS; i++) {
const next = new Map<number, number>();
for (const id of fileIds) next.set(id, base);
for (const id of fileIds) {
const links = outLinks.get(id);
if (!links || links.length === 0) continue;
const share = (DAMPING * rank.get(id)!) / links.length;
for (const target of links) {
next.set(target, next.get(target)! + share);
}
}
for (const id of fileIds) rank.set(id, next.get(id)!);
}

The algorithm distributes each file’s rank equally among its import targets, with a base contribution from the damping factor. Files that are imported by high-ranking files receive higher scores.

The final metrics stage computes three aggregate measures for each file and stores them in the file_metrics table: in-degree (count of incoming imports), out-degree (count of outgoing imports), and PageRank score.

export async function computeGraphMetrics(db: Database): Promise<void> {
const inDeg = new Map<number, number>();
const outDeg = new Map<number, number>();
for (const e of edges) {
inDeg.set(e.target_file_id, (inDeg.get(e.target_file_id) ?? 0) + 1);
outDeg.set(e.source_file_id, (outDeg.get(e.source_file_id) ?? 0) + 1);
}
// ... PageRank computation ...
// Store all metrics
db.transaction(() => {
for (const id of fileIds) {
insert.run(id, inDeg.get(id) ?? 0, outDeg.get(id) ?? 0, rank.get(id) ?? 0);
}
})();
}

These metrics enable ranking files by importance, identifying hub files with many dependencies, and understanding the overall connectivity of the codebase. The full pipeline runs sequentially—extraction → resolution → metrics—ensuring each stage completes before the next begins.