Skip to content

File Discovery

The discovery process begins by leveraging git ls-files --cached to enumerate all tracked files in the repository. This approach automatically respects .gitignore rules and other Git configurations, ensuring only files intended for version control are processed.

const proc = Bun.spawn(["git", "ls-files", "--cached"], {
cwd: repoRoot,
stdout: "pipe",
stderr: "pipe",
});

The scanner then walks through the file list, computing metadata for each file including its absolute path, relative path, extension, size, and SHA256 checksum using Bun’s file APIs.

Beyond Git’s built-in .gitignore support, the scanner maintains a hardcoded set of commonly-ignored directories to skip, including build artifacts, package managers, and version control directories:

const IGNORED_DIRS = new Set([
"node_modules",
".next",
".nuxt",
".output",
"dist",
"build",
"out",
".turbo",
".vercel",
".netlify",
".cache",
".parcel-cache",
"vendor",
".vendor",
"__pycache__",
".venv",
"venv",
".git",
".svn",
".hg",
"coverage",
".nyc_output",
".terraform",
".pulumi",
]);

Files are filtered based on size and content. Files exceeding 1MB are excluded as they are unlikely to be source code. Binary files are detected by scanning the first 8KB for null bytes, which reliably distinguishes text from binary content without loading entire files into memory.

if (size > 1_000_000) continue;
const bytes = new Uint8Array(content);
let isBinary = false;
const checkLen = Math.min(bytes.length, 8192);
for (let i = 0; i < checkLen; i++) {
if (bytes[i] === 0) {
isBinary = true;
break;
}
}

The scanner computes file size during enumeration and performs lazy binary detection only on files that pass the size threshold. This optimizes performance by avoiding unnecessary I/O for large files or binary assets. File checksums are computed using SHA256 for integrity tracking.

After files are persisted to the database, a language detection pass maps file extensions to programming languages using a comprehensive mapping table covering 50+ language and format types:

const EXTENSION_MAP: Record<string, string> = {
ts: "typescript",
tsx: "typescript",
js: "javascript",
py: "python",
rs: "rust",
// ... additional mappings
};

The detectLanguages function updates all file records with their detected language in a single transaction.

File entries are persisted to the files table with the following columns:

  • path — absolute file path
  • relative_path — repository-relative path
  • extension — file extension without leading dot
  • size_bytes — file size in bytes
  • checksum_sha256 — SHA256 hash for integrity
  • language — detected programming language

The insertion process uses a transaction that clears dependent tables (file_symbols, file_imports, file_exports, edges, file_metrics, file_summaries, folder_summaries, topics) to ensure consistency on re-runs, allowing safe incremental updates to the file inventory.