バイト予算付き2フェーズリポジトリウォーク

大規模なコードベースを解析するツールでは、全ファイルを一度にメモリへ読み込むと破綻する。path と stat だけを先に集めて構造解析を済ませ、必要なファイルだけを byte budget で chunking して順次 read/parse/free すれば、メモリ使用量を予測可能に保てる。

コード

import { promises as fs } from 'node:fs'
import { join, extname } from 'node:path'

interface FileEntry {
  path: string
  size: number
}

interface ParsedFile {
  path: string
  content: string
  ast?: unknown // AST や解析結果
}

interface WalkOptions {
  /** 解析対象の拡張子（例: ['.ts', '.js']） */
  parseableExtensions: string[]
  /** ignore するパターン（例: node_modules, .git） */
  ignorePatterns: RegExp[]
  /** 1チャンクの最大バイト数 */
  chunkBudget: number
  /** 単一ファイルの最大サイズ（これを超えるファイルはスキップ） */
  maxFileSize: number
}

/**
 * フェーズ1: pathとsizeだけをスキャン
 */
async function scanPaths(
  rootPath: string,
  options: WalkOptions
): Promise<FileEntry[]> {
  const entries: FileEntry[] = []

  async function walk(dir: string): Promise<void> {
    const items = await fs.readdir(dir, { withFileTypes: true })

    for (const item of items) {
      const fullPath = join(dir, item.name)

      // ignoreパターンチェック（メモリ効率のため先に判定）
      if (options.ignorePatterns.some((pattern) => pattern.test(fullPath))) {
        continue
      }

      if (item.isDirectory()) {
        await walk(fullPath)
      } else if (item.isFile()) {
        const stat = await fs.stat(fullPath)

        // サイズチェック（巨大ファイルを早期除外）
        if (stat.size > options.maxFileSize) {
          continue
        }

        entries.push({
          path: fullPath,
          size: stat.size
        })
      }
    }
  }

  await walk(rootPath)
  return entries
}

/**
 * フェーズ2: 構造解析とフィルタリング
 */
function analyzeStructure(
  entries: FileEntry[],
  options: WalkOptions
): FileEntry[] {
  // 拡張子でフィルタ
  const parseableFiles = entries.filter((entry) => {
    const ext = extname(entry.path)
    return options.parseableExtensions.includes(ext)
  })

  // サイズ順にソート（小さいファイルから処理すると早期結果が得られやすい）
  return parseableFiles.toSorted((a, b) => a.size - b.size)
}

/**
 * フェーズ3: byte budgetでchunkingして順次read/parse
 */
async function parseWithBudget(
  entries: FileEntry[],
  options: WalkOptions,
  parseFile: (content: string, path: string) => unknown
): Promise<ParsedFile[]> {
  const results: ParsedFile[] = []
  let currentBudget = 0
  let chunk: FileEntry[] = []

  for (const entry of entries) {
    // budgetを超えたらチャンク処理
    if (currentBudget + entry.size > options.chunkBudget) {
      // 現在のチャンクを処理
      const parsed = await processChunk(chunk, parseFile)
      results.push(...parsed)

      // リセット（GCがメモリを回収できる）
      chunk = []
      currentBudget = 0
    }

    chunk.push(entry)
    currentBudget += entry.size
  }

  // 最後のチャンクを処理
  if (chunk.length > 0) {
    const parsed = await processChunk(chunk, parseFile)
    results.push(...parsed)
  }

  return results
}

/**
 * チャンク内のファイルを一括読み取り・解析
 */
async function processChunk(
  chunk: FileEntry[],
  parseFile: (content: string, path: string) => unknown
): Promise<ParsedFile[]> {
  const results: ParsedFile[] = []

  // 並列読み取り（チャンク内は同時実行でI/O効率化）
  const readPromises = chunk.map(async (entry) => {
    try {
      const content = await fs.readFile(entry.path, 'utf-8')
      const ast = parseFile(content, entry.path)
      return { path: entry.path, content, ast }
    } catch (error) {
      console.warn(`Failed to parse ${entry.path}:`, error)
      return null
    }
  })

  const parsed = await Promise.all(readPromises)

  for (const result of parsed) {
    if (result) {
      results.push(result)
    }
  }

  return results
}

/**
 * 統合インターフェース
 */
export async function walkRepository(
  rootPath: string,
  options: WalkOptions,
  parseFile: (content: string, path: string) => unknown
): Promise<ParsedFile[]> {
  // フェーズ1: pathとsizeをスキャン
  const entries = await scanPaths(rootPath, options)

  // フェーズ2: 構造解析
  const parseableFiles = analyzeStructure(entries, options)

  // フェーズ3: byte budgetで順次parse
  const results = await parseWithBudget(parseableFiles, options, parseFile)

  return results
}

使用例

import { parse } from '@typescript-eslint/typescript-estree'

// TypeScriptリポジトリを解析
const results = await walkRepository(
  './my-project',
  {
    parseableExtensions: ['.ts', '.tsx', '.js', '.jsx'],
    ignorePatterns: [
      /node_modules/,
      /\.git/,
      /dist/,
      /build/
    ],
    chunkBudget: 50 * 1024 * 1024, // 50MB
    maxFileSize: 10 * 1024 * 1024    // 10MB
  },
  (content, path) => {
    // ASTパーサーを適用
    return parse(content, {
      filePath: path,
      jsx: true
    })
  }
)

console.log(`Parsed ${results.length} files`)

仕組み

フェーズ1（スキャン）: fs.readdir と fs.stat でディレクトリを再帰走査し、path と size だけを収集する。ファイル本体は読まない。
フェーズ2（構造解析）: 収集したメタデータから parseable なファイルを抽出し、処理順序を最適化する（サイズ順、優先度順など）。
フェーズ3（chunking + parse）: byte budget を超えないようにファイルをチャンク分割し、各チャンクを read → parse → free する。
並列読み取り: チャンク内のファイルは Promise.all で同時読み取りし、I/O 待機時間を削減する。
メモリ解放: チャンク処理後に変数をリセットすることで、GC がメモリを回収できるようにする。

メリット

メモリ制御: 全ファイルを一度に読まず、byte budget で上限を設定できる
早期フィルタリング: stat だけで不要ファイルを除外し、I/O を削減する
進捗追跡: チャンク単位で進捗を報告でき、UI 更新やキャンセルに対応しやすい
I/O 効率: チャンク内並列読み取りで、ディスク待機時間を最適化する

注意点

フェーズ2の構造解析で「どのファイルを優先するか」を判断するには、ある程度のドメイン知識が必要。たとえば package.json や設定ファイルを先に読むと、他のファイルの処理方針を決められる。また、chunkBudget と maxFileSize は環境のメモリ上限に応じて調整が必要。

応用

code indexer: リポジトリ全体の関数・型定義を抽出してインデックス化する
static analyzer: lint ルールを大規模コードベースへ一括適用する
AST-based migration tool: コードベース全体へ自動リファクタリングを適用する
bulk linter: 数万ファイルのリポジトリでもメモリ破綻せずにチェックする
large-repo ingestion: CI/CD で大規模リポジトリの変更を解析する