바이트 예산이 있는 2단계 저장소 워크

대규모 코드베이스를 분석하는 도구에서 모든 파일을 한 번에 메모리로 읽어들이면 실패한다. path와 stat만 먼저 수집하여 구조 분석을 완료한 다음, 필요한 파일만 바이트 예산으로 청킹하여 순차적으로 read/parse/free하면 메모리 사용량을 예측 가능하게 유지할 수 있다.

코드

import { promises as fs } from 'node:fs'
import { join, extname } from 'node:path'

interface FileEntry {
  path: string
  size: number
}

interface ParsedFile {
  path: string
  content: string
  ast?: unknown // AST 또는 분석 결과
}

interface WalkOptions {
  /** 파싱할 확장자 (예: ['.ts', '.js']) */
  parseableExtensions: string[]
  /** 무시할 패턴 (예: node_modules, .git) */
  ignorePatterns: RegExp[]
  /** 청크당 최대 바이트 */
  chunkBudget: number
  /** 단일 파일 최대 크기 (이를 초과하는 파일은 건너뜀) */
  maxFileSize: number
}

/**
 * 단계 1: path와 size만 스캔
 */
async function scanPaths(
  rootPath: string,
  options: WalkOptions
): Promise<FileEntry[]> {
  const entries: FileEntry[] = []

  async function walk(dir: string): Promise<void> {
    const items = await fs.readdir(dir, { withFileTypes: true })

    for (const item of items) {
      const fullPath = join(dir, item.name)

      // 무시 패턴 확인 (메모리 효율을 위해 조기 판단)
      if (options.ignorePatterns.some((pattern) => pattern.test(fullPath))) {
        continue
      }

      if (item.isDirectory()) {
        await walk(fullPath)
      } else if (item.isFile()) {
        const stat = await fs.stat(fullPath)

        // 크기 확인 (거대한 파일 조기 제외)
        if (stat.size > options.maxFileSize) {
          continue
        }

        entries.push({
          path: fullPath,
          size: stat.size
        })
      }
    }
  }

  await walk(rootPath)
  return entries
}

/**
 * 단계 2: 구조 분석 및 필터링
 */
function analyzeStructure(
  entries: FileEntry[],
  options: WalkOptions
): FileEntry[] {
  // 확장자로 필터링
  const parseableFiles = entries.filter((entry) => {
    const ext = extname(entry.path)
    return options.parseableExtensions.includes(ext)
  })

  // 크기순 정렬 (작은 파일부터 처리하면 조기 결과를 얻기 쉬움)
  return parseableFiles.toSorted((a, b) => a.size - b.size)
}

/**
 * 단계 3: 바이트 예산으로 청킹하여 순차적으로 read/parse
 */
async function parseWithBudget(
  entries: FileEntry[],
  options: WalkOptions,
  parseFile: (content: string, path: string) => unknown
): Promise<ParsedFile[]> {
  const results: ParsedFile[] = []
  let currentBudget = 0
  let chunk: FileEntry[] = []

  for (const entry of entries) {
    // 예산 초과 시 청크 처리
    if (currentBudget + entry.size > options.chunkBudget) {
      // 현재 청크 처리
      const parsed = await processChunk(chunk, parseFile)
      results.push(...parsed)

      // 초기화 (GC가 메모리를 회수할 수 있도록)
      chunk = []
      currentBudget = 0
    }

    chunk.push(entry)
    currentBudget += entry.size
  }

  // 마지막 청크 처리
  if (chunk.length > 0) {
    const parsed = await processChunk(chunk, parseFile)
    results.push(...parsed)
  }

  return results
}

/**
 * 청크 내 파일을 일괄 읽기 및 파싱
 */
async function processChunk(
  chunk: FileEntry[],
  parseFile: (content: string, path: string) => unknown
): Promise<ParsedFile[]> {
  const results: ParsedFile[] = []

  // 병렬 읽기 (청크 내에서 동시 실행하여 I/O 효율화)
  const readPromises = chunk.map(async (entry) => {
    try {
      const content = await fs.readFile(entry.path, 'utf-8')
      const ast = parseFile(content, entry.path)
      return { path: entry.path, content, ast }
    } catch (error) {
      console.warn(`Failed to parse ${entry.path}:`, error)
      return null
    }
  })

  const parsed = await Promise.all(readPromises)

  for (const result of parsed) {
    if (result) {
      results.push(result)
    }
  }

  return results
}

/**
 * 통합 인터페이스
 */
export async function walkRepository(
  rootPath: string,
  options: WalkOptions,
  parseFile: (content: string, path: string) => unknown
): Promise<ParsedFile[]> {
  // 단계 1: path와 size 스캔
  const entries = await scanPaths(rootPath, options)

  // 단계 2: 구조 분석
  const parseableFiles = analyzeStructure(entries, options)

  // 단계 3: 바이트 예산으로 순차 파싱
  const results = await parseWithBudget(parseableFiles, options, parseFile)

  return results
}

사용 예제

import { parse } from '@typescript-eslint/typescript-estree'

// TypeScript 저장소 분석
const results = await walkRepository(
  './my-project',
  {
    parseableExtensions: ['.ts', '.tsx', '.js', '.jsx'],
    ignorePatterns: [
      /node_modules/,
      /\.git/,
      /dist/,
      /build/
    ],
    chunkBudget: 50 * 1024 * 1024, // 50MB
    maxFileSize: 10 * 1024 * 1024    // 10MB
  },
  (content, path) => {
    // AST 파서 적용
    return parse(content, {
      filePath: path,
      jsx: true
    })
  }
)

console.log(`Parsed ${results.length} files`)

동작 원리

단계 1 (스캔): fs.readdir과 fs.stat으로 디렉토리를 재귀 순회하여 path와 size만 수집. 파일 내용은 읽지 않음.
단계 2 (구조 분석): 수집된 메타데이터에서 파싱 가능한 파일을 추출하고 처리 순서를 최적화 (크기순, 우선순위 등).
단계 3 (청킹 + 파싱): 바이트 예산을 초과하지 않도록 파일을 청크로 분할한 다음 각 청크를 read → parse → free.
병렬 읽기: 청크 내 파일은 Promise.all로 동시에 읽어 I/O 대기 시간 감소.
메모리 해제: 청크 처리 후 변수를 초기화하여 GC가 메모리를 회수할 수 있도록 함.

장점

메모리 제어: 모든 파일을 한 번에 읽지 않고 바이트 예산으로 상한 설정 가능
조기 필터링: stat만으로 불필요한 파일을 제외하여 I/O 감소
진행 추적: 청크 단위로 진행 상황 보고 가능, UI 업데이트나 취소 처리에 용이
I/O 효율성: 청크 내 병렬 읽기로 디스크 대기 시간 최적화

주의사항

단계 2 구조 분석에서 “어떤 파일을 우선할지” 판단하려면 어느 정도의 도메인 지식이 필요하다. 예를 들어 package.json이나 설정 파일을 먼저 읽으면 다른 파일의 처리 방침을 결정할 수 있다. 또한 chunkBudget과 maxFileSize는 환경의 메모리 제한에 따라 조정이 필요하다.

응용

code indexer: 저장소 전체의 함수/타입 정의를 추출하여 인덱싱
static analyzer: 대규모 코드베이스에 린트 규칙 일괄 적용
AST 기반 마이그레이션 도구: 코드베이스 전체에 자동 리팩토링 적용
bulk linter: 수만 개 파일의 저장소에서도 메모리 고갈 없이 검사
large-repo ingestion: CI/CD에서 대규모 저장소의 변경사항 분석