Elasticsearch 대용량 데이터 색인 시 OOM 해결

문제 상황

코로나로 재택근무가 시작되면서 사용자 활동 로그가 예상보다 2배 이상 증가했다. 기존에는 문제없이 동작하던 Elasticsearch 색인 작업이 매일 새벽마다 OOM으로 실패했다.

[2020-05-15T03:24:11] java.lang.OutOfMemoryError: Java heap space

원인 분석

Elasticsearch 모니터링 결과:

Bulk 요청당 5000개 문서를 색인 중이었음
refresh_interval이 기본값 1s로 설정되어 있었음
색인 중 segment merge가 과도하게 발생

해결 방법

1. Bulk 사이즈 조정

문서 개수가 아닌 바이트 크기 기준으로 변경했다.

const BULK_SIZE = 5 * 1024 * 1024; // 5MB
let currentBatch = [];
let currentSize = 0;

for (const doc of documents) {
  const docSize = Buffer.byteLength(JSON.stringify(doc));
  
  if (currentSize + docSize > BULK_SIZE) {
    await bulkIndex(currentBatch);
    currentBatch = [];
    currentSize = 0;
  }
  
  currentBatch.push(doc);
  currentSize += docSize;
}

2. 대량 색인 시 refresh_interval 비활성화

await esClient.indices.putSettings({
  index: 'logs-*',
  body: {
    refresh_interval: '-1'
  }
});

// 색인 작업 수행
await indexLogs();

// 색인 완료 후 복구
await esClient.indices.putSettings({
  index: 'logs-*',
  body: {
    refresh_interval: '30s'
  }
});

3. 색인 성능 개선 설정

{
  "number_of_replicas": 0,
  "translog.durability": "async",
  "translog.flush_threshold_size": "1gb"
}

결과

색인 속도: 2000 docs/s → 8000 docs/s
메모리 사용량: 85% → 60%
색인 시간: 3시간 → 45분

색인 완료 후에는 replica를 다시 1로 설정하고 force merge를 실행했다.

POST /logs-2020-05/_forcemerge?max_num_segments=1

대량 색인 시에는 실시간성보다 처리량이 중요하다는 걸 다시 확인했다.