OpenAI API Rate Limit 처리와 Exponential Backoff 구현

문제 상황

사용자 요청을 GPT-4로 처리하는 기능을 배포했는데, 트래픽이 몰리면서 429 에러가 빈번하게 발생했다. 특히 동시 요청이 많은 시간대에 사용자 경험이 크게 저하되었다.

해결 과정

OpenAI 공식 문서에서 권장하는 exponential backoff 패턴을 적용했다. 기본적인 구현은 다음과 같다.

async function callOpenAIWithRetry(
  prompt: string,
  maxRetries: number = 3
): Promise<string> {
  for (let i = 0; i < maxRetries; i++) {
    try {
      const response = await openai.chat.completions.create({
        model: 'gpt-4',
        messages: [{ role: 'user', content: prompt }],
      });
      return response.choices[0].message.content;
    } catch (error) {
      if (error.status === 429 && i < maxRetries - 1) {
        const delay = Math.min(1000 * Math.pow(2, i), 10000);
        await new Promise(resolve => setTimeout(resolve, delay));
        continue;
      }
      throw error;
    }
  }
}

추가 개선 사항

단순 retry만으로는 부족해서 요청 큐잉 시스템을 추가했다. BullMQ를 사용해 요청을 큐에 넣고, rate limiter를 두어 초당 최대 요청 수를 제한했다.

const rateLimiter = new Bottleneck({
  maxConcurrent: 5,
  minTime: 200
});

const wrappedCall = rateLimiter.wrap(callOpenAIWithRetry);

이후 429 에러가 거의 발생하지 않았고, 요청이 몰려도 큐에서 순차적으로 처리되어 안정성이 크게 향상되었다.

참고사항

Retry-After 헤더가 있으면 해당 시간만큼 대기하도록 개선 예정
비용 모니터링을 위해 OpenAI usage API도 함께 활용 중
tier별로 rate limit이 다르니 공식 문서 확인 필수