On Wednesday, December 11, OpenAI faced a three-hour outage affecting all its services, including ChatGPT, API, and Sora. The disruption occurred between 3:16 p.m. and 7:38 p.m. Pacific Time, as detailed in an incident report released by the company.
The outage stemmed from the deployment of a new telemetry service, which overwhelmed the Kubernetes control plane, causing cascading failures across critical systems. OpenAI clarified that the incident was not a security breach or related to any recent product launches but rather an unintended consequence of internal changes.
The new telemetry service, designed to enhance reliability by collecting detailed metrics on Kubernetes operations, triggered resource-intensive API calls shortly after deployment. This overwhelmed Kubernetes API servers and led to widespread system failures across OpenAI’s larger clusters.
OpenAI quickly identified the issue and began mitigation efforts within minutes of the outage. The company is now implementing measures to prevent similar incidents, including phased rollouts with enhanced monitoring for infrastructure changes.
“We apologize for the impact this incident caused to all our customers — from ChatGPT users to developers and businesses that rely on OpenAI products,” the report stated. “We’ve fallen short of our own expectations.”
This marks another notable service disruption for OpenAI in 2023. The company experienced a three-hour ChatGPT outage in June and a significant service disruption two days after the launch of its new app store in November.
Despite these setbacks, OpenAI has continued to grow its user base. As of December 4, ChatGPT boasts 300 million weekly active users, with 1 billion messages sent daily and 1.3 million developers utilizing OpenAI’s platform in the United States. The company aims to reach 1 billion global users within the next year.