[Resolved] Service Outage in Zone C of the China (Hong Kong) Region

We want to provide you with information about the major service disruption that affected services in Zone C of the China (Hong Kong) region on December 18, 2022 (UTC+8).

Event Summary
At 08:56 on December 18, 2022, we were alerted to rising corridor temperatures in data center rooms in Zone C of the China (Hong Kong) region. Our engineers immediately started to inspect the situation and notified the data center infrastructure provider. At 09:01, alerts on rising temperatures of multiple server rooms were generated, and the on-site engineers identified the issue as being due to a malfunction of the data center's chillers. At 09:09, the data center infrastructure provider's engineers switched the four malfunctioning chillers over to the standby chillers and restarted the chillers based on the contingency plan. However, these measures did not take effect. At 09:17, on-site engineers implemented the contingency plan for cooling failure in accordance to the issue handling process, and took auxiliary heat dissipation and ventilation measures accordingly. The data center infrastructure provider's engineers attempted to manually isolate and restore the chillers individually. However, the issue remained unsolved. Then, the data center infrastructure provider notified the chiller manufacturer. The increased temperature up to this point resulted in performance degradation of some servers.

At 10:30, we began to reduce loads on the computing, storage, network, database, and big data clusters of the entire data center to prevent the temperature from rising too fast and causing a fire. During this period, the on-site engineers attempted other methods to resolve the issue, which did not take effect.

At 12:30, the chiller manufacturer's engineers arrived on site. After diagnosing the issue, on-site engineers decided to manually recirculate the water and discharge gas from the cooling tower, condenser water pipes, and cooling system condensers, but the chillers still did not resume stable operation. Our engineers shut down the servers in rooms with high temperature. At 14:47, the fire sprinkler system of a room was triggered automatically due to high temperature, which increased the difficulty in troubleshooting. At 15:20, the chiller manufacturer's engineers manually adjusted system configurations to unbind the chillers from the cluster and individually restart each chiller. The first chiller was restored and the temperature began to drop. Then, the engineers continued to restore the other chillers in the same way. At 18:55, all four chillers were restored. At 19:02, our engineers restarted the servers in batches and continued to monitor the data center temperature. At 19:47, the temperature of the server room went back to normal and held stable. The engineers began to restore services and conduct the necessary data integrity checks.

At 21:36, the temperatures of most server rooms were holding stable, and servers in these rooms were restarted. Necessary checks were also completed. Servers in the room where the fire sprinkler system was triggered were not powered on. To ensure data integrity, our engineers took the necessary steps to conduct a careful data security check on servers in this room, which required an extended period of time. At 22:50, the data security check and risk assessment were completed. Then, the power supply was restored to the last room and all servers were successfully restarted.

Service Impact
Compute services: At 09:23 on December 18, 2022, Alibaba Cloud Elastic Compute Service (ECS) instances deployed in Zone C of the China (Hong Kong) region began to go offline. However, the ECS instances failed over to unaffected instances in the same zone. As the temperature continued to rise, more servers went offline, which started to affect the business of customers. The outage also affected services such as Elastic Block Storage (EBS), Object Storage Service (OSS), and ApsaraDB RDS in Zone C.

While the event did not directly affect the services in other zones in the China (Hong Kong) region, it affected the control plane of the ECS instances deployed in this region. A large number of customers purchased new ECS instances in other zones in the China (Hong Kong) region after the failure occurred, which triggered a throttling policy at 14:49 on December 18. This greatly impacted our availability SLA, which fell to 20% at one point. Customers who purchased ECS instances with custom images by using the RunInstances or CreateInstance operation could successfully complete the purchase, but some instances failed to start. This was because the custom images were stored in a locally redundant storage (LRS) of OSS in Zone C of the China (Hong Kong) region. Due to the nature of the issue, it could not be resolved by repeating the operation. In addition, some operations in the DataWorks and Container Service for Kubernetes (ACK) consoles were also affected. The RunInstances and CreateInstance operations were restored at 23:11 on the same day.

Storage services: At 10:37 on December 18, the event began to affect OSS services deployed in Zone C of the China (Hong Kong) region. At the time, the effect was imperceptible to customers, but to prevent data loss that may arise from bad sectors due to the heat, we shut down the servers. This resulted in a service downtime from 11:07 to 18:26. Zone C of the China (Hong Kong) region provides OSS in two service availability standards: LRS and zone-redundant storage (ZRS). In LRS mode, the service is deployed only in Zone C, while in ZRS mode, the service is deployed across three zones (for example, Zone B, C, and D). During this event, services deployed in ZRS mode were not affected. However, services deployed in LRS mode experienced prolonged disruptions until the devices in Zone C were restored. At 18:26, most servers were gradually rebooted. For the remaining servers on which LRS services were hosted, we took the time to properly dry out the servers and double check the data integrity before we put these servers back online. This process was completed at 00:30 on December 19.

Network services: A small number of services that support only single-zone deployment were affected by this event, such as VPN Gateway, PrivateLink, and some Global Accelerator (GA) instances. At 11:21 on December 18, our engineers performed cross-zone disaster recovery on the network services and restored most network services such as SLB by 12:45. By 13:47, cross-zone disaster recovery was completed for NAT Gateway. Aside from the aforementioned single-zone services and NAT Gateway on which services were compromised for several minutes, other network services were not affected by the event.

Database services: At 10:37 on December 18, alerts were generated on some ApsaraDB RDS instances in Zone C of the China (Hong Kong) region that were going offline. As more servers were affected by this event, more instances ran into issues, which prompted our engineers to implement our contingency plan. By 12:30, we completed the failover for most of the instances that support zone-redundancy, including services such as ApsaraDB RDS for MySQL, ApsaraDB for Redis, ApsaraDB for MongoDB, and Data Transmission Service (DTS). As for the instances that support only single-zone deployment, only a few were successfully migrated, as the process required access to the backup data stored in Zone C.

In the process, cross-zone failover failed for a small number of ApsaraDB RDS instances. This was because the affected instances used proxies that were deployed in Zone C, and the customers could not access the instances through the proxy endpoints. After identifying the issue, we assisted the customers in accessing the instances directly through the endpoints of the primary instances. By 21:30, most database instances were restored along with the cooling system. For single-zone instances and high-availability instances whose primary and secondary instances were all deployed in Zone C of the China (Hong Kong) region, we offered contingency measures such as instance cloning and instance migration, but the migration and recovery of some instances required an extended period of time due to the limits of the underlying resources.

We noticed that the customers whose business was deployed in multiple zones could still get their business running during the event. Therefore, we recommend customers who have ultimate requirements for high availability that they adopt a multi-zone architecture throughout their business to avoid impacts caused by unexpected events.

Issue Analysis and Corrective Action

1. Prolonged recovery of the cooling system
Issue analysis: Air entered the cooling system of the data center due to lack of water, which affected the water circulation of the four active chillers. However, the four standby chillers used the same water circulation system as the four active chillers, so the failover failed. After recirculating the water and discharging gas, chillers could not be restarted individually because all chillers of the data center were bound into a cluster. In this case, engineers had to manually modify the configurations of the chillers to allow them to run individually. Then, the chillers were restarted one after another, which slowed the restoration progress of the entire cooling system. It took 3 hours and 34 minutes to locate the cause, 2 hours and 57 minutes to recirculate the water and bleed the air from the equipment, and 3 hours and 32 minutes to unbind the four chillers from the cluster.
Corrective action: We will perform comprehensive checks on the infrastructure control systems of data centers. We will expand the scope of metrics we monitor to obtain more fine-grained data and to ensure more efficient troubleshooting. We will also ensure that automatic failover and manual failover are both effective to avoid disaster recovery failures due to deadlocks.

2. Fire sprinkler system triggered due to slow on-site incident response
Issue analysis: With the failure of the cooling system, the temperatures in all server rooms rose uncontrollably. This triggered the fire sprinkler system, and water got into multiple power supply cabinets and server racks, damaging hardware and complicating the equipment restoration process.
Corrective action: We will strengthen the management of data center infrastructure providers, improve our contingency plan for data center cooling issues, and regularly conduct emergency response drills. The contingency plan will include standardized emergency response processes and clarify when servers must be shut down and server rooms must be powered off.

3. Failure to support customer operations like purchasing new ECS instances
Issue analysis: The control plane of ECS in the China (Hong Kong) region is deployed in a dual-active mode between Zone B and C. When Zone C failed, Zone B took over and provided services single-handedly for the entire region. However, the resources of the control plane in Zone B were quickly depleted due to two factors: a large influx of new instance purchases in the other zones of the China (Hong Kong) region; and the disaster recovery mechanisms of ECS instances in Zone C that routed more traffic to Zone B. We attempted to scale up the control plane, but failed to call the API, because the required middleware was deployed in the data center of Zone C. Moreover, the custom images of ECS instances were stored in an LRS of OSS in Zone C, so the customers who purchased new ECS instances in the other zones could not get their new instances started.
Corrective action: We will perform a full review of our services and improve the high-availability architecture of our multi-zone products, eliminating risks such as dependency on services in a single zone. We will also strengthen the disaster recovery drills on control planes of Alibaba Cloud services to become better prepared against such events.

4. Lack of timely and clear information updates
Issue analysis: After the event occurred, we communicated with customers through DingTalk groups and official announcement channels. However, these measures did not provide enough useful information. In addition, the untimely updates on the health status page caused confusion among our customers.
Corrective action: We will improve the speed and accuracy of evaluating the impact of failures and how we communicate with customers upon such events. We will also release a new version of the health status page in the near future, so as to keep our customers updated about the impact of failures on their services.

We want to apologize to all the customers affected by this event, and we will handle the compensation as soon as we can. This event has caused severe impact on the business of our customers, and is the longest major-scale failure in Alibaba Cloud's history of more than a decade. Our customers expect highly reliable services, and we are constantly striving to deliver them. We will do our best to learn from this event and improve the availability of our services.

Alibaba Cloud
December 25, 2022









阿里云香港可用区C的故障,没有直接影响客户在香港其他可用区运行的业务,但影响了香港Region ECS管控服务(Control Plane)的正常使用。因大量可用区C的客户在香港其他可用区新购ECS实例,从12月18日14:49开始,ECS管控服务触发限流,可用性最低跌至20%。客户在使用RunInstances/CreateInstance API购买新ECS实例时,如果指定了自定义镜像,部分实例在购买成功之后会出现启动失败的现象,由于自定义镜像数据服务依赖可用区C的单AZ冗余版本的OSS服务,无法通过重试解决。此时,部分Dataworks、k8s用户控制台操作也受到了故障影响。API完全恢复可用为当日23:11。



12月18日10:17开始,阿里云香港Region可用区C部分RDS实例出现不可用的报警。随着该可用区受故障影响的主机范围扩大,出现服务异常的实例数量随之增加,工程师启动数据库应急切换预案流程。截至12:30,RDS MySQL与Redis、MongoDB、DTS等大部分跨可用区实例完成跨可用区切换。部分单可用区实例以及单可用区高可用实例,由于依赖单可用区的数据备份,仅少量实例实现有效迁移。少量支持跨可用区切换的RDS实例没有及时完成切换。经排查是由于这部分RDS实例依赖了部署在香港Region可用区C的代理服务,由于代理服务不可用,无法通过代理地址访问RDS实例。我们协助相关客户通过临时切换到使用RDS主实例的地址访问来进行恢复。随着机房制冷设备恢复,21:30左右绝大部分数据库实例恢复正常。对于受故障影响的单机版实例及主备均在香港Region可用区C的高可用版实例,我们提供了克隆实例、实例迁移等临时性恢复方案,但由于底层服务资源的限制,部分实例的迁移恢复过程遇到一些异常情况,需要花费较长的时间来处理解决。




原因分析:ECS管控系统为B、C可用区双机房容灾,C可用区故障后由B可用区对外提供服务,由于大量可用区C的客户在香港其他可用区新购实例,同时可用区C的ECS实例拉起恢复动作引入的流量,导致可用区 B 管控服务资源不足。新扩容的ECS管控系统启动时依赖的中间件服务部署在可用区C机房,导致较长时间内无法扩容。ECS管控依赖的自定义镜像数据服务,依赖可用区C的单AZ冗余版本的OSS服务,导致客户新购实例后出现启动失败的现象。

原因分析:故障发生后阿里云启动对客钉群、公告等通知手段,由于现场冷机处理进展缓慢,有效信息不够。Status Page页面信息更新不及时引发客户困惑。
改进措施:提升故障影响和客户影响的快速评估和识别拉取能力。尽快上线新版的阿里云服务健康状态页面(Status Page),提高信息发布的速度,让客户可以更便捷地了解故障事件对各类产品服务的影响。





北京時間2022年12月18日,阿里雲香港Region可用區C發生大規模服務中斷事件。 經過復盤,我們在這裡向大家進一步說明故障情況、問題分析和改進措施。

12月18日08:56,阿里雲監控到香港Region可用區C機房包間通道溫控告警,阿里雲工程師介入應急處理,通知機房服務商進行現場排查。 09:01,阿里雲監控到該機房多個包間溫升告警,此時工程師排查到冷機異常。09:09,機房服務商按應急預案對異常冷機進行4+4主備切換以及重啟,但操作失敗,冷水機組無法恢復正常。 09:17,依照故障處理流程,啟動製冷異常應急預案,進行輔助散熱和應急通風。 嘗試對冷機控制系統逐個進行隔離和手工恢復操作,但發現無法穩定運行,聯繫冷機設備供應商到現場排查。 此時,由於高溫原因,部分服務器開始受到影響。

自10:30開始,為避免可能出現的高溫消防問題,阿里雲工程師陸續對整個機房計算、存儲、網絡、數據庫、大數據集群進行降載處理。 期間,繼續多次對冷機設備進行操作,但均不能保持穩定運行。

12:30,冷機設備供應商到場,在多方工程師診斷下,對冷塔、冷卻水管路及冷機冷凝器進行手工補水排氣操作,但系統仍然無法保持穩定運行。 阿里雲工程師對部分高溫包間啟動服務器關機操作。 14:47,冷機設備供應商對設備問題排查遇到困難,其中一個包間因高溫觸發了強制消防噴淋。 15:20,經冷機設備商工程師現場手工調整配置,冷機群控解鎖完成並獨立運行,第1台冷機恢復正常,溫度開始下降。 工程師隨後繼續通過相同方法對其他冷機進行操作。 18:55, 4台冷機恢復到正常製冷量。 19:02,分批啟動服務器,並持續觀察溫升情況。 19:47,機房溫度趨於穩定。 同時,阿里雲工程師開始進行服務啟動恢復,並進行必要的數據完整性檢查。

21:36,大部分機房包間服務器陸續啟動並完成檢查,機房溫度穩定。 其中一個包間因消防噴淋啟動,未進行服務器上電。 因為保持數據的完整性至關重要,工程師對這個包間的服務器進行了仔細的數據安全檢查,這裡花費了一些必要的時間。22:50,數據檢查以及風險評估完成,最後一個包間依據安全性逐步進行供電恢復和服務器啟動。

12月18日09:23,香港Region可用區C部分ECS服務器開始出現停機,觸發同可用區內當機遷移。 隨著溫度繼續升高,受影響的伺服器停機數量持續增加,客戶業務開始受到影響,影響面擴大到香港Region可用區C的EBS,OSS,RDS等更多雲服務。

阿里雲香港Region可用區C的故障,沒有直接影響客戶在香港其他可用區運行的業務,但影響了香港Region ECS管控服務(Control Plane)的正常使用。 因大量可用區C的客戶在香港其他可用區新購ECS實例,從12月18日14:49開始,ECS管控服務觸發限流,可用性最低跌至20%。 客戶在使用RunInstances/CreateInstance API購買新ECS實例時,如果指定了自定義鏡像,部分實例在購買成功之後會出現啟動失敗的現象,由於自定義鏡像數據服務依賴可用區C的單AZ冗餘版本的OSS服務,無法通過重試解決。 此時,部分Dataworks、K8s用戶控制台操作也受到了故障影響。 API完全恢復可用為當日23:11。

12月18日10:37,阿里雲香港Region可用區C的部分存儲服務OSS開始受到停機影響,此時客戶暫不會感知,但持續高溫會導致磁盤壞軌,影響數據安全,工程師對伺服器進行停機操作,從11:07至18:26中斷了服務。阿里雲在香港Region可用區C提供了2種類型的OSS服務,一種是OSS本地冗餘LRS服務(通常叫單AZ冗餘服務),僅部署在可用區C;另一種是OSS同城冗餘ZRS服務(通常叫3AZ冗餘服務),部署在可用區B、C和D。在此次故障中,OSS同城冗餘ZRS服務基本沒有受到影響。 可用區C的OSS本地冗餘服務中斷時間較長,因不支持跨可用區切換,需要依賴故障機房的恢復。 從18:26開始,存儲服務器重新分批啟動。 其中,單AZ本地冗餘LRS服務有部分服務器因消防問題需要做隔離處理。恢復服務前,我們必須要確保數據可靠性,花費了較多的時間進行完整性檢驗工作。 直至12月19日00:30,這部分OSS服務(單AZ冗餘服務)才恢復了對外服務能力。

阿里雲網絡少量單可用區產品(如:VPN、PrivateLink以及少量全球加速GA實例)在此次故障中受到影響。 12月18日11:21,工程師啟動網絡產品可用區容災逃逸,12:45完成SLB等大部分網絡產品可用區容災逃逸,13:47 NAT產品完成收尾逃逸。 除上述少量單可用區產品以外,各網絡產品在故障期間保持了業務連續性,NAT網關有分鐘級業務受損。

12月18日10:17開始,阿里雲香港Region可用區C部分RDS實例出現不可用的告警。 隨著該可用區受故障影響的主機範圍擴大,出現服務異常的實例數量隨之增加,工程師啟動數據庫應急切換預案流程。 截至12:30,RDS MySQL與Redis、MongoDB、DTS等大部分跨可用區實例完成跨可用區切換。 部分單可用區實例以及單可用區高可用實例,由於依賴單可用區的數據備份,僅少量實例實現有效遷移。 少量支持跨可用區切換的RDS實例沒有及時完成切換。經排查是由於這部分RDS實例依賴了部署在香港Region可用區C的代理服務,由於代理服務不可用,無法通過代理地址訪問RDS實例。 我們協助相關客戶通過臨時切換到使用RDS主實例的地址訪問來進行恢復。 隨著機房製冷設備恢復,21:30左右絕大部分數據庫實例恢復正常。 對於受故障影響的單機版實例及主備均在香港region可用區C的高可用版實例,我們提供了克隆實例、實例遷移等臨時性恢復方案,但由於底層服務資源的限制,部分實例的遷移恢復過程遇到一些異常情況,需要花費較長的時間來處理解決。

我們注意到,同時在多個可用區運行業務的客戶,在這次事件中依然可以維持業務運行。 對於業務需要絕對高可用的客戶,我們持續建議您採用全鏈路多可用區的業務架構設計,以應對各種可能的意外事件。

原因分析:機房冷卻系統缺水進氣形成氣阻,影響水路循環導致4台主冷機服務異常,啟動4台備冷機時因主備共用的水路循環系統氣阻導致啟動失敗。 水盤補水後,因機房冷卻系統的群控邏輯,無法單台獨立啟動冷機,手工修改冷機配置,將冷機從群控調整為獨立運行後,陸續啟動冷機,影響了冷卻系統的恢復時長。 整個過程中,原因定位耗時3小時34分鐘,補水排氣耗時2小時57分鐘,解鎖群控邏輯啟動4台冷機耗時3小時32分鐘。


原因分析:ECS管控系統為B、C可用區雙機房容災,C可用區故障後由B可用區對外提供服務,由於大量可用區C的客戶在香港其他可用區新購實例,同時可用區C的ECS實例重啟恢復動作引入的流量,導致可用區 B 管控服務資源不足。新擴容的ECS管控系統啟動時依賴的中間件服務部署在可用區C機房,導致較長時間內無法擴容。 ECS管控依賴的自定義鏡像數據服務,依賴可用區C的單AZ冗餘版本的OSS服務,導致客戶新購實例後出現啟動失敗的現象。
改進措施:全網巡檢,整體優化多AZ產品高可用設計,避免出現依賴OSS單AZ和中間件單AZ的問題。 加強阿里雲管控平面的容災演練,進一步提升雲產品高可用容災能力。

原因分析:故障發生後阿里雲啟動對客釘群、公告等通知手段,由於現場冷機處理進展緩慢,有效信息不夠。 健康狀態頁面(Status Page)信息更新不及時引發客戶困惑。
改進措施:提升故障影響和客戶影響的快速評估和識別拉取能力。 盡快上線新版的阿里雲服務健康狀態頁面(Status page),提高信息發佈的速度,讓客戶可以更便捷地了解故障事件對各類產品服務的影響。

最後,我們要向所有受到故障影響的客戶公開致歉,並盡快處理賠償事宜。 此次香港Region可用區C服務中斷事件,對很多客戶的業務產生重大影響,也是阿里雲運營十多年來持續時間最長的一次大規模故障。 穩定性是雲服務的生命線,對我們的客戶至關重要。 我們將盡一切努力從此次事件中吸取經驗教訓,持續提升雲服務的穩定性,不辜負客戶所託!

phone Contact Us