关于阿里云香港Region可用区C服务中断事件的说明

其他

关于阿里云香港Region可用区C服务中断事件的说明

Dec 25, 2022

北京时间2022年12月18日，阿里云香港Region可用区C发生大规模服务中断事件。经过复盘，我们在这里向大家进一步说明故障情况、问题分析和改进措施。

处理过程
12月18日08:56，阿里云监控到香港Region可用区C机房包间通道温控告警，阿里云工程师介入应急处理，通知机房服务商进行现场排查。09:01，阿里云监控到该机房多个包间温升告警，此时工程师排查到冷机异常。09:09，机房服务商按应急预案对异常冷机进行4+4主备切换以及重启，但操作失败，冷水机组无法恢复正常。09:17，依照故障处理流程，启动制冷异常应急预案，进行辅助散热和应急通风。尝试对冷机控制系统逐个进行隔离和手工恢复操作，但发现无法稳定运行，联系冷机设备供应商到现场排查。此时，由于高温原因，部分服务器开始受到影响。

自10:30开始，为避免可能出现的高温消防问题，阿里云工程师陆续对整个机房计算、存储、网络、数据库、大数据集群进行降载处理。期间，继续多次对冷机设备进行操作，但均不能保持稳定运行。

12:30，冷机设备供应商到场，在多方工程师诊断下，对冷塔、冷却水管路及冷机冷凝器进行手工补水排气操作，但系统仍然无法保持稳定运行。阿里云工程师对部分高温包间启动服务器关机操作。14:47，冷机设备供应商对设备问题排查遇到困难，其中一个包间因高温触发了强制消防喷淋。15:20，经冷机设备商工程师现场手工调整配置，冷机群控解锁完成并独立运行，第1台冷机恢复正常，温度开始下降。工程师随后继续通过相同方法对其他冷机进行操作。18:55，4台冷机恢复到正常制冷量。19:02，分批启动服务器，并持续观察温升情况。19:47，机房温度趋于稳定。同时，阿里云工程师开始进行服务启动恢复，并进行必要的数据完整性检查。

21:36，大部分机房包间服务器陆续启动并完成检查，机房温度稳定。其中一个包间因消防喷淋启动，未进行服务器上电。因为保持数据的完整性至关重要，工程师对这个包间的服务器进行了仔细的数据安全检查，这里花费了一些必要的时间。22:50，数据检查以及风险评估完成，最后一个包间依据安全性逐步进行供电恢复和服务器启动。

服务影响
12月18日09:23，香港Region可用区C部分ECS服务器开始出现停机，触发同可用区内宕机迁移。随着温度继续升高，受影响的服务器停机数量持续增加，客户业务开始受到影响，影响面扩大到香港可用区C的EBS、OSS、RDS等更多云服务。

阿里云香港可用区C的故障，没有直接影响客户在香港其他可用区运行的业务，但影响了香港Region ECS管控服务（Control Plane）的正常使用。因大量可用区C的客户在香港其他可用区新购ECS实例，从12月18日14:49开始，ECS管控服务触发限流，可用性最低跌至20%。客户在使用RunInstances/CreateInstance API购买新ECS实例时，如果指定了自定义镜像，部分实例在购买成功之后会出现启动失败的现象，由于自定义镜像数据服务依赖可用区C的单AZ冗余版本的OSS服务，无法通过重试解决。此时，部分Dataworks、k8s用户控制台操作也受到了故障影响。API完全恢复可用为当日23:11。

12月18日10:37，阿里云香港可用区C的部分存储服务OSS开始受到停机影响，此时客户暂不会感知，但持续高温会导致磁盘坏道，影响数据安全，工程师对服务器进行停机操作，从11:07至18:26中断了服务。阿里云在香港Region可用区C提供了2种类型的OSS服务，一种是OSS本地冗余LRS服务（通常叫单AZ冗余服务），仅部署在可用区C；另一种是OSS同城冗余ZRS服务（通常叫3AZ冗余服务），部署在可用区B、C和D。在此次故障中，OSS同城冗余ZRS服务基本没有受到影响。可用区C的OSS本地冗余服务中断时间较长，因不支持跨可用区切换，需要依赖故障机房的恢复。从18:26开始，存储服务器重新分批启动。其中，单AZ本地冗余LRS服务有部分服务器因消防问题需要做隔离处理。恢复服务前，我们必须要确保数据可靠性，花费了较多的时间进行完整性检验工作。直至12月19日00:30，这部分OSS服务（单AZ冗余服务）才恢复了对外服务能力。

阿里云网络少量单可用区产品（如：VPN、Privatelink以及少量GA实例）在此次故障中受到影响。12月18日11:21，工程师启动网络产品可用区容灾逃逸，12:45完成SLB等大部分网络产品可用区容灾逃逸，13:47NAT产品完成收尾逃逸。除上述少量单可用区产品以外，各网络产品在故障期间保持了业务连续性，NAT有分钟级业务受损。

12月18日10:17开始，阿里云香港Region可用区C部分RDS实例出现不可用的报警。随着该可用区受故障影响的主机范围扩大，出现服务异常的实例数量随之增加，工程师启动数据库应急切换预案流程。截至12:30，RDS MySQL与Redis、MongoDB、DTS等大部分跨可用区实例完成跨可用区切换。部分单可用区实例以及单可用区高可用实例，由于依赖单可用区的数据备份，仅少量实例实现有效迁移。少量支持跨可用区切换的RDS实例没有及时完成切换。经排查是由于这部分RDS实例依赖了部署在香港Region可用区C的代理服务，由于代理服务不可用，无法通过代理地址访问RDS实例。我们协助相关客户通过临时切换到使用RDS主实例的地址访问来进行恢复。随着机房制冷设备恢复，21:30左右绝大部分数据库实例恢复正常。对于受故障影响的单机版实例及主备均在香港Region可用区C的高可用版实例，我们提供了克隆实例、实例迁移等临时性恢复方案，但由于底层服务资源的限制，部分实例的迁移恢复过程遇到一些异常情况，需要花费较长的时间来处理解决。

我们注意到，同时在多个可用区运行业务的客户，在这次事件中依然可以维持业务运行。对于业务需要绝对高可用的客户，我们持续建议您采用全链路多可用区的业务架构设计，以应对各种可能的意外事件。

问题分析与改进措施
1、冷机系统故障恢复时间过长
原因分析：机房冷却系统缺水进气形成气阻，影响水路循环导致4台主冷机服务异常，启动4台备冷机时因主备共用的水路循环系统气阻导致启动失败。水盘补水后，因机房冷却系统的群控逻辑，无法单台独立启动冷机，手工修改冷机配置，将冷机从群控调整为独立运行后，陆续启动冷机，影响了冷却系统的恢复时长。整个过程中，原因定位耗时3小时34分钟，补水排气耗时2小时57分钟，解锁群控逻辑启动4台冷机耗时3小时32分钟。
改进措施：全面检查机房基础设施管控系统，在监控数据采集层面，扩大覆盖度，提升精细度，提高对故障的排查和定位速度；在设施管控逻辑层面，确保系统自动切换逻辑符合预期，同时保证手工切换的准确性，防止内部状态死锁从而影响故障的恢复。

2、现场处置不及时导致触发消防喷淋
原因分析：随着机房冷却系统失效，包间温度逐渐升高，导致一机房包间温度达到临界值触发消防系统喷淋，电源柜和多列机柜进水，部分机器硬件损坏，增加了后续恢复难度和时长。
改进措施：加强机房服务商管理，梳理机房温升预案及标准化执行动作，明确温升场景下的业务侧关机和机房强制关电的预案，力求更简单有效，并通过常态化演练强化执行。

3、客户在香港地域新购ECS等管控操作失败
原因分析：ECS管控系统为B、C可用区双机房容灾，C可用区故障后由B可用区对外提供服务，由于大量可用区C的客户在香港其他可用区新购实例，同时可用区C的ECS实例拉起恢复动作引入的流量，导致可用区 B 管控服务资源不足。新扩容的ECS管控系统启动时依赖的中间件服务部署在可用区C机房，导致较长时间内无法扩容。ECS管控依赖的自定义镜像数据服务，依赖可用区C的单AZ冗余版本的OSS服务，导致客户新购实例后出现启动失败的现象。
改进措施：全网巡检，整体优化多AZ产品高可用设计，避免出现依赖OSS单AZ和中间件单AZ的问题。加强阿里云管控平面的容灾演练，进一步提升云产品高可用容灾逃逸能力。

4、故障信息发布不够及时透明
原因分析：故障发生后阿里云启动对客钉群、公告等通知手段，由于现场冷机处理进展缓慢，有效信息不够。Status Page页面信息更新不及时引发客户困惑。
改进措施：提升故障影响和客户影响的快速评估和识别拉取能力。尽快上线新版的阿里云服务健康状态页面（Status Page），提高信息发布的速度，让客户可以更便捷地了解故障事件对各类产品服务的影响。

总结
最后，我们要向所有受到故障影响的客户公开致歉，并尽快处理赔偿事宜。此次香港Region可用区C服务中断事件，对很多客户的业务产生重大影响，也是阿里云运营十多年来持续时间最长的一次大规模故障。稳定性是云服务的生命线，对我们的客户至关重要。我们将尽一切努力从此次事件中吸取经验教训，持续提升云服务的稳定性，不辜负客户所托！

阿里云
2022年12月25日

---------------------------------------------------------------------------------------
關於阿里雲香港Region可用區C服務中斷事件的說明

北京時間2022年12月18日，阿里雲香港Region可用區C發生大規模服務中斷事件。經過復盤,我們在這裡向大家進一步說明故障情況、問題分析和改進措施。

處理過程
12月18日08:56,阿里雲監控到香港Region可用區C機房包間通道溫控告警,阿里雲工程師介入應急處理,通知機房服務商進行現場排查。 09:01,阿里雲監控到該機房多個包間溫升告警,此時工程師排查到冷機異常。09:09,機房服務商按應急預案對異常冷機進行4+4主備切換以及重啟,但操作失敗,冷水機組無法恢復正常。 09:17,依照故障處理流程,啟動製冷異常應急預案,進行輔助散熱和應急通風。嘗試對冷機控制系統逐個進行隔離和手工恢復操作,但發現無法穩定運行,聯繫冷機設備供應商到現場排查。此時,由於高溫原因,部分服務器開始受到影響。

自10:30開始,為避免可能出現的高溫消防問題,阿里雲工程師陸續對整個機房計算、存儲、網絡、數據庫、大數據集群進行降載處理。期間,繼續多次對冷機設備進行操作,但均不能保持穩定運行。

12:30,冷機設備供應商到場,在多方工程師診斷下,對冷塔、冷卻水管路及冷機冷凝器進行手工補水排氣操作,但系統仍然無法保持穩定運行。阿里雲工程師對部分高溫包間啟動服務器關機操作。 14:47,冷機設備供應商對設備問題排查遇到困難,其中一個包間因高溫觸發了強制消防噴淋。 15:20,經冷機設備商工程師現場手工調整配置,冷機群控解鎖完成並獨立運行,第1台冷機恢復正常,溫度開始下降。工程師隨後繼續通過相同方法對其他冷機進行操作。 18:55, 4台冷機恢復到正常製冷量。 19:02,分批啟動服務器,並持續觀察溫升情況。 19:47,機房溫度趨於穩定。同時,阿里雲工程師開始進行服務啟動恢復,並進行必要的數據完整性檢查。

21:36,大部分機房包間服務器陸續啟動並完成檢查,機房溫度穩定。其中一個包間因消防噴淋啟動,未進行服務器上電。因為保持數據的完整性至關重要,工程師對這個包間的服務器進行了仔細的數據安全檢查,這裡花費了一些必要的時間。22:50,數據檢查以及風險評估完成,最後一個包間依據安全性逐步進行供電恢復和服務器啟動。

服務影響
12月18日09:23,香港Region可用區C部分ECS服務器開始出現停機,觸發同可用區內當機遷移。隨著溫度繼續升高,受影響的伺服器停機數量持續增加,客戶業務開始受到影響,影響面擴大到香港Region可用區C的EBS,OSS,RDS等更多雲服務。

阿里雲香港Region可用區C的故障,沒有直接影響客戶在香港其他可用區運行的業務,但影響了香港Region ECS管控服務(Control Plane)的正常使用。因大量可用區C的客戶在香港其他可用區新購ECS實例,從12月18日14:49開始,ECS管控服務觸發限流,可用性最低跌至20%。客戶在使用RunInstances/CreateInstance API購買新ECS實例時,如果指定了自定義鏡像,部分實例在購買成功之後會出現啟動失敗的現象,由於自定義鏡像數據服務依賴可用區C的單AZ冗餘版本的OSS服務,無法通過重試解決。此時,部分Dataworks、K8s用戶控制台操作也受到了故障影響。 API完全恢復可用為當日23:11。

12月18日10:37,阿里雲香港Region可用區C的部分存儲服務OSS開始受到停機影響,此時客戶暫不會感知,但持續高溫會導致磁盤壞軌,影響數據安全,工程師對伺服器進行停機操作,從11:07至18:26中斷了服務。阿里雲在香港Region可用區C提供了2種類型的OSS服務,一種是OSS本地冗餘LRS服務(通常叫單AZ冗餘服務),僅部署在可用區C;另一種是OSS同城冗餘ZRS服務(通常叫3AZ冗餘服務),部署在可用區B、C和D。在此次故障中,OSS同城冗餘ZRS服務基本沒有受到影響。可用區C的OSS本地冗餘服務中斷時間較長,因不支持跨可用區切換,需要依賴故障機房的恢復。從18:26開始,存儲服務器重新分批啟動。其中,單AZ本地冗餘LRS服務有部分服務器因消防問題需要做隔離處理。恢復服務前,我們必須要確保數據可靠性,花費了較多的時間進行完整性檢驗工作。直至12月19日00:30,這部分OSS服務(單AZ冗餘服務)才恢復了對外服務能力。

阿里雲網絡少量單可用區產品(如:VPN、PrivateLink以及少量全球加速GA實例)在此次故障中受到影響。 12月18日11:21,工程師啟動網絡產品可用區容災逃逸,12:45完成SLB等大部分網絡產品可用區容災逃逸,13:47 NAT產品完成收尾逃逸。除上述少量單可用區產品以外,各網絡產品在故障期間保持了業務連續性,NAT網關有分鐘級業務受損。

12月18日10:17開始,阿里雲香港Region可用區C部分RDS實例出現不可用的告警。隨著該可用區受故障影響的主機範圍擴大,出現服務異常的實例數量隨之增加,工程師啟動數據庫應急切換預案流程。截至12:30,RDS MySQL與Redis、MongoDB、DTS等大部分跨可用區實例完成跨可用區切換。部分單可用區實例以及單可用區高可用實例,由於依賴單可用區的數據備份,僅少量實例實現有效遷移。少量支持跨可用區切換的RDS實例沒有及時完成切換。經排查是由於這部分RDS實例依賴了部署在香港Region可用區C的代理服務,由於代理服務不可用,無法通過代理地址訪問RDS實例。我們協助相關客戶通過臨時切換到使用RDS主實例的地址訪問來進行恢復。隨著機房製冷設備恢復,21:30左右絕大部分數據庫實例恢復正常。對於受故障影響的單機版實例及主備均在香港region可用區C的高可用版實例,我們提供了克隆實例、實例遷移等臨時性恢復方案,但由於底層服務資源的限制,部分實例的遷移恢復過程遇到一些異常情況,需要花費較長的時間來處理解決。

我們注意到,同時在多個可用區運行業務的客戶,在這次事件中依然可以維持業務運行。對於業務需要絕對高可用的客戶,我們持續建議您採用全鏈路多可用區的業務架構設計,以應對各種可能的意外事件。

問題分析與改進措施
1、冷機系統故障恢復時間過長
原因分析:機房冷卻系統缺水進氣形成氣阻,影響水路循環導致4台主冷機服務異常,啟動4台備冷機時因主備共用的水路循環系統氣阻導致啟動失敗。水盤補水後,因機房冷卻系統的群控邏輯,無法單台獨立啟動冷機,手工修改冷機配置,將冷機從群控調整為獨立運行後,陸續啟動冷機,影響了冷卻系統的恢復時長。整個過程中,原因定位耗時3小時34分鐘,補水排氣耗時2小時57分鐘,解鎖群控邏輯啟動4台冷機耗時3小時32分鐘。
改進措施:全面檢查機房基礎設施管控系統,在監控數據採集層面,擴大覆蓋度,提升精細度,提高對故障的排查和定位速度;在設施管控邏輯層面,確保系統自動切換邏輯符合預期,同時保證手工切換的準確性,防止內部狀態死鎖從而影響故障的恢復。

2、現場處置不及時導致觸發消防噴淋
原因分析:隨著機房冷卻系統失效,機房包間溫度逐漸升高,導致一包間溫度達到臨界值觸發消防系統噴淋,電源櫃和多列機櫃進水,部分機器硬件損壞,增加了後續恢復難度和時長。
改進措施:加強機房服務商管理,梳理機房溫升預案及標準化執行動作,明確溫升場景下的業務側關機和機房強制關電的預案,力求更簡單有效,並通過常態化演練強化執行。

3、客戶在香港地域新購ECS等管控操作失敗
原因分析:ECS管控系統為B、C可用區雙機房容災,C可用區故障後由B可用區對外提供服務,由於大量可用區C的客戶在香港其他可用區新購實例,同時可用區C的ECS實例重啟恢復動作引入的流量,導致可用區 B 管控服務資源不足。新擴容的ECS管控系統啟動時依賴的中間件服務部署在可用區C機房,導致較長時間內無法擴容。 ECS管控依賴的自定義鏡像數據服務,依賴可用區C的單AZ冗餘版本的OSS服務,導致客戶新購實例後出現啟動失敗的現象。
改進措施:全網巡檢,整體優化多AZ產品高可用設計,避免出現依賴OSS單AZ和中間件單AZ的問題。加強阿里雲管控平面的容災演練,進一步提升雲產品高可用容災能力。

4、故障信息發佈不夠及時透明
原因分析:故障發生後阿里雲啟動對客釘群、公告等通知手段,由於現場冷機處理進展緩慢,有效信息不夠。健康狀態頁面(Status Page)信息更新不及時引發客戶困惑。
改進措施:提升故障影響和客戶影響的快速評估和識別拉取能力。盡快上線新版的阿里雲服務健康狀態頁面(Status page),提高信息發佈的速度,讓客戶可以更便捷地了解故障事件對各類產品服務的影響。

總結
最後,我們要向所有受到故障影響的客戶公開致歉,並盡快處理賠償事宜。此次香港Region可用區C服務中斷事件,對很多客戶的業務產生重大影響,也是阿里雲運營十多年來持續時間最長的一次大規模故障。穩定性是雲服務的生命線,對我們的客戶至關重要。我們將盡一切努力從此次事件中吸取經驗教訓,持續提升雲服務的穩定性,不辜負客戶所託!

阿里雲
2022年12月25日

---------------------------------------------------------------------------------------

[Resolved] Service Outage in Zone C of the China (Hong Kong) Region

We want to provide you with information about the major service disruption that affected services in Zone C of the China (Hong Kong) region on December 18, 2022 (UTC+8).

Event Summary
At 08:56 on December 18, 2022, we were alerted to rising corridor temperatures in data center rooms in Zone C of the China (Hong Kong) region. Our engineers immediately started to inspect the situation and notified the data center infrastructure provider. At 09:01, alerts on rising temperatures of multiple server rooms were generated, and the on-site engineers identified the issue as being due to a malfunction of the data center's chillers. At 09:09, the data center infrastructure provider's engineers switched the four malfunctioning chillers over to the standby chillers and restarted the chillers based on the contingency plan. However, these measures did not take effect. At 09:17, on-site engineers implemented the contingency plan for cooling failure in accordance to the issue handling process, and took auxiliary heat dissipation and ventilation measures accordingly. The data center infrastructure provider's engineers attempted to manually isolate and restore the chillers individually. However, the issue remained unsolved. Then, the data center infrastructure provider notified the chiller manufacturer. The increased temperature up to this point resulted in performance degradation of some servers.

At 10:30, we began to reduce loads on the computing, storage, network, database, and big data clusters of the entire data center to prevent the temperature from rising too fast and causing a fire. During this period, the on-site engineers attempted other methods to resolve the issue, which did not take effect.

At 12:30, the chiller manufacturer's engineers arrived on site. After diagnosing the issue, on-site engineers decided to manually recirculate the water and discharge gas from the cooling tower, condenser water pipes, and cooling system condensers, but the chillers still did not resume stable operation. Our engineers shut down the servers in rooms with high temperature. At 14:47, the fire sprinkler system of a room was triggered automatically due to high temperature, which increased the difficulty in troubleshooting. At 15:20, the chiller manufacturer's engineers manually adjusted system configurations to unbind the chillers from the cluster and individually restart each chiller. The first chiller was restored and the temperature began to drop. Then, the engineers continued to restore the other chillers in the same way. At 18:55, all four chillers were restored. At 19:02, our engineers restarted the servers in batches and continued to monitor the data center temperature. At 19:47, the temperature of the server room went back to normal and held stable. The engineers began to restore services and conduct the necessary data integrity checks.

At 21:36, the temperatures of most server rooms were holding stable, and servers in these rooms were restarted. Necessary checks were also completed. Servers in the room where the fire sprinkler system was triggered were not powered on. To ensure data integrity, our engineers took the necessary steps to conduct a careful data security check on servers in this room, which required an extended period of time. At 22:50, the data security check and risk assessment were completed. Then, the power supply was restored to the last room and all servers were successfully restarted.

Service Impact
Compute services: At 09:23 on December 18, 2022, Alibaba Cloud Elastic Compute Service (ECS) instances deployed in Zone C of the China (Hong Kong) region began to go offline. However, the ECS instances failed over to unaffected instances in the same zone. As the temperature continued to rise, more servers went offline, which started to affect the business of customers. The outage also affected services such as Elastic Block Storage (EBS), Object Storage Service (OSS), and ApsaraDB RDS in Zone C.

While the event did not directly affect the services in other zones in the China (Hong Kong) region, it affected the control plane of the ECS instances deployed in this region. A large number of customers purchased new ECS instances in other zones in the China (Hong Kong) region after the failure occurred, which triggered a throttling policy at 14:49 on December 18. This greatly impacted our availability SLA, which fell to 20% at one point. Customers who purchased ECS instances with custom images by using the RunInstances or CreateInstance operation could successfully complete the purchase, but some instances failed to start. This was because the custom images were stored in a locally redundant storage (LRS) of OSS in Zone C of the China (Hong Kong) region. Due to the nature of the issue, it could not be resolved by repeating the operation. In addition, some operations in the DataWorks and Container Service for Kubernetes (ACK) consoles were also affected. The RunInstances and CreateInstance operations were restored at 23:11 on the same day.

Storage services: At 10:37 on December 18, the event began to affect OSS services deployed in Zone C of the China (Hong Kong) region. At the time, the effect was imperceptible to customers, but to prevent data loss that may arise from bad sectors due to the heat, we shut down the servers. This resulted in a service downtime from 11:07 to 18:26. Zone C of the China (Hong Kong) region provides OSS in two service availability standards: LRS and zone-redundant storage (ZRS). In LRS mode, the service is deployed only in Zone C, while in ZRS mode, the service is deployed across three zones (for example, Zone B, C, and D). During this event, services deployed in ZRS mode were not affected. However, services deployed in LRS mode experienced prolonged disruptions until the devices in Zone C were restored. At 18:26, most servers were gradually rebooted. For the remaining servers on which LRS services were hosted, we took the time to properly dry out the servers and double check the data integrity before we put these servers back online. This process was completed at 00:30 on December 19.

Network services: A small number of services that support only single-zone deployment were affected by this event, such as VPN Gateway, PrivateLink, and some Global Accelerator (GA) instances. At 11:21 on December 18, our engineers performed cross-zone disaster recovery on the network services and restored most network services such as SLB by 12:45. By 13:47, cross-zone disaster recovery was completed for NAT Gateway. Aside from the aforementioned single-zone services and NAT Gateway on which services were compromised for several minutes, other network services were not affected by the event.

Database services: At 10:37 on December 18, alerts were generated on some ApsaraDB RDS instances in Zone C of the China (Hong Kong) region that were going offline. As more servers were affected by this event, more instances ran into issues, which prompted our engineers to implement our contingency plan. By 12:30, we completed the failover for most of the instances that support zone-redundancy, including services such as ApsaraDB RDS for MySQL, ApsaraDB for Redis, ApsaraDB for MongoDB, and Data Transmission Service (DTS). As for the instances that support only single-zone deployment, only a few were successfully migrated, as the process required access to the backup data stored in Zone C.

In the process, cross-zone failover failed for a small number of ApsaraDB RDS instances. This was because the affected instances used proxies that were deployed in Zone C, and the customers could not access the instances through the proxy endpoints. After identifying the issue, we assisted the customers in accessing the instances directly through the endpoints of the primary instances. By 21:30, most database instances were restored along with the cooling system. For single-zone instances and high-availability instances whose primary and secondary instances were all deployed in Zone C of the China (Hong Kong) region, we offered contingency measures such as instance cloning and instance migration, but the migration and recovery of some instances required an extended period of time due to the limits of the underlying resources.

We noticed that the customers whose business was deployed in multiple zones could still get their business running during the event. Therefore, we recommend customers who have ultimate requirements for high availability that they adopt a multi-zone architecture throughout their business to avoid impacts caused by unexpected events.

Issue Analysis and Corrective Action

1. Prolonged recovery of the cooling system
Issue analysis: Air entered the cooling system of the data center due to lack of water, which affected the water circulation of the four active chillers. However, the four standby chillers used the same water circulation system as the four active chillers, so the failover failed. After recirculating the water and discharging gas, chillers could not be restarted individually because all chillers of the data center were bound into a cluster. In this case, engineers had to manually modify the configurations of the chillers to allow them to run individually. Then, the chillers were restarted one after another, which slowed the restoration progress of the entire cooling system. It took 3 hours and 34 minutes to locate the cause, 2 hours and 57 minutes to recirculate the water and bleed the air from the equipment, and 3 hours and 32 minutes to unbind the four chillers from the cluster.
Corrective action: We will perform comprehensive checks on the infrastructure control systems of data centers. We will expand the scope of metrics we monitor to obtain more fine-grained data and to ensure more efficient troubleshooting. We will also ensure that automatic failover and manual failover are both effective to avoid disaster recovery failures due to deadlocks.

2. Fire sprinkler system triggered due to slow on-site incident response
Issue analysis: With the failure of the cooling system, the temperatures in all server rooms rose uncontrollably. This triggered the fire sprinkler system, and water got into multiple power supply cabinets and server racks, damaging hardware and complicating the equipment restoration process.
Corrective action: We will strengthen the management of data center infrastructure providers, improve our contingency plan for data center cooling issues, and regularly conduct emergency response drills. The contingency plan will include standardized emergency response processes and clarify when servers must be shut down and server rooms must be powered off.

3. Failure to support customer operations like purchasing new ECS instances
Issue analysis: The control plane of ECS in the China (Hong Kong) region is deployed in a dual-active mode between Zone B and C. When Zone C failed, Zone B took over and provided services single-handedly for the entire region. However, the resources of the control plane in Zone B were quickly depleted due to two factors: a large influx of new instance purchases in the other zones of the China (Hong Kong) region; and the disaster recovery mechanisms of ECS instances in Zone C that routed more traffic to Zone B. We attempted to scale up the control plane, but failed to call the API, because the required middleware was deployed in the data center of Zone C. Moreover, the custom images of ECS instances were stored in an LRS of OSS in Zone C, so the customers who purchased new ECS instances in the other zones could not get their new instances started.
Corrective action: We will perform a full review of our services and improve the high-availability architecture of our multi-zone products, eliminating risks such as dependency on services in a single zone. We will also strengthen the disaster recovery drills on control planes of Alibaba Cloud services to become better prepared against such events.

4. Lack of timely and clear information updates
Issue analysis: After the event occurred, we communicated with customers through DingTalk groups and official announcement channels. However, these measures did not provide enough useful information. In addition, the untimely updates on the health status page caused confusion among our customers.
Corrective action: We will improve the speed and accuracy of evaluating the impact of failures and how we communicate with customers upon such events. We will also release a new version of the health status page in the near future, so as to keep our customers updated about the impact of failures on their services.

Summary
We want to apologize to all the customers affected by this event, and we will handle the compensation as soon as we can. This event has caused severe impact on the business of our customers, and is the longest major-scale failure in Alibaba Cloud's history of more than a decade. Our customers expect highly reliable services, and we are constantly striving to deliver them. We will do our best to learn from this event and improve the availability of our services.

Alibaba Cloud
December 25, 2022