Fighting Coronavirus: Freshippo Reveals 12 Key Technologies to Achieve 0 Faults over a Year (Part 2)

In part 2 of this blog post, we'll continue exploring how Alibaba's retail brand, Freshippo, managed to achieve 0 faults over a year through technological innovation.

By Zhang Peng (nicknamed Zhangpeng)

In the first part of this blog series, we briefly mentioned how Alibaba's self-operated grocery chain, Hema Fresh, also known as Freshippo, has become a lifeline for many local residents in China during the coronavirus outbreak, by committing to a policy of remaining open for business. Additionally, it has not raised its prices and has remained stocked, particularly in the 18 stores across Wuhan, while teaming up with local restaurant chains to provide free meals and necessities to the hospital staff in Wuhan and emergency response teams.

In part two of this blog series, we'll continue discussing about the 12 technologies that let Freshippo achieve 0 faults throughout an entire year.

3.4 Tair Dependency

(1) Use cases of Tair products

MDB is a commonly used cache that supports high QPS. However, MDB does not guarantee persistence. It is divided into single cluster and independent cluster. Two independent clusters have completely independent data storage, and there is no guarantee that data can be accessed after being cached. MDB is not suitable for distributed locks. LDB is suitable for persistent storage. For any other scenarios, it is not required. RDB has a synchronous elimination policy. This policy is triggered when the capacity becomes insufficient, resulting in write delay. You can request a maximum of 1,024 interfaces at a time. We recommend that you request a maximum of 100 interfaces at a time.

(2) Cache capacity and QPS

When the cache capacity and QPS become insufficient, you need to increase the capacity. MDB supports 1 million QPS and 100 thousand RDB.

(3) Throughput

Tair is not suitable for large keys and large values in any scenarios. The key and value should be less than 1,000 and 10,000, respectively. In extreme cases, exceeding these thresholds can cause data splitting, resulting in data loss. For batch writing, reduce the number of batches and ensure that the skey value is less than 5,120.

(4) Cache timeout

A timeout period (in seconds) is required for the cache. Some users think that using the unit of milliseconds can result in a long timeout period, causing the cache to overflow.

(5) Independent cluster and single cluster

Two MDB independent clusters are completely independent from each other. For example, Na61 is only accessible to data center 61 but not to data center 62. Due to the independence of the clusters, cached data is inaccessible.

(6) Tair distributed lock

The Tair lock uses the put and get methods, and its version starts from 2. The lock has to take timeout into account, and the retry mechanism can be used to avoid the impact of timeout. The return value of the ResultCode after the lock times out may be ResultCode.CONNERROR,ResultCode.TIMEOUT,ResultCode.UNKNOW.

(7) Consistency of cached data

For simple consistency issues such as ensuring data consistency between a database and the cache, after the database is successfully updated, you can add some other channels to ensure that the database sends a message upon successful execution, while implementing idempotence in the cache. If the system crashes in this context, data reconciliation can be performed based on the Data Replication System message of the database change or on binlog messages.

Some consistency problems are less obvious. For example, when a scoring service performs caching to reduce Cartesian product calculation, at the beginning, the cache key is set a ticket and the value is set to a similarity score object. However, when the algorithm uses the ticket object, the similarity score class contains on-site deliverymen for assigned tickets. Assume that the deliveryman of a ticket is deliveryman 1. In the second calculation, the same deliveryman is obtained while this deliveryman is already off-site, causing a non-deterministic polynomial (NP) problem. To correct this problem, both the ticket and the deliveryman's cached key must be used to correctly score the deliveryman.

To summarize, objects that are modified by context are not suitable for caching. Cached objects must be with a fine granularity and not be modified by context.

(8) Cache breakdown

It is highly risky to access the database when the cache is down. In the case of high QPS, this is especially true and can easily crash the database. When the cache is down, faults follow. You can set hotspot data to increase the expiration period, use the memory cache to store static data, and use a lock policy to protect and throttle the database.

3.5 MetaQ Dependency

(1) MetaQ consumer is not registered online

This issue can cause a batch of messages to be sent after registration. After the SMS service is introduced to the messaging platform, you forget to subscribe to it online. As a result, SMS messages are sent in a batch during subscription, which causes a lot of confusion. You can clear topic messages during off-peak hours, and then perform subscription.

(2) MetaQ size restriction

The size of a single MetaQ message is limited to 128 KB. If this limit is exceeded, the message cannot be sent.

(3) MetaQ sending failure

Although MetaQ is relatively stable, there are occasional failures when sending MetaQ messages, which may be due to MetaQ broker cluster jitters, network problems, and other issues. The failure of sending an important MetaQ message must be monitored. To correct this, you can manually resend the message or use message checking to automatically retry the sending.

(4) MetaQ consumption failure

The maximum number of consumption retries is 16 times, with a maximum retry interval of 2 hours. You can modify the retry interval to reduce the number of retries. You can also set MetaQ loop retry to send another MetaQ message after 15 retries to form a MetaQ loop. We recommend that you do not set an endless loop. You can control the loop through the consumption time.

(5) Metaq QPS or TPS

The QPS or TPS of a single sent or consumed topic should be less than 3,000. There was a case, in which the TPS of a single topic exceeded 3,000 and a DingTalk message was received from internal MetaQ developers. By troubleshooting, we found that the reason was that some abandoned blink tasks were writing data. After these tasks were stopped, MetaQ was restarted to correct the issue.

(6) MetaQ accumulation monitoring

MetaQ accumulation is generally caused by a processing failure of the business system. One of the potential reasons is that some channels cannot be consumed due to MetaQ server problems. The MetaQ platform can configure consumption accumulation monitoring for a default accumulation volume of 10 thousand. For important services, you can set a relatively small number of accumulated entries. For example, you can set the maximum accumulation volume to 1,000 data entries.

3.6 Data Replication System Dependencies

(1) Data cannot be written to database JSON fields due to garbled data transmission by the Data Replication System

The database has a field in JSON format. When it is transmitted to the reader through Data Replication System, the JSON format is damaged due to a Chinese parsing problem and therefore cannot be written to the database. For this reason, exercise caution when database fields are in JSON format.

(2) Data Replication System latency

Database changes, inconsistent database indexes, inconsistent fields, and other similar issues can lead to Data Replication System latency. Also, Data Replication System's own problems can result in latency. For important services, a latency alert must be configured, With the acknowledgment of both the DBA and Data Replication System personnel, you can increase the write concurrency or increase the TPS of data synchronization to close the gap.

(3) Data Replication System suspends tasks

You can set automatic restart detection for a suspended Data Replication System task, or subscribe to Data Replication System delay monitoring and manually restart the task.

3.7 DTS Dependency

(1) Discretization of grid tasks and parallel tasks in DTS

Parallel tasks must consider the downstream bearing capacity for discretization, otherwise it will produce QPS hotspots.

(2) DTS degradation

A local distributed scheduling task can be implemented for degradation when DTS becomes unavailable.

(3) DTS monitoring

Configure DTS timeout monitoring to detect DTS monitoring failures and monitoring startup failures. In addition to middleware-level monitoring, set up the monitoring of application-level calls for DTS. As long as the monitoring magnitude is stable, the number of DTS scheduling times per minute should be within the threshold.

3.8 Switch

(1) Check and monitor switch push

After enabling or disabling the push function, you need to refresh the switch page to check that the change is successful. We recommend that you monitor the log for the effectiveness of the switch change. After enabling the push function, you need to check the log for the effectiveness of the change on all servers.

(2) Batch release switch for multiple Diamonds

During the batch release of multiple Diamond switches, if the release of a switch is suspended, the release of other switches cannot take effect. In this case, you must wait until the release of the other switches is completed.

(3) Switch initialization

Switch encoding is subject to initialization. An NP risk may occur if the dependent switch is not initialized during system startup.

(4) Multi-thread switch effect

Most of our switches are used in a multi-threaded environment. When releasing switches, pay attention to data visibility. For complex arrays or map structures, we recommend that you use a shadow instance to change the switch status.

(5) Use changefree for switch connection

We recommend that you connect switch changes with changefree and set multiple approvers. Otherwise, in the case of emergent switch push or rollback, no contact person would be available.

3.9 Monitoring

(1) Traffic monitoring

Traffic monitoring includes monitoring the QPS of provided services, the QPS of dependent services, and the number of regular tasks. Traffic monitoring reflects the approximate traffic during a stable business period. Monitoring configuration items include the week-on-week comparison, day-on-day comparison, and traffic threshold. Traffic monitoring requires the constant evaluation and exploration of the business within a reasonable range of changes. Open Data Processing Service (ODPS) can be used to collect data for the past month and set corresponding monitoring values.

(2) Number of errors and success rate monitoring

In some business scenarios, the traffic is low, but the importance is high. In such cases, monitoring the number of errors and the success rate is appropriate. Statistics can be collected based on the average minutes to prevent normal service failures caused by network problems or system load jitters.

(3) Error cause monitoring

Error cause monitoring can quickly locate service errors. Error logs are printed in a fixed format, and the number of error causes is counted by error type.

(4) RT monitoring

RT indicates the service duration, which reflects the current health of the service. It can be measured by average minutes to avoid normal service failures caused by network problems or system load jitters.

(5) Dashboard

All monitoring conditions for P-level faults need to be configured on the monitoring dashboard, which allows you to view all business monitoring metrics of the system and quickly locate specific problems.

(6) System exception monitoring

NP error monitoring, data out-of-bounds monitoring, and type conversion errors can be configured for all systems. Such runtime errors are definitely caused by system bugs, and are likely caused by releases. By configuring this monitoring, you can detect release issues in a timely manner.

3.10 Grayscale

(1) The system has multiple state dependencies on the service that cannot be distributed for release

One link depends on the state value (database, Tair, memory, or another storage) in the system twice. When the system is being released, the two dependencies result in accessing two different state values. As a result, the inconsistent states block the system link. In many offline job scenarios, the grayscale granularity is business-based instead of traffic-based. Therefore, it is better to use the business unit grayscale in some scenarios.

(2) Evaluate traffic before switching

Before a large-scale traffic switch, it is necessary to evaluate the traffic to be bored by the system. When the traffic has been cut by about 10% previously, the other 90% of the traffic can be properly evaluated according to business characteristics. After evaluating the traffic, you can perform a stress test to evaluate whether the server can handle the subsequent traffic.

(3) Grayscale expectations and solution rollback

A grayscale should have business expectations and system expectations. Business expectations include whether the product process is fully followed, and whether grayscale traffic covers all TC cases. System expectations include whether the number of system errors is consistent year-on-year and period-over-period, whether new exceptions occur, and whether NP errors occur.

If the grayscale does not meet your expectations, you need to roll back immediately. You do not need to consider any causes or consequences. Instead, you can analyze these after rolling back. During the pre-scheduled grayscale period, Freshippo is suspended and a site was launched after several layers of approval. The service performed normally within 10 minutes after the grayscale release, but the system showed an error log that had not appeared before, and then a pre-scheduled batch was not assigned. In this case, we rolled back the system immediately. After the analysis, for a whole day (the problem occurred only during non-peak hours), we found that it was caused by the NP error due to the changed cached object.

3.11 Testing

(1) Pre-release dry-run verification

Pre-create test site traffic to verify the feasibility of the function. The pre-release dry run should be completed in the early stage of testing and verification. Create multiple cases according to TC to verify whether it meets expectations.

(2) Verify pre-release traffic

Pre-release pulls traffic online to verify product logic and verify system functions with high traffic. For such a long link as intelligent scheduling, we can use traffic pull to verify some scenarios where data is difficult to create and other scenarios beyond TC.

(3) Comparison of online traffic pull

You can start new scheduling and old scheduling for the online site separately, and compare the final assignment results of the business, analyze behavior differences from the log, and find whether it is a normal latency difference or a system bug. You can also use statistical methods to compare and analyze core system metrics.

3.12 Emergency Response

There is no perfectly designed system, and bugs always exist. Then, how can we handle online problems in an emergency? To avoid rushing when online problems occur, we need to "repair the roof when it is sunny outside". Repairs include not only the aforementioned architecture management and monitoring enhancement but also emergency plan development and fault drilling.

(1) Develop an emergency plan

The emergency plan includes system switches, throttling tools, and degradation plans, as well as emergency response measures for services. These enable the proper running of operations even if the system cannot be restored shortly. For example, for delivery ticket jobs, the receipt and delivery of items can be completed through tickets when item collection is unavailable in the delivery system.

The system emergency plan must be verified in advance. After the plan is launched, the effectiveness of the plan must be verified. In addition, ensure that the plan can be pushed with one click, and the plan can be accessed from the pre-planned platform with one click. Also, conduct training to teach the plan, and ensure that each trainee understand how and when to exercise it.

(2) Fault drills

To avoid confusion when faults occur, improve decision-making efficiency, and accelerate problem location, it is necessary to hold fault drills during normal times. Fault drills should be taken as online faults and initiated by testing in a safe production environment. Fault drills must be authentic and kept confidential. We recommend that test and development personnel hold disaster drills in "reds fights blues" mode.

To improve development and emergency O&M capabilities, a set of fault response mechanisms are required. The Freshippo distribution team has developed a set of emergency response methods from the perspectives of organization, discovery, decision-making, and aftermath. When a fault arises, ask the team leaders and team members to stop their work at hand and troubleshoot the fault together. To do this, first, establish a communication group, and then identify the problem from the dashboard. Then, implement the corresponding plan based on the decision for the problem. When the alert is cleared, evaluate the range of the impact to see if any problems are left over from the fault.

(3) Troubleshooting

After multiple fault drills, the team's emergency O&M capability can be effectively built. When a fault occurs, do not troubleshoot the fault alone, but take recovery actions as quickly as possible. More than 80% of faults are caused by changes, such as releases, switch push, configuration addition, and upstream or downstream releases. A small portion of the faults are caused by increased traffic and other contextual changes. If a fault occurs when a change is made, roll back the system immediately. For other reasons, you can determine whether to push the pre-plan, restart the system, or scale up the system based on the actual symptom. If such measures are unavailable, the division of labor and cooperation are required to organize personnel to communicate with each other, troubleshoot logs, check databases, inspect service traffic, and analyze link traces in order to locate the problem as quickly as possible.

4. Summary

In the continuous development of intelligent scheduling, we are also running ongoing projects, such as strategic operations, intelligent troubleshooting, and simulations. We will continue to address new stability challenges and explore the best practices for achieving stability in different environments.

While continuing to wage war against the worldwide outbreak, Alibaba Cloud will play its part and will do all it can to help others in their battles with the coronavirus. Learn how we can support your business continuity at https://www.alibabacloud.com/campaign/supports-your-business-anytime