Computing resources - Well-Architected Framework - Alibaba Cloud Documentation Center

Computing resources generally refer to software and hardware resources used for executing computing tasks, including CPU, GPU, memory, operating systems, and software and hardware environments for specific computing tasks. The main function of computing resources is to execute various computing tasks, including data processing, algorithmic calculations, and business logic execution. The performance and capacity of computing resources directly affect the computing capabilities and response speed of the system, and therefore affect the quality of service. The following are detailed descriptions of 5 computing resource risks and corresponding fault tolerance strategies.

Uneven Resource Allocation

Uneven resource allocation refers to the situation where some nodes are heavily loaded due to defects in task allocation policies, long connections, and other issues, while other nodes have less load pressure. In addition, different nodes in distributed systems may compete for limited computing resources. For example, if a node excessively occupies computing resources, other nodes may not obtain sufficient resources, resulting in reduced performance and task delays. Uneven resource allocation may lead to reduced system performance, increased task delays, and wasted resources. Common fault tolerance strategies include:

Load balancing: Use a reasonable load balancing algorithm to evenly distribute requests or tasks to different nodes to achieve load balancing. By this means, the computing resources can be fully utilized the improve the system's performance.
Resource scheduling: Dynamically adjust resource allocation based on the system's load and resource utilization. When a node is overloaded, some tasks or data can be migrated to other nodes to balance resource utilization.

Insufficient Resource Capacity

Insufficient CPU or memory resources may result in increased task delays, slow system response, or even failure to execute tasks. Common fault tolerance strategies include:

Elastic scaling: Use elastic scaling capabilities provided by the cloud platform to increase the capacity of computing resources by adding new nodes or upgrading existing node instances to resolve this issue.
Resource scheduling: As described above.

Abnormal Task Interruption

Abnormal task interruption refers to situations where tasks cannot be completed or are interrupted during the execution of computing tasks due to various reasons, resulting in task failures. Common fault tolerance strategies include:

Checkpoint and recovery: Create checkpoints periodically during task execution to save the intermediate results and status of the task. When a task is interrupted, the task can be resumed by loading the checkpoint, avoiding the need to re-execute the entire task.
Monitoring and automatic retry: Monitor the status and progress of tasks regularly. Once a task interruption or exception is detected, automatic failure retries can be performed. This can be implemented using monitoring tools and task management systems.
Task splitting and parallel computing: Split the computing task into smaller subtasks and distribute them to different computing nodes for execution. Even if one node fails or is interrupted, other nodes can continue to execute the remaining subtasks, improving the fault tolerance and reliability of the task.

Task Repetition

Task repetition refers to the situation where a computing task is executed multiple times due to various reasons, such as duplicate operations, message duplication, or scheduling duplication. Common fault tolerance strategies include:

Deduplication: Use unique identifiers to identify tasks and check whether a task already exists in the system. If the task already exists, return the existing result instead of executing it again.
Idempotence: In task design, strive to maintain idempotence. That is, whether the task is executed once or multiple times, it does not have any additional impact on the system's state and results. This way, even if the task is executed repeatedly, it will not cause errors in the system's state and results.

Task Blocking and Accumulation

Task blocking and accumulation refer to situations where one or more tasks take a long time to execute or are blocked, causing other tasks to be unable to run in a timely manner, resulting in task accumulation in the system and affecting overall performance and response time. Common fault tolerance strategies include:

Timeout mechanism: Set a reasonable execution time limit for each task. Once the task exceeds the threshold, the system can abort or cancel the task and mark it as abnormal or failed. This mechanism can prevent one blocked task from delaying the entire system's execution.
Asynchronous execution: Process time-consuming tasks asynchronously to fully utilize computing resources, accelerate task processing, and avoid blocking the entire task execution flow.
Priority: Set different task priorities based on their importance and urgency. Give higher priority to high-priority tasks to avoid them being blocked or accumulated, ensuring the responsiveness of the system and timely completion of tasks.
Task splitting and parallel computing: As described above, split the computing task into smaller subtasks and distribute them to different computing nodes for execution.

In addition to the above five points, common risk points for computing resources include "mutual influence of resources," "resource node crashes," "dependent service exceptions," "service process unresponsiveness," "data format exceptions," and "certificate expiry." To handle these risk points, fault tolerance strategies such as resource isolation, quota control, multiple replicas redundancy, service degradation, service circuit breaking, heartbeat reporting, active detection, data verification, and automatic replacement can be used.