Monitoring, diagnosis, and troubleshooting - Object Storage Service

Compared with traditional applications, cloud applications cost less in infrastructure investment and O&M. However, it is more difficult to monitor the running status and performance of cloud applications, locate faults, and troubleshoot faults. To solve this issue, Object Storage Service (OSS) provides the monitoring service and logging feature for you to monitor the performance of your application and locate faults.

This topic describes how to use the OSS monitoring service, the logging feature, and third-party tools to monitor, diagnose, and troubleshoot issues when you use OSS to store the data of your business. The OSS monitoring service serves the following purposes:

Monitor the running status and performance of OSS in real time and send alert notifications.
Provide effective methods and tools to locate issues.
Solve issues based on relevant guides.

This topic includes the following sections:

Monitoring service: describes how to use the OSS monitoring service to monitor the running status and performance of OSS.
Tracking and diagnosis: describe how to use the OSS monitoring service and logging feature to track and diagnose faults.
Troubleshooting: describes common issues and solutions to these issues.

Monitoring service

Monitor the overall status
- Availability/Valid Request Proportion by User (%)
  Availability/Valid Request Proportion by User is the most important metric used to indicate OSS stability and whether OSS is correctly used. A percentage less than 100% indicates that some requests fail.
  A percentage less than 100% may be caused by OSS optimization such as partition migration for load balancing. In this case, OSS SDKs provide relevant retry mechanisms to handle error requests due to this temporary optimization. This way, your business is not affected.
  To determine the types and causes of error requests and troubleshoot the errors, analyze the requests based on request status details and request status distribution in the Cloud Monitor console for OSS.
  In addition, it is expected that valid request proportion may be less than 100% in some business scenarios. For example, you need to send a request to check whether an object exists before you manage the object in some scenarios. If the object does not exist, a 404 error is returned for the request, which results in a valid request proportion lower than 100%. In this case, the error is returned as expected and does not indicate an actual issue.
  For businesses that require high availability of OSS, you can configure an alert rule that is triggered when the metric falls below an expected threshold value.
- Number of Total/Valid Requests by User
  The Number of Total/Valid Requests by User metric shows the running status of OSS in the aspect of request times. If the number of valid requests is smaller than that of total requests, it indicates that some requests fail.
  To follow the fluctuations of the number of valid requests, especially spikes and dips, and analyze the causes, you can configure alert rules to receive timely notifications. For more information, see Alert service.
- Request Status Distribution by User
  When the availability/valid request proportion is less than 100%, or the number of valid requests is smaller than the number of total requests, you can view Request Status Distribution by User to determine the type of error requests. For more information about Request Status Distribution by User, see Metrics.
Monitor request status
Request Status Details monitors different types of requests and provides more details about the monitoring based on the request status distribution.
Monitor performance
The monitoring service provides the following metrics to help you monitor the performance of OSS:
- Average Latency, which includes average E2E latency and average server latency
  The latency metrics indicate the average and maximum time that are consumed to process a type of request generated by calling API operations. The E2E latency metric indicates the time used to transmit a request between the client and the server, which includes the time used to process, read, and respond to the request, and network latency in this process. The server latency is the time used to process the request on the server side, which does not include the network latency during the communications between the server and the client. Therefore, when the E2E latency jumps and the server latency keeps stable, it is reasonable to determine that the E2E latency increase due to poor network conditions and exclude the possibility of OSS failure.
- Maximum Latency, which includes maximum E2E latency and maximum server latency
- Successful Request Category
- Traffic by User
  The Traffic by User metric indicates the traffic used by requests over the Internet, requests over the internal network, Content Delivery Network (CDN) back-to-origin, and cross-region replication (CRR). The monitoring service provides a metric of traffic by user and a metric of traffic by bucket.
OSS monitors the preceding metrics except for traffic, by type of requests that are sent by calling the following API operations:
- GetObject
- HeadObject
- PutObject
- PostObject
- AppendObject
- UploadPart
- UploadPartCopy
In addition, successful request category also includes the requests sent by calling the following API operations:
- DeleteObject
- DeleteObjects
For metrics that indicate the performance of OSS, you must focus on their abnormal fluctuations such as a spike in average latency and extended high latency for requests. You can configure alert rules for performance metrics so that a notification is sent when an alert rule is triggered.
Monitor billable items
The OSS monitoring service allows you to monitor the following billable items: the storage usage, the number of the PUT requests and the GET requests, outbound traffic over the Internet, which does not include CRR outbound traffic or CDN outbound traffic. The OSS monitoring service does not support alert settings for billable items and OpenAPI reading.
The OSS monitoring service collects monitoring data by bucket on an hourly basis. You can view the chart of a specific bucket to obtain the continuous data trend of the bucket. You can predict the trend of storage usage of your business and your future storage costs based on the monitoring graph.
The OSS monitoring service provides statistics of resources usage by user and by bucket per month. The OSS resource usage of an Alibaba Cloud account or a specific bucket is calculated and updated every hour. This way, you can calculate your storage costs during the current month.
For more information about the billable items and billing methods of OSS, see Overview.
Note
The statistics provided by the monitoring service may be different from actual usage in bills. You are charged based on the actual usage in the bills provided by Billing Management.

Tracking and diagnosis

Problem diagnosis
- Performance diagnosis
  You must determine the baseline of your application performance based on your business requirements. Note that a failed request sent from the client may be caused by one or more elements throughout the transmission link of the request, such as excessive overloads of OSS, the configurations of TCP of the client, and the traffic bottleneck from networks.
  This way, you can locate the possible issue based on the performance metric provided by the monitoring service. Then, query the logs for detailed information and diagnose the fault further.
- Error diagnosis
  The client application receives an error message from the server when a request error occurs. The monitoring service records and displays the counts and proportions of different error requests. You can check logs of the server, the client, and the network conditions to obtain the details about a specific request. Generally, the HTTP status code, the error code, and the error message in the response indicate the possible cause of an error request.
  For more information about error codes, see Error responses.
- Use the logging feature to diagnose
  OSS provides the logging feature for requests from users, which can track the details about requests throughout the transmission between the client application and the server.
  For more information about how to enable logging and use this feature, see Configure logging.
  For more information about the naming conventions and formats of OSS logs, see Logging.
- Use the logs of network conditions to diagnose
  Generally, you can determine the cause of a fault based on logs of the server and the client application. However, in some cases, to determine the cause of the fault, you also need network logs to obtain information about network conditions between the server and the client, such as the traffic and data transmitted in this process. For example, an error is reported for a user request, but no logs are generated for the request on the application server. In this case, check OSS logs to determine whether the cause is on the application user client, or use network monitoring tools to check the network conditions.
  Wireshark is one of the most common tool for log monitoring and analysis. This free tool works on the level of data packages. You can use this tool to query the details about data packages transmitted over all protocols so that you can determine whether failures are caused by package loss or network connections.
  For more information about how to use Wireshark, visit Wireshark User’s Guide.
E2E tracking and diagnosis
In a normal process, a request is sent from the client to OSS, and the OSS server processes the request and sends a response to the client. You can track the whole process between the client and the server. You can use logs of the client application, networks, and the server to locate a potential fault.
OSS assigns every received request a unique request ID, which can be used as the identifier to distinguish logs generated for requests. In addition, by using the timestamp of a log, you can obtain information about other events occurred on the client, in networks, and on the application server while a request is processed. This helps you analyze and investigate the cause of a fault.
- RequestID
  In different logs, the request ID of a request is contained in different fields.
  - In OSS logs, the Request ID is in the Request ID field.
  - When you track network data, such as data flows captured by Wireshark, the Request ID is included in the response as the standard HTTP header x-oss-request-id.
  - In the client application, the Request ID is automatically displayed in the logs of the client, which is implemented by using the code of the application. The latest version of OSS SDK for Java supports displaying the Request ID of a request in the response. You can use the getRequestId method to obtain the Request ID from the response. Each version of OSS SDK for various programming languages supports displaying the Request ID of an exceptional request. You can obtain the Request ID of a request by calling the getRequestId method of the OSSException operation.
- Use the timestamp of logs
  You can use the timestamp of logs to query logs. Note that an event may occur at different time points on the client and on the application server. Therefore, the timestamps of OSS logs and logs of the client for an event may be different. When you query logs of the server based on the timestamp of logs on the client, add 15 minutes to or subtract 15 minutes from the timestamp.

Troubleshooting

Performance-related issues
- High average E2E latency and low average server latency
  The following possible causes are inferred from the preceding description of the average E2E latency and the average server latency:
  - Slow response of client applications
    - The available connections and threads are limited.
      For limited available connections, you can run relevant commands to check whether a large number of connections is in the TIME_WAIT state. If a large number of connections is in the TIME_WAIT state, you can adjust the number of CPU cores to handle this issue.
      For limited threads, you can view whether bottlenecks exist in resources such as the client CPU, memory, and networks. If not, you can properly increase the number of concurrent threads.
      If the issue persists, you must optimize the code of the application. For example, you can optimize the code of the application to support asynchronous access. You can also use the performance analysis feature to identify application features that are most commonly used and then optimize the client application.
    - Limited resources of CPU, memory, or bandwidth
      In this case, you must use the monitoring feature of relevant systems to determine resource bottlenecks. Then, optimize the code of the application to adjust the limits of the resources, or you can scale up the resources of the client such as increase the number of CPU cores or increase the memory.
  - Poor network conditions
    Generally, high average E2E latency is caused by temporary poor network conditions. You can use Wireshark to analyze accidental and persistent network issues such as the loss of data packages.
- Low average E2E latency, low average server latency, and high client request latency
  For high client request latency, the high latency most likely occurs before requests arrive at the application server. Therefore, we recommend that you analyze why requests from the client do not arrive at the server.
  The following scenarios may cause high client request latency:
  - The available connections and threads are limited.
    - For limited available connections, you can run relevant commands to check whether a large number of connections is in the TIME_WAIT state. If a large number of connections is in the TIME_WAIT state, you can adjust the number of CPU cores to handle this issue.
    - For limited threads, you can view whether bottlenecks exist in the client CPU, memory, and network resources. If not, you can properly increase the number of concurrent threads.
    - If the issue persists, you must optimize the code of the application, such as access by using the asynchronous method. You can also use the performance analysis feature to identify application features that are most commonly used and then optimize the client application.
  - Multiple retries occur for request in the client. In this case, you must analyze the cause based on the retry information. You can determine whether retries occur in the client by using the following methods:
    - Check the client logs to view whether retries have occurred. Take OSS SDK for Java as an example. You can query log prompts of the warn- or info- level: If similar logs are recorded, retries may occur in the client or in the server.
```
[Server]Unable to execute HTTP request:
  or
  [Client]Unable to execute HTTP request:
```
    - Take OSS SDK for Java as an example. You can query the following log if the level of the client log is debug. If similar logs are recorded, retries must have occurred.
```
Retrying on
```
  If the client has no faults, consider potential issues of networks such as the loss of data packages. You can use tools such as Wireshark to analyze the cause of network issues.
- High average server latency
  For high average server latency of downloads and uploads, consider the following possible causes:
  - A large number of clients frequently access an object.
    In this case, you can view OSS logs to determine whether an object or a group of objects is frequently accessed.
    For the scenario of downloads, we recommend that you activate Content Delivery Network (CDN) for the bucket to improve the performance and reduce the generated traffic. For the scenario of uploads, we recommend that you modify the access control list (ACL) of the bucket or the object so that users cannot write data to the bucket or the object if this does not affect your business requirements.
  - Internal issues of OSS
    For internal issues of OSS, they may not be solved by optimizing the code of the application. In this case, contact technical support to provide the client logs or the Request ID of the failed request in OSS logs.
Errors of the application server
If the errors of the server increase, consider the following possible causes:
- Temporary increase
  In this case, you must adjust the retry policy of the client application and use proper concession mechanisms such as exponential backoff. This way, you can prevent service unavailability caused by the optimization, upgrade, and data migration for OSS load balance. In addition, the pressure of your business at peak times can be reduced.
- Permanent increase
  If server errors remain high, contact technical support to provide the client logs or the Request ID of the failed request in OSS logs.
Network errors
A network error occurs when the server fails to respond to a request because of disconnection from the client or networks while the server processes the request. In this case, the HTTP status code 499 is recorded for the request. The status code 499 has the following possible causes:
- Before the server receives requests to read and write data, the server checks whether the connection is available. If not, the HTTP status code 499 is recorded for the request.
- The HTTP status code 499 is recorded for a request when the server is processing the request but the client disables the connection.
A network error occurs when the client cancels the request or loses the connection during a request process. If the client cancels the request, you must check the code of the application and obtain why and when the client disconnects with OSS. If the client loses the connection, you can use tools such as Wireshark to analyze possible causes.
Client errors
- Increase of client authorization error requests
  When the client authorization error request increases, or the client application receives a large amount of the HTTP status code 403, consider the following possible causes:
  - Invalid bucket domain name
    - If users use the third-level domain or the second-level domain to access the bucket, the region contained in the domain name may be different from the region in which the bucket is located. For example, the accessed bucket is located in the China (Hangzhou) region, but the accessed domain name is Bucket.oss-cn-shanghai.aliyuncs.com. In this case, you must confirm the region in which the bucket is located and correct the accessed domain name.
    - If you have enabled CDN, the origin mapped to CDN may be the wrong domain name of the bucket. In this case, check whether the origin is the third-level domain of the bucket.
    - If users use the client of JavaScript and a 403 error is returned, check whether CORS is configured for the web browser that users use to access the bucket. In this case, check the CORS settings of the bucket and correct the settings so that users can use the web browser to access the bucket. For more information about how to configure cross-origin resource sharing (CORS), see Configure CORS.
  - Access control
    - If you use the AccessKey pair of an Alibaba Cloud account to access the bucket, check the validity of your AccessKey pair.
    - If you use the AccessKey pair of a RAM user to access the bucket, check the validity or the authorization of the AccessKey pair.
    - If you use a token generated by Security Token Service (STS), check whether the temporary token expires. If the token has expired, apply for a new token.
    - If the access control list (ACL) is configured for the accessed bucket or object, check whether users are allowed to perform specific operations based on the ACL settings.
  - Expired URL
    If a 403 error occurs when the third party accesses the bucket by using a signed URL, the most possible cause is that the signed URL has expired.
  - A 403 error may occur when RAM users log on to OSS tools such as ossftp, ossbrowser, and the OSS console. In this case, check whether you enter the correct AccessKey pair or whether you have the permission to call the GetService operation if your account is a RAM user.
- Increase of 404 errors for client requests
  A 404 error for a client request indicates that the data that users access does not exist. When the number of 404 errors for client requests increases, consider the following possible causes:
  - The business logic of the application. For example, an application calls the doesObjectExist method provided by OSS SDK for Java to check whether an object exists before further actions. If the object does not exist, the value of false is returned to the client and a 404 error message is generated on the server. Therefore, in the business scenario, a 404 status code does not indicate an error.
  - The accessed object is deleted by the client application or other processes. In this case, query OSS logs for the accessed object.
  - Repeat delete operations caused by network failure. For example, the client initiates an operation to delete an object. The request arrives at the server, and the object is deleted. However, the response does not arrive at the client due to network failure. As a result, the client sends another request to delete the object, and a 404 error occurs. In this case, you can query and view the client logs and OSS logs to determine the cause of the 404 error.
    - Query the client logs and check whether a repeated request is sent from the client.
    - Query OSS logs. Then, check whether two delete operations are initiated on the object and the HTTP status code of the first operation is 2xx.
- Low valid request proportion and high client error requests
  Valid request proportion is the proportion of successful requests whose responses are the HTTP status code 2xx or 3xx to total requests. Requests whose responses are the HTTP status code 4xx or 5xx are counted as failed requests and lower the valid request proportion. Client other error requests by user refer to error requests other than server error requests, network error requests, client authorization error requests, resource not found error requests and client timeout error requests. The responses to the preceding error requests are respectively 5xx, 499, 403, 404, 408, or 400 whose corresponding OSS error code is RequestTimeout.
  You can query OSS logs to determine the error type. Then, refer to OSS error codes and solve the errors by modifying the code of the application. For more information, see Error responses.
Exceptional increase in storage usage
When your storage usage dramatically increases, the cause may be the cleaning operation failure. Troubleshoot in the following aspects:
- If the client application uses specific processes to perform regular cleaning operations to release storage, find out the cause in the following steps:
  1. Check whether the valid request proportion decrease because failed cleaning operations lower the valid request proportion.
  2. Locate and determine the specific cause of the decrease in valid request proportion and the type of error requests. Then, you can obtain the details about errors from the client logs.
- The client application clears your bucket storage by configuring lifecycle rules. For cleaning operations triggered by lifecycle rules, you must use the OSS console or call API operations to check whether the lifecycle rules are correctly configured. If not, you can modify the lifecycle rules in the OSS console. To check whether the lifecycle rules were modified, query OSS logs. If the lifecycle rules are correctly configured but do not take effect, contact OSS technical support for help.
Other storage service issues
If the troubleshooting section in this topic does not solve your storage service issues, use the following methods to diagnose and troubleshoot your issues:
1. View the monitoring service of OSS in the Cloud Monitor console and check whether the baseline of metrics has been changed. You may determine whether the issue is temporary or permanent and which storage operations are affected by this issue.
2. Query OSS logs to obtain all errors that occur at the same time based on the monitoring information which may help you locate and solve the issue.
3. If the logs of the OSS server cannot provide sufficient information for troubleshooting, use the client logs to investigate the client application or network tools such as Wireshark to investigate network failures.