Object Storage Service (OSS) provides comprehensive monitoring metrics and the logging feature to help you monitor the running behavior of programs, quickly identify potential issues, and identify the root causes of failures. This greatly improves troubleshooting efficiency.
This topic describes how to use the OSS monitoring service, the logging feature, and third-party tools to monitor, diagnose, and troubleshoot issues when you use OSS to store the data of your business. The OSS monitoring service serves the following purposes:
Monitor the running status and performance of OSS in real time and send alert notifications.
Provide effective methods and tools to identify issues.
Solve issues based on relevant guides.
This topic includes the following sections:
Monitoring service: describes how to use the OSS monitoring service to monitor the running status and performance of OSS.
Tracking and diagnosis: describes how to use the OSS monitoring service and the logging feature to track and diagnose issues.
Troubleshooting: describes common issues and the solutions to these issues.
Monitoring service
Monitor the overall status
Availability/Valid Request Rate
Availability/Valid Request Rate is the most important metric used to indicate OSS stability and OSS usage. A percentage less than 100% indicates that specific requests fail.
A percentage less than 100% may be caused by OSS optimization such as partition migration for load balancing. In this case, OSS SDKs provide relevant retry mechanisms to handle error requests due to this temporary optimization. This way, your business is not affected.
To determine the types and causes of error requests and troubleshoot the issues, analyze the requests based on the request status details and request status distribution in the Cloud Monitor console.
In addition, it is expected that the valid request rate may be less than 100% in specific business scenarios. For example, you need to send a request to check whether an object exists before you manage the object in specific scenarios. If the object does not exist, HTTP status code 404 is returned to the client, which results in a valid request rate lower than 100%.
For business that requires high availability of OSS, you can configure an alert rule that is triggered when the metric falls below the threshold value.
Total Number of Requests/Number of Valid Requests
The Total Number of Requests/Number of Valid Requests metric shows the running status of OSS in the aspect of the request number. If the number of valid requests is smaller than that of total requests, it indicates that specific requests fail.
To monitor the fluctuations of the number of valid requests, especially spikes and dips, and analyze the causes, you can configure alert rules to receive notifications at the earliest opportunity. For more information, see Use the alert service.
Request Status Distribution
When the availability/valid request rate is less than 100%, or the number of valid requests is smaller than the total number of requests, you can view Request Status Distribution to determine the type of the error requests. For more information about Request Status Distribution, see Metrics.
Monitor request status
Request Status Details monitors different types of requests and provides more details about the monitoring based on the request status distribution.
Monitor performance
The monitoring service provides the following metrics to help you monitor the performance of OSS:
Average Latency, which includes average E2E latency and average server latency
The latency metrics indicate the average and maximum amount of time required to process a specific type of request generated by calling API operations. The E2E latency metric indicates the amount of time required to transmit a request between the client and the server, which includes the amount of time required to process, read, and respond to the request, and network latency in this process. The server latency is the amount of time required to process the request on the server side, which does not include the network latency during the communications between the server and the client. Therefore, when the E2E latency increases and the server latency remains stable, it is reasonable to determine that the E2E latency increases due to poor network conditions, which excludes the possibility of OSS failures.
Maximum Latency, which includes maximum E2E latency and maximum server latency
Successful Request Category
Traffic
The Traffic metric indicates the traffic used by requests over the Internet, requests over the internal network, Content Delivery Network (CDN) back-to-origin requests, and cross-region replication (CRR) tasks. The monitoring service provides a metric for traffic by user and a metric for traffic by bucket.
OSS monitors the preceding metrics, except for traffic, by type of requests that are sent by calling the following API operations:
GetObject
HeadObject
PutObject
PostObject
AppendObject
UploadPart
UploadPartCopy
In addition, successful request category also includes the requests sent by calling the following API operations:
DeleteObject
DeleteObjects
For metrics that indicate the performance of OSS, you must focus on their abnormal fluctuations, such as a spike in average latency and extended high latency for requests. You can configure alert rules for performance metrics so that a notification is sent when an alert rule is triggered.
Monitor billable items
The OSS monitoring service allows you to monitor the following billable items: the storage usage, the number of the PUT requests and the GET requests, outbound traffic over the Internet, which does not include CRR outbound traffic or CDN outbound traffic. The OSS monitoring service does not support alert settings for billable items and OpenAPI reading.
The OSS monitoring service collects monitoring data by bucket on an hourly basis. You can view the chart of a specific bucket to obtain the continuous data trend of the bucket. You can predict the trend of the storage usage of your business and your future storage costs based on the monitoring graph.
The OSS monitoring service provides statistics on resource usage by user and by bucket per month. The OSS resource usage of an Alibaba Cloud account or a specific bucket is calculated and updated every hour. This way, you can calculate your storage costs for the current month.
For more information about the billable items and billing methods of OSS, see Billing overview.
NoteThe statistics provided by the monitoring service may be different from the actual usage in the bills. You are charged based on the actual usage in the bills displayed in the Expenses and Costs console.
Tracking and diagnosis
Problem diagnosis
Performance diagnosis
You must determine the baseline of your application performance based on your business requirements. Performance issues may be caused by excessive overloads in OSS, the TCP configurations of the client, and the traffic bottleneck from networks.
This way, you can identify the possible issue based on the performance metric provided by the monitoring service. Then, query the logs for details and further diagnose the issue.
Error diagnosis
The client application receives an error message from the server when a request error occurs. The monitoring service records and displays different error requests. You can check the logs of the server, the client, and the network conditions to obtain details about a specific request. The HTTP status code and the error code indicate the possible cause of an error request.
For more information about error codes, see Error code.
Use the logging feature to diagnose issues
OSS provides the logging feature which can record details about requests.
For more information about how to enable and use logging, see Configure logging and Logging.
Use network log tools
In specific cases, you may need to use network log tools to capture traffic between the client and the server. This way, you can obtain transmitted data and information about network conditions. For example, an error is reported for a user request, but no logs are generated for the request on the application server. In this case, check OSS logs or use network monitoring tools to identify the issue. One of the commonly used tools is Wireshark, which is used to view the detailed packet information about various network protocols. For more information, see How to install and use Wireshark in Windows and Use WireShark.
E2E tracking and diagnosis
In most cases, a request is sent from the client to OSS, and the OSS server processes the request and sends a response to the client. You can use the logs of the client, network, and server to troubleshoot the root cause of the problem. OSS provides the request ID as the unique identifier of a log. In addition, by using the timestamp of a log, you can obtain information about other events while a request is being processed. This helps you analyze and investigate the cause of an issue.
RequestID
OSS assigns a unique request ID to each request. In different logs, the request ID of a request is contained in different fields.
In OSS logs, the request ID is in the Request ID field.
When you track network data, such as data flows captured by Wireshark, the request ID is included in the response as the standard HTTP header x-oss-request-id.
In client applications, the latest version of OSS SDK for Java displays the request ID in the response. You can use the getRequestId method to obtain the request ID from the response. You can obtain the Request ID of an exceptional request by calling the getRequestId method of the OSSException operation.
Timestamp
You can use the timestamp of logs to query logs. Note that an event may occur at different points in time on the client and on the application server. Therefore, the timestamps of OSS logs and logs of the client for an event may be different. When you query the logs of the server based on the timestamp of the logs on the client, add 15 minutes to or subtract 15 minutes from the timestamp.
Troubleshooting
Performance-related issues
High average E2E latency and low average server latency
Possible causes:
Slow response of client applications
The available connections and threads are limited.
Run relevant commands to check the system connection status and change the number of CPU cores.
View the bottlenecks of the client resources, appropriately increase the number of concurrent threads, and optimize the client code.
Limited resources of CPU, memory, or bandwidth
Use monitoring tools to determine resource bottlenecks and optimize the code or scale up the resources.
Poor network conditions
Use Wireshark to analyze the cause of network issues.
Low average E2E latency, low average server latency, and high client request latency
For high client request latency, the high latency most likely occurs before requests arrive at the application server. Therefore, we recommend that you analyze why requests from the client do not arrive at the server.
The following scenarios may cause high client request latency:
The available connections and threads are limited.
Check the system connection status and change the number of CPU cores.
View the bottlenecks of the client resources, appropriately increase the number of concurrent threads, and optimize the client code.
Multiple retries occur for request in the client
Check the logs in the client to determine the cause of the retries and use Wireshark to identify network issues.
Check the client logs to determine whether retries were performed. Take OSS SDK for Java as an example. You can query log prompts of the warn- or info- level: If similar logs are recorded, retries may occur on the client or the server.
[Server]Unable to execute HTTP request: or [Client]Unable to execute HTTP request:
Take OSS SDK for Java as an example. You can query the following log if the level of the client log is debug. If similar logs are recorded, retries were performed.
Retrying on
High average server latency
For high average server latency in downloads and uploads, consider the following possible causes:
A large number of clients frequently access an object.
View OSS logs and activate CDN or modify the access control list (ACL) of the bucket or the object.
Internal OSS issues
Contact technical support and provide the client logs to resolve the issue.
Errors in the application server
Temporary increase
Change the retry policy of the client and use proper concession mechanisms.
Permanent increase
Contact technical support and provide the client logs to resolve the issue.
Network errors
A network error occurs when the server fails to respond to a request because the connection between the server and the client is lost while the server is processing the request. In this case, the HTTP status code 499 is recorded for the request. HTTP status code 499 is returned due to the following possible reasons:
Before the server receives requests to read and write data, the server checks whether the connection is established. If not, the HTTP status code 499 is recorded for the request.
The HTTP status code 499 is recorded for a request when the server is processing the request but the client disables the connection.
A network error occurs when the client cancels the request or loses the connection while the request is being processed. If the client cancels the request, you must check the code of the application and obtain why and when the client disconnects with OSS. If the client loses the connection, you can use tools such as Wireshark to identify the possible causes.
Client errors
Increase of client authorization error requests
When the client authorization error request increases, or the client application receives a large number of HTTP status code 403 error responses, consider the following possible causes:
Invalid bucket domain name
If users use the third-level domain or the second-level domain to access the bucket, the region included in the domain name may be different from the region in which the bucket is located. For example, the accessed bucket is located in the China (Hangzhou) region, but the accessed domain name is Bucket.oss-cn-shanghai.aliyuncs.com. In this case, you must verify the region in which the bucket is located and access the correct domain name.
If you use CDN acceleration, the domain name of the bucket that is mapped to the CDN may be incorrect. In this case, check whether the origin is the third-level domain of the bucket.
If users use the JavaScript client and HTTP status code 403 is returned, check whether cross-origin resource sharing (CORS) is configured for the web browser that users use to access the bucket. In this case, check and modify the CORS configurations of the bucket so that users can use the web browser to access the bucket. For more information about how to configure CORS rules, see CORS.
Access control
If you use the AccessKey pair of an Alibaba Cloud account to access a bucket, check the validity of your AccessKey pair.
If you use the AccessKey pair of a RAM user to access the bucket, check the validity or the authorization of the AccessKey pair.
If you use a temporary token generated by Security Token Service (STS), check whether the temporary token has expired. If the temporary token has expired, apply for a new token.
If the ACL is configured for the accessed bucket or object, check whether users are allowed to perform specific operations based on the ACL settings.
Expired URL
If HTTP status code 403 is returned when a third-party user accesses the bucket by using a signed URL, the most possible cause is that the signed URL has expired.
HTTP status code 403 may be returned when RAM users log on to OSS tools such as ossftp, ossbrowser, and the OSS console. In this case, check whether the correct AccessKey pair was entered or whether the user has the permissions to call the GetService operation if the account is a RAM user.
An increase in the number of HTTP status code 404 error responses to client requests
An HTTP status code 404 error response returned to a client request indicates that the data that users access does not exist. If the number of HTTP status code 404 error responses to client requests increases, consider the following possible causes:
The business logic of the application. For example, an application calls the doesObjectExist method provided by OSS SDK for Java to check whether an object exists before performing further operations. If the object does not exist, a value of false is returned to the client and HTTP status code 404 is generated on the server. In the scenario, HTTP status code 404 does not indicate an error.
The accessed object is deleted by the client application or other processes. In this case, query OSS logs of the accessed object.
Repeat delete operations caused by network failures. For example, the client initiates an operation to delete an object. The request arrives at the server, and the object is deleted. However, the response does not arrive at the client due to network failures. As a result, the client sends another request to delete the object, and HTTP status code 404 is returned. In this case, you can query and view the client logs and OSS logs to determine the cause of HTTP status code 404.
Query the client logs and check whether a repeated request is sent from the client.
Query OSS logs. Then, check whether two delete operations are initiated on the object and the HTTP status code of the first operation is 2xx.
Low valid request proportion and high client error requests
Valid request proportion is the proportion of successful requests whose responses are HTTP status code 2xx or 3xx to total requests. Requests whose responses are HTTP status code 4xx or 5xx are counted as failed requests and lower the valid request proportion. Client other error requests by user refer to error requests other than server error requests, network error requests, client authorization error requests, resource not found error requests and client timeout error requests. The responses to the preceding error requests are respectively 5xx, 499, 403, 404, 408, or 400 whose corresponding OSS error code is RequestTimeout.
You can query OSS logs to determine the error type. Then, refer to OSS error codes and resolve the issue by modifying the code of the application. For more information, see Error code.
Abnormal increase in storage usage
When your storage usage dramatically increases, the cause may be the cleaning operation failures. Troubleshoot the issue in the following aspects:
The client application uses specific processes to perform regular cleaning operations to release storage. Perform the following steps:
Check whether the valid request rate decreases due to failed cleaning operations.
Determine what specifically causes the valid request rate to decrease and the type of the error requests. Then, obtain the details about the errors from the client logs.
The client application clears your bucket storage by configuring lifecycle rules. For cleaning operations triggered by lifecycle rules, you must use the OSS console or call API operations to check whether the lifecycle rules are correctly configured. If not, you can modify the lifecycle rules in the OSS console. To check whether the lifecycle rules were modified, query OSS logs. If the lifecycle rules are correctly configured but do not take effect, contact OSS technical support for assistance.
Other storage service issues
If the troubleshooting section in this topic cannot resolve your storage service issues, use the following methods to diagnose and troubleshoot your issues:
View the monitoring service of OSS in the CloudMonitor console and check whether the baseline of metrics has been changed. You may determine whether the issue is temporary or permanent and which storage operations are affected by this issue.
Query OSS logs to obtain all errors that occur at the same time based on the monitoring information which may help you identify and resolve the issue.
If the OSS logs on the server cannot provide sufficient information for troubleshooting, use the client logs to investigate the client application or network tools such as Wireshark to investigate network failures.