All Products
Search
Document Center

E-MapReduce:Best practices for optimizing the performance of OSS and OSS-HDFS

Last Updated:Aug 15, 2023

This topic describes how to accelerate access to Object Storage Service (OSS) and OSS-HDFS over HTTP. This way, you can write data to or read data from OSS or OSS-HDFS much faster.

Background information

Due to the wide adoption of data lakes, more and more users use OSS to store data. OSS can ensure 99.9999999999% (12 nines) data durability and 99.995% service availability. OSS provides multiple storage classes to help you manage and reduce storage costs.

OSS can handle thousands of requests sent from your application per second when the application writes data to or reads data from OSS. OSS sorts objects in each partition in alphabetical order of object names. Partitions are divided based on the size of objects stored in OSS and the QPS of requests sent to access the objects. Each partition can handle at least 3,500 PUT, COPY, POST, or DELETE requests and 5,500 GET or HEAD requests per second. The number of object prefixes that you can configure is not limited. Therefore, you can create multiple object prefixes to improve the read or write performance. For example, you can create 10 object prefixes to write data to 10 different partitions in parallel. This way, your application can send 35,000 PUT requests per second to write data to OSS.

JindoFS is deployed as a service named OSS-HDFS in Alibaba Cloud. OSS-HDFS is deeply integrated with OSS. You can use OSS-HDFS without the need to deploy and maintain JindoFS in E-MapReduce (EMR) clusters. For more information about OSS-HDFS, see Overview.

Best practices for reducing the HTTP request latency

You can use the following methods to reduce the HTTP request latency when you access OSS or OSS-HDFS:

  • Make sure that the OSS bucket or OSS-HDFS service is deployed in the same region as your Elastic Compute Service (ECS) instance.

  • Horizontally distribute workloads and send concurrent requests to increase the throughput.

  • Configure latency-sensitive applications to automatically retry timeout requests.

  • Cache frequently-accessed content.

  • Use the latest JindoSDK version.

We recommend that you choose the preceding methods based on scenarios. The following sections describe the best practice for each method.

Deploy the OSS bucket or OSS-HDFS service in the same region as your ECS instance

The name of a bucket is globally unique in OSS. You must specify a region for a bucket when you create the bucket. The name and region of a bucket cannot be changed after the bucket is created. To reduce the network latency and data transmission costs, we recommend that you access an OSS bucket by using ECS instances that are deployed in the same region as the bucket. For more information, see Access to OSS resources from ECS instances by using an internal endpoint of OSS.

Horizontally distribute workloads and send concurrent requests to increase the throughput

OSS is a large-scale distributed storage system. To fully utilize the throughput of OSS, we recommend that you send concurrent requests to OSS and distribute the requests across multiple OSS service nodes. This way, workloads can be distributed through multiple network paths.

OSS-HDFS can use the preceding best practices together with the metadata service to distribute file blocks across multiple OSS servers to improve the read and write performance.

If your application requires high-throughput data transmission, you can modify parameters based on your application and the specifications of your ECS instances to use multiple threads or instances to control the concurrency or parallelism of read and write requests.

  • View performance indicators

    Choose ECS instance types based on CPU and network throughput requirements. For more information about ECS instance types, see Overview. In addition, we recommend that you use HTTP analysis tools to analyze the DNS query time, latency, and data transmission speed.

    Before you adjust the number of concurrent requests, you need to check the performance metrics. We recommend that you first check the usage of bandwidth and other resources that are consumed by a single request. This helps you identify resources with the highest usage and determine the maximum number of concurrent requests that can be processed based on the resource upper limit. For example, if 10% CPU resources are required to process a request, you can send up to 10 concurrent requests.

  • Modify the concurrency and parallelism parameters

    You can modify the fs.oss.download.thread.concurrency and fs.oss.upload.thread.concurrency parameters to control the number of concurrent uploads and the number of concurrent downloads per process.

    If you run MapReduce or Spark jobs, you can also modify the following settings:

    • For MapReduce jobs, you can modify the mapreduce.job.maps and mapreduce.job.reduces Hadoop parameters to control the number of concurrent map tasks and the number of concurrent reduce tasks per job.

    • For Spark jobs, you can set the --num-executors option or configure the spark.executor.instance Spark parameter to control the number of concurrent executors.

Configure latency-sensitive applications to automatically retry timeout requests

OSS throttles the query per second (QPS) of management-related API operations, such as GetService (ListBuckets), PutBucket, and GetBucketLifecycle. If your application initiates a large number of requests at the same time, HTTP 503 may be returned to indicate that request throttling is triggered. In this case, we recommend that you retry the requests after a few seconds.

By default, OSS limits the QPS by account, which is at most 10,000 requests per second. If you want to increase the QPS upper limit, contact the OSS technical support.

Important
  • If the QPS of the account does not exceed the preceding upper limit but most requests are sent to a specific partition, the server triggers request throttling and returns the HTTP 503 error because the partition is overwhelmed.

  • If you use random prefixes, OSS automatically scales out the number of partitions when the QPS increases. You need to only wait and retry timeout requests. For more information about random prefixes, see OSS performance and scalability best practices.

You can modify the fs.oss.retry.count parameter to control the number of attempts to retry requests sent to OSS or OSS-HDFS and modify the fs.oss.retry.interval.millisecond parameter to control the retry interval. The interval is doubled each time the number of retries increases. You can also modify the fs.oss.timeout.millisecond parameter to adjust the timeout period based your network conditions.

Cache frequently-accessed content

If your application needs to send a large number of requests to access a static file in the same region as the application, you can use JindoData to cache the static file. JindoData can store files as blocks in a distributed cache service. This saves you the need to repetitively read the same data from OSS or OSS-HDFS. JindoData can efficiently reduce the access latency and increase the utilization of compute resources. For more information, see Accelerate access to OSS or OSS-HDFS by using the transparent caching feature of JindoFSx.

Use the latest JindoSDK version

The latest JindoSDK version provides optimized adaptive configurations and prefetch algorithms. The latest JindoSDK version is also periodically updated to comply with the latest best practice. For example, the latest JindoSDK version can automatically retry requests upon different network errors and control the concurrency of requests.

Download links: Download JindoData