Learn what factors affect data transformation performance and how to optimize throughput by tuning shards and transformation logic.
Based on how it works, the overall speed of a data transformation job depends on the number of shards in the source LogStore and the complexity of the transformation logic. As a guideline, plan for a processing throughput of 1 MB/s of uncompressed data per shard, equivalent to approximately 85 GB per day per shard. For example, if the source LogStore's data write rate is 1 TB per day, you must split it into at least 12 shards (1,024 GB / 85 GB per shard ≈ 12). For more information about splitting shards, see Split a shard.
Data transformation performance
The speed of data transformation depends on the transformation logic. Key factors include the following:
-
Output data
-
Performance decreases as the volume of output data increases. Generating more log entries (for example, by splitting one entry into several), adding more fields, or increasing their content size all consume more CPU and network resources, which reduces throughput.
-
Writing to multiple destinations, adding numerous tags to each log entry, or creating more log groups slows performance. Each action increases network interactions and overhead.
-
-
Transformation logic
Complex transformation logic requires more searches, computations, and external resource synchronizations, all of which consume additional computing and network resources and reduce throughput.
-
External data sources
If you use a third-party source to enrich your data, larger volumes of pulled data slow down transformation. Cross-region data pulls, such as OSS objects in another region, further reduce performance.
Scale source LogStore transformation
-
Scaling real-time data transformation.
Increase the number of shards to improve real-time data transformation performance. For more information about the billing methods of shards, see Pay-by-feature billing.
-
Scaling historical data transformation.
Shard splitting affects only newly written data. To process a large volume of historical data in a LogStore with a small number of shards, create multiple data transformation jobs. Configure each job to process a separate, non-overlapping time range. For example, to process historical logs from September 1 to September 10, you can create ten jobs to process the data in daily increments, such as
[9/1, 9/2), [9/2, 9/3), ..., [9/10, 9/11).NoteThe transformation time is the log reception time. For more information, see Create a data transformation job.
Scale destination LogStore transformation
The required shard count for a destination LogStore depends on two factors:
-
Write throughput. A single LogStore shard supports a maximum write throughput of 5 MB/s. Estimate the required destination shard count based on the number of source shards and the processing concurrency.
For example, if the source LogStore has 20 shards, the destination LogStore needs at least 4 shards.
-
Query and analysis requirements. If you create an index and run queries on the data in the destination LogStore, plan your shard count based on your query scope. As a rule of thumb, provision one shard for every 50 million log entries queried at once.
For example, assume you write 10 GB of logs daily, where an average log entry is 1 KB. This equals 10 million log entries per day. If your queries must span a 30-day period, you will be querying approximately 300 million log entries. In this scenario, configure the destination LogStore with 6 shards (300 million entries / 50 million entries per shard).