Limits on data transformation in Simple Log Service - Simple Log Service

This topic describes the limits on data transformation in Simple Log Service.

Job configuration

Item	Description
Number of jobs	You can create up to 100 data transformation jobs in a project. Important When a data transformation job is stopped or complete, the job still consumes the job quota. To prevent the quota from being consumed by the data transformation jobs that are stopped or complete, we recommend that you delete the jobs. Make sure that you no longer require the jobs. For more information, see Manage a data transformation job. To increase the quota, submit a ticket.
Dependency of a consumer group in a source Logstore	The running of a data transformation job depends on a consumer group in the source Logstore. When a data transformation job is running, do not delete or reset the consumption checkpoint for the consumer group on which the job depends. If you perform the delete or reset operation, the job consumes data again from the start time that you specify, and duplicate data exists in the result. Important The data consumption progress of a job in a shard is updated to the consumer group on which the job depends at regular intervals. This optimizes the efficiency of data transformation. However, the result of the GetCheckPoint operation on the consumer group does not indicate the latest data transformation progress. To obtain the accurate data transformation progress of a job, you can go to the shard consumption delay chart of the dashboard that is created for the job. For more information about the dashboard, see Data transformation dashboard. For more information, see Data transformation basics, Terms, and API operations related to consumer groups.
Number of consumer groups in a source Logstore	You can create up to 30 consumer groups in a Logstore. Therefore, you can create up to 30 data transformation jobs in a source Logstore. For more information, see Basic resources. If you create more than 30 consumer groups, the data transformation jobs cannot run as expected after the jobs are started. The run logs of the jobs record error information. For more information, see View error logs. Important When a data transformation job is stopped or complete, Simple Log Service does not automatically delete the consumer group on which the job depends. To reduce invalid consumer groups, we recommend that you delete the data transformation jobs that are stopped or complete and you no longer require. For more information, see Manage a data transformation job.
Change in the time ranges of jobs	If you change the time range of a running job, the job starts consumption from the start time that you specify and transforms all data that is generated in the new time range. If you want a job to consume data that is generated within a longer time range, we recommend that you create another job to expand the time range instead of prolonging the time range of the existing job. If you want a job to consume data that is generated within a shorter time range, we recommend that you delete the data that is written to the storage destinations and then shorten the time range of the existing job to prevent data duplication. The data that is written to the storage destinations is not automatically deleted.
Number of storage destinations	You can configure up to 20 independent static storage destinations for a data transformation job. Up to 200 projects and 200 Logstores can be dynamically specified in data transformation code. If one of the preceding limits is exceeded, the data that is written to a different storage destination other than the allowed 20 storage destinations is discarded.

Data transformation

Item	Description
Quick preview	The quick preview feature of data transformation is used to debug data transformation code. The feature has the following limits: Connections to external resources such as ApsaraDB RDS, Object Storage Service (OSS), and Simple Log Service are not supported. You can specify custom test data for a dimension table. A single request can obtain up to 1 MB of test data from a source table or a dimension table. If the size of the data exceeds 1 MB, an error is returned. Up to the first 100 logs can be returned for a single request. The advanced preview feature does not have these limits.
Runtime concurrency	The number of readwrite shards in a source Logstore specifies the maximum number of data transformation jobs that can concurrently run. For more information, see Data transformation basics. For more information about the limits on the shards of a Logstore, see Basic resources. For more information about how to split a shard of a Logstore, see Manage shards. Important If the number of data transformation jobs that can concurrently run does not meet the requirements, automatic sharding is not triggered for the source Logstore. You must manually split a shard of the source Logstore to increase the number of data transformation jobs that can concurrently run. For more information about automatic sharding, see Manage shards. For data that is written after the shard is split, the maximum number of data transformation jobs that can concurrently run equals the number of readwrite shards that are available in the source Logstore after splitting. For data that is written before the shard is split, the maximum number of data transformation jobs that can concurrently run equals the number of readwrite shards that are available in the source Logstore when the data is written.
Data load of a concurrent unit	The data load of a concurrent unit in a data transformation job varies based on the amount of data that is consumed by the job from a shard of the source Logstore. If the data in the source Logstore is unevenly distributed among shards, the data load of a concurrent unit in a data transformation job may be heavier. This type of concurrent unit is considered a hot concurrent unit. In this case, the transformation of data in specific shards is delayed. If data is written to the source Logstore in KeyHash mode, we recommend that you appropriately allocate hash keys and shards to minimize uneven data distribution. For more information about data writing, see PutLogs.
Memory usage	The memory usage threshold of a concurrent unit in a data transformation job is 6 GB. If the memory usage threshold is exceeded, the job performance is limited, and transformation latency exists. The memory usage threshold is exceeded when a large number of log groups are pulled at the same time. You can modify the `system.process.batch_size` advanced parameter to adjust the memory usage threshold. Important The maximum value allowed for the `system.process.batch_size` advanced parameter is 1000. You can change the value to a positive integer that is less than or equal to 1,000. The default value is 1000.
CPU utilization	The CPU utilization threshold for a concurrent unit of a data transformation job is 100%. If you have higher requirements for CPU utilization, you can increase the number of data transformation jobs that can concurrently run based on the preceding descriptions.
Data amount in a dimension table	The maximum number of data entries allowed in a dimension table is 2 million, and the maximum memory that can be occupied by data in a dimension table is 2 GB. If one of the preceding limits is exceeded, truncation is performed. In this case, only the allowed data entries and data can be used. The related functions include res_rds_mysql, res_log_logstore_pull, and res_oss_file. For more information, see res_rds_mysql, res_log_logstore_pull, and res_oss_file. Important If a single data transformation job consumes data from multiple dimension tables, the tables must conform to the limits as a whole. We recommend that you minimize the amount of data in a dimension table.

Result data writing

Item

Description

Data writing to a destination Logstore

When transformation results are written to a destination Logstore, the write limits of the Logstore cannot be exceeded. For more information, see Basic resources and Data read and write.

If you configure the hash_key_field or hash_key parameter and specify the KeyHash mode when you call the e_output and e_coutput functions to write data to a destination Logstore, we recommend that you appropriately allocate hash keys and shards to minimize uneven data distribution.

You can locate a write limit error based on the logs that record data transformation jobs. For more information, see View error logs.

Important

If a write limit error occurs when the results of a data transformation job are written to a destination Logstore, repeated retries are performed to ensure that the transformation results are complete. In this case, the progress of the data transformation job is compromised, and the transformation of data in the source shard is delayed.

Cross-region data transmission

When data is transferred across regions by using a public endpoint, network quality cannot be ensured. In this case, a network error may occur when the results of a data transformation job are written to a destination Logstore. This delays the progress of the entire data transformation job. For more information about Simple Log Service endpoints, see Endpoints.

To improve the stability of network transmission, we recommend that you enable the global acceleration feature for your project and specify a global acceleration endpoint in your data transformation job. For more information, see Enable the global acceleration feature.