Monitoring data transformation jobs - Simple Log Service

Metric data

Running metrics for Data Transformation (New Version) jobs require SLS Job Operational Logs to be enabled. For more information, see Manage service logs.

Dashboard

Simple Log Service automatically creates a dashboard on each data transformation job's details page, displaying its operational metrics.

Procedure

Log on to the Simple Log Service console.
In the Projects section, click the one you want.
In the left-side navigation pane, choose Job Management > Data Transformation.
Click the target data transformation job and view the dashboard in the Execution Status section.

Overall metrics

The dashboard includes the following key metrics:

Processing Rate: The data processing rate, measured in events per second. By default, this metric is calculated over a 1-minute window within a 1-hour period.
- ingest: The number of events read from all shards in the source logstore.
- deliver: The number of events successfully written to the destination logstore.
- failed: The number of events that were read from the source logstore but failed during transformation.
Total Events Read: The total number of events read from all shards in the source logstore. The default statistical period is one day.
Total Events Delivered: The total number of events successfully written to all destination logstores. The default statistical period is one day.
Total Events Failed: The total number of events read from the source logstore that failed during transformation. The default statistical period is one day.
Event Delivery Ratio: The ratio of events successfully delivered to the destination logstore to the total events read from the source logstore. The default statistical period is one day.

Shard details

Shard-level metrics are collected at one-minute intervals as the job reads data from the source Logstore.

Shard Consumption Latency (s): For each shard, this is the time difference (in seconds) between the most recent event's ingestion time and the currently processed event's ingestion time. This indicates the processing delay.
Active Shard Statistics: The default statistical period is one hour.
- shard: The ID of the shard.
- ingest: The number of raw events read from the shard.
- failed: The number of raw events read from the shard that failed during transformation.

Runtime exceptions

Error details are available in the message field.

For example, the log table on the error details page contains four columns: time, level, action, and message. When the level is ERROR and the action is deliver, the message field might display an error like {"Code":"InvalidArgs","Message":"failed to get sts token: ...The role not exists: acs:ram::*:role/test-role."}. This message indicates that the system failed to obtain an STS token because the specified RAM role does not exist.

Alert rules

You can create alert rules based on the operational metrics in Metric data to monitor processing latency, exceptions, and traffic changes. For more information about alerting, see Alerts. To create an alert rule, see Create a log-based alert rule.

Important

When creating an alert rule for a data transformation job, ensure the query targets the same project and logstore where the job's operational logs are stored. For more information, see Manage service logs.

In the Query and Analyze dialog box, click the Advanced Settings tab. Set Type to Logstore and Authorization Method to Default. Select the target Region, and enter the names for the project and logstore. As needed, configure dedicated SQL (you can select Auto, Enable, or Disable) and the time range. Then, click Confirm.

Monitoring processing latency

Item	Description
Purpose	Monitors the shard consumption latency in a data transformation job. An alert is triggered if the processing latency exceeds the specified threshold.
Associated dashboard metric	See Shard Consumption Latency (s).
Sample analysis query	Replace `{job_name}` in the following query with the name of your data transformation job. `__topic__: etl_metrics and job_name: {job_name} and "_etl_:connector_meta.action": ingest \| select split_part( "_etl_:connector_meta.task_name", '#', 2 ) as shard, max_by("_etl_:connector_metrics.lags", __time__) as lags group by shard having shard is not null limit all`
Alert rule settings	Set Trigger Condition to Data Matches Expression. Set the evaluation expression to `lags > 120`. This sets the latency threshold to 120 seconds. Set time range to 5 minutes. Set Check Frequency to 5 minutes. Note To avoid false alarms caused by periodic metric updates (every 1 minute) or latency caused by sudden data spikes, we recommend using these settings.
How to resolve alerts	To resolve these alerts: If the job was recently created and is processing historical data, it may take time to process the backlog. Monitor the latency for one hour. If it does not fall below the alert threshold, proceed to the next step. If the data volume in the source logstore increases significantly: If the Processing Rate (events/s) increases while the Shard Consumption Latency (s) decreases, this indicates that the data transformation job is automatically scaling its resources due to an increase in data in the source Logstore. Monitor the latency for 5 minutes to see if it returns to a normal range. If not, proceed to the next step. If the Processing rate (events/s) does not increase or the Shard consumption latency (s) is still on an upward trend, the number of shards in the source Logstore may be insufficient, which limits the scaling of data transformation resources. You need to manually split the shards of the source Logstore. For specific steps, see Manage Shards. After the split is complete, observe for 5 minutes to check whether the latency falls within the alert range. If not, proceed to the next step. If you have an active alert for processing exceptions, resolve that alert first. After resolving the issue, observe for 5 minutes to see if the latency falls below the alert threshold.

Monitoring processing exceptions

Item	Description
Purpose	Triggers an alert when an exception occurs during a data transformation job.
Associated dashboard metric	See Runtime exceptions.
Sample analysis query	Replace `{job_name}` in the following query with the name of your data transformation job. `__topic__: etl_metrics and job_name: {job_name} and "_etl_:connector_metrics.error": * \| select distinct "_etl_:connector_metrics.error" as errors`
Alert rule settings	Set Trigger Condition to Data Is Returned. Set time range to 10 minutes. Set Check Frequency to 10 minutes.
How to resolve alerts	Troubleshoot based on the error message: If the error message contains `Invalid SPL query`, the job's SPL query contains a syntax error. Correct the query based on the error message details. For more information, see SPL syntax. If the error message contains `Unauthorized`, `InvalidAccessKeyId`, or `SignatureNotMatch`, the job does not have the required permissions to read data from the source logstore or write data to the destination logstore. For more information, see Authorization. If the error message contains `ProjectNotExist` or `LogStoreNotExist`, the specified project or logstore does not exist. Log on to the Simple Log Service console to check and resolve the issue.

Monitoring written data volume ratio (period-over-period)

Item	Description
Purpose	Triggers an alert based on period-over-period changes in the data delivery ratio (written volume vs. read volume). The rule compares the current ratio to the same period from the previous day and week, triggering an alert if the change exceeds your configured growth or decline thresholds.
Associated dashboard metric	Event Delivery Ratio: The ratio of events successfully delivered to the destination logstore to the total events read from the source logstore. The default statistical period is one day.
Sample analysis query	Enter the following query in the Query and Analyze dialog box when you create the alert rule. Replace `{job_name}` in the following query with the name of your data transformation job. __topic__: etl_metrics and job_name: {job_name} \| select round(diff [1], 1) as percent, round(coalesce(diff [4], 0), 1) as ratio_1d, round(coalesce(diff [5], 0), 1) as ratio_1w from( select compare(percent, 86400, 604800) as diff FROM ( select deliver /(ingest + 0.0001) as percent from( select sum( if( "_etl_:connector_meta.action" = 'ingest', "_etl_:connector_metrics.native_bytes", 0 ) ) as ingest, sum( if( "_etl_:connector_meta.action" = 'deliver', "_etl_:connector_metrics.native_bytes", 0 ) ) as deliver FROM log ) ) )
Alert rule settings	Set Trigger Condition to Data Matches Expression. Set the evaluation expression to `(ratio_1d > 1.2 \|\| ratio_1d < 0.8) && (ratio_1w > 1.2 \|\| ratio_1w < 0.8)`. This sets the daily/weekly growth and decline threshold to 20%. Set time range to 1 hour. Set Check Frequency to 1 hour. Note To avoid false alarms from periodic fluctuations in raw data traffic, we recommend setting the daily/weekly growth and decline thresholds to at least 20%, or adjusting the comparison period to match the cycle of your raw data traffic.
How to resolve alerts	To resolve these alerts: If the data volume in the source logstore has changed, check for new data patterns being ingested or for interruptions in existing data streams. If this is the case and the resulting data change aligns with the metric, the alert is caused by the change in the source data pattern. Otherwise, proceed to the next step. If you have active alerts for processing latency or exceptions, resolve those first.

Monitoring source event count (period-over-period)

Item	Description
Purpose	Triggers an alert if the number of events read by the data transformation job changes significantly compared to the previous day and week. The rule fires if the event count exceeds a growth threshold or falls below a decline threshold.
Associated dashboard metric	Total Events Read: The total number of events read from all shards in the source logstore. The default statistical period is one day.
Sample analysis query	Enter the following query in the Query and Analyze dialog box when you create the alert rule. Replace `{job_name}` in the following query with the name of your data transformation job. `__topic__: etl_metrics and job_name: {job_name} and "_etl_:connector_meta.action": ingest \| select diff [1] as events, round(coalesce(diff [4], 0), 1) as ratio_1d, round(coalesce(diff [5], 0), 1) as ratio_1w from( select compare(events, 86400, 604800) as diff FROM ( select sum("_etl_:connector_metrics.events") as events FROM log ) )`
Alert rule settings	Set Trigger Condition to Data Matches Expression. Set the evaluation expression to `(ratio_1d > 1.2 \|\| ratio_1d < 0.8) && (ratio_1w > 1.2 \|\| ratio_1w < 0.8)`. This sets the daily/weekly growth and decline threshold to 20%. Set time range to 1 hour. Set Check Frequency to 1 hour. Note To avoid false alarms from periodic fluctuations in raw data traffic, we recommend setting the daily/weekly growth and decline thresholds to at least 20%, or adjusting the comparison period to match the cycle of your raw data traffic.
How to resolve alerts	To resolve these alerts: If the trend of this metric matches the growth or decline in the event count of the source logstore, the change is caused by the source data volume. Otherwise, proceed to the next step. If you have active alerts for processing latency or exceptions, resolve those first.

Monitoring delivered event count (period-over-period)

Item	Description
Purpose	Triggers an alert if the number of events written by the data transformation job changes significantly compared to the previous day and week. The rule fires if the event count exceeds a growth threshold or falls below a decline threshold.
Associated dashboard metric	Total Events Delivered
Sample analysis query	Enter the following query in the Query and Analyze dialog box when you create the alert rule. Replace `{job_name}` in the following query with the name of your data transformation job. `__topic__: etl_metrics and job_name: {job_name} and "_etl_:connector_meta.action": deliver \| select diff [1] as events, round(coalesce(diff [4], 0), 1) as ratio_1d, round(coalesce(diff [5], 0), 1) as ratio_1w from( select compare(events, 86400, 604800) as diff FROM ( select sum("_etl_:connector_metrics.events") as events FROM log ) )`
Alert rule settings	Set Trigger Condition to Data Matches Expression. Set the evaluation expression to `(ratio_1d > 1.2 \|\| ratio_1d < 0.8) && (ratio_1w > 1.2 \|\| ratio_1w < 0.8)`. This sets the daily/weekly growth and decline threshold to 20%. Set time range to 1 hour. Set Check Frequency to 1 hour. Note To avoid false alarms from periodic fluctuations in raw data traffic, we recommend setting the daily/weekly growth and decline thresholds to at least 20%, or adjusting the comparison period to match the cycle of your raw data traffic.
How to resolve alerts	To resolve these alerts: If the trend of this metric matches the growth or decline in data volume of the source logstore, the change is caused by the source data volume. Otherwise, proceed to the next step. If you have active alerts for processing latency or exceptions, resolve those first.