All Products
Search
Document Center

Realtime Compute for Apache Flink:FAQ about the monitoring and alerting feature and logs

Last Updated:Mar 17, 2025

This topic provides answers to some frequently asked questions about the monitoring and alerting feature and logs.

How do I check the type of the monitoring and alerting service of a workspace?

The type of the monitoring and alerting service of a workspace is selected when you create the workspace. After you create a workspace, you can perform the following steps to check the type of the monitoring and alerting service of the workspace: In the left-side navigation pane of the development console of Realtime Compute for Apache Flink, choose O&M > Deployments. On the Deployments page, find the desired deployment and click the name of the deployment. If the Alarm tab appears, the Pay-as-you-go Prometheus monitoring service Managed Service for Prometheus of Application Real-Time Monitoring Service (ARMS) is used. If the Alarm tab is not displayed, the free monitoring service CloudMonitor is used. For more information about how to configure different types of monitoring and alerting services, see Deployment monitoring and alerting.

image

What are the limits of the alerting feature of CloudMonitor compared with ARMS?

  • The query and analysis statements are not supported.

  • Only the current metric charts of a deployment are displayed. Historical metric charts are unavailable. This causes inefficiency when you compare records per second (RPS) during multi-round optimization.

  • The metric charts of a subtask are unavailable. In scenarios that have multiple sources and subtasks, the latency problem occurred after clustering cannot be identified in an intuitive and efficient manner.

  • You cannot view the metrics reported by using internal code instrumentation. This may cause inconvenience for troubleshooting.

How do I configure or add an alert contact?

When you configure alert rules in the CloudMonitor console or the ARMS console, you must configure or add an alert contact in the related console. For more information about how to configure alert rules, see Configure monitoring alerts.

When you use Managed Service for Prometheus of ARMS for a workspace, you can perform the following operations to configure or add alert contacts if you configure alert rules for metrics for a single deployment in the development console of Realtime Compute for Apache Flink or configure alert rules for deployment failure events:

  1. Go to the Alarm tab.

    1. Log on to the management console of Realtime Compute for Apache Flink. Find the workspace that you want to manage and click Console in the Actions column.

    2. In the left-side navigation pane of the development console of Realtime Compute for Apache Flink, choose O&M > Deployments. On the Deployments page, find the desired deployment and click the name of the deployment.

    3. Click the Alarm tab.

  2. On the Alarm tab, click the Alarm Rules tab. In the upper-right corner of the Alarm Rules tab, choose Add Rule > Custom Rule to go to the Create Rule panel.

  3. Configure or add an alert contact.

    • Add an alert contact.

      Click Notification object management next to the Notification object parameter to add a contact and DingTalk chatbot. For more information about how to add a DingTalk or Lark chatbot and a webhook, see FAQ.

      After you add an alert contact by phone, make sure that the phone number of the recipient passes verification. Otherwise, the configuration does not take effect. If Unverified appears in the Phone column of a contact on the Contact tab, click Unverified to complete verification.

      image

  • Configure an alert contact.

    Select the alert contact that you want to notify from the Notification object drop-down list. If no alert contacts exist, you can add an alert contact by following the above steps.

How do I deactivate Managed Service for Prometheus that is automatically activated for Realtime Compute for Apache Flink?

If you select Pay-as-you-go Prometheus monitoring service when you create the workspace, ARMS is automatically activated. If you no longer need to monitor Realtime Compute for Apache Flink, perform the following steps to deactivate Managed Service for Prometheus:

Important

You can uninstall a Prometheus instance to deactivate Managed Service for Prometheus for a workspace. After you deactivate Managed Service for Prometheus for a workspace, all metrics in the workspace are discarded. If an exception occurs when you run a deployment, the time at which the exception first occurs cannot be determined and alerts cannot be reported. Proceed with caution.

  1. Log on to the Managed Service for Prometheus console.

  2. In the left-side navigation pane, click Instances to go to the Instances page.

  3. Select the ID or name of the workspace that you want to manage from the Filter by Tag drop-down list.

  4. Find the instance whose value in the Instance Type column is Prometheus for Flink Serverless and click Uninstall in the Actions column.

  5. In the message that appears, click OK.

How do I find the deployment that triggers an alert?

An alert event contains the job ID and the deployment ID. However, the job ID changes after a deployment failover occurs. In this case, you must use the deployment ID to find the deployment for which an error is returned. You can use one of the following methods to view the deployment ID:

  • In the left-side navigation pane of the development console of Realtime Compute for Apache Flink, click Deployments. On the Deployments page, find the desired deployment and click the name of the deployment. On the Configuration tab, view the deployment ID in the Basic section.

    image

  • View the deployment ID in the URL of the deployment.

    image

How do I configure alert rules for a deployment to deal with issues such as deployment restarts?

You can configure alert rules for a deployment in the development console of Realtime Compute for Apache Flink based on metrics of Realtime Compute for Apache Flink. In this case, metric curves cannot be displayed and alerts cannot be triggered if a deployment failover occurs. To deal with events, such as deployment restarts, that have a significant impact on your business, you can configure custom rules based on the flink_jobmanager_job_numRestarts metric in the ARMS console to trigger an alert for TaskManager failovers. The flink_jobmanager_job_numRestarts indicates the instantaneous rate of change. To configure alert rules, perform the following steps:

  1. Log on to the management console of Realtime Compute for Apache Flink.

  2. Find the workspace that you want to manage and choose More > Monitoring Indicator Configuration to go to the ARMS console.

  3. In the left-side navigation pane of the ARMS console, click Alert rules. On the Prometheus Alert Rules page, click Create Prometheus Alert Rule.

  4. On the Create Prometheus Alert Rule page, select Custom PromQL for the Check Type parameter and select an instance from the Prometheus Instance drop-down list.

  5. Write code for the custom PromQL statements.

    For example, if you configure irate(flink_jobmanager_job_numRestarts{jobId=~"$jobId",deploymentId=~"$deploymentId"}[1m])>0, data of the flink_jobmanager_job_numRestarts metric within the last one minute is queryed. An alert is triggered when the instantaneous rate of change is greater than 0.

  6. Click Complete.

How do I configure parameters at the log level for a single class?

For example, if you specify log4j.logger.org.apache.kafka.clients.consumer=trace for a Kafka source table and specify log4j.logger.org.apache.kafka.clients.producer=trace for a sink table when you use the Kafka connector, you must configure the parameters in the Log Levels field in the Logging section of the Configuration tab. You cannot configure parameters in the Other Configuration field of the Parameters section.参数设置

How do I enable GC logging?

In the left-side navigation pane of the development console of Realtime Compute for Apache Flink, choose O&M > Deployments. On the Deployments page, find the desired deployment and click the name of the deployment. On the Configuration tab of the deployment details page, click Edit in the upper-right corner of the Parameters section, add the following code to the Other Configuration field, and then click Save to make the configuration take effect:

env.java.opts: >-
-XX:+PrintGCDetails -XX:+PrintGCDateStamps -Xloggc:/flink/log/gc.log
-XX:+UseGCLogFileRotation -XX:NumberOfGCLogFiles=2 -XX:GCLogFileSize=50M

image

What do I do if a deployment startup error is reported after I configure parameters to export the logs of the deployment to Simple Log Service?

  • Problem description

    After the parameters are configured to export the logs of the deployment to Simple Log Service, the "Failed to start the deployment. Try again." error message appears during the startup of the deployment and the following error message is also reported:

    Unknown ApiException {exceptionType=com.ververica.platform.appmanager.controller.domain.TemplatesRenderException, exceptionMessage=Failed to render {userConfiguredLoggers={}, jobId=3fd090ea-81fc-4983-ace1-0e0e7b******, rootLoggerLogLevel=INFO, clusterName=f7dba7ec27****, deploymentId=41529785-ab12-405b-82a8-1b1d73******, namespace=flinktest-default, priorityClassName=flink-p5, deploymentName=test}}
    029999 202312121531-8SHEUBJUJU
  • Cause

    The values of the variables in Twig templates, such as namespace and deploymentId, are changed when you configure the parameters to export logs of the deployment to Simple Log Service.

    image.png

  • Solution

    Reconfigure the parameters based on your business requirements. For more information, see Configure parameters to export logs of a deployment.

How do I view, search for, and analyze the historical operational logs of a Realtime Compute for Apache Flink deployment?

You can view and analyze the historical operational logs of a deployment in the development console of Realtime Compute for Apache Flink or in an external storage.

  • View and analyze the historical operational logs of a deployment on the Logs tab in the development console of Realtime Compute for Apache Flink

    In the Logging section of the Configuration tab, turn on Allow Log Archives to enable the Log Archiving feature and configure the Log Archives Expires parameter. By default, Allow Log Archives is turned on. The Log Archive Expires parameter is set to 7. Unit: days. The system retains the latest 5 MB operational logs.

    image

  • View and analyze the historical operational logs of a deployment in an external storage

    You can also configure parameters to export deployment logs to an external storage, such as Object Storage Service (OSS), Simple Log Service, or Kafka, and specify the level of the logs that you want to export. For more information, see Configure parameters to export logs of a deployment.

How do I resolve the issue that the logs generated by using a non-static method cannot be exported to Simple Log Service?

  • Problem description

    The logic of Logger and Appender in Log4j Appender is used in Simple Log Service. As a result, the logs that are generated by using a non-static method cannot be exported to Simple Log Service.

  • Solution

    Use the static method private static final Logger LOG = LoggerFactory.getLogger(xxx.class);.

What do I do if Kafka can receive data that is written from Realtime Compute for Apache Flink but the value in the Records Received column on the Status tab of the related deployment is 0?

  • Problem description

    The deployment has only one data operator. The source operator has no input but only output and the sink operator has only input but no output. In this case, I cannot view the amount of data that is read and written in the deployment topology.

  • Solution

    Split the operators to view the amount of data in the deployment topology. Split the source and sink operators as independent operators from the topology. Then, separately connect the source operator and sink operator with other operators to form a new topology. You can view the data flow and traffic in the new topology.

    In the left-side navigation pane of the development console of Realtime Compute for Apache Flink, choose O&M > Deployments. On the Deployments page, find the desired deployment and click the name of the deployment. On the Configuration tab of the deployment details page, click Edit in the upper-right corner of the Parameters section, add the pipeline.operator-chaining: 'false' configuration to the Other Configuration field to split the operators, and then click Save to make the configuration take effect.监控FAQ.png

What do I do if a DataStream deployment is not delayed but the values of delay-related metrics for output data indicate a delay in the deployment?

  • Problem description

    Data is continuously read by using a source table of Realtime Compute for Apache Flink, and the Kafka connector continuously writes the data to each partition of an ApsaraMQ for Kafka physical table. However, the values of the CurrentEmitEventTimeLag and CurrentFetchEventTimeLag metrics for the DataStream deployment indicate that the deployment is delayed for 52 years.

  • Cause

    The Kafka connector in the DataStream deployment is provided by the Apache Flink community and is not a built-in connector that is supported by Realtime Compute for Apache Flink. Connectors that are supported by the Apache Flink community do not support metric-based monitoring. As a result, the values of the metrics are abnormal.

  • Solution

    For more information about the dependencies of the built-in connector of Realtime Compute for Apache Flink, see Maven repository.

What do I do if the TaskManager logs of a DataStream deployment contain a NullPointerException error but do not provide the details of the error stack?

In the left-side navigation pane of the development console of Realtime Compute for Apache Flink, choose O&M > Deployments. On the Deployments page, find the desired deployment and click the name of the deployment. On the Configuration tab of the deployment details page, click Edit in the upper-right corner of the Parameters section, add the following code to the Other Configuration field, and then click Save to make the configuration take effect:

env.java.opts: "-XX:-OmitStackTraceInFastThrow"