All Products
Search
Document Center

Realtime Compute for Apache Flink:Perform intelligent deployment diagnostics

Last Updated:Nov 16, 2023

The intelligent deployment diagnostics feature provided by Flink Advisor is used to monitor the health status of a deployment, analyze and diagnose error logs, exceptions, and risks of a deployment, and provide optimization suggestions based on the diagnostic results. This feature helps ensure the stability and reliability of your business. This topic describes how to use the intelligent deployment diagnostics feature.

Background information

The intelligent deployment diagnostics feature provided by Flink Advisor is used to monitor the health status of a deployment and check the health score of a deployment that is in the RUNNING state in real time. This feature helps you manage and diagnose deployments from draft development to deployment O&M of Realtime Compute for Apache Flink. The intelligent deployment diagnostics feature allows the system to analyze all logs, events, metrics, and configurations of a fully managed Flink deployment during the lifecycle of the deployment in real time. You can use this feature to diagnose error logs of a draft, check the health score of a deployment that is in the RUNNING state, and detect the root cause of a deployment in an abnormal state based on the O&M experience of Alibaba Cloud technical experts on issues that frequently occur in the deployments of fully managed Flink. After the diagnostics is complete, fully managed Flink provides optimization suggestions based on the diagnostic results. This helps reduce the time that is required for data analysis and mean time to repair (MTTR) and ensures the stability and health status of your deployments. The following figure shows the capabilities of the intelligent deployment diagnostics feature.

image..png

Limits

Only streaming deployments support the intelligent deployment diagnostics feature. Batch deployments do not support this feature.

Diagnose error logs

Diagnose error logs of a draft

  1. Go to the SQL Editor page.

    1. Log on to the Realtime Compute for Apache Flink console.

    2. On the Fully Managed Flink tab, find the workspace that you want to manage and click Console in the Actions column.

    3. In the left-side navigation pane, click SQL Editor.

  2. Write SQL statements to develop a draft and click Validate.

    Check the SQL semantics of the draft, network connectivity, and the metadata information of the tables that are used by the draft. You can also click SQL Advice in the calculated results to view information about SQL risks and related optimization suggestions.

  3. In the lower part of the SQL Editor page, view the error details, possible causes, and optimization suggestions.

    image..png
    Note

    If you cannot identify the cause of the error and obtain optimization suggestions based on the result of the syntax check, you can select the related logs and click Search in Documentation to find relevant information in the help document.

Diagnose error logs of a deployment

  1. Go to the Deployments page.

    1. Log on to the Realtime Compute for Apache Flink console.

    2. On the Fully Managed Flink tab, find the workspace that you want to manage and click Console in the Actions column.

    3. In the left-side navigation pane, click Deployments. On the Deployments page, click the name of the desired deployment.

  2. Click the Diagnostics tab.

  3. On the Diagnostics tab, click Running Logs, Startup Logs, and JM Exceptions to view the logs of the deployment.

    image..png

    For more information, see View startup logs and operational logs of a job, View the exception logs of a deployment, and View the logs of a historical job.

Perform intelligent deployment diagnostics on abnormal deployments

  1. Go to the Diagnosis tab.

    1. Log on to the Realtime Compute for Apache Flink console.

    2. On the Fully Managed Flink tab, find the workspace that you want to manage and click Console in the Actions column.

    3. In the left-side navigation pane, click Deployments.

    4. Go to the Diagnosis tab.

      You can go to the Diagnosis tab by using one of the following methods:

      • In the deployment list, find the desired deployment and click the score of the deployment in the Health column.健康分按钮

      • On the Deployments page, click the name of the desired deployment and click the Diagnosis tab.

        image..png
  2. Click Diagnose.

    Flink Advisor provides a variety of log repositories for Flink error logs. For more information about the diagnostic types, diagnostic phases, diagnostic items, and handling methods, see Flink Advisor diagnostic capabilities.

  3. View the diagnostic result and optimization suggestions.

    If you want to apply an optimization suggestion, you can click Apply on the right side of the optimization suggestion.

Flink Advisor diagnostic items

Type

Phase

Item

Description

Exception (The execution of the deployment is affected.)

Startup

Startup file analysis

If the required JAR package does not exist in the Object Storage Service (OSS) directory, the deployment cannot be started. To resolve this issue, upload the JAR package again before you start the deployment.

Resource analysis

If the remaining available resources are insufficient, the deployment cannot be started. To resolve this issue, reduce the values of the resource configurations of the deployment or scale out the cluster to which the deployment belongs.

If the Container Network Interface (CNI) fails to be bound to the deployment, the deployment cannot be started. To resolve this issue, check whether the number of IP addresses of the related vSwitch reaches the upper limit.

If the number of IP addresses of Elastic network interfaces (ENIs) exceeds the upper limit, the deployment cannot be started. We recommend that you increase the number of ENIs and try again.

Topology network analysis

If no network connection is established between the TaskManager and JobManager, the deployment is abnormal.

If the operation of binding ENIs to Elastic Compute Service (ECS) instances times out within the previous 10 minutes, the deployment starts at a low speed. We recommend that you wait for a period of time.

Network analysis of upstream and downstream services

If the Transmission Control Protocol (TCP) port detection is normal but the upstream or downstream connector is not connected, the deployment cannot be started. We recommend that you check the network configurations of the upstream and downstream services.

Permission detection of upstream and downstream services

If the upstream data source is not connected, the deployment cannot be started. We recommend that you check the permission configuration of the upstream service.

If the downstream data source is not connected, the deployment cannot be started. We recommend that you check the permission configuration of the downstream service.

Startup speed analysis

If the JAR package of the deployment is excessively large, the deployment starts at a low speed. We recommend that you compress the JAR package and upload the package again or wait patiently.

JobGraph check

The configuration file of fully managed Flink of an earlier version may be missing. If this issue occurs, the deployment may not recover after the deployment fails. To resolve this issue, manually cancel and then start the deployment.

Session cluster check

A session cluster of fully managed Flink of an earlier version may be abnormal. If this issue occurs, the deployment is abnormal.

Run icon

High availability (HA) status check

If HA is not enabled for the deployment, the deployment cannot recover after the deployment fails. To resolve this issue, publish the draft for the deployment again and manually cancel and then start the deployment.

Checkpoint check

The checkpoint feature of fully managed Flink of an earlier version may be abnormal. If this issue occurs, checkpointing may fail.

Permission detection of upstream and downstream services

If the TCP port detection is normal but the upstream or downstream connector is not connected, the deployment cannot be started. We recommend that you check the permission configurations of the upstream and downstream services.

Running status check

If an out-of-memory (OOM) error occurs in a TaskManager of a deployment, the deployment performs a failover. We recommend that you check the deployment configuration and increase the memory of the TaskManager.

Cancellation

Cancellation speed analysis

In fully managed Flink of an earlier version, the process of canceling a deployment is slow. If the deployment is canceled at a low speed, manually cancel and then start the deployment.

Risk (The execution of the deployment is not affected.)

Additional Configuration

JobGraph check

The current status of the deployment is normal. However, the system detects that the configuration file of fully managed Flink of an earlier version may be missing. As a result, the deployment cannot recover after the deployment fails. To resolve this issue, manually cancel and then start the deployment.

HA status check

The current status of the deployment is normal. However, the system detects that HA is not enabled for the deployment. As a result, the deployment cannot recover after the deployment fails. To resolve this issue, publish the draft for the deployment again and manually cancel and then start the deployment.

Version check

The current status of the deployment is normal. However, the system detects a major defect in fully managed Flink of the current version.

Run icon

Checkpoint check

The current status of the deployment is normal. However, the system detects a potential stability issue that is caused by a checkpoint exception in fully managed Flink of an earlier version.

The current status of the deployment is normal. However, the system detects that no checkpoint is created for a long period of time.

Cancellation speed analysis

The current status of the deployment is normal. However, the system detects a risk that may cause the deployment to be canceled at a low speed in fully managed Flink of an earlier version. To resolve this issue, manually cancel and then start the deployment.

Runtime environment analysis

  • The deployment performs a failover because an exception occurs on the machine where the deployment runs. In this case, the deployment can be automatically restored after the failover. You do not need to manually handle the issue.

  • A deployment may perform a failover within a few minutes during the upgrade of the machine where the deployment runs. After the failover is successful, the deployment can be automatically restored. To prevent this issue, manually cancel and then start the deployment before you upgrade the machine.

  • A hardware failure occurs on the machine where a deployment runs and the machine recovers after a period of time. If this occurs, the deployment performs a failover. To prevent this issue, manually cancel and then start the deployment.