If a job is considered a job that has a risk, you can use the job diagnostics feature to automatically locate the possible causes of the risk. Then, you can quickly restore the job to normal based on the handling suggestions that are provided by fully managed Flink. This topic describes how to use the job diagnostics feature.

Background information

If a job that is running has a risk, fully managed Flink allows you to locate the possible causes of the risk for the job. Fully managed Flink also provides a health score for the job based on the diagnostic result. You can determine the health status of a job based on the health score. Then, you can fix the issue of the job based on the handling suggestions that are provided by fully managed Flink to restore the job to normal. The following table describes the diagnostic items and suggestions that are supported by fully managed Flink.
Job status Category Diagnostic item Risk level Suggestion
Started Stability Check whether the session cluster is abnormal. Medium Submit a ticket for technical support.
Check whether the remaining resources in the session cluster are sufficient. Medium Reduce the values of the resource configurations of the job or scale out the session cluster.
Running Stability Check whether the hardware of the machine is abnormal. Low When the system handles the exception, the job is automatically restored without additional operation.
Check whether the memory size of TaskManagers for the job does not meet your business requirements. Medium Adjust the memory size of the nodes to which TaskManagers for the job correspond.
Check whether high availability (HA) is enabled for the job. Medium Publish the job again and manually restart the job by suspending and starting the job.
Check whether the checkpoint feature for the job fails to run as expected. Medium Submit a ticket for technical support.
Check whether the server on which the job runs is overloaded. Medium Manually restart the job by suspending and starting the job. The system schedules the job to the server on which the load is low.
Canceled Performance Check whether the job is abnormally canceled. High Submit a ticket for technical support.
Check whether the job version is an earlier version. Medium Suspend the job and change the job version. Then, start the job.

Procedure

  1. Log on to the Realtime Compute for Apache Flink console.
  2. On the Fully Managed Flink tab, find the workspace that you want to manage and click Console in the Actions column.
  3. In the left-side navigation pane, choose Applications > Deployments.
  4. Click the name of the desired job.
  5. In the upper-right corner of the job details page, click Diagnosis.
    Diagnosis
  6. On the left side of the page that appears, view the diagnostic result and optimization suggestions.
    Diagnostic resultClick Expand to view the optimization suggestions. Optimization suggestions