Use intelligent deployment diagnostics to analyze exceptions and optimize deployments - Realtime Compute for Apache Flink

Limitations

Only streaming deployments support intelligent deployment diagnostics. Batch deployments are not supported.

How it works

Flink Advisor evaluates your deployment's health status continuously. The health score starts at 100 and is deducted based on the number and severity of risks detected in the last 30 minutes.

The score is displayed as a color-coded badge in the deployment list:

Color	Score range	Meaning
Green	Above 80	No potential risks. Configuration suggestions may be provided.
Yellow	60–80	Issues or potential risks detected. Review the deployment.
Red	Below 60	Serious issues detected. Act immediately to prevent deployment cancellation.

Diagnose exception logs of a draft

Use this workflow to diagnose SQL errors before publishing a deployment.

Log on to the Realtime Compute for Apache Flink console. Find the workspace you want to manage and click Console in the Actions column.
In the left-side navigation pane, choose Development > ETL. Create a draft, write your SQL statements, and click Validate.
In the lower part of the SQL Editor page, review the error details, possible causes, and optimization suggestions.

If you cannot identify the cause from the syntax check results, select the related logs and click Search in Documentation to find relevant information.

Diagnose exception logs of a deployment

Use this workflow to investigate log-level exceptions in a running or failed deployment.

Log on to the Realtime Compute for Apache Flink console. Find the workspace you want to manage and click Console in the Actions column.
In the left-side navigation pane, choose O&M > Deployments. Click the name of the deployment you want to investigate.
Click the Logs tab. In the left-side pane, click Logs, Startup Logs, or JM Exceptions to view the relevant logs.

For more information about log types, see View the boot logs and operational logs of a deployment. To investigate exception logs, see View the exception logs of a deployment. To access historical records, see View the logs of a historical deployment.

Run intelligent diagnostics on an abnormal deployment

Use this workflow when a deployment is in an abnormal state and you need to identify the root cause.

Navigate to the Diagnosis tab using one of these methods:
- In the deployment list, find the deployment and click its score in the Health column.
- Click the deployment name, then click the Diagnosis tab.
Click Diagnose. Flink Advisor provides a variety of log repositories for Flink exception logs. For the full list of diagnostic items and remediation steps, see Flink Advisor diagnostic items.
Review the diagnostic results and optimization suggestions. To apply a suggestion, click Apply next to it.

Flink Advisor diagnostic items

Diagnostic items fall into two categories:

Exception: The deployment execution is affected. Immediate action is required.
Risk: The deployment is currently running but a potential issue is detected. Address it proactively to prevent future failures.

Type	Phase	Diagnostic item	Description
Exception (deployment execution is affected)	Startup	Startup file analysis	The required JAR package is missing from the Object Storage Service (OSS) directory. Re-upload the JAR package before starting the deployment.
	Startup	Resource analysis	Remaining available resources are insufficient. Reduce the resource configuration values for the deployment, or scale out the cluster.
	Startup	Resource analysis	The Container Network Interface (CNI) failed to bind to the deployment. Check whether the number of IP addresses of the related vSwitch has reached its upper limit.
	Startup	Resource analysis	The number of Elastic Network Interface (ENI) IP addresses exceeds the upper limit. Increase the number of ENIs and try again.
	Startup	Topology network analysis	No network connection is established between the TaskManager and JobManager. The deployment is abnormal.
	Startup	Topology network analysis	ENI binding to Elastic Compute Service (ECS) instances timed out within the previous 10 minutes. The deployment is starting at a low speed. Wait for the operation to complete.
	Startup	Network analysis of upstream and downstream services	TCP port detection is normal, but the upstream or downstream connector is not connected. Check the network configurations of the upstream and downstream services.
	Startup	Permission detection of upstream and downstream services	The upstream data source is not connected. Check the permission configuration of the upstream service.
	Startup	Permission detection of upstream and downstream services	The downstream data source is not connected. Check the permission configuration of the downstream service.
	Startup	Startup speed analysis	The JAR package is excessively large, causing a slow startup. Compress the JAR package and re-upload it, or wait for the startup to complete.
	Startup	JobGraph check	The configuration file from an earlier version of Realtime Compute for Apache Flink may be missing. The deployment may not recover after a failover. Manually cancel and restart the deployment.
	Startup	Session cluster check	A session cluster from an earlier version of Realtime Compute for Apache Flink may be abnormal, causing the deployment to be abnormal.
	Run icon	High availability (HA) status check	HA is not enabled. The deployment cannot recover after a failure. Publish the draft again, then manually cancel and restart the deployment.
	Run icon	Checkpoint check	The checkpoint feature from an earlier version of Realtime Compute for Apache Flink may be abnormal, which can cause checkpointing to fail.
	Run icon	Permission detection of upstream and downstream services	TCP port detection is normal, but the upstream or downstream connector is not connected. Check the permission configurations of the upstream and downstream services.
	Run icon	Running status check	An out-of-memory (OOM) error occurred in a TaskManager. Check the deployment configuration and increase the TaskManager memory.
	Cancellation	Cancellation speed analysis	Cancellation is slow in an earlier version of Realtime Compute for Apache Flink. Manually cancel and restart the deployment.
Risk (deployment execution is not affected)	Configurations	JobGraph check	The deployment is running normally. However, the configuration file from an earlier version may be missing. The deployment may not recover after a failure. Manually cancel and restart the deployment.
	Configurations	HA status check	The deployment is running normally. However, HA is not enabled. The deployment may not recover after a failure. Publish the draft again, then manually cancel and restart the deployment.
	Configurations	Version check	The deployment is running normally. However, a major defect is detected in the current version of Realtime Compute for Apache Flink.
	Run icon	Checkpoint check	The deployment is running normally. However, a checkpoint exception in an earlier version of Realtime Compute for Apache Flink may cause a potential stability issue.
	Run icon	Checkpoint check	The deployment is running normally. However, no checkpoint has been created for an extended period of time.
	Run icon	Cancellation speed analysis	The deployment is running normally. However, a risk is detected that may cause slow cancellation in an earlier version of Realtime Compute for Apache Flink. Manually cancel and restart the deployment.
	Run icon	Runtime environment analysis	The deployment failed over due to an exception on the host machine. The deployment can be automatically restored after the failover. No action is required.
	Run icon	Runtime environment analysis	A machine upgrade may cause the deployment to fail over within a few minutes. The deployment can be automatically restored after the failover. To prevent this, manually cancel and restart the deployment before the machine upgrade.
	Run icon	Runtime environment analysis	A hardware failure occurred on the host machine. The deployment failed over. Manually cancel and restart the deployment.
	Run icon	Version check	The version is at End of Service (EOS). Stability issues may occur, or product support may no longer be available. For more information, see Console operations.

What's next

Monitor JobManager and TaskManager performance metrics for running deployments. For more information, see Monitor deployment performance.
Configure automatic tuning to let the system reconfigure resources automatically or on a schedule. For more information, see Enable automatic performance tuning.
Optimize Flink SQL deployments by tuning deployment configurations and SQL logic. For more information, see Optimize Flink SQL.