To improve user experience, Realtime Compute allows you to use automatic configuration to optimize job performance.

Note Automatic configuration applies to Blink 1.0 and Blink 2.0.

Background and scope

If all the operators and both the upstream and downstream storage systems of your Realtime Compute job meet the performance requirements and remain stable, automatic configuration can help you properly adjust job configurations, such as operator resources and parallelism. It also helps optimize your job throughout the entire process to resolve performance issues such as low throughput or upstream and downstream backpressure.

In the following scenarios, you can use this feature to optimize job performance but cannot eliminate job performance bottlenecks. To eliminate the performance bottlenecks, manually configure the resources or contact the Realtime Compute support team.

  • Performance issues exist in the upstream or downstream storage systems of a Realtime Compute job.
    • Performance issues in the data source, such as insufficient DataHub partitions or Message Queue (MQ) throughput. In this case, you must increase the partitions of the relevant source table.
    • Performance issues in the data sink, such as a deadlock in ApsaraDB for RDS.
  • Performance issues of user-defined extensions (UDXs) such as the UDFs, UDAFs, and UDTFs in your Realtime Compute job.

Operations

  • New jobs
    1. Publish a job.
      1. After you complete SQL development and syntax check on the Development page, click Publish. The Publish New Version dialog box appears.
      2. Specify Resource Configuration Method.
        • Automatic CU Configuration (2.25 CUs Available): If you select this option, you can specify the number of compute units (CUs). The automatic configuration algorithm generates an optimized resource configuration and assigns a value for the number of CUs based on the default configuration. If you use automatic CU configuration for the first time, the default number of CUs is used. This algorithm generates an initial configuration based on empirical data when you use automatic CU configuration for the first time. We recommend that you select Automatic CU Configuration (2.25 CUs Available) if your job has been running for 5 to 10 minutes and its metrics, such as source RPS, remain stable for 2 to 3 minutes. You can obtain the optimal configuration after you repeat the optimization process for three to five times.
        • Use Latest Manually Configured Resources: The latest saved resource configuration is used. If the latest resource configuration is generated based on automatic CU configuration, the latest resource configuration is used. If the latest resource configuration is obtained based on the manual configuration, the manual configuration is used.
    2. Use the default configuration to start the job.
      1. Use the default configuration to start the job, as shown in the following figure.
      2. On the Administration page, find the job and click Start in the Actions column to start the job.
      Assume that the default number of CUs generated the first time is 71.
      Note Make sure that your job runs longer than 10 minutes and its metrics such as source RPS remain stable for 2 to 3 minutes before you select Automatic CU Configuration (2.25 CUs Available) for Resource Configuration Method.
    3. Use the automatic CU configuration to start a job.
      1. Resource performance optimization

        If you select Automatic CU Configuration (2.25 CUs Available) for Resource Configuration Method and specify 40 CUs to start your job, you can change the number of CUs based on your job to optimize resource performance.

        • Determine the minimum number of CUs.

          We recommend that you set the number of CUs to a value that is greater than or equal to 50% of the default value. The number of CUs cannot be less than 1. Assume that the default number of CUs for automatic CU configuration is 71. The recommended minimum number of CUs is 36, which is calculated by using the following formula: 71 CUs × 50% = 35.5 CUs.

        • Increase the number of CUs.

          If the throughput of your Realtime Compute job does not meet your requirements, increase the number of CUs. We recommend that you increase the number of CUs by more than 30% of the current value. For example, if the number of CUs that you specified last time is 10 CUs, you can increase the number to 13.

        • Repeat the optimization process.

          If the first optimization attempt does not meet your requirements, repeat the process until you obtain the desired results. You can change the number of CUs based on your job status after each optimization attempt.

      2. View the result of optimization. The following figure shows an example.
        Note Do not select Use Latest Manually Configured Resources for a new job. Otherwise, an error is returned.
  • Existing jobs
    • The following figure shows the optimization process of automatic configuration.
      Note
      • Before you use automatic configuration for a job that is in the running state, check whether stateful operations are involved. This is because the saved state data of a job may be cleared during the optimization process of automatic configuration.
      • If you make changes to a job, for example, modifying SQL statements or changing the Realtime Compute version, automatic configuration may fail. These changes may lead to topology changes, which results in certain issues. For example, curve charts may not be able to display the latest data, or the state data may not able to be used for fault tolerance. In this case, resource configurations cannot be optimized based on the job running history and therefore an error is returned when you perform automatic configuration. To rectify the fault, you must treat the changed job as a new job and repeat the previous operations.
    • Procedure
      1. Suspend the job.
      2. Repeat the steps performed for new jobs and resume the job with the latest configuration.

FAQ

The optimization result of automatic configuration may not be accurate in the following scenarios:

  • If the job runs only for a short period of time, the data collected during data sampling is insufficient. We recommend that you increase the running duration of the job and make sure that the curves of job metrics such as source RPS remain stable for at least 2 to 3 minutes.
  • A job fails. We recommend that you check and fix the failure.
  • Only a small amount of data is available for a job. We recommend that you retrieve more historical data.
  • The effect of automatic configuration is affected by multiple factors. Therefore, the latest configuration obtained by using automatic configuration may not be optimal. If the effect of automatic configuration does not meet your requirements, you can manually configure the resources. For more information, see Optimize performance by manual configuration.

Recommendations

  • To help automatic configuration accurately collect the runtime metric information of a job, make sure that the job runs stably for more than 10 minutes before you apply automatic configuration to the job.
  • Job performance can be improved after you use automatic configuration for three to five times.
  • When you use automatic configuration, you can specify the start offset to retrieve historical data or even accumulate large amounts of data for a job to create backpressure to accelerate the optimization effect.

Method used to determine the effectiveness of automatic configuration

Automatic configuration of Realtime Compute is enabled based on a JSON configuration file. After you use automatic configuration to optimize a job, you can view the JSON configuration file to check whether the feature is running as expected.

  • You can view the JSON configuration file by using one of the following methods:
    1. View the file on the job edit page, as shown in the following figure.
    2. View the file on the Job Administration page, as shown in the following figure.
  • JSON configuration description
    "autoconfig" : {
        "goal": {  // The goal of automatic configuration.
            "maxResourceUnits": 10000.0,  // The maximum number of CUs for a Blink job. This value cannot be changed. Therefore, you can ignore this item when you check whether the feature is running as expected.
            "targetResoureUnits": 20.0  // The number of CUs that you specified. The specified number of CUs is 20.
        },
        "result" : {  // The result of automatic configuration. We recommend that you pay attention to this item.
          "scalingAction" : "ScaleToTargetResource",  // The action of automatic configuration. *
          "allocatedResourceUnits" : 18.5, // The total resources allocated by automatic configuration.
          "allocatedCpuCores" : 18.5,      // The total CPU cores allocated by automatic configuration.
          "allocatedMemoryInMB" : 40960    // The total memory size allocated by automatic configuration.
          "messages" : "xxxx"  // We recommend that you pay attention to these messages. *
        }
    }
    • scalingAction: If the value of this parameter is InitialScale, this is the first time that you use automatic configuration. If the value of this parameter is ScaleToTargetResource, this is not the first time that you use automatic configuration.
    • If no message is displayed, automatic configuration runs properly. If some messages are displayed, you must analyze these messages. Messages are categorized into the following two types:
      • Warning: This type of message indicates that automatic configuration runs properly but you must pay attention to potential issues, such as insufficient partitions in a source table.
      • Error or exception: This type of message indicates that automatic configuration failed. The following error message is usually displayed: Previous job statistics and configuration will be used. The automatic configuration for a job fails in the following two scenarios:
        • The job or Blink version is modified before you use automatic configuration. In this case, the previous running information cannot be used for automatic configuration.
        • An error message that contains "exception" is reported when you use automatic configuration. In this case, you must analyze the error based on the job running information and logs. If you do not have enough information, submit a ticket.

Error messages

IllegalStateException

If the following error messages are displayed, the state data cannot be used for fault tolerance. To resolve this issue, terminate the job, clear its state, and then specify the start offset to re-read the data.

If you cannot migrate the target job to a backup node, follow these steps to mitigate the negative impact of service interruption: Roll back the target job to an earlier version and specify the start offset to re-read the data during off-peak hours. To roll back the target job, click Versions on the right side of the Development page. On the page that appears, move the pointer over More in the Actions column, click Compare, and then click Roll Back to Version.

java.lang.IllegalStateException: Could not initialize keyed state backend.
    at org.apache.flink.streaming.api.operators.AbstractStreamOperator.initKeyedState(AbstractStreamOperator.java:687)
    at org.apache.flink.streaming.api.operators.AbstractStreamOperator.initializeState(AbstractStreamOperator.java:275)
    at org.apache.flink.streaming.runtime.tasks.StreamTask.initializeOperators(StreamTask.java:870)
    at org.apache.flink.streaming.runtime.tasks.StreamTask.initializeState(StreamTask.java:856)
    at org.apache.flink.streaming.runtime.tasks.StreamTask.invoke(StreamTask.java:292)
    at org.apache.flink.runtime.taskmanager.Task.run(Task.java:762)
    at java.lang.Thread.run(Thread.java:834)
Caused by: org.apache.flink.api.common.typeutils.SerializationException: Cannot serialize/deserialize the object.
    at com.alibaba.blink.contrib.streaming.state.AbstractRocksDBRawSecondaryState.deserializeStateEntry(AbstractRocksDBRawSecondaryState.java:167)
    at com.alibaba.blink.contrib.streaming.state.RocksDBIncrementalRestoreOperation.restoreRawStateData(RocksDBIncrementalRestoreOperation.java:425)
    at com.alibaba.blink.contrib.streaming.state.RocksDBIncrementalRestoreOperation.restore(RocksDBIncrementalRestoreOperation.java:119)
    at com.alibaba.blink.contrib.streaming.state.RocksDBKeyedStateBackend.restore(RocksDBKeyedStateBackend.java:216)
    at org.apache.flink.streaming.api.operators.AbstractStreamOperator.createKeyedStateBackend(AbstractStreamOperator.java:986)
    at org.apache.flink.streaming.api.operators.AbstractStreamOperator.initKeyedState(AbstractStreamOperator.java:675)
    ... 6 more
Caused by: java.io.EOFException
    at java.io.DataInputStream.readUnsignedByte(DataInputStream.java:290)
    at org.apache.flink.types.StringValue.readString(StringValue.java:770)
    at org.apache.flink.api.common.typeutils.base.StringSerializer.deserialize(StringSerializer.java:69)
    at org.apache.flink.api.common.typeutils.base.StringSerializer.deserialize(StringSerializer.java:28)
    at org.apache.flink.api.java.typeutils.runtime.RowSerializer.deserialize(RowSerializer.java:169)
    at org.apache.flink.api.java.typeutils.runtime.RowSerializer.deserialize(RowSerializer.java:38)
    at com.alibaba.blink.contrib.streaming.state.AbstractRocksDBRawSecondaryState.deserializeStateEntry(AbstractRocksDBRawSecondaryState.java:162)
    ... 11 more