This topic provides answers to commonly asked questions about Operation Center.

How can I deal with isolated nodes?

  • Problem description

    Isolated nodes are displayed on the Dashboard page of Operation Center or isolated nodes exist in the upstream when you run diagnostics on an instance.

    Isolated nodes are the nodes that lose all upstream dependencies in the production environment. An isolated node cannot create instances and its descendant nodes cannot be scheduled. This affects data production.

  • Possible causes
    Nodes are run in sequence based on the defined node dependencies in DataWorks. The root cause of isolated nodes is that the dependency becomes invalid. The node dependency in the following figure is used as an example to describe the causes of isolated nodes. node
    • The output name of the ancestor node is modified or deleted.
    • The ancestor node does not exist.
    • The ancestor node expires.
    • Scheduling is not enabled for the workspace where the ancestor node resides.
  • Troubleshooting
    1. Check whether the output name of the ancestor node is modified or deleted.
      • Scenarios

        In the preceding figure, Node A is the ancestor node of Node B. The output name of Node A is Workspace name.A, and the two nodes have been deployed to the production environment. On the Properties tab of Node A, you modified or deleted Workspace name.A. Then, you deployed Node A to the production environment. This operation may cause Workspace name.A invalid, which makes Node B isolated.

      • Troubleshooting

        On the configuration tab of Node A, click Properties in the right-side navigation pane. In the Dependency section, check whether the output name of Node A is modified and then Node A is deployed to the production environment.

      • Solution

        Modify the output name of Node A to Workspace name.A. Then, deploy Node A to the production environment.

    2. Check whether the ancestor node exists.
      • Scenarios

        You deployed only Node B but not Node A. This makes Node B isolated. Alternatively, you deployed both Node A and Node B to the production environment. Then, you called DataWorks API operations to forcibly undeploy Node A. This makes Node B isolated.

      • Troubleshooting

        On the Create Package page, check whether Node A is in the deployed state.

      • Solution

        Deploy Node A to the production environment.

    3. Check whether Node A expires.
      • Troubleshooting

        On the configuration tab of Node A, click Properties in the right-side navigation pane. In the Schedule section, check the effective period of Node A.

        If Node A expires, it no longer generates node instances. As a result, Node B no longer generates node instances.

      • Solution

        Modify the effective period of Node A.

    4. Check whether scheduling is enabled for the workspace where the ancestor node resides.
      • Troubleshooting

        If Node A and Node B reside in different workspaces, click Properties in the right-side navigation pane on the configuration tab of Node A. In the Schedule section, check whether periodic scheduling is enabled for the workspace where Node A resides.

        If periodic scheduling is not enabled, Node B in another workspace is isolated.

      • Solution

        Enable periodic scheduling for the workspace where Node A resides.

Note Descendant nodes of isolated nodes cannot run. In an emergency situation, if you confirm that the dependency on isolated nodes does not affect the data output of the current node, you can cancel the dependency on isolated nodes and run the current node.

What can I do when the error message "Communications link failure" is returned?

  • Read error
    • Problem description
      The following error message is returned during data reading:
      Communications link failure The last packet successfully received from the server was 7,200,100 milliseconds ago.
      The last packet sent successfully to the server was 7,200,100 milliseconds ago.
      - com.mysql.jdbc.exceptions.jdbc4.CommunicationsException: Communications link failure
    • Possible causes

      The ApsaraDB RDS for MySQL database has a read timeout due to slow SQL queries.

    • Solution
      • Check whether the WHERE clause is specified. Make sure that an index is created on the filter fields.
      • Check whether a large amount of data exists in the source table. If a large amount of data exists in the source table, we recommend that you run multiple tasks to execute the SQL queries.
      • Check the database logs to find which SQL queries are delayed and contact the database administrator to resolve the issue.
  • Write error
    • Problem description
      The following error message is returned during data writing:
      Caused by: java.util.concurrent.ExecutionException: ERR-CODE: [TDDL-4614][ERR_EXECUTE_ON_MYSQL] Error occurs when execute on GROUP 'xxx' ATOM 'dockerxxxxx_xxxx_trace_shard_xxxx': Communications link failure The last packet successfully received from the server was 12,672 milliseconds ago.
      The last packet sent successfully to the server was 12,013 milliseconds ago. More...
    • Possible causes

      A socket timeout is caused by slow SQL queries. The default value of the SocketTimeout parameter of Taobao Distributed Data Layer (TDDL) connections is 12 seconds. If the running time of an SQL statement on a MySQL client exceeds 12 seconds, a TDDL-4614 error is returned. This error occurs occasionally when the data volume is large or the server is busy.

    • Solution
      • We recommend that you restart the sync node after the database becomes stable.
      • Contact the database administrator to adjust the value of the SocketTimeout parameter.

What can I do when the error message "Semantic analysis exception - Invalid partition value" is returned?

  • Problem description
    Operation Center reports the following error message:
    FAILED: ODPS-0130071:[1,71] Semantic analysis exception - Invalid partition value: '20200715'?
  • Possible causes

    The bizdate parameter of a node is not set.

  • Troubleshooting

    Click the Properties tab in the right-side navigation pane on the node configuration page. Check whether the bizdate parameter is set.

    If the bizdate parameter is not set, set it. If it is already set, set it to bizdate=${yyyymmdd}.

What can I do when the error message "no available machine resources under the task resource group" is returned?

  • Problem description
    Operation Center reports the following error message:
    no available machine resources under the task resource group
  • Solution

    In the left-side navigation pane on the Operation Center page, choose Cycle Task Maintenance > Cycle Task. Procedure:

    1. Go to the DataStudio page.
      1. Log on to the DataWorks console.
      2. In the left-side navigation pane, click Workspaces.
      3. In the top navigation bar, select a region, find the required workspace, and then click Data Analytics in the Actions column.
    2. On the DataStudio page, click Auto triggered nodes in the upper-left corner and choose All Products > Operation Center.
    3. In the left-side navigation pane, choose Cycle Task Maintenance > Cycle Task.
    4. Select one or more nodes and choose More > Modifying a scheduling Resource Group at the bottom of the page.
    5. In the Modify scheduling resource groups in batches dialog box, select a resource group and click OK.
    6. Generate retroactive data for the nodes. For more information, see Retroactive instances.

Why does dry-run scheduling occur?

  • Problem description

    A dry-run usually occurs in weekly scheduling and monthly scheduling. If a node is not set to run at a specified time of each week or each month, a dry-run occurs.

    Examples:
    • Weekly and monthly instances are not scheduled based on the specified recurrence.
    • The scheduling log is empty. The scheduling system does not actually run the node but directly returns a success response.
  • Solution

    If the node is to be run at another time, you must change the scheduling time and then recommit the node or generate retroactive data.

    If the node is scheduled on a daily basis, check whether Execution Mode is set to Dry Run for the node on the Properties tab.

    If Next Day is selected for Start Instantiation, the data timestamp is the current day, and the node is to be run on the next day.

    If the node is scheduled on the first day of each month, you must set the data timestamp to the last day of the last month when you generate retroactive data.

    If no node instance is valid in the specified period, check the data timestamp of the node.

Special dry-run circumstances: The running time is 0s, a success response is returned, and the instance is a dry-run instance. For example, a node that is scheduled by hour generates 24 instances in one day. These instances are generated from 00:00 to 23:00. If the node is deployed at 14:40 of the day, the instances that are generated before 14:40 are dry-run instances. The instances that are generated after 15:00 are run as scheduled.

The scheduling principle of a node scheduled by day is the same as that of a node scheduled by hour. For example, if a node is scheduled to run at 00:10 and the node is deployed at 11:00, the instances of the node that are generated on the day are dry-run instances.

How do I unfreeze a frozen node?

  • Problem description

    How do I unfreeze a frozen node?

  • Solution
    You can use the following methods to unfreeze a frozen node:
    • On the Cycle Task page, open the directed acyclic graph (DAG) of nodes. Right-click the frozen node in the DAG and select Unfreeze.
    • On the DataStudio page, click the Properties tab of the frozen node. Select or clear Skip Execution in the Schedule section. Commit and deploy the node again.
      Note Check whether the frozen nodes are auto triggered nodes.

What can I do if a frozen node is scheduled?

  • Check whether the frozen nodes are auto triggered nodes.
  • You cannot freeze the generated instances of an auto triggered node. Only the instances that are generated on the next day can be frozen.

Why does a task keep waiting for gateway resources?

  • Problem description

    A task keeps waiting for gateway resources in the cloud.

  • Possible causes

    The number of concurrent tasks in the workspace exceeds the upper limit.

  • Solution
    • Check the workspace for tasks that have a long running time. If a task has a long running time and its resources are not released, other tasks may fail to run.
    • Purchase exclusive resource groups in the DataWorks console to alleviate the resource shortage.

How do I set priorities?

The following priorities are supported:
  • The priority of task instances that are scheduled in DataWorks. A higher value indicates a higher priority. The priority can be regarded as the priority of a task in the scheduling cluster of a workspace.
  • The priority of baselines. Valid values: 1, 3, 5, 7, and 8. A higher value indicates a higher priority.
  • The priority of SQL tasks in MaxCompute. The value ranges from 1 to 9. A smaller value indicates a higher priority. The priority can be converted to a priority parameter later. The priority can be regarded as the priority of a task in the computing cluster. You can view the priority in LogView.
Take note of the following relationship between priorities:
  • A baseline of the level-8 priority corresponds to a level-1 SQL task that has the highest priority.
  • The higher the priority of the baseline, the higher the scheduling priority of the task in DataWorks, and the higher the priority of the SQL computing task in MaxCompute.
You can use the following methods to adjust priorities:
  • To adjust the priority of an SQL task, add the following command before the first SQL statement: set odps.instance.priority=8;. The value of the set priority in the command is for reference only.

    The value ranges from 1 to 9. A smaller value indicates a higher priority. The priority can be converted to a priority parameter later.

  • To adjust the priority of an auto triggered node, adjust the priority of the baseline. The default priority of an auto triggered node is 1.
  • To adjust the priority of a baseline, perform the following steps:
    1. Go to the Operation Center page.
    2. In the left-side navigation pane, choose Alarm > Baseline Management.
    3. Click Change in the Actions column of the baseline.
    4. In the Change Baseline dialog box, set the Priority parameter.
    5. Click OK.

    You can increase the priority of the baseline to be the same as the priority of a MaxCompute computing task, which is the same as the priority of the corresponding instance. A higher priority of the baseline indicates a higher scheduling priority. A baseline with a higher priority is scheduled preferentially.