This topic describes the FAQ about custom resource groups.

How do I install a monitor?

If an error occurs when you use a custom resource group for scheduling, check whether a monitor is installed for the agent by following these steps:
  1. Log on to each Elastic Compute Service (ECS) instance and switch to the root user.
  2. Run the following command to check whether a monitor is installed:
    wget https://alisaproxy.shuju.aliyun.com/install_monitor.sh --no-check-certificate
  3. If no monitor is installed, run the following command to install one:
    sh install_monitor.sh

What can I do if I fail to add an ECS instance to a custom resource group for scheduling?

If you fail to add an ECS instance to a custom resource group for scheduling and the status of the instance is always Stopped, consider the following reasons:
  • The hostname or universally unique identifier (UUID) you entered on the ECS instance registration page is different from the actual one.
    To check the hostname or UUID, follow these steps:
    • If you set the network type to classic network, check whether the hostname and IP address you entered on the registration page are the same as those returned after you run the hostname and hostname -i commands on the ECS instance. Note that you can set the network type to classic network only when the ECS instance is in the China (Shanghai) region.
      Note Check whether you have changed the hostname. If you have changed the hostname, go to the /etc/hosts directory and check whether the instance is bound to a host. If the instance is bound to a host, enter the name of the bound host on the registration page.
    • If you set the network type to Virtual Private Cloud (VPC), check whether the UUID you entered on the registration page is the same as that returned after you run the dmidecode | grep UUID command on the ECS instance.
      Note
      • If you do not install dmidecode, install it first.
      • The UUIDs returned by different versions of dmidecode are case-sensitive.
      • The hostnames are case-sensitive.
    If the issue is caused by this reason, resolve it by following these steps:
    1. Remove the original instance.
    2. Enter the correct IP address and hostname or UUID and register the instance again.
  • The initialization commands are incorrect.
    To check whether the initialization commands are correct, follow these steps:
    1. Log on to the ECS instance and run the following command:
      cat /home/admin/alisatasknode/target/alisatasknode/conf/config.properties | grep driver
    2. Log on to the DataWorks console.
    3. In the left-side navigation pane, click Resource Groups.
    4. On the Resource Groups page, click the Custom Resource Groups tab.
    5. Find the target resource group and click Initialize Server in the Actions column.
    6. Check whether the username in the output of the preceding command is the same as that in the initialization dialog box.
    If the issue is caused by this reason, run the correct commands listed in the initialization dialog box to re-initialize the instance.
    Note
    • After an instance is registered, you can initialize the instance on the Custom Resource Groups tab of the Resource Groups page.

      The initialization commands for different resource groups are different and cannot be mixed up.

    • Copy the commands in the Initialize Server dialog box and run them in sequence.
    • For instances in a VPC, use the initialization commands for instances on a classic network.
  • The difference between the time of the ECS instance and time in UTC+8 is greater than 5 minutes.
    To check the time difference, follow these steps:
    1. Log on to the ECS instance.
    2. Run the date command and check whether the difference between the returned time and time in UTC+8 is greater than 5 minutes.

    If the issue is caused by this reason, confirm that a time adjustment does not affect your business and then adjust the time of the ECS instance to that in UTC+8.

  • The permissions on relevant directories are incorrect.
    To check the permissions, follow these steps:
    1. Log on to the ECS instance and run the ps -ef | grep zoo | grep -v cdp command.
    2. Check whether the returned processes are owned by the admin user.

      If the processes are owned by the admin user, check whether the admin user has permissions on the /home/admin/alisatasknode directory and its subdirectories.

    If the root permission is required, follow these steps:
    1. Switch to the root user and run the chown admin:admin /home/admin -R command.
    2. Switch back to the admin user and run the /home/admin/alisatasknode/target/alisatasknode/bin/serverctl restart command to restart the agent.
  • An error occurs when the install.sh script is run.
    To check whether an error occurs when the install.sh script is run, follow these steps:
    1. Run the install.sh script.
    2. Check whether a log file is generated in the /home/admin/alisatasknode/logs directory. If no log file is generated, the agent is not installed.
    If the issue is caused by this reason, resolve it by following these steps:
    1. Check whether the operating system of the ECS instance is CentOS V5, CentOS V6, or CentOS V7. If the ECS instance does not run one of the preceding operating systems, change the operating system and re-initialize the instance.
    2. Run the /opt/taobao/java/bin/java -V command to check whether the version of the Java Development Kit (JDK) is 1.8.
    3. Run the ls -al /opt/taobao command to check whether the admin user has permissions on the /opt/taobao directory. If the root permission is required, switch to the root user and run the chown admin:admin /opt/taobao -R command. Then, switch back to the admin user and run the initialization commands.

What can I do if an ECS instance is suddenly stopped and fails to be restarted?

If an ECS instance is suddenly stopped after a period of use and fails to be restarted or the issue persists after the restart, consider the following reasons:
  • Different users have started the agent, resulting in inconsistent permissions on relevant directories.
    To check whether different users have started the agent, follow these steps:
    1. Log on to the ECS instance and switch to the root user.
    2. Run the ps -ef | grep zoo | grep -v cdp command.
    If two processes are returned, different users have started the agent. In this case, follow these steps:
    1. Log on to the ECS instance and run the kill -9 command to end the two processes.
    2. Switch to the root user and run the chown admin:admin /home/admin/ -R command.
    3. Switch back to the admin user.
    4. Run the /home/admin/alisatasknode/target/alisatasknode/bin/serverctl restart command to restart the agent.
  • Relevant processes occupy too many handles.
    To check whether relevant processes occupy too many handles, follow these steps:
    • Log on to the ECS instance and run the grep "temporarily unavailable" /home/admin/alisatasknode/logs/alisatasknode.log command. If a result is returned, relevant processes occupy too many handles.
    • Restart the agent. If you fail to restart the agent and the error message Caused by: java.io.IOException: error=11, Resource temporarily unavailable is returned, relevant processes occupy too many handles.
    If the issue is caused by this reason, resolve it by following these steps:
    1. Switch to the root user and run the ps -ef | grep zoo | grep -v cdp command.
    2. Run the kill -9 command to end all the processes returned by the preceding ps command.
    3. Run the chown admin:admin /home/admin/ -R command.
    4. Switch back to the admin user.
    5. Run the /home/admin/alisatasknode/target/alisatasknode/bin/serverctl restart command to restart the agent.
  • The ECS instance is in a VPC and the UUID of the ECS instance is changed.
    1. Log on to the ECS instance and run the dmidecode | grep UUID command. Assume that the letters in the UUID were in uppercase. Check whether the returned letters are in lowercase.
    2. Compare the returned UUID with the one in the Manage Server dialog box.

    If the issue is caused by this reason, remove the original instance on the Custom Resource Groups tab and add the instance with the new UUID.

Note If the instance cannot be removed and the error message remove node failed, exception: [3006:ERROR_GATEWAY_EXIST_TASKS]:gateway tasks not empty is returned, record the region where the instance resides, copy the error message, and then submit a ticket for consultation.

What can I do if a node that is running on a custom resource group for scheduling is waiting for resources for a long period?

If a node that is running on a custom resource group for scheduling is waiting for resources for a long period, consider the following reasons:
  • The ECS instance is stopped.
    To check whether the ECS instance is stopped, follow these steps:
    1. Log on to the DataWorks console. In the left-side navigation pane, click Resource Groups.
    2. On the Resource Groups page, click the Custom Resource Groups tab.
    3. Find the target instance and click Manage Server in the Actions column to check whether the status of the ECS instance is Stopped.

    If the ECS instance is stopped, log on to the instance and start the agent.

  • The ECS instance is temporarily unavailable.
    • To check whether the ECS instance is temporarily unavailable, follow these steps:
      1. Log on to the ECS instance.
      2. View logs in the /home/admin/alisatasknode/logs/alisatasknode_status.log file.

        The logs display the status of the instance in real time. If the instance status is BUSY or HANGUP, a node that is running on the instance occupies many resources.

    • To resolve this issue, follow these steps:
      1. Run the ps -ef | grep taskexec command to view the relevant processes of nodes.
      2. Check logs to find the node that occupies many resources.

      If the node is abnormal, terminate it in the DataWorks console. Wait 2 minutes. Then the instance automatically works again.

  • The agent is abnormal.
    To check whether the agent is abnormal, follow these steps:
    • Run the df -h command to check whether the disk usage is 100%.
    • Check whether the CPU and memory usage is too high.

    If the issue is caused by this reason, resolve the issue of the instance and then restart the agent.

How do I temporarily disable or initialize the agent?

To temporarily disable the agent, select one of the following procedures:
  • If you add the agent on the Custom Resource Groups tab of the Resource Groups page, find the target instance and click Manage Server in the Actions column. In the Manage Server dialog box that appears, click Freeze.
  • If you add the agent on the Custom Resource Group page in Data Integration, you cannot temporarily disable the agent. In this case, submit a ticket for consultation.
To initialize the agent, follow these steps:
  1. Switch to the root user and run the ps -ef | grep zoo | grep alisa command.
  2. Run the kill -9 command to end the processes returned by the preceding ps command.
  3. Delete the /home/admin/alisatasknode directory.
  4. Run the install.sh script in an empty directory.
    Note Download the install.sh script in the region where the instance resides.

How do I enable the agent to automatically work after I restart an ECS instance?

After you restart an ECS instance, you can follow these steps to enable the agent to automatically work:
  1. Log on to the ECS instance and switch to the root user.
  2. Run the wget https://alisaproxy.shuju.aliyun.com/install_monitor.sh --no-check-certificate command.
  3. Run the sh install_monitor.sh command.

What can I do if I fail to start a custom resource group?

If you enter the hostname instead of the UUID to register an ECS instance whose network type is VPC and fail to start a custom resource group, run the tail -f /home/admin/alisatasknode/logs/alisatasknode.log command to view the operational logs to determine the cause.View logs

If you enter the UUID to register an ECS instance whose network type is VPC and fail to start a custom resource group, the initialization commands may be incorrect and you need to correct the commands.

In the latter case, follow these steps to resolve the issue:
  1. Log on to the DataWorks console.
  2. In the left-side navigation pane, click Resource Groups.
  3. On the Resource Groups page, click the Custom Resource Groups tab.
  4. Find the target instance and click Initialize Server in the Actions column.
  5. Follow the steps listed in the Initialize Server dialog box.
    Note In step 3, change enable_uuid=false to enable_uuid=true in the commands.

What are the advantages of custom resource groups?

  • Guarantee enough resources: All tenants share the default resource group. Therefore, the resource usage may be high and your nodes may wait for resources for a long period. If you have high requirements for resource usage, you can select a custom resource group to run your node when you create the node.
  • Connect to data stores in various network environments: The default resource group cannot connect to databases in a VPC. Therefore, you can use a custom resource group to connect to databases in the VPC.
  • Be used for scheduling: If resources for scheduling are insufficient, you can use a custom resource group.
  • Improve concurrency: The default resource group provides a limited number of slots. You can add slots by adding custom resource groups to run more nodes concurrently.

What are the restrictions on custom resource groups?

  • You can add an ECS instance to only one custom resource group but you can add multiple ECS instances to a custom resource group.
  • If you set the network type to classic network when you register an ECS instance, you must enter the hostname of the instance. If you set the network type to VPC, you must enter the UUID of the instance.
  • You can select only one network type for each custom resource group.
  • You cannot run manually triggered node instances on custom resource groups.
  • ECS instances must be able to access the Internet. You can configure a public IP address, an Elastic IP Address (EIP), and a source network address translation (SNAT) IP address of the Network Address Translation (NAT) gateway for an ECS instance.

What information do I need to view after I install a custom resource group?

After you install a custom resource group based on the instructions in the DataWorks console, log on to an ECS instance and view the following information about the agent:
  • Default directory: /home/admin/. This directory usually contains the following folders:
    • alisatasknode: stores agent-related configurations and commands.
    • datax and datax-on-flume: store the synchronization wrapper library and configurations.
  • Agent-related commands: Currently, you can run commands such as stop, start, and restart on the agent process.
    /home/admin/alisatasknode/target/alisatasknode/bin/serverctl start/stop/restart
  • Operational logs: The operational logs of the agent are stored in the following directories:
    • /home/admin/alisatasknode/taskinfo/: stores the operational logs of shell scripts. The logs are the same as the operational logs of DataWorks nodes.
    • /home/admin/alisatasknode/logs: The alisatasknode.log file stores the running information of the agent, such as the node operations and the heartbeat status of the agent.
    • /home/admin/datax3/log: stores detailed operational logs of Data Integration nodes. If a node fails, you can view the logs for troubleshooting.

How do I monitor the status of the agent process?

You can follow these steps to monitor the agent process. If the agent process exits, you can recover it in a timely manner.
  1. Log on to the ECS instance as the root user.
  2. Run the wget https://alisaproxy.shuju.aliyun.com/install_monitor.sh --no-check-certificate command.
  3. Run the sh install_monitor.sh command. By default, monitoring logs are stored in the /home/admin/alisatasknode/monitor/monitor.log file.

What types of resources does the DataWorks scheduling system provide?

Custom resource groups are used in the DataWorks scheduling system. Currently, the DataWorks scheduling system provides level-1 scheduling resources and level-2 running resources.
  • Level-1 scheduling resources: Go to the Operation Center page and choose Cycle Task Maintenance > Cycle Instance in the left-side navigation pane. On the Cycle Instance page, right-click the target instance in the directed acyclic graph (DAG) on the right and select More. On the General tab that appears at the bottom, you can view the level-1 scheduling resources.
  • Level-2 running resources: Go to the Data Integration page and click Custom Resource Group in the left-side navigation pane. On the Custom Resource Group page, you can view the level-2 running resources.

How do I use custom resource groups?

  • Configure level-1 scheduling resources
    Log on to the DataWorks console. In the left-side navigation pane, click Resource Groups. On the Resource Groups page, click the Custom Resource Groups tab and add a resource group on this tab.
    Note The resource groups you add on this tab are applicable to shell nodes and the resources you configure on this tab are level-1 scheduling resources.
  • Configure level-2 running resources
    1. Log on to the DataWorks console. In the left-side navigation pane, click Workspaces. On the Workspaces page, find the target workspace and click Data Integration in the Actions column. On the page that appears, click Custom Resource Group in the left-side navigation pane. On the Custom Resource Group page, click Add Resource Group in the upper-right corner to add a custom resource group. For more information, see Add a custom resource group.
      Note The resource groups you add on this page are only applicable to sync nodes and the resources you configure on this page are level-2 running resources.
    2. After you add a custom resource group, click Resource Group configuration in the right-side navigation pane. In the Resource Group configuration pane that appears, set Programme to Custom DI Resource Groups and select the custom resource group you added.Channel step

What can I do if an error is returned when I add an ECS instance to a custom resource group?

If the error message gateway already exists is returned when you add an ECS instance to a custom resource group, follow these steps to resolve this issue:
  1. Check whether an ECS instance with the same hostname or UUID exists on the Custom Resource Groups tab of the Resource Groups page and the Custom Resource Group page in Data Integration. This is because the error message indicates that the ECS instance has been registered in the gateway and an ECS instance can be added to only one custom resource group.
  2. If you do not find such an ECS instance in your workspace, provide the request ID for Alibaba Cloud staff for consultation.

What can I do if the custom resource group I add is unavailable?

  • View logs in the alisatasknode.log file to check whether the heartbeat status code 302 is returned. If the heartbeat status code 302 is returned, check the following items:
    • Check whether the UUID on the custom resource group page is the same as that returned after you run the dmidecode | grep UUID command on the ECS instance.
      Note The UUID is case-sensitive.
    • If the UUIDs are different, enter the correct UUID and reinstall the agent.
      Note For dmidecode 3.0.5 or earlier, letters in a UUID are in uppercase. If you upgrade dmidecode to 3.1.2 or later, the letters in the UUID change to lowercase ones, which leads to an abnormal heartbeat. In this case, you need to reinstall the agent.
    • Check whether the username and password in the config.properties file are the same as those that appear when you install the agent on the custom resource group page. If not, reinstall the agent by running the commands listed in the agent installation dialog box.
    • If the UUID, username, and password are correct, check the node.uuid.enable parameter in the config.properties file. For an ECS instance in a VPC, the value of this parameter must be true. If node.uuid.enable is set to false for an ECS instance in a VPC, change the value to true and restart the agent process.
  • View logs in the alisatasknode.log file to check whether the logs contain information related to connection timeout. If such information exists, follow these steps:
    1. Check whether the ECS instance can access the Internet, for example, whether the ECS instance is configured with a public IP address, an EIP, or an SNAT IP address of the NAT gateway. You can run the ping www.taobao.com command and check whether www.taobao.com can be reached by PING messages to determine whether the ECS instance can access the Internet.
    2. If the ECS instance can access the Internet, check whether access control is enabled in the outbound rule of the security group for traffic over the Internet or internal network. If access control is enabled, add the IP address and port number of the gateway to the outbound rule.

What can I do if an ECS instance is normal but a shell node fails?

  • Use the keyword T3_0699121848 to search for detailed error information in the alisatasknode.log file.
  • Log on to the ECS instance, switch to the admin user, and then run the python -V command to check whether the Python version is 2.7 or 2.6.
    Note Currently, the agent supports Python 2.7 or 2.6. If the Python version is not 2.7 or 2.6, the error message replace user hive conf error is returned.

What can I do if I fail to find a specific operational log file of DataWorks?

Log on to the ECS instance, switch to the admin user, and then run the sh -x Script name command. Check whether the command can be run. If the command fails to be run, resolve the issue based on the returned error message.

What can I do if an OOM error occurs and I fail to allocate memory to relevant threads when I run a node on a custom resource group?

Symptom: An out of memory (OOM) error occurs when a node is run on a custom resource group. The operational logs shown in the following figure indicate that memory cannot be allocated to the relevant threads.Error message

Cause: The memory size you set when you create a custom resource group determines the slot capability of the resource group. The system and agent processes in a resource group occupy a part of memory. Therefore, the memory of an ECS instance cannot be all used for slots, and an OOM error may occur when too many nodes are run concurrently.

Workaround: We recommend that you decrease the memory size for the custom resource group and reserve 2 GB memory for the system and agent processes. If other processes exist, we recommend that you reserve more memory.