All Products
Search
Document Center

DataWorks:Use a PyODPS node to reference a third-party package

Last Updated:Aug 17, 2023

This topic describes how to use a PyODPS node in DataWorks to reference a third-party package. You can reference a common Python script or a third-party open source package.

Background information

  • DataWorks allows you to create Python resources in a visualized manner. If you want a Python resource to depend on a third-party open source package, use an exclusive resource group for scheduling and install the package on the O&M Assistant page of the resource group.
  • A third-party package that you install on the O&M Assistant page can be referenced only when you run a PyODPS node on an exclusive resource group for scheduling. For information about how to reference third-party packages in MaxCompute Python user-defined functions (UDFs), see Reference third-party packages in Python UDFs.
  • If your PyODPS node needs to access a data source or service in a special network environment, such as a virtual private cloud (VPC) or data center, use an exclusive resource group for scheduling to run the node, and establish a network connection between the resource group and the data source or service.
  • For more information about the PyODPS syntax, see PyODPS documentation.
  • PyODPS nodes are classified into two types: PyODPS 2 and PyODPS 3. The two types of PyODPS nodes use different Python versions at the underlying layer. PyODPS 2 nodes use Python 2, and PyODPS 3 nodes use Python 3. You can create a PyODPS node based on the Python version in use. For more information about how to create a PyODPS node, see Create a PyODPS 2 node and Create a PyODPS 3 node.

Limits

  • When you use a PyODPS node to reference a third-party package, you can use only an exclusive resource group for scheduling to run the node. For more information about how to create and use an exclusive resource group for scheduling, see Create and use an exclusive resource group for scheduling.
  • Due to the specifications of resources in the resource group that is used to run a node, we recommend that you use a PyODPS node to process no more than 50 MB of on-premises data. If a PyODPS node processes more than 50 MB of on-premises data, an out-of-memory (OOM) exception may occur, and the system may report Got killed. We recommend that you do not write excessive data processing code for a PyODPS node.
  • If the system reports Got killed, the memory usage exceeds the limit, and the system terminates the related processes. We recommend that you do not perform local data operations. However, the limits on the memory usage and CPU utilization do not apply to SQL or DataFrame tasks that are initiated by PyODPS. Take note that to_pandas tasks are excluded.
  • You can use the NumPy and pandas libraries that are pre-installed in DataWorks to run functions other than UDFs. Third-party packages that contain binary code are not supported.
  • For compatibility reasons, options.tunnel.use_instance_tunnel is set to False in DataWorks by default. If you want to globally enable InstanceTunnel, you must set this parameter to True.

Reference a common Python script

  1. Go to the DataStudio page.

    Log on to the DataWorks console. In the left-side navigation pane, choose Data Modeling and Development > DataStudio. On the page that appears, select the desired workspace from the drop-down list and click Go to DataStudio.

  2. Create a Python resource.
    1. On the DataStudio page, move the pointer over the Create icon icon and choose Create Resource > MaxCompute > Python.
      Alternatively, you can click the name of the desired workflow in the Business Flow section, right-click MaxCompute, and then choose Create Resource > Python.
    2. In the Create Resource dialog box, configure the Name parameter. In this example, the Name parameter is set to pyodps_packagetest.py.
      Important The resource name can contain letters, digits, periods (.), underscores (_), and hyphens (-). It must end with .py.
    3. Click Create.
    4. On the configuration tab of the newly created Python resource, enter the common Python script that you want to reference. In this example, the following script is used:
      # import os
      # print os.getcwd()
      # print os.path.abspath('.')
      # print os.path.abspath('..')
      # print os.path.abspath(os.curdir)
      
      def printname():
          print 'test2'
      print 123
    5. Click the Submit icon icon in the top toolbar.
  3. Create a PyODPS 2 node.
    1. In the Business Flow section, find the workflow in which you want to create a PyODPS 2 node, right-click MaxCompute, and then choose Create Node > PyODPS 2.
    2. In the Create Node dialog box, configure the Name parameter. In this example, the Name parameter is set to pyodps_testpackage.
      Note The node name must be 1 to 128 characters in length and can contain letters, digits, underscores (_), and periods (.).
    3. Click Confirm.
  4. Open the configuration tab of the newly created PyODPS 2 node. Then, right-click the name of the Python resource in the Resource folder of your workflow and select Insert Resource Path.
    After the resource is referenced, the ##@resource_reference{"pyodps_packagetest.py"} statement is automatically written in the code editor of the PyODPS 2 node. Insert Resource Path
  5. Enter the code that is used to reference the common Python script in the code editor of the PyODPS 2 node. In this example, the following code is used:
    ##@resource_reference{"pyodps_packagetest.py"} # This statement is required to reference the created Python resource. 
    
    import sys
    import os
    sys.path.append(os.path.dirname(os.path.abspath('pyodps_packagetest.py'))) # Import the resource to the workspace. 
    import pyodps_packagetest # Reference the resource. You must delete the .py suffix in the resource name. 
    pyodps_packagetest.printname() # Call the method. 
  6. Click the Run icon in the top toolbar and view the results on the Runtime Log tab in the lower part of the configuration tab.
    View the results

Reference a third-party open source package

Before you reference a third-party open source package, you must use pip to install the package and make sure that the following requirements are met:
  • An exclusive resource group for scheduling is available. For more information, see Create and use an exclusive resource group for scheduling.
  • The third-party open source package is installed on the O&M Assistant page of the exclusive resource group for scheduling. For more information, see Use the O&M Assistant feature. PyODPS nodes include PyODPS 2 nodes and PyODPS 3 nodes.
    • If you want to use a PyODPS 2 node to reference the third-party open source package, run the following command to install the package:
      pip install <Package that you want to reference> -i https://pypi.tuna.tsinghua.edu.cn/simple
      If you are prompted to upgrade pip after you run the preceding command, run the following command:
      pip install --upgrade pip -i https://pypi.tuna.tsinghua.edu.cn/simple
    • If you want to use a PyODPS 3 node to reference the third-party open source package, run the following command to install the package:
      /home/tops/bin/pip3 install <Package that you want to reference> -i https://pypi.tuna.tsinghua.edu.cn/simple

      After the package is installed, run the import command to import the package. For example, run the pip3 -install oss2 command on the O&M Assistant page to install the package oss2. Then, run the import oss2 command in the PyODPS 3 node to import and reference oss2.

      If you are prompted to upgrade pip after you run the preceding command, run the following command:
      /home/tops/bin/pip3 install --upgrade pip -i https://pypi.tuna.tsinghua.edu.cn/simple
      If the following error occurs when you use the PyODPS 3 node, submit a ticket to apply for permissions.
      "/home/admin/usertools/tools/cmd-0.sh:Line 3: /home/tops/bin/python3: The file or directory does not exist."