All Products
Search
Document Center

DataWorks:Use a PyODPS node to reference a third-party package

Last Updated:Jul 10, 2024

This topic describes how to use a PyODPS node in DataWorks to reference a third-party package. You can reference a common Python script or a third-party open source package.

Background information

  • If a third-party package needs to be referenced when you run a PyODPS node on a DataWorks resource group, you need to install the third-party package based on the resource group that you use.

  • If your PyODPS node needs to access a data source or service in a special network environment, such as a virtual private cloud (VPC) or data center, use a new-version resource group or an old-version exclusive resource group for scheduling to run the node, and establish a network connection between the resource group and the data source or service.

  • For information about the PyODPS syntax, see PyODPS documentation.

  • PyODPS nodes are classified into two types: PyODPS 2 and PyODPS 3. The two types of PyODPS nodes use different Python versions at the underlying layer. PyODPS 2 nodes use Python 2, and PyODPS 3 nodes use Python 3. You can create a PyODPS node based on the Python version in use. For more information about how to create a PyODPS node, see Create a PyODPS 2 node and Create a PyODPS 3 node.

Limits

  • Due to the specifications of resources in the resource group that is used to run a node, we recommend that you use a PyODPS node to locally process no more than 50 MB of data. If a PyODPS node processes more than 50 MB of data, an out-of-memory (OOM) exception may occur, and the system may report Got killed. We recommend that you do not write excessive data processing code for a PyODPS node. For more information, see Overview.

  • If the system reports Got killed, the memory usage exceeds the limit, and the system terminates the related processes. We recommend that you do not perform local data operations. However, the limits on the memory usage do not apply to SQL or DataFrame tasks that are initiated by PyODPS. Take note that to_pandas tasks are excluded.

  • You can use the NumPy and pandas libraries that are pre-installed in DataWorks to run functions other than UDFs on an exclusive resource group for scheduling. Third-party packages that contain binary code are not supported.

  • For compatibility reasons, options.tunnel.use_instance_tunnel is set to False in DataWorks by default. If you want to globally enable InstanceTunnel, you must set this parameter to True.

Reference a common Python script

  1. Go to the DataStudio page.

    Log on to the DataWorks console. In the top navigation bar, select the desired region. Then, choose Data Modeling and Development > DataStudio in the left-side navigation pane. On the page that appears, select the desired workspace from the drop-down list and click Go to DataStudio.

  2. Create a Python resource

    1. On the DataStudio page, move the pointer over the 新建 icon and choose Create Resource > MaxCompute > Python.

      Alternatively, you can click the name of the desired workflow in the Business Flow section, right-click MaxCompute, and then choose Create Resource > Python.

    2. In the Create Resource dialog box, configure the Name parameter. In this example, the Name parameter is set to pyodps_packagetest.py.

      Important

      The resource name can contain only letters, digits, periods (.), underscores (_), and hyphens (-). It must end with .py.

    3. Click Create.

    4. On the configuration tab of the newly created Python resource, enter the common Python script that you want to reference. In this example, the following script is used:

      # import os
      # print os.getcwd()
      # print os.path.abspath('.')
      # print os.path.abspath('..')
      # print os.path.abspath(os.curdir)
      
      def printname():
          print 'test2'
      print 123
    5. Click the 提交 icon in the top toolbar.

  3. Create a PyODPS 2 node.

    1. In the Business Flow section, find the workflow in which you want to create a PyODPS 2 node, right-click MaxCompute, and then choose Create Node > PyODPS 2.

    2. In the Create Node dialog box, configure the Name parameter. In this example, the Name parameter is set to pyodps_testpackage.

      Note

      The node name cannot exceed 128 characters in length and can contain letters, digits, underscores (_), and periods (.).

    3. Click Confirm.

  4. Open the configuration tab of the newly created PyODPS 2 node. Then, right-click the name of the Python resource in the Resource folder of your workflow and select Insert Resource Path.

    After the resource is referenced, the ##@resource_reference{"pyodps_packagetest.py"} statement is automatically written in the code editor of the PyODPS 2 node.

  5. Enter the code that is used to reference the common Python script in the code editor of the PyODPS 2 node. In this example, the following code is used:

    ##@resource_reference{"pyodps_packagetest.py"} # This statement is required to reference the created Python resource. 
    
    import sys
    import os
    sys.path.append(os.path.dirname(os.path.abspath('pyodps_packagetest.py'))) # Import the resource to the workspace. 
    import pyodps_packagetest # Reference the resource. You must delete the .py suffix in the resource name. 
    pyodps_packagetest.printname() # Call the method.

  6. Click the 运行 icon in the top toolbar and view the results on the Runtime Log tab in the lower part of the configuration tab.

    查看结果

Reference a third-party open source package

Before you reference a third-party open source package, you must use pip to install the package and configure the package based on the resource group that you select.

Use a new-version resource group (general-purpose resource group) to configure a third-party open source package

New-version resource groups support custom image. When you create a custom image, you can select and install a third-party open source package based on your business requirements. Then, you can use the image when you configure scheduling properties for a PyODPS node.

Configuration description

In the Create Image panel, configure a third-party open source package based on your business requirements.

Key parameters:

  • Image Name/ID: Set the value to dataworks_pyodps_task_pod.

  • Supported Task Type: Select PyODPS 2 or PyODPS 3.

  • Installation Package: Select the desired third-party open source package.

Use an exclusive resource group for scheduling to configure a third-party open source package

An exclusive resource group for scheduling is used. The desired third-party open source package is installed on the O&M Assistant page of the exclusive resource group for scheduling. For more information, see Use the O&M Assistant feature.

PyODPS nodes include PyODPS 2 nodes and PyODPS 3 nodes.

  • If you want to use a PyODPS 2 node to reference the third-party open source package, run the following command to install the package:

    pip install <Package that you want to reference> -i  https://pypi.tuna.tsinghua.edu.cn/simple

    If you are prompted to upgrade pip after you run the preceding command, run the following command:

    pip install --upgrade pip -i https://pypi.tuna.tsinghua.edu.cn/simple
  • If you want to use a PyODPS 3 node to reference the third-party open source package, run the following command to install the package:

    /home/tops/bin/pip3 install <Package that you want to reference> -i https://pypi.tuna.tsinghua.edu.cn/simple

    After the package is installed, run the import command to import the package. For example, run the pip3 -install oss2 command on the O&M Assistant page to install the oss2 package. Then, run the import oss2 command in the PyODPS 3 node to import and reference oss2.

    If you are prompted to upgrade pip after you run the preceding command, run the following command:

    /home/tops/bin/pip3 install --upgrade pip -i https://pypi.tuna.tsinghua.edu.cn/simple

    If the following error occurs when you use the PyODPS 3 node, submit a ticket to apply for permissions:

    "/home/admin/usertools/tools/cmd-0.sh:Line 3: /home/tops/bin/python3: The file or directory does not exist."