All Products
Search
Document Center

Platform For AI:Python Script

Last Updated:Feb 26, 2024

The Python Script component provided by Machine Learning Designer allows you to install custom dependencies and invoke custom Python functions. This topic describes how to configure the Python Script component and provides an example on how to use the component.

Background information

The Python Script component is placed in the UserDefinedScript folder on the left-side pane of a pipeline details page. To open a pipeline details page, go to the Visualized Modeling (Designer) page in the Platform for AI (PAI) console, select the pipeline that you want to use, and click Open.

Prerequisites

  • The permissions required to use Deep Learning Containers (DLC) are granted. For more information, see Grant the permissions that are required to use DLC.

  • The DLC computing resources on which the Python Script component depends are associated with the PAI workspace that you want to use. For more information, see Manage workspaces.

  • An Object Storage Service (OSS) bucket is created to store code for the Python Script component. For more information, see Create buckets.

    Important

    The OSS bucket must be created in the same region as Machine Learning Designer and DLC.

  • The Resource Access Management (RAM) user who manages the Python Script component is assigned the Algorithm Developer role in the workspace. For more information, see Manage members of the workspace. If the RAM user wants to use MaxCompute as a data source, you also need to assign the MaxCompute Developer role to the RAM user.

Configure the component in the PAI console

  • Input ports

    The Python Script component has four input ports that can be used to receive OSS data and MaxCompute table data.

    • Input ports for OSS data

      OSS data from upstream components can be mounted to the Python Script component. The system passes the path of the mounted data as a command line argument. No manual operations are required. For example, the python main.py --input1 /ml/input/data/input1 syntax specifies the path of OSS data that is read by the first OSS input port. The mounted files in the/ml/input/data/input1 path can be read the same way as on-premise files.

    • Input ports for MaxCompute tables

      MaxCompute tables cannot be directly mounted to the component. The system converts the table metadata into a Uniform Resource Identifier (URI) and then passes the URI to the component as a command line argument. No manual operations are required. For example, the python main.py --input1 odps://some-project-name/tables/table syntax specifies the URI of the MaxCompute table that is read by the first MaxCompute input port. You can use the parse_odps_url function from the code template of this component to parse and obtain metadata such as the project name, table name, and partitions. For more information, see the Examples section in this topic.

  • Output ports

    The Python Script component has four output ports. OSS Output Port 1 and OSS Output Port 2 are used to export OSS data. Table Output Port 1 and Table Output Port 2 are used to export MaxCompute tables.

    • Output ports for OSS data

      The path specified by the Job Output Path parameter on the Code Config tab is automatically mapped to the /ml/output/ path. OSS Output Port 1 and OSS Output Port 2 correspond to the /ml/output/output1 and /ml/output/output2 paths, respectively. The files can be written in these paths in the same way as on-premises files before they are passed to downstream components.

    • Output ports for MaxCompute tables

      If MaxCompute projects are associated with the PAI workspace, a temporary URI is passed as a command line argument by using the python main.py --output3 odps://<some-project-name>/tables/<output-table-name> syntax. You can use PyODPS to create a temporary table that corresponds to the URI, write the data that is processed by the component to the table, and pass the table to downstream components. For more information, see the Examples section in this topic.

  • Component parameters

    Code Config

    Parameter

    Description

    Job Output Path

    The OSS path to which data is exported.

    • This OSS path is mapped to the /ml/output/ path. Data written to the /ml/output/ path is persisted to the mapped OSS path.

    • OSS Output Port 1 and OSS Output Port 2 correspond to the /ml/output/output1 and ml/output/output2 path, respectively. Once connected to these output ports, the downstream components can read data from the mapped paths.

    Code Source

    (Select a source from the drop-down list)

    Literal Code

    • Python Code: the OSS path to store the script that you write in the code editor. By default, the script is saved as an object named main.py.

      Important

      Before you click Save, make sure that the OSS path that you want to use to store the script does not contain an object that has the same name as the current object. Otherwise, the existing object is overwritten.

    • Code editor: the Python code editor where sample code is provided by default. For more information, see the Examples section in this topic. You can write code in the code editor.

    Specify Git Configuration

    • Git Repository Address: the address of the Git repository.

    • Code branch: the branch where the code is stored. Default value: master.

    • Code Commit: the commit where the code is submitted. This parameter takes precedence over the Code branch parameter. If you specify this parameter, the Code branch parameter is invalid.

    • Git Username: the Git username. This parameter is required if you want to access a private code repository.

    • Git Access Token: the access token to the Git repository. This parameter is required if you want to access a private code repository. For more information, see Obtain a GitHub account token.

    Select Code Source

    • Select Code Source Repositories: the code build you created. For more information, see Code builds.

    • Code branch: the branch where the code is stored. Default value: master.

    • Code Commit: the commit where the code is submitted. This parameter takes precedence over the Code branch parameter. If you specify this parameter, the Code branch parameter is invalid.

    Select Oss path

    In the OSS Code Path field, you can select the path where the code is stored.

    Command

    Enter the command that you want to run. Example: python main.py.

    Note

    The system automatically generates a command based on the script name and the connected ports. No manual operations are required.

    Advanced Option

    • Third Dependency: the third-party dependencies that you want to install. You can specify these dependencies in the format that is used by a Python requirement.txt file. The following code provides an example. These dependencies are automatically installed before you run the component.

      cycler==0.10.0            # via matplotlib
      kiwisolver==1.2.0         # via matplotlib
      matplotlib==3.2.1
      numpy==1.18.5
      pandas==1.0.4
      pyparsing==2.4.7          # via matplotlib
      python-dateutil==2.8.1    # via matplotlib, pandas
      pytz==2020.1              # via pandas
      scipy==1.4.1              # via seaborn
    • Whether to enable container monitoring: If you select this option, you can enter parameter configurations in the Error Monitoring Arguments field.

    Run Config

    Parameter

    Description

    ResourceGroup

    Public Resource Group is supported.

    • If you select Public Resource Group, set the InstanceType parameter to CPU or GPU and specify the CPU or GPU specifications. Default value: ecs.c6.large.

    By default, the resource group that is used by DLC resources of the current workspace is selected.

    VPC Settings

    You can select an existing virtual private cloud (VPC).

    Security Group

    You can select an existing security group.

    Advanced Option

    If you select this parameter, you can configure the following parameters:

    • Instance Count: the number of instances that you want to create. Specify a value for this parameter based on your business requirements. Default value: 1.

    • Job Image URI: the URI of the job image that you want to use. By default, open source XGBoost 1.6.0 is used. If you want to use deep learning frameworks, you must change the image.

    • Job Type: the job type. You need to modify this parameter only if the script is executed in a distributed manner. Valid values:

      • XGBoost/LightGBM Job

      • TensorFlow Job

      • PyTorch Job

      • MPI Job

Examples

Parse the default sample code

By default, the Python Script component provides the following sample code:

import os
import argparse
import json
"""
Sample code for the Python Script component
"""
# MaxCompute is used in this workspace. The name and endpoint of the MaxCompute project are required. 
# To run the code, make sure that a MaxCompute project is associated with the workspace. 
# Example: {"endpoint": "http://service.cn.maxcompute.aliyun-inc.com/api", "odpsProject": "lq_test_mc_project"}. 
ENV_JOB_MAX_COMPUTE_EXECUTION = "JOB_MAX_COMPUTE_EXECUTION"


def init_odps():
    from odps import ODPS
    # Information about the default MaxCompute project that is associated with the workspace. 
    mc_execution = json.loads(os.environ[ENV_JOB_MAX_COMPUTE_EXECUTION])
    o = ODPS(
        access_id="<YourAccessKeyId>",
        secret_access_key="<YourAccessKeySecret>",
        # Use the region in which the MaxCompute project resides. Example: http://service.cn-shanghai.maxcompute.aliyun-inc.com/api. 
        endpoint=mc_execution["endpoint"],
        project=mc_execution["odpsProject"],
    )
    return o


def parse_odps_url(table_uri):
    from urllib import parse
    parsed = parse.urlparse(table_uri)
    project_name = parsed.hostname
    r = parsed.path.split("/", 2)
    table_name = r[2]
    if len(r) > 3:
        partition = r[3]
    else:
        partition = None
        return project_name, table_name, partition


def parse_args():
    parser = argparse.ArgumentParser(description="PythonV2 component script example.")
    parser.add_argument("--input1", type=str, default=None, help="Component input port 1.")
    parser.add_argument("--input2", type=str, default=None, help="Component input port 2.")
    parser.add_argument("--input3", type=str, default=None, help="Component input port 3.")
    parser.add_argument("--input4", type=str, default=None, help="Component input port 4.")
    parser.add_argument("--output1", type=str, default=None, help="Output OSS port 1.")
    parser.add_argument("--output2", type=str, default=None, help="Output OSS port 2.")
    parser.add_argument("--output3", type=str, default=None, help="Output MaxComputeTable 1.")
    parser.add_argument("--output4", type=str, default=None, help="Output MaxComputeTable 2.")
    args, _ = parser.parse_known_args()
    return args


def write_table_example(args):
    # Example: Execute an SQL statement to copy the public table data provided by PAI and feed the data to the temporary table of Table Output Port 1. 
    output_table_uri = args.output3
    o = init_odps()
    project_name, table_name, partition = parse_odps_url(output_table_uri)
    o.run_sql(f"create table {project_name}.{table_name} as select * from pai_online_project.heart_disease_prediction;")


def write_output1(args):
    # Example: Write the data to the subpath of OSS Output Port 1 and pass the data to downstream components by connecting to those components. 
    output_path = args.output1
    os.makedirs(output_path, exist_ok=True)
    p = os.path.join(output_path, "result.text")
    with open(p, "w") as f:
        f.write("TestAccuracy=0.88")


if __name__ == "__main__":
    args = parse_args()
    print("Input1={}".format(args.input1))
    print("Output1={}".format(args.output1))
    # write_table_example(args)
    # write_output1(args)

The preceding code includes the following commonly used functions:

  • init_odps(): initializes a MaxCompute instance to read the MaxCompute table data. To initiate a MaxCompute instance, you must enter your AccessKey ID and AccessKey secret. For more information about how to obtain an AccessKey pair, see Create an AccessKey pair.

  • parse_odps_url(table_uri): parses the MaxCompute table URI and returns the project name, table name, and partitions. Specify this parameter in the odps://${your_projectname}/tables/${table_name}/${pt_1}/${pt_2}/ format. Example: odps://test/tables/iris/pa=1/pb=1. In this example, pa=1/pb=1 is a multi-level partition.

  • parse_args(): parses the arguments that are passed to the script. The arguments specify the input and output data of the script.

Example 1: Use Python Script with other components

This example uses the heart disease prediction template to show how to use the Python Script component together with other components. 组合使用To configure a pipeline, perform the following steps:

  1. Create a pipeline based on the heart disease prediction template and open the pipeline. For more information, see Predict heart disease.

  2. Drag the Python Script component to the canvas, rename the component SMOTE, and then enter the following code.

    Important

    The imblearn library is not included in the image that is used in this example. You must specify the imblearn library in the Third Dependency field of the Code Config tab. The library is automatically installed before the component is run.

    import argparse
    import json
    import os
    from odps.df import DataFrame
    from imblearn.over_sampling import SMOTE
    from urllib import parse
    from odps import ODPS
    ENV_JOB_MAX_COMPUTE_EXECUTION = "JOB_MAX_COMPUTE_EXECUTION"
    
    
    def init_odps():
        # Information about the default MaxCompute project that is associated with the workspace. 
        mc_execution = json.loads(os.environ[ENV_JOB_MAX_COMPUTE_EXECUTION])
        o = ODPS(
            access_id="<Replace the value with your AccessKey ID>",
            secret_access_key="<Replace the value with your AccessKey secret>",
            # Use the region in which the MaxCompute project resides. Example: http://service.cn-shanghai.maxcompute.aliyun-inc.com/api. 
            endpoint=mc_execution["endpoint"],
            project=mc_execution["odpsProject"],
        )
        return o
    
    
    def get_max_compute_table(table_uri, odps):
        parsed = parse.urlparse(table_uri)
        project_name = parsed.hostname
        table_name = parsed.path.split('/')[2]
        table = odps.get_table(project_name + "." + table_name)
        return table
    
    
    def run():
        parser = argparse.ArgumentParser(description='PythonV2 component script example.')
        parser.add_argument(
            '--input1', type=str, default=None, help='Component input port 1.'
        )
        parser.add_argument(
            '--output3', type=str, default=None, help='Component input port 1.'
        )
        args, _ = parser.parse_known_args()
        print('Input1={}'.format(args.input1))
        print('output3={}'.format(args.output3))
        o = init_odps()
        imbalanced_table = get_max_compute_table(args.input1, o)
        df = DataFrame(imbalanced_table).to_pandas()
        sm = SMOTE(random_state=2)
        X_train_res, y_train_res = sm.fit_resample(df, df['ifhealth'].ravel())
        new_table = o.create_table(get_max_compute_table(args.output3, o).name, imbalanced_table.schema, if_not_exists=True)
        with new_table.open_writer() as writer:
            writer.write(X_train_res.values.tolist())
    
    
    if __name__ == '__main__':
        run()
    

    Replace access_id and secret_access_key with your AccessKey ID and AccessKey secret. For more information about how to obtain an AccessKey pair, see Obtain an AccessKey pair.

  3. Connect the SMOTE component as a downstream component of the Split component. Then, the component uses the SMOTE algorithm to perform oversampling on the split datasets that contain a small number of samples and generates new samples to handle class imbalance.

  4. To use the generated samples for training, connect the Logistic Regression for Binary Classification component as a downstream component of the SMOTE component.

  5. Compare the models that are generated from the left and right branches by connecting the Confusion Matrix and Binary Classification Evaluation components as downstream components at the end of the two branches. After you run the pipeline, click the 可视化 icon to view the evaluation results.

    The evaluation results show that oversampling does not significantly improve model performance. This indicates that the original sample distribution and model have a good performance.

Example 2: Use Python Script to orchestrate DLC jobs

In Machine Learning Designer, you can connect multiple Python Script components to orchestrate and schedule a pipeline for DLC jobs. For example, start four DLC jobs based on the sequence as shown in the following Directed Acyclic Graph (DAG).

Note

If the code execution of DLC does not require reading data from the upstream component or passing data to the downstream component, the connections between the components indicate only the dependencies among the components and the sequence of running these components.

DAG图You can deploy the entire pipeline in Machine Learning Designer to DataWorks to schedule the pipeline as a periodic task. For more information, see Use DataWorks tasks to schedule pipelines in Machine Learning Designer.

Example 3: Pass global variables to Python Script

  1. Configure global variables.

    On the pipeline details page in Machine Learning Designer, click the blank area on the canvas and configure global variables on the Global Variables tab in the right-side pane.image.png

  2. Use one of the following methods to pass the configured global variables to the Python script component:

    • Click the Python Script component. Select Advanced Option on the Code Config tab and pass the global variables in the Command field.image.png

    • Modify the Python code to use argparser to parse parameters.

      The following sample code shows how to parse the global variables that you configure in Step 1. You need to modify the code based on the actual global variables that you configure. After you modify the code, you can paste the code to the code editor on the Code Config tab.

      import os
      
      import argparse
      import json
      
      """
      Sample code for the Python Script component
      """
      
      ENV_JOB_MAX_COMPUTE_EXECUTION = "JOB_MAX_COMPUTE_EXECUTION"
      
      
      def init_odps():
      
          from odps import ODPS
      
          mc_execution = json.loads(os.environ[ENV_JOB_MAX_COMPUTE_EXECUTION])
      
          o = ODPS(
              access_id="<YourAccessKeyId>",
              secret_access_key="<YourAccessKeySecret>",
              endpoint=mc_execution["endpoint"],
              project=mc_execution["odpsProject"],
          )
          return o
      
      
      def parse_odps_url(table_uri):
          from urllib import parse
      
          parsed = parse.urlparse(table_uri)
          project_name = parsed.hostname
          r = parsed.path.split("/", 2)
          table_name = r[2]
          if len(r) > 3:
              partition = r[3]
          else:
              partition = None
          return project_name, table_name, partition
      
      
      def parse_args():
          parser = argparse.ArgumentParser(description="PythonV2 component script example.")
      
          parser.add_argument("--input1", type=str, default=None, help="Component input port 1.")
          parser.add_argument("--input2", type=str, default=None, help="Component input port 2.")
          parser.add_argument("--input3", type=str, default=None, help="Component input port 3.")
          parser.add_argument("--input4", type=str, default=None, help="Component input port 4.")
      
          parser.add_argument("--output1", type=str, default=None, help="Output OSS port 1.")
          parser.add_argument("--output2", type=str, default=None, help="Output OSS port 2.")
          parser.add_argument("--output3", type=str, default=None, help="Output MaxComputeTable 1.")
          parser.add_argument("--output4", type=str, default=None, help="Output MaxComputeTable 2.")
          # Add code based on the configured global variables. 
          parser.add_argument("--arg1", type=str, default=None, help="Argument 1.")
          parser.add_argument("--arg2", type=int, default=None, help="Argument 2.")
          args, _ = parser.parse_known_args()
          return args
      
      
      def write_table_example(args):
      
          output_table_uri = args.output3
      
          o = init_odps()
          project_name, table_name, partition = parse_odps_url(output_table_uri)
          o.run_sql(f"create table {project_name}.{table_name} as select * from pai_online_project.heart_disease_prediction;")
      
      
      def write_output1(args):
          output_path = args.output1
      
          os.makedirs(output_path, exist_ok=True)
          p = os.path.join(output_path, "result.text")
          with open(p, "w") as f:
              f.write("TestAccuracy=0.88")
      
      
      if __name__ == "__main__":
          args = parse_args()
      
          print("Input1={}".format(args.input1))
          print("Output1={}".format(args.output1))
          # Add code based on the configured global variables. 
          print("Argument1={}".format(args.arg1))
          print("Argument2={}".format(args.arg2))
          # write_table_example(args)
          # write_output1(args)