All Products
Search
Document Center

Dataphin:Create Python Computing Task

Last Updated:Jan 21, 2025

Dataphin supports a wide array of Python application scenarios, allowing for the creation of Python computing tasks using Python syntax. This topic outlines the steps to create a new Python computing task in Dataphin.

Background information

Python 3.7 offers enhanced capabilities for big data processing compared to Python 2.7. For instance, Python 3.7 includes the list.clear() method, which Python 2.7 lacks. For more details, see Python.

Limits

  • Python 3.7 is not backward compatible with Python 2.7. Existing Python 2.7 tasks cannot be directly upgraded to Python 3.7.

  • From version 2.9.3 onwards, Dataphin defaults to supporting Python 3.7 for computing task development. The system allows version changes only for tasks in draft status, with support for Python 2.7, Python 3.7, and Python 3.11.

Task execution instructions

  • Once a Python task is edited in Dataphin, it is executed by the Dataphin scheduling cluster. The cluster clones Dataphin's built-in template image, which includes common Python resource packages, to run the task. These pre-installed packages can be utilized during task development. For additional information, refer to Appendix: Pre-installed Python Resource Packages.

    image

  • If the built-in resource packages do not meet your needs, you can install the required resource packages through the Python third-party package management center. Installed resource packages can be used by reference. During runtime, the system will automatically place the referenced resource packages in the runtime environment for task execution. Since Dataphin runs by cloning the built-in template image each time, if you use the pip install command to use resource packages, the pip install command will be rerun each time the task is executed. It is recommended to use Python third-party packages. For usage instructions, see Develop Python Computing Tasks Using Third-party Libraries.

Procedure

  1. Navigate to the Dataphin home page, and from the top menu bar, select Development > Data Development.

  2. On the Development page, select Project from the top menu bar (Dev-Prod mode requires selecting the environment).

  3. In the left-side navigation pane, choose Data Processing > Script Task. In the Script Task list, click the image icon and select PYTHON.

  4. In the New Python Task dialog box, configure the following parameters:

    Parameter

    Description

    Task Name

    Enter the name of the code task. It must not exceed 256 characters. The following characters are not supported: vertical bars (|), forward slashes (/), backslashes (\), colons (:), question marks (?), angle brackets (<>), asterisks (*), and quotation marks (").

    The length must be 256 characters or fewer. Unsupported characters include vertical bars (|), forward slashes (/), backslashes (\), colons (:), question marks (?), angle brackets (<>), asterisks (*), and quotation marks (").

    Schedule Type

    Choose the schedule type for the task. Options for Schedule Type include:

    • Recurring Task: Automatically included in the system's periodic scheduling.

    • One-Time Task: Requires manual initiation of task execution.

    Select Directory

    Choose the directory to store the task. If no directory exists, create one as follows:

    If no directory exists, Create Folder to establish one. The steps for this operation are as follows:

    1. Click the image icon above the task list to open the Create Folder dialog box.

    2. In the Create Folder dialog box, input the desired folder Name and choose the appropriate Directory location.

    3. Click Confirm.

    Use Template

    Utilize code templates for efficient development. Template task code is read-only and cannot be edited. Configure the template parameters to complete code development.

    Python Third-party Package

    If Python third-party packages are required, select the necessary ones. For more details, see Install and Manage Python Third-party Packages.

    Note

    After adding a third-party module to the Python package, declare the reference in the task to import the module in the code. Set and edit the referenced module in the task properties > Python third-party package configuration.

    Description

    Provide a brief description of the task, within 1000 characters.

  5. Click Confirm.

  6. In the code editing area of the current Python task tab, write the Python computing task code. Once the code is complete, click Run above the code editing area.

    Note
    • When developing Python computing tasks, install necessary resource packages as per the business scenario. Dataphin has pre-installed common resource packages. Simply add the import <resource package name> statement at the beginning of the code, for instance, import configparser. For more information, refer to Appendix: Pre-installed Python Resource Packages.

    • Proactively comment on the encoding at the start of the Python file to prevent execution errors due to system encoding.

    • To introduce uploaded resource files in Python, see Upload Resources and References.

  7. Click on the Attribute option in the page's right sidebar. Within the Attribute panel, set up the task's Basic Information, Running Resources, Python Third-party Package, Runtime Parameter, Scheduling Properties for recurring tasks, Schedule Dependency for recurring tasks, Running Configuration, Resource Configuration, and other relevant details.

    • Basic Information

      Define the task's name, responsible individual, description, and other foundational details. For guidance, see Configure Basic Task Information.

    • Running Resources

      Allocate CPU and memory resources for the task. The default setting is 0.1 core 256MB. For instructions, see Configure Offline Task Running Resources.

    • Python Third-party Package

      Select the Python third-party packages to be included. For more details, see Install Python Module.

    • Runtime Parameter

      Assign values to parameter variables in the attributes panel, allowing for automatic substitution during node scheduling. For configuration instructions, see Parameter Configuration and Use Node Parameters.

    • Scheduling Properties (Recurring Task)

      When the scheduling type of an offline computing task is set to Recurring Task, you must configure the task's scheduling properties in addition to its Basic Information. For guidance on configuring these properties, see Configure Scheduling Properties.

    • Schedule Dependency (Recurring Task)

      When the scheduling type of an offline computing task is set to Recurring Task, you must also set up the task's scheduling dependencies in addition to its Basic Information. For guidance on configuring these dependencies, see Configure Scheduling Dependency.

    • Running Configuration

      Configure task-level running timeout and rerun policy based on the business scenario. If not set, the default tenant-level values will apply. For configuration instructions, see Configure Computing Task Running.

    • Resource Configuration

      Set the resource group for the task. During execution, the task will utilize the resources from this group. For guidance, see Configure Computing Task Resource.

  8. Save and submit the task under the current Python task tab.

    1. Click the image icon to save the code.

    2. Click the image icon to submit the code.

  9. On the Submitting Log page, verify the Submission Content and Pre-check results, and provide any necessary remarks. For more information, see Offline Computing Task Submission Instructions.

  10. After review, click Confirm And Submit.

    Note
    • For data security, if the Python task code includes from dataphin import hivec or import dataphin, submission will initiate a code review. A review request will be generated automatically, and the task can proceed after approval.

    • The code reviewer is the project administrator (if multiple administrators exist, approval from any one suffices).

What to do next

  • In Dev-Prod mode, after successful task submission, proceed to the release list to publish the task to the production environment. For more details, see Manage Release Tasks.

  • In Basic mode, the Python task is ready for production environment scheduling upon successful submission. Visit the Operation Center to view published tasks. For more information, see View and Manage Script Tasks, View and Manage One-time Tasks.