Dataphin supports a wide array of Python application scenarios, allowing for the creation of Python computing tasks using Python syntax. This topic outlines the steps to create a new Python computing task in Dataphin.
Background information
Python 3.7 offers enhanced capabilities for big data processing compared to Python 2.7. For instance, Python 3.7 includes the list.clear() method, which Python 2.7 lacks. For more details, see Python.
Limits
Python 3.7 is not backward compatible with Python 2.7. Existing Python 2.7 tasks cannot be directly upgraded to Python 3.7.
From version 2.9.3 onwards, Dataphin defaults to supporting Python 3.7 for computing task development. The system allows version changes only for tasks in draft status, with support for Python 2.7, Python 3.7, and Python 3.11.
Task execution instructions
Once a Python task is edited in Dataphin, it is executed by the Dataphin scheduling cluster. The cluster clones Dataphin's built-in template image, which includes common Python resource packages, to run the task. These pre-installed packages can be utilized during task development. For additional information, refer to Appendix: Pre-installed Python Resource Packages.

If the built-in resource packages do not meet your needs, you can install the required resource packages through the Python third-party package management center. Installed resource packages can be used by reference. During runtime, the system will automatically place the referenced resource packages in the runtime environment for task execution. Since Dataphin runs by cloning the built-in template image each time, if you use the
pip installcommand to use resource packages, thepip installcommand will be rerun each time the task is executed. It is recommended to use Python third-party packages. For usage instructions, see Develop Python Computing Tasks Using Third-party Libraries.
Procedure
Navigate to the Dataphin home page, and from the top menu bar, select Development > Data Development.
On the Development page, select Project from the top menu bar (Dev-Prod mode requires selecting the environment).
In the left-side navigation pane, choose Data Processing > Script Task. In the Script Task list, click the
icon and select PYTHON.In the New Python Task dialog box, configure the following parameters:
Parameter
Description
Task Name
Enter the name of the code task. It must not exceed 256 characters. The following characters are not supported: vertical bars (|), forward slashes (/), backslashes (\), colons (:), question marks (?), angle brackets (<>), asterisks (*), and quotation marks (").
The length must be 256 characters or fewer. Unsupported characters include vertical bars (|), forward slashes (/), backslashes (\), colons (:), question marks (?), angle brackets (<>), asterisks (*), and quotation marks (").
Schedule Type
Choose the schedule type for the task. Options for Schedule Type include:
Recurring Task: Automatically included in the system's periodic scheduling.
One-Time Task: Requires manual initiation of task execution.
Select Directory
Choose the directory to store the task. If no directory exists, create one as follows:
If no directory exists, Create Folder to establish one. The steps for this operation are as follows:
Click the
icon above the task list to open the Create Folder dialog box.In the Create Folder dialog box, input the desired folder Name and choose the appropriate Directory location.
Click Confirm.
Use Template
Utilize code templates for efficient development. Template task code is read-only and cannot be edited. Configure the template parameters to complete code development.
Python Third-party Package
If Python third-party packages are required, select the necessary ones. For more details, see Install and Manage Python Third-party Packages.
NoteAfter adding a third-party module to the Python package, declare the reference in the task to import the module in the code. Set and edit the referenced module in the task properties > Python third-party package configuration.
Description
Provide a brief description of the task, within 1000 characters.
Click Confirm.
In the code editing area of the current Python task tab, write the Python computing task code. Once the code is complete, click Run above the code editing area.
NoteWhen developing Python computing tasks, install necessary resource packages as per the business scenario. Dataphin has pre-installed common resource packages. Simply add the
import <resource package name>statement at the beginning of the code, for instance,import configparser. For more information, refer to Appendix: Pre-installed Python Resource Packages.Proactively comment on the encoding at the start of the Python file to prevent execution errors due to system encoding.
To introduce uploaded resource files in Python, see Upload Resources and References.
Click on the Attribute option in the page's right sidebar. Within the Attribute panel, set up the task's Basic Information, Running Resources, Python Third-party Package, Runtime Parameter, Scheduling Properties for recurring tasks, Schedule Dependency for recurring tasks, Running Configuration, Resource Configuration, and other relevant details.
Basic Information
Define the task's name, responsible individual, description, and other foundational details. For guidance, see Configure Basic Task Information.
Running Resources
Allocate CPU and memory resources for the task. The default setting is 0.1 core 256MB. For instructions, see Configure Offline Task Running Resources.
Python Third-party Package
Select the Python third-party packages to be included. For more details, see Install Python Module.
Runtime Parameter
Assign values to parameter variables in the attributes panel, allowing for automatic substitution during node scheduling. For configuration instructions, see Parameter Configuration and Use Node Parameters.
Scheduling Properties (Recurring Task)
When the scheduling type of an offline computing task is set to Recurring Task, you must configure the task's scheduling properties in addition to its Basic Information. For guidance on configuring these properties, see Configure Scheduling Properties.
Schedule Dependency (Recurring Task)
When the scheduling type of an offline computing task is set to Recurring Task, you must also set up the task's scheduling dependencies in addition to its Basic Information. For guidance on configuring these dependencies, see Configure Scheduling Dependency.
Running Configuration
Configure task-level running timeout and rerun policy based on the business scenario. If not set, the default tenant-level values will apply. For configuration instructions, see Configure Computing Task Running.
Resource Configuration
Set the resource group for the task. During execution, the task will utilize the resources from this group. For guidance, see Configure Computing Task Resource.
Save and submit the task under the current Python task tab.
Click the
icon to save the code.Click the
icon to submit the code.
On the Submitting Log page, verify the Submission Content and Pre-check results, and provide any necessary remarks. For more information, see Offline Computing Task Submission Instructions.
After review, click Confirm And Submit.
NoteFor data security, if the Python task code includes
from dataphin import hivecorimport dataphin, submission will initiate a code review. A review request will be generated automatically, and the task can proceed after approval.The code reviewer is the project administrator (if multiple administrators exist, approval from any one suffices).
What to do next
In Dev-Prod mode, after successful task submission, proceed to the release list to publish the task to the production environment. For more details, see Manage Release Tasks.
In Basic mode, the Python task is ready for production environment scheduling upon successful submission. Visit the Operation Center to view published tasks. For more information, see View and Manage Script Tasks, View and Manage One-time Tasks.