Dataphin exclusively supports the development of scripts in Python and does not accommodate scripts dependent on third-party components. To utilize third-party components, they must be acquired using pip install. This topic outlines the process of creating a Shell task in Dataphin to employ Python for reading files from third-party sources.
Prerequisites
Add the access address mirrors.aliyun.com and port * to your project's sandbox whitelist. For more information, see Step 2: Connect to an ApsaraDB RDS for MySQL instance.
Ensure the availability of files in formats readable by Python, such as TXT, CSV, XLS, XLSX, or PDF.
Step 1: Upload files
Sign in to the Dataphin console.
In the Dataphin console, choose your workspace region and then Enter Dataphin>> with a single click.
Navigate to the Resource Management page.
From the Dataphin home page, simply click on Development.
Navigate to the Data Development page and single-click Data Processing.
In the left-side navigation pane, single click the
Resource Management icon.
On the Resource Management page, click the Resource Management
icon.
In the New Resource dialog box, set the necessary parameters.
Parameter
Description
Type
Select others.
Name
The uploaded file's name must include the file type, such as test.xlsx.
Description
Provide a description for the resource.
Upload File
Choose a local file to upload, such as test.xlsx.
Compute Type
Choose No Affiliated Engine.
ImportantSince file resources are stored within the Dataphin system, only No Affiliated Engine is selectable.
Select Directory
The default directory is Resource Management.
To complete the resource submission, simply Submit with a single click.
In the Submit Remarks dialog box, enter the necessary comments.
Click Confirm And Submit once.
Step 2: Create a Shell task
On the Data Processing tab, single click the left navigation bar
Script Task icon.
On the Script Task page, single click the Script Task
icon next to , and select .
Compose the DataX task script.
In the New File dialog box, set the parameters accordingly.
Parameter
Description
Name
Enter a name for the compute task, such as 'Python reading files'.
Schedule Type
Set the task's schedule type to Recurring Task Node.
Description
Provide a description for the task.
Select Directory
The system will automatically select Code Management.
Click Confirm once.
Step 3: Write and run Shell task code
Write your code on the designated coding page.
# Create a new directory on the Dataphin Linux server. mkdir -p /tmp/chars/ && \ # Specify the directory /tmp/chars/ as the Python source. pip install -i https://mirrors.aliyun.com/pypi/simple/ \ --target=/tmp/chars/ \ openpyxl # Write the Python source to openfile.py. cat >openfile.py <<EOF @resource_reference{"test.xlsx"} # -*- coding:utf-8 -*- import os import sys sys.path.append('/tmp/chars/') import openpyxl print '========= python execute ok ==========' print("start===============") args = sys.argv # Open the Excel file and get the sheet name wb = openpyxl.load_workbook(args[1]) # The wb.get_sheet_names method is deprecated and will raise a warning print(wb.worksheets[0]) EOF # Invoke the file in Python. python openfile.py test.xlsx
Replace the test.xlsx parameter with the name of your uploaded file.
To run the task code, single click Execute located at the top right of the page.
If the run status is SUCCESS, it indicates that the file has been read successfully.