All Products
Search
Document Center

Dataphin:Read files using Python

Last Updated:Jan 21, 2025

Dataphin exclusively supports the development of scripts in Python and does not accommodate scripts dependent on third-party components. To utilize third-party components, they must be acquired using pip install. This topic outlines the process of creating a Shell task in Dataphin to employ Python for reading files from third-party sources.

Prerequisites

  • Add the access address mirrors.aliyun.com and port * to your project's sandbox whitelist. For more information, see Step 2: Connect to an ApsaraDB RDS for MySQL instance.

  • Ensure the availability of files in formats readable by Python, such as TXT, CSV, XLS, XLSX, or PDF.

Step 1: Upload files

  1. Sign in to the Dataphin console.

  2. In the Dataphin console, choose your workspace region and then Enter Dataphin>> with a single click.

  3. Navigate to the Resource Management page.

    1. From the Dataphin home page, simply click on Development.

    2. Navigate to the Data Development page and single-click Data Processing.

    3. In the left-side navigation pane, single click the 资源管理Resource Management icon.

  4. On the Resource Management page, click the Resource Management icon.

  5. In the New Resource dialog box, set the necessary parameters.

    Parameter

    Description

    Type

    Select others.

    Name

    The uploaded file's name must include the file type, such as test.xlsx.

    Description

    Provide a description for the resource.

    Upload File

    Choose a local file to upload, such as test.xlsx.

    Compute Type

    Choose No Affiliated Engine.

    Important

    Since file resources are stored within the Dataphin system, only No Affiliated Engine is selectable.

    Select Directory

    The default directory is Resource Management.

  6. To complete the resource submission, simply Submit with a single click.

  7. In the Submit Remarks dialog box, enter the necessary comments.

  8. Click Confirm And Submit once.

Step 2: Create a Shell task

  1. On the Data Processing tab, single click the left navigation bar agagaScript Task icon.

  2. On the Script Task page, single click the Script TaskGeneral Script > SHELL icon next to , and select .

  3. Compose the DataX task script.

    1. In the New File dialog box, set the parameters accordingly.

      Parameter

      Description

      Name

      Enter a name for the compute task, such as 'Python reading files'.

      Schedule Type

      Set the task's schedule type to Recurring Task Node.

      Description

      Provide a description for the task.

      Select Directory

      The system will automatically select Code Management.

    2. Click Confirm once.

Step 3: Write and run Shell task code

  1. Write your code on the designated coding page.

    # Create a new directory on the Dataphin Linux server.
    mkdir -p /tmp/chars/ && \
           
    # Specify the directory /tmp/chars/ as the Python source.
    pip install -i https://mirrors.aliyun.com/pypi/simple/ \
    --target=/tmp/chars/ \
    openpyxl
    
    # Write the Python source to openfile.py.
    cat >openfile.py <<EOF
    
    @resource_reference{"test.xlsx"}
    # -*- coding:utf-8 -*-
    import os
    import sys
    sys.path.append('/tmp/chars/')
    import openpyxl
    print '========= python execute ok =========='
    print("start===============")
    args = sys.argv
    # Open the Excel file and get the sheet name
    wb = openpyxl.load_workbook(args[1])
    # The wb.get_sheet_names method is deprecated and will raise a warning
    print(wb.worksheets[0])
    
    EOF
    # Invoke the file in Python.
    python openfile.py test.xlsx

    Replace the test.xlsx parameter with the name of your uploaded file.

  2. To run the task code, single click Execute located at the top right of the page.

    If the run status is SUCCESS, it indicates that the file has been read successfully. test