All Products
Search
Document Center

Platform For AI:Read data from MaxCompute tables

Last Updated:Feb 05, 2024

This topic describes how to use PyODPS and SQL File to read data from MaxCompute tables.

Prerequisites

Before you perform the operations that are described in this topic, make sure that the following requirements are met:

  • MaxCompute is activated. For more information, see Activate MaxCompute and DataWorks.

  • Optional: A Resource Access Management (RAM) role that is used to log on to the Machine Learning Platform for AI (PAI) console is created. For more information, see Log on to the PAI console through SSO and use PAI.

  • Optional: The RAM role is granted permissions on MaxCompute projects. For more information, see Add workspace members and assign roles to them.

  • The version of Python meets requirements. We recommend that you use Python 3.6 or later. Python 2.7 or earlier is not recommended.

  • The credentials for identity authentication are configured. You can use environment variables or local files to configure credentials. For more information, see Configure credentials.

PyODPS

You can use PyODPS to read data from MaxCompute or Machine Learning Designer. PyODPS is an SDK for Python provided by Alibaba Cloud. For more information, see PyODPS: ODPS Python SDK and data analysis framework.

  1. Install PyODPS.

    On the Terminal interface of Data Science Workshop (DSW), run the following command:

    pip install pyodps
  2. Run the following command to check whether the installation is successful: If no result is returned and no error is reported, the installation is successful.

    python -c "from odps import ODPS"
  3. If the Python version is not the default version, run the following command to switch to the default version after pip is installed:

    /home/tops/bin/python3.7 -m pip install setuptools>=3.0
    #/home/tops/bin/python3.7 is the directory in which Python is installed.
  4. Execute SQL statements to read data from MaxCompute tables. The following code provides an example:

    import numpy as np
    import pandas as pd
    import os
    
    from odps import ODPS
    from odps.df import DataFrame
    # Establish a connection. 
    o = ODPS(
        os.getenv('ALIBABA_CLOUD_ACCESS_KEY_ID'),
        os.getenv('ALIBABA_CLOUD_ACCESS_KEY_SECRET'),
        project='your-default-project',
        endpoint='your-end-point',
    )
    
    # Read data from MaxCompute tables. 
    sql = '''
    SELECT  
        *
    FROM
        your-default-project.<table>
    LIMIT 100
    ;
    '''
    query_job = o.execute_sql(sql)
    result = query_job.open_reader(tunnel=True)
    df = result.to_pandas(n_process=1) # You can configure the n_process parameter based on the server configuration. If you set the n_process parameter to a value greater than 1, multiple threads are used to accelerate data reading.

    Parameters:

    • ALIBABA_CLOUD_ACCESS_KEY_ID and ALIBABA_CLOUD_ACCESS_KEY_SECRET: Set the two environment variables to the AccessKey ID and AccessKey secret of your Alibaba Cloud account separately.

      Note

      We recommend that you use the environment variables rather than the AccessKey ID and AccessKey secret.

    • your-default-project and your-end-point: Replace them with the default project name and endpoint. For more information about the endpoints of each region, see Endpoints.

    For more information about how to use PyODPS to perform other operations such as data writes on MaxCompute tables, see Tables.

SQL File

DSW provides the SQL File feature to let you quickly query data from MaxCompute data sources on which you have read permissions. The following section describes the procedure.

Note

When you execute SQL statements to query data, you are charged for the consumed computing and storage resources in MaxCompute. DSW does not charge additional fees.

  1. Go to the development environment of Data Science Workshop (DSW).

    1. Log on to the PAI console.

    2. In the left-side navigation pane, click Workspaces. On the Workspace list page, click the name of the workspace that you want to manage.

    3. In the upper-left corner of the page, select the region where you want to use the service.

    4. In the left-side navigation pane, choose Model Training > Notebook Service (DSW).

    5. Optional: On the Interactive Modeling (DSW) page, enter the name of a DSW instance or a keyword in the search box to search for the DSW instance.

    6. Find the DSW instance and click Launch in the Actions column.

  2. On the Launcher interface, click SQL File in the Text and terminal section to create an SQL query file.

    SQL File

  3. Configure a MaxCompute data source

    1. In the upper-right corner of the main workbench, click the Data Source icon as shown in the following figure.

    2. In the Data Source Settings panel, configure the required parameters and click Select. The following table describes the parameters.

      Parameter

      Description

      MaxCompute Endpoint

      The endpoint of the MaxCompute project. For more information, see Endpoints. For example, if your MaxCompute project is deployed in the China (Hangzhou) region, the endpoint is http://service.cn-hangzhou.maxcompute.aliyun.com/api.

      Tunnel Endpoint

      The endpoint of the Tunnel service of the MaxCompute project. For more information, see Endpoints. For example, if your MaxCompute project is deployed in the China (Hangzhou) region, the endpoint is http://dt.cn-hangzhou.maxcompute.aliyun.com.

      MaxCompute Project

      The name of the MaxCompute project.

  4. Execute statements.

    In the SQL editor, write the statements that you want to execute to query the data of the MaxCompute data source. Then, click the Run icon in the toolbar and view the query results.

    • Execute statements: You can click the Run icon on the left side of a statement to execute the statement. You can also select the statements that you want to execute and click the 运行 icon in the toolbar. After you execute a statement, you can view the execution results.

    • Format statements: You can click the Format icon in the toolbar to write your statements in the standard format.

  5. View the query results.

    After you execute the SQL statements, you can perform various operations on the query results. The following table describes the operations.

    Operation

    Illustration

    Description

    View the query results

    表格

    You can view the query results of SQL statements.

    Visualize the query results

    可视化图表展示

    You can view the query results in line charts, column charts, bar charts, pie charts, and cross tables. You can drag fields to the X-axis and Y-axis to create charts and tables. The fields on the X-axis are dimensions, and the fields on the Y-axis are metrics. You can click a field to modify the field type and specify an alias for the field.

    View the SQL statements that correspond to a query result

    SQL

    You can view the SQL statements that correspond to a query result. You can also copy and paste the SQL statements to the code editor.

    View the operation logs that are generated during the query process

    操作日志

    You can view the operation logs that are generated for the query statements. If you fail to execute the SQL statements, you can click the logview link in the generated logs to troubleshoot the issue.