In Data Science Workshop (DSW) of Platform for AI (PAI), you can use PyODPS to read data from MaxCompute tables.
Prerequisites
Before you perform the operations that are described in this topic, make sure that the following requirements are met:
MaxCompute is activated. For more information, see Activate MaxCompute and DataWorks.
The account you are using is granted permissions on MaxCompute projects. If you are using the Alibaba Cloud account, you do not need to authorize. If you are using an RAM user, you can perform the following steps to authorize:
Ensure that you have Python 3.6 or later installed.
Procedure
You can use PyODPS to read data from MaxCompute or Machine Learning Designer. For more information, see PyODPS documentation.
Install PyODPS.
In the DSW terminal, run the following command:
pip install pyodpsRun the following command to verify the installation. The installation is successful if no value or error message is returned.
# For Windows, use python -c "from odps import ODPS" python3 -c "from odps import ODPS"If the Python version you want to use is not the default version of the system, run the following command to use the required version:
/home/tops/bin/python3.7 -m pip install setuptools>=3.0 #/home/tops/bin/python3.7 is the path of the installed Python.Execute SQL statements to read data from MaxCompute tables.
import numpy as np import pandas as pd import os from odps import ODPS from odps.df import DataFrame # Establish a connection. o = ODPS( os.getenv('ALIBABA_CLOUD_ACCESS_KEY_ID'), os.getenv('ALIBABA_CLOUD_ACCESS_KEY_SECRET'), project='your-default-project', endpoint='your-end-point', ) # Read data from MaxCompute tables. sql = ''' SELECT * FROM your-default-project.<table> LIMIT 100 ; ''' query_job = o.execute_sql(sql) result = query_job.open_reader(tunnel=True) df = result.to_pandas(n_process=1) # You can configure the n_process parameter based on the server configuration. If you set the n_process parameter to a value greater than 1, multiple threads are used to accelerate data reading.Parameters in the preceding code:
ALIBABA_CLOUD_ACCESS_KEY_ID and ALIBABA_CLOUD_ACCESS_KEY_SECRET: The AccessKey ID and AccessKey Secret of your Alibaba Cloud account. We recommend that you set them as environment variables to prevent leakage.
For information about how to obtain an AccessKey pair, see Create an AccessKey pair.
For information about how to set environment variables, see Configure environment variables in Linux, macOS, and Windows.
your-default-project and your-end-point: Replace them with the default project name and endpoint. For more information about the endpoints of each region, see Endpoints.
For information about how to use PyODPS to perform other operations, such as write data to MaxCompute tables, see Tables.
References
DSW provides the SQL File feature to help you quickly query data from MaxCompute data sources by using SQL statements. For more information, see Use SQL files to query MaxCompute tables.