PyODPS is a Python SDK that lets you use MaxCompute for big data processing and analysis directly from your local machine. This topic shows you how to connect to MaxCompute, run SQL queries, use the DataFrame API, and set runtime parameters.
This topic covers local Python environments only. If you are using PyODPS in a DataWorks node, see the DataWorks documentation instead.
Prerequisites
Before you begin, ensure that you have:
PyODPS installed locally and environment variables set
The pyodps_iris sample table created in your project — see Usage examples for instructions on how to download the dataset, create the table, and write data to it
Get started
Open a Python editor and create a new
.pyfile.If no Python editor is installed, create a file with a
.pyextension using any text editor.Write your PyODPS code. The following sections cover the main features with runnable examples.
Save the file and run it with
python <your-file>.py.
Initialize the ODPS entry point
The ODPS class is the entry point for all PyODPS operations. Initialize it with your credentials and project details before calling any other API.
Set the ALIBABA_CLOUD_ACCESS_KEY_ID and ALIBABA_CLOUD_ACCESS_KEY_SECRET environment variables before running your code. Hardcoding credentials in source files is a security risk. (Not recommended) If the environment variables are not set, you can explicitly specify the keys. To do this, obtain an access_id and secret_access_key.
import os
from odps import ODPS
o = ODPS(
# Read credentials from environment variables
access_id=os.getenv('ALIBABA_CLOUD_ACCESS_KEY_ID'),
secret_access_key=os.getenv('ALIBABA_CLOUD_ACCESS_KEY_SECRET'),
# (Not recommended) To specify credentials explicitly instead:
# access_id='your-aliyun-access-key-id',
# secret_access_key='your-aliyun-access-key-secret',
project='<your-default-project>',
endpoint='<your-endpoint>',
)Replace the following placeholders:
| Placeholder | Description | Example |
|---|---|---|
<your-default-project> | Name of your MaxCompute project. Log in to the MaxCompute console to get a list of projects. | my_project |
<your-endpoint> | Endpoint for your region and network type. | https://service.cn-hangzhou.maxcompute.aliyun.com/api |
Select the correct network type for your endpoint. An incorrect network type causes connection failures.
After initialization, the o object supports list, get, exist, create, and delete operations on MaxCompute resources. For a full reference, see Overview of basic operations.
Execute SQL
MaxCompute supports two SQL execution modes: traditional mode and MaxCompute Query Acceleration (MCQA). Both support Data Definition Language (DDL) and Data Manipulation Language (DML) statements. MCQA caches job results and returns cached results on repeated runs to speed up execution. For billing details, see Compute costs (pay-as-you-go).
execute_sql()andrun_sql()do not support all SQL statement types. For non-DDL and non-DML statements, use the corresponding method directly — for example, usecreate_table()forCREATE TABLEstatements, and userun_xflow()orexecute_xflow()for API commands.
Use the traditional mode to execute SQL commands
Use execute_sql() to run a SQL statement and open_reader() to read the results:
# Create a table using the dedicated method
o.create_table('my_t', 'num bigint, id string', if_not_exists=True)
# Run a SELECT query
result = o.execute_sql('SELECT * FROM pyodps_iris LIMIT 3')
with result.open_reader() as reader:
for record in reader:
print(record)For more SQL operations, see SQL.
DataFrame
PyODPS provides a DataFrame API for data processing. Unlike pandas, a PyODPS DataFrame is lazy — it only runs when you call an immediate execution method such as execute() or persist().
from odps.df import DataFrame
# Load the pyodps_iris table as a DataFrame
iris = DataFrame(o.get_table('pyodps_iris'))
# Filter rows and print results — .execute() triggers the actual computation
for record in iris[iris.sepalwidth < 3].execute():
print(record)By default, PyODPS does not print execution details such as the Logview URL. To enable verbose output:
from odps import options
options.verbose = TrueFor more DataFrame examples, see DataFrame (Not recommended).
Set runtime parameters (hints)
Pass a hints dictionary to any SQL call to set per-query runtime parameters:
o.execute_sql('SELECT * FROM pyodps_iris', hints={'odps.sql.mapper.split.size': 16})To apply hints to every SQL call in the session, configure them globally:
from odps import options
# All subsequent execute_sql() calls include these hints automatically
options.sql.settings = {'odps.sql.mapper.split.size': 16}
o.execute_sql('SELECT * FROM pyodps_iris')Complete example
The following example covers the full workflow: connect to MaxCompute, create a table, write data, read it back, query the pyodps_iris table, and clean up.
This example calls table.drop() at the end, which permanently deletes the table my_new_table. Do not use an existing table name as the value of create_table() if you want to keep it.
Create a file named
test-pyodps-local.py.Add the following code:
import os from odps import ODPS o = ODPS( access_id=os.getenv('ALIBABA_CLOUD_ACCESS_KEY_ID'), secret_access_key=os.getenv('ALIBABA_CLOUD_ACCESS_KEY_SECRET'), project='<your-default-project>', endpoint='<your-endpoint>', ) # Create a non-partitioned table with two columns table = o.create_table('my_new_table', 'num bigint, id string', if_not_exists=True) # Write four records to the table records = [[111, 'aaa'], [222, 'bbb'], [333, 'ccc'], [444, 'Chinese']] o.write_table(table, records) # Read back the records for record in o.read_table(table): print(record[0], record[1]) # Query the pyodps_iris table using SQL result = o.execute_sql('SELECT * FROM pyodps_iris LIMIT 3;', hints={'odps.sql.allow.fullscan': 'true'}) # Drop the table to release resources table.drop() print('Read data from the pyodps_iris table using open_reader:') # Print query results with result.open_reader() as reader: for record in reader: print(record[0], record[1])Run the script:
python test-pyodps-local.pyThe expected output is:
111 aaa 222 bbb 333 ccc 444 Chinese Read data from the pyodps_iris table using open_reader: 4.9 3.0 4.7 3.2 4.6 3.1
What's next
Overview of basic operations — table operations, data upload and download, and more
DataFrame (Not recommended) — advanced data processing with the DataFrame API
Use a PyODPS node for Jieba Chinese word segmentation — an end-to-end example in DataWorks