PyODPS is a Python SDK from Alibaba Cloud that allows developers to use MaxCompute for big data processing and analysis. This topic describes how to use PyODPS in a local environment to perform table operations, load data, and run SQL queries.
Prerequisites
PyODPS is installed locally, and the environment variables are set.
The examples in this document demonstrate basic operations using the pyodps_iris table. To prepare this table, see Usage examples for instructions on how to download the dataset, create the pyodps_iris table, and write data to it.
Procedure
Open a Python editor and create a Python file.
NoteIf a Python editor is not installed locally, you can create a file with a
.pyextension.Develop the PyODPS task code.
After you create the file, you can refer to the following sections for simple examples that demonstrate the main features of PyODPS.
For more information about using PyODPS, see Overview of basic operations and DataFrame (Not recommended). You can also refer to the Use a PyODPS node for Jieba Chinese word segmentation topic for a simple end-to-end operation example.
Save the Python file and run it locally.
Initialize the ODPS entry point
Ensure that the ALIBABA_CLOUD_ACCESS_KEY_ID and ALIBABA_CLOUD_ACCESS_KEY_SECRET environment variables are set.
(Not recommended) If the environment variables are not set, you can explicitly specify the keys. To do this, obtain an access_id and secret_access_key.
Manually define the ODPS entry point. The following code provides an example.
import os
from odps import ODPS
o = ODPS(
# (Recommended) Make sure that the environment variables are set.
# Make sure that the ALIBABA_CLOUD_ACCESS_KEY_ID environment variable is set to your AccessKey ID.
access_id=os.getenv('ALIBABA_CLOUD_ACCESS_KEY_ID'),
# Make sure that the ALIBABA_CLOUD_ACCESS_KEY_SECRET environment variable is set to your AccessKey secret.
secret_access_key=os.getenv('ALIBABA_CLOUD_ACCESS_KEY_SECRET'),
# (Not recommended) If the environment variables are not set, use the following format instead.
# access_id='your-aliyun-access-key-id',
# secret_access_key= 'your-aliyun-access-key-secret',
project='your-default-project',
endpoint='your-end-point',
)Parameter | Description |
your-default-project | Enter the name of your MaxCompute project. Log on to the MaxCompute console to obtain a list of projects. |
your-end-point | Enter the Endpoint for your region and network environment. For example, for the public Endpoint in the China (Hangzhou) region, enter Important Make sure to select the correct network type. Otherwise, the connection will fail. |
After you complete these configurations, you can use PyODPS in your local environment to perform basic operations on ODPS objects. These operations include list, get, exist, create, and delete. For more information about using PyODPS, see Overview of basic operations and DataFrame (Not recommended).
Execute SQL
Execute SQL commands in a PyODPS node.
In a PyODPS node, you can execute SQL commands using the traditional mode or MaxCompute Query Acceleration (MCQA). This feature currently supports Data Definition Language (DDL) and Data Manipulation Language (DML) commands. The accelerated query mode (MCQA) writes job results to a temporary cache. When you run the same query job again, MaxCompute returns the cached results to accelerate execution. For more information about the billing rules, see or Compute costs (pay-as-you-go). You can select an execution mode based on your requirements.
NoteYou cannot use methods such as
execute_sql()andrun_sql()on the entry point object to run all types of SQL statements.You must use other methods to call non-DDL or non-DML statements. For example, use the
create_tablemethod to call a `CREATE TABLE` statement, and use therun_xfloworexecute_xflowmethod to call an API command.Use the traditional mode to execute SQL commands
You can use execute_sql()/run_sql() to execute SQL commands. The following code provides an example:
Use the
create_tablemethod to create a new table.o.create_table('my_t', 'num bigint, id string', if_not_exists=True)Use the
execute_sqlmethod to execute an SQL query.result = o.execute_sql('SELECT * FROM pyodps_iris LIMIT 3') with result.open_reader() as reader: for record in reader: print(record)Read the SQL execution result in a PyODPS node.
Use open_reader() to read the result of an SQL command. For more information, see Read SQL execution results.
For more information about SQL-related operations in PyODPS nodes, see SQL.
DataFrame
PyODPS provides a DataFrame API that you can use for data processing. For more examples of DataFrame operations, see DataFrame.
Execution
To execute a DataFrame, you must explicitly call an immediate execution method, such as
executeorpersist. The following code provides an example.# Call an immediate execution method to process each record and print all data where iris.sepalwidth is less than 3 in the pyodps_iris table. from odps.df import DataFrame iris = DataFrame(o.get_table('pyodps_iris')) for record in iris[iris.sepalwidth < 3].execute(): print(record)Print detailed information
By default, the running process of a PyODPS node in a local environment does not print detailed information, such as the Logview. To print detailed information, such as the Logview, you must manually set the
options.verboseoption.from odps import options options.verbose = True
Set runtime parameters (hints)
To set runtime parameters for a task, use the hints parameter, which is a dictionary.
o.execute_sql('SELECT * FROM pyodps_iris', hints={'odps.sql.mapper.split.size': 16})You can also configure global SQL settings. After the configuration is complete, the related runtime parameters are added to each run.
from odps import options
options.sql.settings = {'odps.sql.mapper.split.size': 16}
# Hints are added based on the global configuration.
o.execute_sql('SELECT * FROM pyodps_iris')
Complete example
Create a file named
test-pyodps-local.pyon your local machine.Write the sample code.
import os from odps import ODPS o = ODPS( # (Recommended) Make sure that the environment variables are set. # Make sure that the ALIBABA_CLOUD_ACCESS_KEY_ID environment variable is set to your AccessKey ID. access_id=os.getenv('ALIBABA_CLOUD_ACCESS_KEY_ID'), # Make sure that the ALIBABA_CLOUD_ACCESS_KEY_SECRET environment variable is set to your AccessKey secret. secret_access_key=os.getenv('ALIBABA_CLOUD_ACCESS_KEY_SECRET'), project='your-default-project', endpoint='your-end-point', ) # Create a non-partitioned table named my_new_table by directly specifying the field names and types. table = o.create_table('my_new_table', 'num bigint, id string', if_not_exists=True) # Insert data into the non-partitioned table my_new_table. records = [[111, 'aaa'], [222, 'bbb'], [333, 'ccc'], [444, 'Chinese']] o.write_table(table, records) # Read data from the non-partitioned table my_new_table. for record in o.read_table(table): print(record[0], record[1]) # Read data from the table by running an SQL statement. result = o.execute_sql('SELECT * FROM pyodps_iris LIMIT 3;', hints={'odps.sql.allow.fullscan': 'true'}) # Drop the table to release resources. table.drop() print('Read data from the pyodps_iris table using open_reader:') # Read the SQL execution result. with result.open_reader() as reader: for record in reader: print(record[0], record[1])Run the Python code.
python test-pyodps-local.pyThe output is:
111 aaa 222 bbb 333 ccc 444 Chinese Read data from the pyodps_iris table using open_reader: 4.9 3.0 4.7 3.2 4.6 3.1