PyODPS is MaxCompute SDK for Python. It provides easy-to-use Python programming interfaces. Similar to pandas, PyODPS provides fast, flexible, and expressive data structures. You can use the data processing feature of PyODPS, which is similar to that of pandas, by calling the DataFrame API provided by PyODPS. This topic describes how to use PyODPS in your projects.
Prerequisites
- MaxCompute is activated. For more information, see Activate MaxCompute.
- DataWorks is activated, and a workspace is created. For more information, see Create a project.
Background information
- Interface connection
An external service must provide the required information to complete authentication before the service can access a data record in MaxCompute tables by using HTTP interfaces.
- Asynchronous calls
In most cases, the system creates a node for each of thousands of tasks with similar data processing logic. These nodes are difficult to manage and occupy excessive resources at the same time. PyODPS allows you to use queues to asynchronously run SQL tasks at a high concurrency. Queues also help you manage all nodes in a unified manner.
- UDF development
If the built-in functions of MaxCompute cannot meet your requirement, you can develop UDFs. For example, you can develop a UDF to format data of the DECIMAL type with thousands separators.
- SQL performance improvement
To check whether database transactions follow the first in, first out (FIFO) rule, you must compare each new record with historical records in the only transaction table. This transaction table generally contains a large number of records. If you use SQL statements to complete the comparison, the time complexity is O(N²). In addition, the SQL statements may fail to return the expected result. To resolve this issue, you can use the cache method in Python code to traverse records in the transaction table only once. In this case, the time complexity is O(N), and the efficiency is improved significantly.
- Other scenarios
You can use Python to develop an amount allocation model. For example, you can develop a model to allocate USD 10 to three persons. In this model, you must define the logic for handling the cash over and short.
- Quick start
- Use a PyODPS node to read data from a partitioned table
- Use a PyODPS node to pass parameters
- Reference a third-party package in a PyODPS node
- Use a PyODPS node to read data from the level-1 partition of the specified table
- Use a PyODPS node to query data based on specific criteria
- Use a PyODPS node to perform sequence operations