Process data online using a PyODPS node - MaxCompute

Downloading MaxCompute table data to a local machine bypasses MaxCompute's massively parallel processing (MPP) capability, turning the local machine into a bottleneck. For data volumes above 10 MB, process data directly on MaxCompute instead of pulling it locally.

PyODPS DataFrame operations are lazy: they translate into MaxCompute SQL and execute inside the MaxCompute engine. Only "action" methods — those that materialize results — transfer data. Choosing the right action method determines whether data moves to your local machine or stays in MaxCompute.

Choose a download method

Method	Suitable for	Where data lands
`head` / `tail`	Small data — inspection and debugging	Local memory
`to_pandas`	Small data — local analysis	Local pandas DataFrame
`open_reader`	Small-to-medium data — runs on a table or an SQL instance; iterative row-by-row reads without loading the full dataset into memory	Local memory (streaming)
`persist`	Large volumes (recommended for production)	MaxCompute table

For large-scale transformations, use a PyODPS DataFrame or MaxCompute SQL. A PyODPS DataFrame is built on top of a MaxCompute table and pushes computation to the MaxCompute engine. See Execution for details.

Process JSON data with a PyODPS DataFrame

The following example converts a JSON column into multiple rows, one row per key-value pair. The same function runs locally for testing and in MaxCompute for production — only the call site differs.

Define the transformation function once:

from odps.df import output

@output(['k', 'v'], ['string', 'int'])
def h(row):
    import json
    for k, v in json.loads(row.json).items():
        yield k, v

Then call it in the appropriate mode:

# Local testing: head downloads data to the local machine.
# Keep the sample size small. Do not use head in production pipelines.
In [12]: df.head(2)
               json
0  {"a": 1, "b": 2}
1  {"c": 4, "b": 3}

In [21]: df.apply(h, axis=1).head(4)
   k  v
0  a  1
1  b  2
2  c  4
3  b  3

# Production: persist runs the transformation inside MaxCompute
# and writes the result to my_table. No data is transferred locally.
In [21]: df.apply(h, axis=1).persist('my_table')

The 10 MB threshold applies to both paths. If the data exceeds 10 MB, skip local testing with a full dataset and go directly to persist.