Downloading MaxCompute table data to a local machine bypasses MaxCompute's massively parallel processing (MPP) capability, turning the local machine into a bottleneck. For data volumes above 10 MB, process data directly on MaxCompute instead of pulling it locally.
PyODPS DataFrame operations are lazy: they translate into MaxCompute SQL and execute inside the MaxCompute engine. Only "action" methods — those that materialize results — transfer data. Choosing the right action method determines whether data moves to your local machine or stays in MaxCompute.
Choose a download method
| Method | Suitable for | Where data lands |
|---|---|---|
head / tail |
Small data — inspection and debugging | Local memory |
to_pandas |
Small data — local analysis | Local pandas DataFrame |
open_reader |
Small-to-medium data — runs on a table or an SQL instance; iterative row-by-row reads without loading the full dataset into memory | Local memory (streaming) |
persist |
Large volumes (recommended for production) | MaxCompute table |
For large-scale transformations, use a PyODPS DataFrame or MaxCompute SQL. A PyODPS DataFrame is built on top of a MaxCompute table and pushes computation to the MaxCompute engine. See Execution for details.
Process JSON data with a PyODPS DataFrame
The following example converts a JSON column into multiple rows, one row per key-value pair. The same function runs locally for testing and in MaxCompute for production — only the call site differs.
Define the transformation function once:
from odps.df import output
@output(['k', 'v'], ['string', 'int'])
def h(row):
import json
for k, v in json.loads(row.json).items():
yield k, v
Then call it in the appropriate mode:
# Local testing: head downloads data to the local machine.
# Keep the sample size small. Do not use head in production pipelines.
In [12]: df.head(2)
json
0 {"a": 1, "b": 2}
1 {"c": 4, "b": 3}
In [21]: df.apply(h, axis=1).head(4)
k v
0 a 1
1 b 2
2 c 4
3 b 3
# Production: persist runs the transformation inside MaxCompute
# and writes the result to my_table. No data is transferred locally.
In [21]: df.apply(h, axis=1).persist('my_table')
The 10 MB threshold applies to both paths. If the data exceeds 10 MB, skip local testing with a full dataset and go directly to persist.