All Products
Search
Document Center

MaxCompute:PyODPS FAQ

Last Updated:Mar 26, 2026

This page covers common errors and questions when installing and using PyODPS.

Quick reference

Error or question Section
"Warning: XXX not installed" Installation errors
"Project Not Found" Installation errors
"Syntax Error" Installation errors
"Permission Denied" on macOS Installation errors
"Operation Not Permitted" on macOS Installation errors
"No Module Named ODPS" Import errors
"Cannot Import Name ODPS" Import errors
"Cannot Import Module odps" Import errors
"ImportError" in IPython or Jupyter Notebook Import errors
o.get_table('table_name').size meaning Usage questions
Configure a Tunnel endpoint Usage questions
Third-party CPython package Usage questions
DataFrame size limit Usage questions
max_pt in DataFrame Usage questions
open_writer() vs write_table() Usage questions
DataWorks node returns fewer rows Usage questions
Get DataFrame row count Usage questions
"sourceIP is not in the white list" Usage questions
options.sql.settings not taking effect Usage questions
"IndexError: list index out of range" Usage questions
"ODPSError: ODPS entrance should be provided" Usage questions
"lifecycle is not specified in mandatory mode" Usage questions
"Perhaps the datastream from server is crushed" Usage questions
"Project is protected" Usage questions
"ConnectionError: timed out try catch exception" Usage questions
"NameError: name 'get_task_cost' is not defined" Usage questions
Chinese characters appear encoded in logs Usage questions
DATETIME field returns as STRING Usage questions
Use Python features in DataFrame Usage questions
Debug locally with Pandas backend Usage questions
Nested loops running slowly Usage questions
Avoid downloading data locally Usage questions
When to download data locally Usage questions
open_reader 10,000-record limit Usage questions
Built-in operators vs UDFs Usage questions
DataFrame().schema.partitions is empty Usage questions
Cartesian product in DataFrame Usage questions
Chinese text segmentation with Jieba Usage questions
Download all data from a table Usage questions
Compute null value percentage Usage questions
Enable new data types Usage questions
"ValueError" Usage questions
SQL queries running slowly Usage questions

Installation errors

"Warning: XXX not installed"

Install the missing component using pip. The error message identifies the component name in the XXX part.

"Project Not Found"

Check two things:

"Syntax Error"

PyODPS does not support Python 2.5 or earlier. Use Python 2.6, Python 2.7.6 or later, or Python 3.3 or later.

"Permission Denied" on macOS

Run the installation with sudo:

sudo pip install pyodps

"Operation Not Permitted" on macOS

This is caused by System Integrity Protection (SIP). To disable it:

  1. Restart your Mac and hold Command (⌘) + R during startup to enter Recovery Mode.

  2. Open Terminal and run:

    csrutil disable
    reboot

For more details, see Operation Not Permitted when on root - El Capitan (rootless disabled).

Import errors

"No Module Named ODPS"

This most commonly happens when there is a naming conflict in your working directory. Check whether your current directory contains a file named odps.py or init.py, or a folder named odps — if so, rename it.

Other causes:

  • Conflicting package: If you previously installed a package named odps, remove it with sudo pip uninstall odps.

  • Multiple Python versions: More than one Python version is installed. Make sure only one version is active.

  • PyODPS not installed: Install it under your current Python version. See Install PyODPS.

"Cannot Import Name ODPS"

A file named odps.py exists in your working directory and shadows the package. Rename or move that file, then retry the import.

"Cannot Import Module odps"

This is usually a dependency issue. Join the PyODPS technical support DingTalk group and contact the group administrator for help.

"ImportError" in IPython or Jupyter Notebook

Add from odps import errors at the top of your code. If the error persists, the IPython dependency may be missing — reinstall Jupyter:

sudo pip install -U jupyter

Usage questions

What does o.get_table('table_name').size return?

The size field returns the physical storage size of the table, not the number of rows.

How do I configure a Tunnel endpoint?

Set options.tunnel.endpoint to your endpoint URL. For all available options, see the aliyun-odps-python-sdk options reference.

How do I use a third-party package that contains CPython?

Generate a wheel package that includes CPython. For an example, see Create a crcmod that can be used in MaxCompute.

Is there a size limit for PyODPS DataFrame?

PyODPS itself has no table size limit. For DataFrames created from local Pandas, the limit is your available local memory.

How do I use max_pt in a DataFrame?

Use the odps.df.func module to call MaxCompute built-in functions:

from odps.df import func
df = o.get_table('your_table').to_df()
df[df.ds == func.max_pt('your_project.your_table')]  # ds is a partition column.

What is the difference between open_writer() and write_table()?

Each write_table() call creates a new file on the server. Calling it repeatedly with small datasets generates many files, which reduces query performance and can cause memory issues.

Recommended: Pass all records in a single call or use a Generator object to minimize the number of files created:

# Less efficient: one file created per call
write_table(records_batch_1)
write_table(records_batch_2)

# More efficient: all records in a single call
write_table(all_records)

open_writer() writes to a Block directly and is better suited for streaming writes.

For usage details, see Write data to a table.

Why does the DataWorks PyODPS node return fewer rows than local execution?

DataWorks does not enable Instance Tunnel by default. Without it, instance.open_reader uses the Result interface, which is capped at 10,000 records.

To retrieve all records, enable Instance Tunnel and remove the limit:

options.tunnel.use_instance_tunnel = True
options.tunnel.limit_instance_tunnel = False  # Remove the 10,000-record cap.

with instance.open_reader() as reader:
    for record in reader:
        ...

After enabling Instance Tunnel, use reader.count to get the total record count.

How do I get the row count of a DataFrame?

DataFrame uses lazy execution — operations are not run until you explicitly trigger them. To get an immediate count:

iris = DataFrame(o.get_table('pyodps_iris'))
iris.count().execute()

Calling count() without execute() returns a lazy expression, not a number. For more on lazy execution, see Execution and Aggregation.

"sourceIP is not in the white list"

The MaxCompute project has IP whitelist protection enabled. Contact the project owner to add your IP address to the whitelist. See Manage IP address whitelists.

options.sql.settings not taking effect

The parameter names differ between the client and PyODPS. The client uses odps.stage.mapper.split.size, but PyODPS uses odps.sql.mapper.split.size. These are not the same parameter.

Use the correct parameter name in PyODPS:

from odps import options
options.sql.settings = {'odps.stage.mapper.split.size': 32}

"IndexError: list index out of range" when calling head()

The DataFrame has no rows, or the requested index exceeds the available rows. Check whether the DataFrame is empty before calling head().

"ODPSError: ODPS entrance should be provided" when uploading a Pandas DataFrame

PyODPS cannot find a global MaxCompute entry object. Fix this with one of the following approaches:

  • Use the %enter magic command (Room mechanism) to set a global entry automatically.

  • Call to_global() on your MaxCompute entry object.

  • Pass the entry object explicitly: DataFrame(pd_df).persist('your_table', odps=odps).

"lifecycle is not specified in mandatory mode"

The project requires a lifecycle value for every table. Set it before writing:

from odps import options
options.lifecycle = 7  # Number of days. Must be an integer.

"Perhaps the datastream from server is crushed"

This indicates dirty data. Check that the number of columns in your data matches the target table schema.

"Project is protected"

The project's security policy restricts direct data reads.

  • To access all data: Ask the project owner to add an exception rule, or export the data to an unprotected project using DataWorks or another masking tool.

  • To preview data: Use o.execute_sql('select * from <table_name>').open_reader() or o.get_table('<table_name>').to_df().

"ConnectionError: timed out try catch exception"

The default connection timeout is 5 seconds. This most commonly causes intermittent failures when the network or server has high latency. Increase the timeout at the top of your script:

from odps import options
options.connect_timeout = 30

If the error occurs on specific machines, sandbox network restrictions may be blocking access. Use a dedicated resource group to run those tasks.

"NameError: name 'get_task_cost' is not defined"

The function name get_sql_task_cost is invalid. Use execute_sql_cost instead.

Chinese characters appear as encoded strings in logs

This only affects Python 2. Use the % format operator when printing strings with Chinese characters:

print("我叫 %s" % ('abc'))

DATETIME field returns as STRING when using open_reader

When options.tunnel.use_instance_tunnel = False, PyODPS calls the legacy Result interface, which returns data in CSV format — so DATETIME values come back as strings.

Enable Instance Tunnel to get correctly typed data:

options.tunnel.use_instance_tunnel = True

How do I use Python language features in PyODPS DataFrame?

PyODPS DataFrame is compatible with standard Python functions and control flow.

Define and reuse functions:

def euclidean_distance(from_x, from_y, to_x, to_y):
    return ((from_x - to_x) ** 2 + (from_y - to_y) ** 2).sqrt()

def manhattan_distance(from_x, from_y, to_x, to_y):
    return (from_x - to_x).abs() + (from_y - to_y).abs()

# Apply to a DataFrame
euclidean_distance(df.from_x, df.from_y, df.to_x, df.to_y).rename('distance')

Use loops and `reduce` to combine tables:

Instead of writing 30-table UNION ALL SQL statements, use Python's reduce:

table_names = ['table1', ..., 'tableN']
dfs = [o.get_table(tn).to_df() for tn in table_names]
result = reduce(lambda x, y: x.union(y), dfs)

How do I debug PyODPS locally using the Pandas backend?

Use a DEBUG flag to switch between local Pandas execution and MaxCompute execution without changing any other code:

df = o.get_table('movielens_ratings').to_df()
DEBUG = True
if DEBUG:
    df = df[:100].to_pandas(wrap=True)

# All subsequent code is unchanged — it runs locally when DEBUG=True.

Set DEBUG = False to run the full job on MaxCompute. For a richer local debugging experience, use MaxCompute Studio.

Nested loops are running slowly

The most common cause is putting df = XXX inside an outer loop. This creates a new DataFrame object on every iteration, which is expensive. Instead, collect results in a dict inside the loop, then build the DataFrame once after the loop completes.

How do I avoid downloading data locally?

See Use a PyODPS node to download data to a local directory for processing or to process data online.

When is it appropriate to download data locally?

Download data for local processing when the data volume is small.

For large-scale operations — especially expanding one row into multiple rows or applying a Python function row-by-row — keep the computation on MaxCompute using PyODPS DataFrame. For example, to expand a JSON string into key-value rows:

from odps.df import output

@output(['k', 'v'], ['string', 'int'])
def h(row):
    import json
    for k, v in json.loads(row.json).items():
        yield k, v

df.apply(h, axis=1)

open_reader only returns 10,000 records — how do I get more?

Save the SQL result as a table, then read from the table:

o.execute_sql('create table result_table as select * from your_table')
o.get_table('result_table').open_reader()

Why use built-in operators instead of UDFs?

Built-in operators run significantly faster than user-defined functions (UDFs). For a job processing millions of rows, a UDF can increase execution time from 7 seconds to 27 seconds. For larger datasets, the gap grows even larger.

Why is DataFrame().schema.partitions empty for a partitioned table?

DataFrame treats partition columns the same as regular columns — it does not distinguish between them. To filter by a partition column, query it directly:

df = o.get_table('your_table').to_df()
print(df[df.ds == ''].execute())

For more on partitions and partition-based reads, see Tables.

How do I perform a Cartesian product in PyODPS DataFrame?

See PyODPS DataFrame handling Cartesian product.

How do I segment Chinese text using Jieba in a PyODPS node?

See Use a PyODPS node to segment Chinese text based on Jieba.

How do I download all data from a table?

By default, PyODPS does not limit the amount of data that can be read from an instance. However, if you do not specify options.tunnel.limit_instance_tunnel, the limit is automatically enabled, and the number of records that can be downloaded is capped based on the project configuration — in most cases, a maximum of 10,000 records at a time. To download all data, enable Instance Tunnel and disable the limit:

options.tunnel.use_instance_tunnel = True
options.tunnel.limit_instance_tunnel = False  # Remove the record cap.

with instance.open_reader() as reader:
    for record in reader:
        ...

Can I compute the percentage of null values using execute_sql or DataFrame?

Both work, but DataFrame aggregate operations are generally faster for this type of calculation.

How do I enable new data types in PyODPS?

For a single query, pass the setting as a hint with execute_sql:

o.execute_sql(
    'set odps.sql.type.system.odps2=true; select * from your_table',
    hints={"odps.sql.submit.mode": "script"}
)

For a single DataFrame job (persist, execute, or to_pandas), pass the hint to that call:

from odps.df import DataFrame
users = DataFrame(o.get_table('odps2_test'))
users.persist('copy_test', hints={'odps.sql.type.system.odps2': 'true'})

For all DataFrame jobs in the session, set the global option:

options.sql.use_odps2_extension = True

"ValueError" when using PyODPS

Upgrade the SDK to V0.8.4 or later. If you cannot upgrade, add the following to your script:

from odps.types import Decimal
Decimal._max_precision = 38

SQL queries through PyODPS are running slowly

Slow SQL execution is usually not caused by PyODPS itself. Work through these steps:

1. Check network and server latency

Verify whether your proxy server or network link is adding delay, and check whether the server-side task queue is backed up.

2. Separate task submission from data reading

Combining submission and reading in one call makes it hard to tell where the delay occurs. Split them to measure each phase independently:

# Before: submission and reading combined
with o.execute_sql('select * from your_table').open_reader() as reader:
    for row in reader:
        print(row)

# After: split into separate steps
inst = o.run_sql('select * from your_table')
inst.wait_for_success()
with inst.open_reader() as reader:
    for row in reader:
        print(row)

3. Check for missing Logview (DataWorks only)

For jobs submitted through DataWorks, confirm that your SQL tasks are generating Logview links. Tasks submitted with execute_sql or run_sql on PyODPS versions below 0.11.6 may fail to generate Logview.

4. Enable debug logging

PyODPS logs all requests and responses when debug logging is enabled. This shows exact timestamps for each stage of the request:

import datetime
import logging
from odps import ODPS

logging.basicConfig(
    level=logging.DEBUG,
    format='%(asctime)s - %(name)s - %(levelname)s - %(message)s'
)
o = ODPS(...)  # Fill in your account credentials.
print("Check time:", datetime.datetime.now())
inst = o.run_sql("select * from your_table")

The log output shows when each phase started and how long it took, helping you identify where the delay occurs.