This page covers common errors and questions when installing and using PyODPS.
Quick reference
| Error or question | Section |
|---|---|
| "Warning: XXX not installed" | Installation errors |
| "Project Not Found" | Installation errors |
| "Syntax Error" | Installation errors |
| "Permission Denied" on macOS | Installation errors |
| "Operation Not Permitted" on macOS | Installation errors |
| "No Module Named ODPS" | Import errors |
| "Cannot Import Name ODPS" | Import errors |
| "Cannot Import Module odps" | Import errors |
| "ImportError" in IPython or Jupyter Notebook | Import errors |
o.get_table('table_name').size meaning |
Usage questions |
| Configure a Tunnel endpoint | Usage questions |
| Third-party CPython package | Usage questions |
| DataFrame size limit | Usage questions |
max_pt in DataFrame |
Usage questions |
open_writer() vs write_table() |
Usage questions |
| DataWorks node returns fewer rows | Usage questions |
| Get DataFrame row count | Usage questions |
| "sourceIP is not in the white list" | Usage questions |
options.sql.settings not taking effect |
Usage questions |
| "IndexError: list index out of range" | Usage questions |
| "ODPSError: ODPS entrance should be provided" | Usage questions |
| "lifecycle is not specified in mandatory mode" | Usage questions |
| "Perhaps the datastream from server is crushed" | Usage questions |
| "Project is protected" | Usage questions |
| "ConnectionError: timed out try catch exception" | Usage questions |
| "NameError: name 'get_task_cost' is not defined" | Usage questions |
| Chinese characters appear encoded in logs | Usage questions |
| DATETIME field returns as STRING | Usage questions |
| Use Python features in DataFrame | Usage questions |
| Debug locally with Pandas backend | Usage questions |
| Nested loops running slowly | Usage questions |
| Avoid downloading data locally | Usage questions |
| When to download data locally | Usage questions |
open_reader 10,000-record limit |
Usage questions |
| Built-in operators vs UDFs | Usage questions |
DataFrame().schema.partitions is empty |
Usage questions |
| Cartesian product in DataFrame | Usage questions |
| Chinese text segmentation with Jieba | Usage questions |
| Download all data from a table | Usage questions |
| Compute null value percentage | Usage questions |
| Enable new data types | Usage questions |
| "ValueError" | Usage questions |
| SQL queries running slowly | Usage questions |
Installation errors
"Warning: XXX not installed"
Install the missing component using pip. The error message identifies the component name in the XXX part.
"Project Not Found"
Check two things:
-
Endpoint: The endpoint must point to your target project. See Endpoints.
-
Entry object parameters: Verify that the MaxCompute entry object parameters are in the correct positions. See Migrate PyODPS nodes from DataWorks to an on-premises environment.
"Syntax Error"
PyODPS does not support Python 2.5 or earlier. Use Python 2.6, Python 2.7.6 or later, or Python 3.3 or later.
"Permission Denied" on macOS
Run the installation with sudo:
sudo pip install pyodps
"Operation Not Permitted" on macOS
This is caused by System Integrity Protection (SIP). To disable it:
-
Restart your Mac and hold Command (⌘) + R during startup to enter Recovery Mode.
-
Open Terminal and run:
csrutil disable reboot
For more details, see Operation Not Permitted when on root - El Capitan (rootless disabled).
Import errors
"No Module Named ODPS"
This most commonly happens when there is a naming conflict in your working directory. Check whether your current directory contains a file named odps.py or init.py, or a folder named odps — if so, rename it.
Other causes:
-
Conflicting package: If you previously installed a package named
odps, remove it withsudo pip uninstall odps. -
Multiple Python versions: More than one Python version is installed. Make sure only one version is active.
-
PyODPS not installed: Install it under your current Python version. See Install PyODPS.
"Cannot Import Name ODPS"
A file named odps.py exists in your working directory and shadows the package. Rename or move that file, then retry the import.
"Cannot Import Module odps"
This is usually a dependency issue. Join the PyODPS technical support DingTalk group and contact the group administrator for help.
"ImportError" in IPython or Jupyter Notebook
Add from odps import errors at the top of your code. If the error persists, the IPython dependency may be missing — reinstall Jupyter:
sudo pip install -U jupyter
Usage questions
What does o.get_table('table_name').size return?
The size field returns the physical storage size of the table, not the number of rows.
How do I configure a Tunnel endpoint?
Set options.tunnel.endpoint to your endpoint URL. For all available options, see the aliyun-odps-python-sdk options reference.
How do I use a third-party package that contains CPython?
Generate a wheel package that includes CPython. For an example, see Create a crcmod that can be used in MaxCompute.
Is there a size limit for PyODPS DataFrame?
PyODPS itself has no table size limit. For DataFrames created from local Pandas, the limit is your available local memory.
How do I use max_pt in a DataFrame?
Use the odps.df.func module to call MaxCompute built-in functions:
from odps.df import func
df = o.get_table('your_table').to_df()
df[df.ds == func.max_pt('your_project.your_table')] # ds is a partition column.
What is the difference between open_writer() and write_table()?
Each write_table() call creates a new file on the server. Calling it repeatedly with small datasets generates many files, which reduces query performance and can cause memory issues.
Recommended: Pass all records in a single call or use a Generator object to minimize the number of files created:
# Less efficient: one file created per call
write_table(records_batch_1)
write_table(records_batch_2)
# More efficient: all records in a single call
write_table(all_records)
open_writer() writes to a Block directly and is better suited for streaming writes.
For usage details, see Write data to a table.
Why does the DataWorks PyODPS node return fewer rows than local execution?
DataWorks does not enable Instance Tunnel by default. Without it, instance.open_reader uses the Result interface, which is capped at 10,000 records.
To retrieve all records, enable Instance Tunnel and remove the limit:
options.tunnel.use_instance_tunnel = True
options.tunnel.limit_instance_tunnel = False # Remove the 10,000-record cap.
with instance.open_reader() as reader:
for record in reader:
...
After enabling Instance Tunnel, use reader.count to get the total record count.
How do I get the row count of a DataFrame?
DataFrame uses lazy execution — operations are not run until you explicitly trigger them. To get an immediate count:
iris = DataFrame(o.get_table('pyodps_iris'))
iris.count().execute()
Calling count() without execute() returns a lazy expression, not a number. For more on lazy execution, see Execution and Aggregation.
"sourceIP is not in the white list"
The MaxCompute project has IP whitelist protection enabled. Contact the project owner to add your IP address to the whitelist. See Manage IP address whitelists.
options.sql.settings not taking effect
The parameter names differ between the client and PyODPS. The client uses odps.stage.mapper.split.size, but PyODPS uses odps.sql.mapper.split.size. These are not the same parameter.
Use the correct parameter name in PyODPS:
from odps import options
options.sql.settings = {'odps.stage.mapper.split.size': 32}
"IndexError: list index out of range" when calling head()
The DataFrame has no rows, or the requested index exceeds the available rows. Check whether the DataFrame is empty before calling head().
"ODPSError: ODPS entrance should be provided" when uploading a Pandas DataFrame
PyODPS cannot find a global MaxCompute entry object. Fix this with one of the following approaches:
-
Use the
%entermagic command (Room mechanism) to set a global entry automatically. -
Call
to_global()on your MaxCompute entry object. -
Pass the entry object explicitly:
DataFrame(pd_df).persist('your_table', odps=odps).
"lifecycle is not specified in mandatory mode"
The project requires a lifecycle value for every table. Set it before writing:
from odps import options
options.lifecycle = 7 # Number of days. Must be an integer.
"Perhaps the datastream from server is crushed"
This indicates dirty data. Check that the number of columns in your data matches the target table schema.
"Project is protected"
The project's security policy restricts direct data reads.
-
To access all data: Ask the project owner to add an exception rule, or export the data to an unprotected project using DataWorks or another masking tool.
-
To preview data: Use
o.execute_sql('select * from <table_name>').open_reader()oro.get_table('<table_name>').to_df().
"ConnectionError: timed out try catch exception"
The default connection timeout is 5 seconds. This most commonly causes intermittent failures when the network or server has high latency. Increase the timeout at the top of your script:
from odps import options
options.connect_timeout = 30
If the error occurs on specific machines, sandbox network restrictions may be blocking access. Use a dedicated resource group to run those tasks.
"NameError: name 'get_task_cost' is not defined"
The function name get_sql_task_cost is invalid. Use execute_sql_cost instead.
Chinese characters appear as encoded strings in logs
This only affects Python 2. Use the % format operator when printing strings with Chinese characters:
print("我叫 %s" % ('abc'))
DATETIME field returns as STRING when using open_reader
When options.tunnel.use_instance_tunnel = False, PyODPS calls the legacy Result interface, which returns data in CSV format — so DATETIME values come back as strings.
Enable Instance Tunnel to get correctly typed data:
options.tunnel.use_instance_tunnel = True
How do I use Python language features in PyODPS DataFrame?
PyODPS DataFrame is compatible with standard Python functions and control flow.
Define and reuse functions:
def euclidean_distance(from_x, from_y, to_x, to_y):
return ((from_x - to_x) ** 2 + (from_y - to_y) ** 2).sqrt()
def manhattan_distance(from_x, from_y, to_x, to_y):
return (from_x - to_x).abs() + (from_y - to_y).abs()
# Apply to a DataFrame
euclidean_distance(df.from_x, df.from_y, df.to_x, df.to_y).rename('distance')
Use loops and `reduce` to combine tables:
Instead of writing 30-table UNION ALL SQL statements, use Python's reduce:
table_names = ['table1', ..., 'tableN']
dfs = [o.get_table(tn).to_df() for tn in table_names]
result = reduce(lambda x, y: x.union(y), dfs)
How do I debug PyODPS locally using the Pandas backend?
Use a DEBUG flag to switch between local Pandas execution and MaxCompute execution without changing any other code:
df = o.get_table('movielens_ratings').to_df()
DEBUG = True
if DEBUG:
df = df[:100].to_pandas(wrap=True)
# All subsequent code is unchanged — it runs locally when DEBUG=True.
Set DEBUG = False to run the full job on MaxCompute. For a richer local debugging experience, use MaxCompute Studio.
Nested loops are running slowly
The most common cause is putting df = XXX inside an outer loop. This creates a new DataFrame object on every iteration, which is expensive. Instead, collect results in a dict inside the loop, then build the DataFrame once after the loop completes.
How do I avoid downloading data locally?
See Use a PyODPS node to download data to a local directory for processing or to process data online.
When is it appropriate to download data locally?
Download data for local processing when the data volume is small.
For large-scale operations — especially expanding one row into multiple rows or applying a Python function row-by-row — keep the computation on MaxCompute using PyODPS DataFrame. For example, to expand a JSON string into key-value rows:
from odps.df import output
@output(['k', 'v'], ['string', 'int'])
def h(row):
import json
for k, v in json.loads(row.json).items():
yield k, v
df.apply(h, axis=1)
open_reader only returns 10,000 records — how do I get more?
Save the SQL result as a table, then read from the table:
o.execute_sql('create table result_table as select * from your_table')
o.get_table('result_table').open_reader()
Why use built-in operators instead of UDFs?
Built-in operators run significantly faster than user-defined functions (UDFs). For a job processing millions of rows, a UDF can increase execution time from 7 seconds to 27 seconds. For larger datasets, the gap grows even larger.
Why is DataFrame().schema.partitions empty for a partitioned table?
DataFrame treats partition columns the same as regular columns — it does not distinguish between them. To filter by a partition column, query it directly:
df = o.get_table('your_table').to_df()
print(df[df.ds == ''].execute())
For more on partitions and partition-based reads, see Tables.
How do I perform a Cartesian product in PyODPS DataFrame?
How do I segment Chinese text using Jieba in a PyODPS node?
See Use a PyODPS node to segment Chinese text based on Jieba.
How do I download all data from a table?
By default, PyODPS does not limit the amount of data that can be read from an instance. However, if you do not specify options.tunnel.limit_instance_tunnel, the limit is automatically enabled, and the number of records that can be downloaded is capped based on the project configuration — in most cases, a maximum of 10,000 records at a time. To download all data, enable Instance Tunnel and disable the limit:
options.tunnel.use_instance_tunnel = True
options.tunnel.limit_instance_tunnel = False # Remove the record cap.
with instance.open_reader() as reader:
for record in reader:
...
Can I compute the percentage of null values using execute_sql or DataFrame?
Both work, but DataFrame aggregate operations are generally faster for this type of calculation.
How do I enable new data types in PyODPS?
For a single query, pass the setting as a hint with execute_sql:
o.execute_sql(
'set odps.sql.type.system.odps2=true; select * from your_table',
hints={"odps.sql.submit.mode": "script"}
)
For a single DataFrame job (persist, execute, or to_pandas), pass the hint to that call:
from odps.df import DataFrame
users = DataFrame(o.get_table('odps2_test'))
users.persist('copy_test', hints={'odps.sql.type.system.odps2': 'true'})
For all DataFrame jobs in the session, set the global option:
options.sql.use_odps2_extension = True
"ValueError" when using PyODPS
Upgrade the SDK to V0.8.4 or later. If you cannot upgrade, add the following to your script:
from odps.types import Decimal
Decimal._max_precision = 38
SQL queries through PyODPS are running slowly
Slow SQL execution is usually not caused by PyODPS itself. Work through these steps:
1. Check network and server latency
Verify whether your proxy server or network link is adding delay, and check whether the server-side task queue is backed up.
2. Separate task submission from data reading
Combining submission and reading in one call makes it hard to tell where the delay occurs. Split them to measure each phase independently:
# Before: submission and reading combined
with o.execute_sql('select * from your_table').open_reader() as reader:
for row in reader:
print(row)
# After: split into separate steps
inst = o.run_sql('select * from your_table')
inst.wait_for_success()
with inst.open_reader() as reader:
for row in reader:
print(row)
3. Check for missing Logview (DataWorks only)
For jobs submitted through DataWorks, confirm that your SQL tasks are generating Logview links. Tasks submitted with execute_sql or run_sql on PyODPS versions below 0.11.6 may fail to generate Logview.
4. Enable debug logging
PyODPS logs all requests and responses when debug logging is enabled. This shows exact timestamps for each stage of the request:
import datetime
import logging
from odps import ODPS
logging.basicConfig(
level=logging.DEBUG,
format='%(asctime)s - %(name)s - %(levelname)s - %(message)s'
)
o = ODPS(...) # Fill in your account credentials.
print("Check time:", datetime.datetime.now())
inst = o.run_sql("select * from your_table")
The log output shows when each phase started and how long it took, helping you identify where the delay occurs.