This topic provides answers to some frequently asked questions about PyODPS.
What do I do if the error message "Warning: XXX not installed" appears when I install PyODPS?
A component is not installed when you install PyODPS. Identify the name of the component
that is not installed based on XXX in the error message and run the pip
command to install the component.
What do I do if the error message "Project Not Found" appears when I install PyODPS?
- The endpoint that you configured is invalid. You must change it to the endpoint of the project in which the PyODPS is installed. For more information about the endpoints of MaxCompute, see Endpoints.
- The positions of parameters for the MaxCompute entry object are invalid. Check the positions of the parameters for the MaxCompute entry object and make sure that you enter valid parameters. For more information about the parameters for a MaxCompute entry object, see Migrate PyODPS nodes from a data development platform to a local PyODPS environment.
What do I do if the error message "Syntax Error" appears when I install PyODPS?
The Python version is outdated. Python 2.5 or earlier is not supported. We recommend that you use the mainstream versions that are supported by PyODPS, such as Python 2.7.6 or later minor versions, Python 3.3 or later minor versions, and Python 2.6.
What do I do if the error message "Permission Denied" appears when I install PyODPS in macOS?
Run the sudo pip install pyodps
command to install PyODPS in macOS.
What do I do if the error message "Operation Not Permitted" appears when I install PyODPS in macOS?
csrutil disable
reboot
For more information, see Operation Not Permitted when on root - El Capitan (rootless disabled).
What do I do if the error message "No Module Named ODPS" appears when I run the from odps import ODPS code?
- Cause 1: Multiple Python versions are installed.
Solution: In the current directory, search for the folders that are named odps and contain the
odps.py
orinit.py
file.- If an existing folder has the same name, change the folder name.
- If you installed a Python package that is named odps, run the
pip uninstall odps
command to delete the package.
- Cause 2: Both Python 2 and Python 3 are installed.
Solution: Make sure that only Python 2 or Python 3 is installed on your on-premises machine.
- Cause 3: PyODPS is not installed in the Python version that you use.
Solution: Install PyODPS in the Python version that you use. For more information about how to install PyODPS, see Installation guide and limits.
What do I do if the error message "Cannot Import Name ODPS" appears when I run the from odps import ODPS code?
Check whether a file that is named odps.py exists in the current working path. If the file exists, rename the file and run the from odps import ODPS code again.
What do I do if the error message "Cannot Import Module odps" appears when I run the from odps import ODPS code?
Dependency issues occur in PyODPS. Join the DingTalk group of PyODPS technical support to fix the dependency issues. The group ID is 11701793.
What do I do if the error message "ImportError" appears when I use PyODPS in IPython or Jupyter Notebook?
Add from odps import errors
to the header of the code.
If the error message still appears after you add from odps import errors
to the header of the code, the IPython component is not installed. Run the pip install -U ipython
command to install the IPython component.
What does the size field represent in o.gettable('table_name').size?
The size field specifies the physical size of a table.
How do I configure a Tunnel endpoint?
Configure the Tunnel endpoint by using the options.tunnel.endpoint
parameter. For more information about the options.tunnel.endpoint parameter, see
aliyun-odps-python-sdk.
How do I use a third-party package that contains CPython in PyODPS?
We recommend that you generate a wheel package that contains CPython. For more information, see Create a crcmod that can be used in MaxCompute.
What is the maximum amount of data that can be processed by a DataFrame in PyODPS? Is the size of a table limited?
The size of a table is not limited in PyODPS. The size of the DataFrame that is created by Pandas is limited by the size of the local memory.
How do I use max_pt in a DataFrame?
odps.df.func
module to call built-in functions of MaxCompute. from odps.df import func
df = o.get_table('your_table').to_df()
df[df.ds == func.max_pt('your_project.your_table')] # ds is a partition field.
What is the difference between the open_writer() and write_table() methods when you use PyODPS to write data to a table?
Each time you call the write_table()
method, MaxCompute generates a file on the server. This operation is time-consuming.
If a large number of files are generated, the efficiency of subsequent queries decreases
and the memory of the server may be insufficient. We recommend that you write multiple
records at the same time or provide a Generator object when you use the write_table()
method.
By default, data is written to blocks by using the open_writer()
method.
Why is the amount of data that is queried on a DataWorks PyODPS node less than the amount of data that is returned in local mode?
By default, Instance Tunnel is disabled in DataWorks. In this case, instance.open_reader
is run by using the Result interface, and a maximum of 10,000 data records can be
read.
After Instance Tunnel is enabled, you can execute reader.count
to obtain the number of data records. If you need to iteratively obtain all data,
you must set options.tunnel.limit_instance_tunnel
to False to remove the limit.
How do I call the count function to obtain the total number of rows in a DataFrame?
- After you install PyODPS, run the following command in the Python environment to create
a MaxCompute table and initialize the DataFrame:
iris = DataFrame(o.get_table('pyodps_iris'))
- Call the count function to compute the total number of rows in the DataFrame.
iris.count()
- DataFrame API operations are not immediately called. These operations are called only
when you explicitly call the execute method or an immediately called method. To prevent
delayed execution of the count function, run the following command:
df.count().execute()
For more information about how to obtain the total number of rows in a DataFrame, see Aggregation. For more information about the delayed execution of PyODPS methods, see Execution.
What do I do if the error message "sourceIP is not in the white list" appears when I use PyODPS?
An IP address whitelist is configured for the MaxCompute project that is accessed by PyODPS. Contact the project owner to add the IP address of your device to the IP address whitelist. For more information about how to configure IP address whitelists, see Manage IP address whitelists.
What do I do if I fail to configure the runtime environment of MaxCompute by using from odps import options options.sql.settings?
- Problem description
When PyODPS is used to execute SQL statements, the following code is used to configure the runtime environment of MaxCompute before the request for a MaxCompute instance is sent:
from odps import options options.sql.settings = {'odps.sql.mapper.split.size': 32}
Only six mappers are enabled after the job is run. The settings do not take effect. Run the
set odps.stage.mapper.split.size=32
command on the MaxCompute client. The job is successfully run in one minute. - Cause
The parameters that are configured on the MaxCompute client and in PyODPS are different. The parameter on the MaxCompute client is
odps.stage.mapper.split.size
, and the parameter in PyODPS isodps.sql.mapper.split.size
. - Solution
Change the parameter in PyODPS to
odps.stage.mapper.split.size
.
Why does the error message "IndexError:listindexoutofrange" appears when I call the head method of a DataFrame?
No elements exist in list[index]
or the number of elements in list[index]
exceeds the upper limit.
What do I do if the error message "ODPSError" appears when I upload a Pandas DataFrame to MaxCompute?
- Problem description
The following error message appears when a Pandas DataFrame is uploaded to MaxCompute:
ODPSError: ODPS entrance should be provided.
- Cause
A global MaxCompute object is not found.
- Solution
- If you use the room mechanism
%enter
, configure the global MaxCompute object. - Call the
to_global
method to configure the global MaxCompute object. - Use the
DataFrame(pd_df).persist('your_table', odps=odps)
parameter.
- If you use the room mechanism
What do I do if the error message "lifecycle is not specified in mandatory mode" appears when I use a DataFrame to write data to a table?
- Problem description
The following error message appears when a DataFrame is used to write data to a table:
table lifecycle is not specified in mandatory mode
- Cause
You have not configured a lifecycle for the table.
- Solution
You must configure a lifecycle for each table in a project. Therefore, configure the following information each time you write data to a table:
from odps import options options.lifecycle = 7 # Configure a lifecycle. The value must be an integer. Unit: days.
What do I do if the error message "Perhaps the datastream from server is crushed" appears when I use PyODPS to write data to a table?
This error is caused by dirty data. Check whether the number of columns that you specify is the same as that of the table to which you want to write data.
What do I do if the error message "Project is protected" appears when I use PyODPS to read data from a table?
- Contact the project owner to add an exception policy.
- Use DataWorks or other masking tools to mask the data and export the data to a project for which data protection is not enabled before you read the data.
- Use
o.execute_sql('select * from <table_name>').open_reader()
. - Use
DataFrame, o.get_table('<table_name>').to_df()
.
What do I do if the error message "ConnectionError: timed out try catch exception" occasionally appears when I run a PyODPS script task?
- The connection timed out. The default timeout period for PyODPS is 5s. Use one of
the following solutions to address the issue:
- Add the following code to the header of the code to increase the timeout period:
workaround from odps import options options.connect_timeout=30
- Capture exceptions and try again.
- Add the following code to the header of the code to increase the timeout period:
- Some machines are not allowed to access network services due to sandbox limits. We recommend that you use exclusive resource groups for scheduling to run the script task to address this issue.
What do I do if the error message "is not defined" appears when I use PyODPS to execute the get_sql_task_cost function?
- Problem description
The following error message appears when PyODPS is used to execute the get_sql_task_cost function:
NameError: name 'get_task_cost' is not defined.
- Cause
The name of the function is invalid.
- Solution
Use the execute_sql_cost function instead of the get_sql_task_cost function.
If I set options.tunnel.use_instance_tunnel to False, fields of the DATETIME type are defined in MaxCompute, but the data that is obtained by using the SELECT statements is of the STRING type. Why?
By default, PyODPS calls the old Result interface when you call Open_Reader. In this case, the data that is obtained from the server is in the CSV format, and data of the DATETIME type is converted into the STRING type.
To address this issue, enable Instance Tunnel by setting options.tunnel.use_instance_tunnel
to True. This way, PyODPS calls Instance Tunnel by default.
How do I use Python to achieve various purposes in PyODPS?
- Write Python functions.
You can use multiple methods to compute the distance between two points, such as the Euclidean distance and the Manhattan distance. You can also write a series of functions and call the functions when you compute data based on your business requirements.
def euclidean_distance(from_x, from_y, to_x, to_y): return ((from_x - to_x) ** 2 + (from_y - to_y) ** 2).sqrt() def manhattan_distance(center_x, center_y, x, y): return (from_x - to_x).abs() + (from_y - to_y).abs()
The following sample code shows how to call a function that you write:In [42]: df from_x from_y to_x to_y 0 0.393094 0.427736 0.463035 0.105007 1 0.629571 0.364047 0.972390 0.081533 2 0.460626 0.530383 0.443177 0.706774 3 0.647776 0.192169 0.244621 0.447979 4 0.846044 0.153819 0.873813 0.257627 5 0.702269 0.363977 0.440960 0.639756 6 0.596976 0.978124 0.669283 0.936233 7 0.376831 0.461660 0.707208 0.216863 8 0.632239 0.519418 0.881574 0.972641 9 0.071466 0.294414 0.012949 0.368514 In [43]: euclidean_distance(df.from_x, df.from_y, df.to_x, df.to_y).rename('distance') distance 0 0.330221 1 0.444229 2 0.177253 3 0.477465 4 0.107458 5 0.379916 6 0.083565 7 0.411187 8 0.517280 9 0.094420 In [44]: manhattan_distance(df.from_x, df.from_y, df.to_x, df.to_y).rename('distance') distance 0 0.392670 1 0.625334 2 0.193841 3 0.658966 4 0.131577 5 0.537088 6 0.114198 7 0.575175 8 0.702558 9 0.132617
- Use conditions and loop statements in Python.
If the tables that you want to compute are stored in a database, you must process the fields of the tables based on configurations and perform UNION or JOIN operations on the tables. If you use SQL statements to perform these operations, the process is complex. We recommend that you use a DataFrame to perform the operations.
For example, if you want to merge 30 tables into one table, you must perform the UNION ALL operation on the 30 tables if you use SQL statements. If you use PyODPS, run the following code:table_names = ['table1', ..., 'tableN'] dfs = [o.get_table(tn).to_df() for tn in table_names] reduce(lambda x, y: x.union(y), dfs) ## The reduce statement is equivalent to the following code: df = dfs[0] for other_df in dfs[1:]: df = df.union(other_df)
How do I use the Pandas DataFrame backend to debug local PyODPS programs?
- A PyODPS DataFrame that is created by using the Pandas DataFrame can use Pandas to debug local PyODPS programs.
- A DataFrame that is created by using MaxCompute tables can be executed in MaxCompute.
df = o.get_table('movielens_ratings').to_df()
DEBUG = True
if DEBUG:
df = df[:100].to_pandas(wrap=True)
If all subsequent code is written, the local debugging speed is fast. After the debugging is complete, you can change the value of the DEBUG parameter to False. Then, you can compute all data in MaxCompute.
We recommend that you use MaxCompute Studio to debug local PyODPS programs.
What do I do if the nested loop execution is slow?
We recommend that you use the Dict data structure to obtain the execution result of
the loop and import the execution result to DataFrame objects in the outer loop. If
you place the DataFrame object code df=XXX
in the outer loop, a DataFrame object is generated for each loop calculation. As
a result, the execution speed of the nested loop is slow.
How do I prevent downloading data to a local directory?
For more information, see Use a PyODPS node to download data to a local directory for processing or to process data online.
In which scenarios can I download PyODPS data to my on-premises machine to process the data?
- A small amount of PyODPS data needs to be processed.
- If you need to apply a Python function to a single row of data or perform operations
that change one row to multiple rows, you can use the PyODPS DataFrame and make full
use of the parallel computing capabilities of MaxCompute.
For example, if you want to expand a JSON string data into one row by using key-value pairs, run the following code:
In [12]: df json 0 {"a": 1, "b": 2} 1 {"c": 4, "b": 3} In [14]: from odps.df import output In [16]: @output(['k', 'v'], ['string', 'int']) ...: def h(row): ...: import json ...: for k, v in json.loads(row.json).items(): ...: yield k, v ...: In [21]: df.apply(h, axis=1) k v 0 a 1 1 b 2 2 c 4 3 b 3
A maximum of 10,000 records can be obtained by using open_reader. How do I obtain more than 10,000 records?
Use create table as select ...
to save the SQL execution result as a table, and use table.open_reader
to read data.
Why am I recommended to use built-in operators instead of UDFs?
UDFs are executed slower than built-in operators during calculation. Therefore, we recommend that you use built-in operators.
If you need to process millions of rows of data and you apply a UDF to a row, the execution time is increased from 7 seconds to 27 seconds. If larger datasets or more complex operations are required, the gap in time may be larger.
Why are the partition values of a partitioned table that is obtained by using DataFrame().schema.partitions empty?
df = o.get_table().to_df()
df[df.ds == '']
For more information about how to configure partitions or read data from partitions, see Tables.
How do I use PyODPS DataFrame to perform the Cartesian product operation?
For more information, see Use PyODPS DataFrame to process Cartesian products.
How do I use a PyODPS node to segment Chinese text based on Jieba?
For more information, see Use a PyODPS node to segment Chinese text based on Jieba.
How do I use PyODPS to download full data?
By default, PyODPS does not limit the amount of data that can be read from an instance.
However, the amount of data that can be downloaded for a protected project by using
Tunnel commands is limited. If you do not specify options.tunnel.limit_instance_tunnel
, the limit is automatically enabled, and the number of data records that can be downloaded
is limited based on the configurations of the MaxCompute project. In most cases, a
maximum of 10,000 data records can be downloaded at a time. If you need to iteratively
obtain all data, you must disable the limit
on the amount of data. You can execute the following statements to enable Instance Tunnel
and disable the limit
:
options.tunnel.use_instance_tunnel = True
options.tunnel.limit_instance_tunnel = False # Disable the limit on the amount of data and read all data.
with instance.open_reader() as reader:
# Use Instance Tunnel to read all data.
Can I use execute_sql or a DataFrame to compute the percentage of null values of a field?
We recommend that you use a DataFrame to perform aggregate operations due to its high aggregate performance.
How do I configure data types for PyODPS?
- Execute
o.execute_sql('set odps.sql.type.system.odps2=true;query_sql', hints={"odps.sql.submit.mode" : "script"})
to enable the new data types that are supported by MaxCompute V2.0 data type edition. - Enable the new data types that are supported by MaxCompute V2.0 data type edition
by using a DataFrame. For example, use an executed action, such as persist, execute,
or to_pandas, by configuring the hints parameter. The following configurations are
valid only for a single job:
from odps.df import DataFrame users - DataFrame(o.get_table('odps2_test')) users.persist('copy_test',hints={'odps.sql.type.system.odps2':'true'})
If you want to enable the new data types that are supported by MaxCompute V2.0 data type edition by using a DataFrame and you want the setting to take effect globally, set
options.sql.use_odps2_extension
to True.