All Products
Search
Document Center

MaxCompute:FAQ about PyODPS

Last Updated:Jun 07, 2023

This topic provides answers to some frequently asked questions about PyODPS.

Category

FAQ

Install PyODPS

Import modules

Use PyODPS

What do I do if the error message "Warning: XXX not installed" appears when I install PyODPS?

A component is not installed when you install PyODPS. Identify the name of the component that is not installed based on XXX in the error message and run the pip command to install the component.

What do I do if the error message "Project Not Found" appears when I install PyODPS?

Causes

  • The endpoint that you configured is invalid. You must change it to the endpoint of the project in which the PyODPS is installed. For more information about the endpoints, see Endpoints.

  • The positions of parameters for the MaxCompute entry object are invalid. Check the positions of the parameters for the MaxCompute entry object and make sure that you enter valid parameters. For more information about the parameters for a MaxCompute entry object, see Migrate PyODPS nodes from DataWorks to an on-premises environment.

What do I do if the error message "Syntax Error" appears when I install PyODPS?

The Python version is outdated. Python 2.5 or earlier is not supported. We recommend that you use the mainstream versions that are supported by PyODPS, such as Python 2.7.6 or later minor versions, Python 3.3 or later minor versions, and Python 2.6.

What do I do if the error message "Permission Denied" appears when I install PyODPS in macOS?

Run the sudo pip install pyodps command to install PyODPS in macOS.

What do I do if the error message "Operation Not Permitted" appears when I install PyODPS in macOS?

This issue is caused by System Integrity Protection (SIP). Restart your device and press +R during the restart. Then, run the following command in the terminal to address the issue:

csrutil disable
reboot       

For more information, see Operation Not Permitted when on root - El Capitan (rootless disabled).

What do I do if the error message "No Module Named ODPS" appears when I run the from odps import ODPS code?

The PyODPS package cannot be loaded. Causes:

  • Cause 1: Multiple Python versions are installed.

    Solution: In the current directory, search for the folders that are named odps and contain the odps.py or init.py file. To resolve the issue, perform the following steps:

    • If an existing folder has the same name, change the folder name.

    • If you installed a Python package that is named odps, run the sudo pip uninstall odps command to delete the package.

  • Cause 2: Both Python 2 and Python 3 are installed.

    Solution: Make sure that only Python 2 or Python 3 is installed on your on-premises machine.

  • Cause 3: PyODPS is not installed in the Python version that you use.

    Solution: Install PyODPS in the Python version that you use. For more information about how to install PyODPS, see Install PyODPS.

What do I do if the error message "Cannot Import Name ODPS" appears when I run the from odps import ODPS code?

Check whether a file that is named odps.py exists in the current working path. If the file exists, rename the file and run the from odps import ODPS code again.

What do I do if the error message "Cannot Import Module odps" appears when I run the from odps import ODPS code?

A dependency issue occurs in PyODPS. You can click the link to join the DingTalk group for PyODPS technical support to fix the issue.

What do I do if the error message "ImportError" appears when I use PyODPS in IPython or Jupyter Notebook?

Add from odps import errors to the header of the code.

If the error message still appears after you add from odps import errors to the header of the code, the IPython component is not installed. Run the sudo pip install -U jupyter command to install the IPython component.

What does the size field represent in o.gettable('table_name').size?

The size field specifies the physical size of a table.

How do I configure a Tunnel endpoint?

Configure the Tunnel endpoint by using the options.tunnel.endpoint parameter. For more information about the options.tunnel.endpoint parameter, see aliyun-odps-python-sdk.

How do I use a third-party package that contains CPython in PyODPS?

We recommend that you generate a wheel package that contains CPython. For more information, see Create a crcmod that can be used in MaxCompute.

What is the maximum amount of data that can be processed by a DataFrame in PyODPS? Is the size of a table limited?

The size of a table is not limited in PyODPS. The size of the DataFrame that is created by Pandas is limited by the size of the local memory.

How do I use max_pt in a DataFrame?

Use the odps.df.func module to call built-in functions of MaxCompute.

from odps.df import func
df = o.get_table('your_table').to_df()
df[df.ds == func.max_pt('your_project.your_table')]  # ds is a partition field.      

What is the difference between the open_writer() and write_table() methods when you use PyODPS to write data to a table?

Each time you call the write_table() method, MaxCompute generates a file on the server. This operation is time-consuming. If a large number of files are generated, the efficiency of subsequent queries decreases and the memory of the server may be insufficient. We recommend that you write multiple records at the same time or provide a Generator object when you use the write_table() method. For more information about how to use the write_table() method, see Write data to a table.

By default, data is written to blocks by using the open_writer() method.

Why is the amount of data that is queried on a DataWorks PyODPS node less than the amount of data that is returned in local mode?

By default, Instance Tunnel is disabled in DataWorks. In this case, instance.open_reader is run by using the Result interface, and a maximum of 10,000 data records can be read.

After Instance Tunnel is enabled, you can execute reader.count to obtain the number of data records. If you need to iteratively obtain all data, you must set options.tunnel.limit_instance_tunnel to False to remove the limit.

How do I call the count function to obtain the total number of rows in a DataFrame?

  1. After you install PyODPS, run the following command in the Python environment to create a MaxCompute table and initialize the DataFrame:

    iris = DataFrame(o.get_table('pyodps_iris'))        
  2. Call the count function to compute the total number of rows in the DataFrame.

    iris.count()      
  3. DataFrame API operations are not immediately called. These operations are called only when you explicitly call the execute method or an immediately called method. To prevent delayed execution of the count function, run the following command:

    df.count().execute()    

For more information about how to obtain the total number of rows in a DataFrame, see Aggregation. For more information about the delayed execution of PyODPS methods, see Execution.

What do I do if the error message "sourceIP is not in the white list" appears when I use PyODPS?

An IP address whitelist is configured for the MaxCompute project that is accessed by PyODPS. Contact the project owner to add the IP address of your device to the IP address whitelist. For more information about how to configure IP address whitelists, see Manage IP address whitelists.

What do I do if I fail to configure the runtime environment of MaxCompute by using from odps import options options.sql.settings?

  • Issue description

    When PyODPS is used to execute SQL statements, the following code is used to configure the runtime environment of MaxCompute before the request for a MaxCompute instance is sent:

    from odps import options
    options.sql.settings = {'odps.sql.mapper.split.size': 32}     

    Only six mappers are enabled after the job is run. The settings do not take effect. Run the set odps.stage.mapper.split.size=32 command on the MaxCompute client. The job is successfully run in one minute.

  • Cause

    The parameters that are configured on the MaxCompute client and in PyODPS are different. The parameter on the MaxCompute client is odps.stage.mapper.split.size, and the parameter in PyODPS is odps.sql.mapper.split.size.

  • Solution

    Change the parameter in PyODPS to odps.stage.mapper.split.size.

Why does the error message "IndexError:listindexoutofrange" appears when I call the head method of a DataFrame?

No elements exist in list[index] or the number of elements in list[index] exceeds the upper limit.

What do I do if the error message "ODPSError" appears when I upload a Pandas DataFrame to MaxCompute?

  • Issue description

    The following error message appears when a Pandas DataFrame is uploaded to MaxCompute:

    ODPSError: ODPS entrance should be provided.
  • Cause

    A global MaxCompute entry object is not found.

  • Solution

    • If you use the room mechanism %enter, configure the global MaxCompute entry object.

    • Call the to_global method to configure the global MaxCompute entry object.

    • Use the DataFrame(pd_df).persist('your_table', odps=odps) parameter.

What do I do if the error message "lifecycle is not specified in mandatory mode" appears when I use a DataFrame to write data to a table?

  • Issue description

    The following error message appears when a DataFrame is used to write data to a table:

    table lifecycle is not specified in mandatory mode
  • Cause

    You have not configured a lifecycle for the table.

  • Solution

    You must configure a lifecycle for each table in a project. Configure the following information each time you write data to a table:

    from odps import options
    options.lifecycle = 7  # Configure a lifecycle. The value must be an integer. Unit: days.       

What do I do if the error message "Perhaps the datastream from server is crushed" appears when I use PyODPS to write data to a table?

This error is caused by dirty data. Check whether the number of columns that you specify is the same as that of the table to which you want to write data.

What do I do if the error message "Project is protected" appears when I use PyODPS to read data from a table?

The project security policy does not allow you to read data from a table. To read all data from the table, use one of the following solutions:

  • Contact the project owner to add an exception policy.

  • Use DataWorks or other masking tools to mask the data and export the data to a project for which data protection is not enabled before you read the data.

If you want to read some data, use one of the following solutions:

  • Use o.execute_sql('select * from <table_name>').open_reader().

  • Use DataFrame,o.get_table('<table_name>').to_df().

What do I do if the error message "ConnectionError: timed out try catch exception" occasionally appears when I run a PyODPS script task?

Causes:

  • The connection timed out. A connection is considered as timed out if the timeout duration exceeds 5s. Use one of the following solutions to address the issue:

    • Add the following code to the header of the code to increase the timeout period:

      workaround from odps import options 
      options.connect_timeout=30                        
    • Capture exceptions and try again.

  • Some machines are not allowed to access network services due to sandbox limits. We recommend that you use exclusive resource groups for scheduling to run the script task to address this issue.

What do I do if the error message "is not defined" appears when I use PyODPS to execute the get_sql_task_cost function?

  • Issue description

    The following error message appears when PyODPS is used to execute the get_sql_task_cost function:

    NameError: name 'get_task_cost' is not defined.
  • Cause

    The name of the function is invalid.

  • Solution

    Use the execute_sql_cost function instead of the get_sql_task_cost function.

When I use PyODPS to display logs, Chinese characters are automatically converted into code to display. How do I retain Chinese characters in the logs?

Enter Chinese characters by writing the code in the format similar to print ("我叫 %s" % ('abc')). These issues occur only in Python 2.

If I set options.tunnel.use_instance_tunnel to False, fields of the DATETIME type are defined in MaxCompute, but the data that is obtained by using the SELECT statements is of the STRING type. Why?

By default, PyODPS calls the old Result interface when you call Open_Reader. In this case, the data that is obtained from the server is in the CSV format, and data of the DATETIME type is converted into the STRING type.

To address this issue, enable Instance Tunnel by setting options.tunnel.use_instance_tunnel to True. This way, PyODPS calls Instance Tunnel by default.

How do I use Python to achieve various purposes in PyODPS?

  • Write Python functions.

    You can use multiple methods to compute the distance between two points, such as the Euclidean distance and the Manhattan distance. You can also write a series of functions and call the functions when you compute data based on your business requirements.

    def euclidean_distance(from_x, from_y, to_x, to_y):
        return ((from_x - to_x) ** 2 + (from_y - to_y) ** 2).sqrt()
    
    def manhattan_distance(center_x, center_y, x, y):
       return (from_x - to_x).abs() + (from_y - to_y).abs()                      

    The following sample code shows how to call a function that you write:

    In [42]: df
         from_x    from_y      to_x      to_y
    0  0.393094  0.427736  0.463035  0.105007
    1  0.629571  0.364047  0.972390  0.081533
    2  0.460626  0.530383  0.443177  0.706774
    3  0.647776  0.192169  0.244621  0.447979
    4  0.846044  0.153819  0.873813  0.257627
    5  0.702269  0.363977  0.440960  0.639756
    6  0.596976  0.978124  0.669283  0.936233
    7  0.376831  0.461660  0.707208  0.216863
    8  0.632239  0.519418  0.881574  0.972641
    9  0.071466  0.294414  0.012949  0.368514
    
    In [43]: euclidean_distance(df.from_x, df.from_y, df.to_x, df.to_y).rename('distance')
       distance
    0  0.330221
    1  0.444229
    2  0.177253
    3  0.477465
    4  0.107458
    5  0.379916
    6  0.083565
    7  0.411187
    8  0.517280
    9  0.094420
    
    In [44]: manhattan_distance(df.from_x, df.from_y, df.to_x, df.to_y).rename('distance')
       distance
    0  0.392670
    1  0.625334
    2  0.193841
    3  0.658966
    4  0.131577
    5  0.537088
    6  0.114198
    7  0.575175
    8  0.702558
    9  0.132617                       
  • Use conditions and loop statements in Python.

    If the tables that you want to compute are stored in a database, you must process the fields of the tables based on configurations and perform UNION or JOIN operations on the tables. If you use SQL statements to perform these operations, the process is complex. We recommend that you use a DataFrame to perform the operations.

    For example, if you want to merge 30 tables into one table, you must perform the UNION ALL operation on the 30 tables if you use SQL statements. If you use PyODPS, run the following code:

    table_names = ['table1', ..., 'tableN']
    dfs = [o.get_table(tn).to_df() for tn in table_names]
    reduce(lambda x, y: x.union(y), dfs) 
    
    ## The reduce statement is equivalent to the following code: 
    df = dfs[0]
    for other_df in dfs[1:]:
        df = df.union(other_df)       

How do I use the Pandas DataFrame backend to debug local PyODPS programs?

You can use one of the following methods to debug local PyODPS programs: The initialization methods are different in the following methods but the subsequent code is the same.

  • A PyODPS DataFrame that is created by using the Pandas DataFrame can use Pandas to debug local PyODPS programs.

  • A DataFrame that is created by using MaxCompute tables can be executed in MaxCompute.

The following code is used in this example.

df = o.get_table('movielens_ratings').to_df()
DEBUG = True
if DEBUG:
    df = df[:100].to_pandas(wrap=True)       

If all subsequent code is written, the local debugging speed is fast. After the debugging is complete, you can change the value of the DEBUG parameter to False. Then, you can compute all data in MaxCompute.

We recommend that you use MaxCompute Studio to debug local PyODPS programs.

What do I do if the nested loop execution is slow?

We recommend that you use the Dict data structure to obtain the execution result of the loop and import the execution result to DataFrame objects in the outer loop. If you place the DataFrame object code df=XXX in the outer loop, a DataFrame object is generated for each loop calculation. As a result, the execution speed of the nested loop is slow.

How do I prevent downloading data to a local directory?

For information, see Use a PyODPS node to download data to a local directory for processing or to process data online.

In which scenarios can I download PyODPS data to my on-premises machine to process the data?

You can download PyODPS data to your on-premises machine in one of the following scenarios:

  • A small amount of PyODPS data needs to be processed.

  • If you need to use a Python function for a single row of data or perform operations that change one row to multiple rows, you can use the PyODPS DataFrame and make full use of the parallel computing capabilities of MaxCompute.

    For example, if you want to expand a JSON string data into one row by using key-value pairs, run the following code:

    In [12]: df
                   json
    0  {"a": 1, "b": 2}
    1  {"c": 4, "b": 3}
    
    In [14]: from odps.df import output
    
    In [16]: @output(['k', 'v'], ['string', 'int'])
        ...: def h(row):
        ...:     import json
        ...:     for k, v in json.loads(row.json).items():
        ...:         yield k, v
        ...:   
    
    In [21]: df.apply(h, axis=1)
       k  v
    0  a  1
    1  b  2
    2  c  4
    3  b  3                          

A maximum of 10,000 records can be obtained by using open_reader. How do I obtain more than 10,000 records?

Use create table as select ... to save the SQL execution result as a table, and use table.open_reader to read data.

Why am I recommended to use built-in operators instead of UDFs?

UDFs are executed slower than built-in operators during calculation. Therefore, we recommend that you use built-in operators.

If you need to process millions of rows of data and you use a UDF for a row, the execution time is increased from 7 seconds to 27 seconds. If larger datasets or more complex operations are required, the gap in time may be larger.

Why are the partition values of a partitioned table that are obtained by using DataFrame().schema.partitions empty?

A DataFrame does not distinguish between partition fields and common fields. Therefore, partition fields are processed as common fields. You can filter out partition fields by using the following method:

df = o.get_table().to_df()
df[df.ds == '']       

For more information about how to configure partitions or read data from partitions, see Tables.

How do I use PyODPS DataFrame to perform the Cartesian product operation?

For more information, see Use PyODPS DataFrame to process Cartesian products.

How do I use a PyODPS node to segment Chinese text based on Jieba?

For information, see Use a PyODPS node to segment Chinese text based on Jieba.

How do I use PyODPS to download full data?

By default, PyODPS does not limit the amount of data that can be read from an instance. However, the amount of data that can be downloaded for a protected project by using Tunnel commands is limited. If you do not specify options.tunnel.limit_instance_tunnel, the limit is automatically enabled, and the number of data records that can be downloaded is limited based on the configurations of the MaxCompute project. In most cases, a maximum of 10,000 data records can be downloaded at a time. If you need to iteratively obtain all data, you must disable the limit on the amount of data. You can execute the following statements to enable Instance Tunnel and disable the limit:

options.tunnel.use_instance_tunnel = True
options.tunnel.limit_instance_tunnel = False  # Disable the limit on the amount of data and read all data. 

with instance.open_reader() as reader:
    # Use Instance Tunnel to read all data. 

Can I use execute_sql or a DataFrame to compute the percentage of null values of a field?

We recommend that you use a DataFrame to perform aggregate operations because of its high aggregate performance.

How do I configure data types for PyODPS?

If you use PyODPS, you can use one of the following methods to enable MaxCompute V2.0 data types:

  • Execute o.execute_sql('set odps.sql.type.system.odps2=true;query_sql', hints={"odps.sql.submit.mode" : "script"}) to enable MaxCompute V2.0 data types.

  • Enable MaxCompute V2.0 data types by using a DataFrame. For example, use an executed action, such as persist, execute, or to_pandas, by configuring the hints parameter. The following configurations are valid only for a single job:

    from odps.df import DataFrame
    users - DataFrame(o.get_table('odps2_test'))
    users.persist('copy_test',hints={'odps.sql.type.system.odps2':'true'})

    If you want to enable MaxCompute V2.0 data types by using a DataFrame and you want the setting to take effect globally, set options.sql.use_odps2_extension to True.

What do I do if the error message "ValueError" appears when I use PyODPS?

You can use one of the following methods to resolve this issue:

  • Upgrade your SDK version to V0.8.4 or later.

  • Add the following clause to the SQL statement that you want to execute:

    fromodps.typesimportDecimal
    Decimal._max_precision=38