This topic describes how to use PyODPS in DataWorks.

Go to the Data Analytics page of DataWorks

You can go to the Data Analytics page of DataWorks to create a PyODPS node.

PyODPS nodes are classified into two types: PyODPS 2 and PyODPS 3. The two types of PyODPS nodes use different Python versions at the underlying layer. PyODPS 2 nodes use Python 2 and PyODPS 3 nodes use Python 3. You can create a PyODPS node based on the Python version in use.

For more information about how to create a PyODPS node, see Create a PyODPS 2 node and Create a PyODPS 3 node.

For more examples, see Use a PyODPS node to segment Chinese text based on Jieba.

Create a node

Limits

  • Each PyODPS node can process a maximum of 50 MB of data and can occupy a maximum of 1 GB of memory. Otherwise, DataWorks terminates the PyODPS node. Do not write unnecessary Python data processing code in PyODPS tasks.
  • The efficiency of writing and debugging code in DataWorks is low. We recommend that you install an integrated development environment (IDE) on your machine to write code.
  • To avoid excess pressure on the gateway of DataWorks, DataWorks limits the CPU utilization and memory usage. If the system displays Got killed, the memory usage exceeds the limit and the system terminates the related processes. Therefore, we recommend that you do not perform local data operations. However, the limits on the memory usage and CPU utilization do not apply to SQL or DataFrame nodes, except to_pandas, that are initiated by PyODPS.
  • Functions may be limited in the following aspects due to the lack of packages such as matplotlib:
    • The use of the plot function of DataFrame is affected.
    • DataFrame user-defined functions (UDFs) can be used only after they are submitted to MaxCompute. As required by the Python sandbox, you can use only pure Python libraries and the NumPy library to run UDFs. Other third-party libraries such as pandas cannot be used.
    • However, you can use the NumPy and pandas libraries that are pre-installed in DataWorks to run non-UDFs. Third-party packages that contain binary code are not supported.
  • For compatibility reasons, options.tunnel.use_instance_tunnel is set to False in DataWorks by default. If you want to enable InstanceTunnel globally, you must set this parameter to True.
  • For implementation reasons, the Python atexit package is not supported. You must use try-finally to implement relevant features.

MaxCompute entry

Each PyODPS node in DataWorks contains the global variable odps or o, which is the MaxCompute entry. You do not need to specify the MaxCompute entry.
print(o.exist_table('pyodps_iris'))

Execute SQL statements

You can execute SQL statements in the PyODPS node. For more information, see SQL.

By default, InstanceTunnel is disabled in DataWorks. In this case, instance.open_reader is run by using the Result interface, and a maximum of 10,000 data records can be read. You can use reader.count to obtain the number of data records. If you need to iteratively obtain all data, you must remove the limit on the amount of data. You can execute the following statements to enable InstanceTunnel and remove the limit.
options.tunnel.use_instance_tunnel = True
options.tunnel.limit_instance_tunnel = False  # Remove the limit on the amount of data. 

with instance.open_reader() as reader:
    # Use InstanceTunnel to read all data. 

You can also add tunnel=True to open_reader to enable InstanceTunnel for the current open_reader operation. You can add limit=False to open_reader to remove the limit on the amount of data for the current open_reader operation.

with instance.open_reader(tunnel=True, limit=False) as reader:
# The current open_reader operation is performed by using InstanceTunnel, and all data can be read. 

DataFrame

  • Execution method

    DataFrame API operations are not automatically performed. These operations can be performed only when you explicitly call an automatically executed method.

    from odps.df import DataFrame
    iris = DataFrame(o.get_table('pyodps_iris'))
    for record in iris[iris.sepal_width < 3].execute():  # Call an automatically executed method to process each data record. 

    To call an automatically executed method for data display, set options.interactive to True.

    from odps import options
    from odps.df import DataFrame
    options.interactive = True  # Set options.interactive to True at the beginning of the code. 
    iris = DataFrame(o.get_table('pyodps_iris'))
    print(iris.sepal_width.sum())  # The method is executed immediately after the system displays information. 
  • Display details

    To display details, you must set options.verbose to True. By default, this parameter is set to True in DataWorks. The system displays details such as the Logview URL during the running process.

Obtain scheduling parameters

Different from SQL nodes in DataWorks, a PyODPS node does not replace strings such as ${param_name} in code. Instead, the PyODPS node adds a dictionary named args as a global variable before it runs the code. You can obtain the scheduling parameters from the dictionary. This way, Python code is not affected. For example, if you set ds=${yyyymmdd} under Schedule > Parameter in DataWorks, you can run the following commands to obtain the parameter value:

print('ds=' + args['ds'])
ds=20161116
Note You can run the following command to obtain the partition named ds=${yyyymmdd}:
o.get_table('table_name').get_partition('ds=' + args['ds'])