PyODPS is MaxCompute SDK for Python. It provides easy-to-use Python programming interfaces. Similar to pandas, PyODPS provides fast, flexible, and expressive data structures. You can use the data processing feature of PyODPS, which is similar to that of pandas, by calling the DataFrame API provided by PyODPS. This topic describes how to use PyODPS in your projects.

Prerequisites

Background information

You can develop most programs in MaxCompute by using SQL statements. However, you must use Python to develop complex business logic and user defined functions (UDFs). For example, you must use Python in the following scenarios:
  • Interface connection

    An external service must provide the required information to complete authentication before the service can access a data record in MaxCompute tables by using HTTP interfaces.

  • Asynchronous calls

    In most cases, the system creates a node for each of thousands of tasks with similar data processing logic. These nodes are difficult to manage and occupy excessive resources at the same time. PyODPS allows you to use queues to asynchronously run SQL tasks at a high concurrency. Queues also help you manage all nodes in a unified manner.

  • UDF development

    If the built-in functions of MaxCompute cannot meet your requirement, you can develop UDFs. For example, you can develop a UDF to format data of the DECIMAL type with thousands separators.

  • SQL performance improvement

    To check whether database transactions follow the first in, first out (FIFO) rule, you must compare each new record with historical records in the only transaction table. This transaction table generally contains a large number of records. If you use SQL statements to complete the comparison, the time complexity is O(N²). In addition, the SQL statements may fail to return the expected result. To resolve this issue, you can use the cache method in Python code to traverse records in the transaction table only once. In this case, the time complexity is O(N), and the efficiency is improved significantly.

  • Other scenarios

    You can use Python to develop an amount allocation model. For example, you can develop a model to allocate USD 10 to three persons. In this model, you must define the logic for handling the cash over and short.

Procedure

  1. Create a PyODPS node.
    This section describes how to create a PyODPS 2 node in the DataWorks console. For more information, see Create a PyODPS 2 node.
    Note
    • The Python version of a PyODPS node is 2.7.
    • Each PyODPS node can process a maximum of 50 MB of data and can occupy a maximum of 1 GB of memory. Otherwise, DataWorks terminates the PyODPS node. Do not write Python code that will process a large amount of data in PyODPS nodes.
    • Writing and debugging code in DataWorks is inefficient. We recommend that you install IntelliJ IDEA locally to write code.
    1. Create a workflow.
      Log on to the DataWorks console and go to the DataStudio page. On the left side of the page, right-click Business Flow and click Create Workflow.
    2. Create a PyODPS 2 node.
      Find the newly created workflow, right-click the workflow name, and then choose Create > MaxCompute > PyODPS 2. In the Create Node dialog box, specify Node Name and click Commit.
  2. Configure and run the PyODPS 2 node.
    1. Write the code of the PyODPS 2 node.
      Write the test code in the code editor of the PyODPS 2 node. In this example, write the following code in the code editor, which covers a full range of table operations. For more information about table operations and SQL operations, see Tables and SQL.
      from odps import ODPS
      import sys
      
      reload (sys)
      # Set UTF-8 as the default encoding format. You must execute this statement if the data contains Chinese characters.
      sys.setdefaultencoding('utf8')
      
      # Create a non-partitioned table named my_new_table, which contains the fields with the specified names and of the specified data types.
      # Each PyODPS node in DataWorks contains the global variable odps or o, which is the MaxCompute entry. You can use the entry without the need to define it. For more information, see Use PyODPS in DataWorks.
      table = o.create_table('my_new_table', 'num bigint, id string', if_not_exists=True)
      
      # Write data to the my_new_table table.
      records = [[111, 'aaa'],
                [222, 'bbb'],
                [333, 'ccc'],
                [444, 'Chinese']]
      o.write_table(table, records)
      
      # Read data from the my_new_table table.
      for record in o.read_table(table):
          print record[0],record[1]
      
      # Read data from the my_new_table table by executing an SQL statement.
      result = o.execute_sql('select * from my_new_table;',hints={'odps.sql.allow.fullscan': 'true'})
      
      # Obtain the execution result of the SQL statement.
      with result.open_reader() as reader:    
          for record in reader:            
              print record[0],record[1]
      
      # Delete the table.
      table.drop()
    2. Run the code.
      After you write the code, click the Run icon in the top toolbar. After the code is run, you can view the running result of the PyODPS 2 node on the Runtime Log tab. The result in the following figure indicates that the running of the code is successful.Runtime Log