This topic describes how to use a PyODPS node in DataWorks to segment Chinese text based on the open source segmentation tool Jieba, and write the segmented words and phrases to a new table. This topic also describes how to use closure functions to segment Chinese text based on a custom dictionary.

Prerequisites

  • A DataWorks workspace is created. In this example, a workspace in basic mode is used. The workspace is associated with multiple MaxCompute compute engines. For more information, see Create a workspace.
  • The open source Jieba package is downloaded from GitHub. clone

Background information

PyODPS nodes integrate MaxCompute SDK for Python. You can directly edit Python code and use MaxCompute SDK for Python in PyODPS nodes of DataWorks. For more information about PyODPS nodes, see Create a PyODPS 2 node.
Notice Sample code in this topic is for reference only. We recommend that you do not use the code in your production environment.

Procedure

  1. Create a workflow.
    1. Log on to the DataWorks console.
    2. In the left-side navigation pane, click Workspaces.
    3. Select the region where the required workspace resides, find the workspace, and then click Data Analytics in the Actions column.
    4. Move the pointer over the Create icon and click Workflow.
    5. In the Create Workflow dialog box, set the Workflow Name and Description parameters.
      Notice The Workflow Name parameter must be 1 to 128 characters in length, and can contain letters, digits, underscores (_), and periods (.).
    6. Click Create.
  2. Upload the jieba-master.zip package.
    1. Click the created workflow, right-click Resource under MaxCompute, and then choose Create > Archive.
    2. In the Create Resource dialog box, set the parameters.
      Create Resource
      Parameter Description
      Resource Name The name of the resource. It can be different from the name of the uploaded file. However, the resource name must comply with the following conventions:
      • The name can contain only letters, digits, periods (.), underscores (_), and hyphens (-).
      • If Archive is selected from the Resource Type drop-down list, the extension of the resource name must be the same as that of the file name. The extension can be .zip, .tgz, .tar.gz, or .tar.
      Location The folder for storing the resource. The default value is the path of the current folder. You can modify the path based on your business requirements.
      Resource Type Select Archive from the Resource Type drop-down list.
      Note If the resource package has been uploaded to the MaxCompute client, clear Upload to MaxCompute. Otherwise, an error occurs during the upload process.
      Engine Instance Select the compute engine where the resource resides from the drop-down list.

      If only one instance is bound to your workspace, this parameter is not displayed.

      Upload Click Upload, select the downloaded file jieb-master.zip from the on-premises machine, and then click Open.
    3. In the Create Resource dialog box, click Create.
    4. Click the Commit icon in the toolbar. In the Commit Node dialog box, set the Change description parameter.
    5. Click OK.
  3. Create a table for storing test data.
    1. Click the created workflow, right-click Table under MaxCompute, and then select Create Table.
    2. In the Create Table dialog box, set Table Name to an appropriate value, such as jieba_test, and select MaxCompute from the Please select an Engine type drop-down list.
    3. Click Create.
    4. Click DDL Statement and enter the following DDL statement for creating a table.
      The table in this example contains two columns. You can segment text in one column during data development.
      CREATE TABLE jieba_test (
          `chinese` string,
          `content` string
      );
    5. Click Generate Table Schema.
    6. In the Confirm message, click OK.
    7. In the General section, set the Display Name parameter for the table.
    8. Click Commit to Production Environment.
    9. In the Commit to Production Environment dialog box, select I am aware of the risk and confirm the commissions and click OK.
  4. Use the same method to create a table for storing the test result.
    In this example, only text in the chinese column of the test data is segmented. Therefore, the result table contains only one column. The following DDL statement is used to create a table for storing the test result:
    CREATE TABLE jieba_result (
        `chinese` string
    ) ;
  5. Download the test data for text segmentation.
  6. Upload test data.
    1. Click the Import icon icon on the DataStudio page.
    2. In the Data Import Wizard dialog box, enter the name of the test table jieba_test to which data needs to be imported, select the table, and then click Next.
    3. Click Browse, upload the jieba_test.csv file from the on-premises machine, and then click Next.
    4. Select By Name and click Import Data.
  7. Create a PyODPS 2 node.
    1. Click the required workflow, right-click Data Analytics under MaxCompute, and then choose Create > PyODPS 2.
    2. In the Create Node dialog box, set Node Name to an appropriate value, such as word_split, and set Location.
      Note The node name must be 1 to 128 characters in length, and can contain letters, digits, underscores (_), and periods (.).
    3. Click Commit.
    4. On the configuration tab of the node, select the MaxCompute compute engine and enter the following PyODPS code:
      def test(input_var):
          import jieba
          import sys
          reload(sys)
          sys.setdefaultencoding('utf-8')
          result=jieba.cut(input_var, cut_all=False)
          return "/ ".join(result)
      hints = {
          'odps.isolation.session.enable': True
      }
      libraries =['jieba-master.zip']  # Reference the jieba-master.zip package. 
      iris = o.get_table('jieba_test').to_df()  # Reference data in the jieba_test table. 
      example = iris.chinese.map(test).execute(hints=hints, libraries=libraries)
      print(example)  # Display the text segmentation result, which is of the MAP type. 
      abci=list(example)   # Convert the text segmentation result to the LIST type. 
      for i in range(i,len(abci)):
          pq=str(abci[i])
          o.write_table('jieba_result',[pq])  # Write the data records to the jieba_result table one by one. 
          i+=1
      else:
          print("done")
    5. Click the Save icon in the toolbar to save the code.
    6. Click the Run icon in the toolbar. In the Arguments dialog box, select a resource group from the Resource Group drop-down list.
    7. Click OK to test the PyODPS 2 node.
    8. View the running result of the Jieba segmentation program on the Runtime Log tab in the lower part of the page.
  8. Create and run an ODPS SQL node.
    1. Click the required workflow, right-click Data Analytics under MaxCompute, and then choose Create > ODPS SQL.
    2. In the Create Node dialog box, set the Node Name and Location parameters.
      Note The node name must be 1 to 128 characters in length, and can contain letters, digits, underscores (_), and periods (.).
    3. Click Commit.
    4. On the configuration tab of the node, enter the SQL statement select * from jieba_result;.
    5. Click the Save icon in the toolbar to save the SELECT statement.
    6. Click the Run icon in the toolbar. In the Arguments dialog box, select a resource group from the Resource Group drop-down list.
    7. Click OK to execute the SELECT statement.
    8. In the Expense Estimate dialog box, check the estimated cost and click Run.
    9. View the running result on the Runtime Log tab in the lower part of the page.
  9. If the dictionary of the Jieba tool does not meet your requirements, use a custom dictionary.
    You can use a PyODPS user-defined function (UDF) to read resources uploaded to MaxCompute. The resources can be tables or files. In this case, you must write the UDF as a closure function or callable class. If you need to reference complex UDFs, you can create a MaxCompute function in DataWorks. For more information, see Create a MaxCompute function.

    In this topic, a closure function is used to reference the custom dictionary file key_words.txt that is uploaded to MaxCompute.

    1. Click the required workflow, right-click Resource under MaxCompute, and then choose Create > File.
    2. In the Create Resource dialog box, set the parameters.
      Item
      Parameter Description
      Resource Name The name of the resource. It can contain letters, digits, periods (.), underscores (_), and hyphens (-).
      Location The folder for storing the resource. The default value is the path of the current folder. You can modify the path based on your business requirements.
      Resource Type Select File from the Resource Type drop-down list.

      Make sure that Upload to MaxCompute is selected when you create the resource file used in this topic.

      Engine Instance Select the compute engine where the resource resides from the Engine Instance drop-down list.
    3. Click Create.
    4. On the configuration tab of the resource, select the MaxCompute compute engine and enter the content of the custom dictionary.
      Dictionary format:
      • Each word occupies one line.
      • Each line contains the following parts: word, frequency, and parts of speech. The frequency and part of speech are optional. Separate every two parts with a space. The order of the three parts cannot be adjusted.

      If you upload a dictionary file from the on-premises machine to DataWorks, the file must be encoded in UTF-8.

    5. Click the Commit icon in the toolbar to commit the resource.
    6. Create a PyODPS 2 node and enter the following code:
      def test(resources):
          import jieba
          import sys
          reload(sys)
          sys.setdefaultencoding('utf-8')
          fileobj = resources[0]
      
          def h(input_var):# Use the nested h() function to load the dictionary and segment text. 
              import jieba
              jieba.load_userdict(fileobj)
              result=jieba.cut(input_var, cut_all=False)
              return "/ ".join(result)
          return h
      hints = {
          'odps.isolation.session.enable': True
      }
      libraries =['jieba-master.zip']  # Reference the jieba-master.zip package. 
      iris = o.get_table('jieba_test').to_df()  # Reference data in the jieba_test table. 
      
      file_object = o.get_resource('key_words.txt') # Use the get_resource() function to reference the MaxCompute resource. 
      example = iris.chinese.map(test, resources=[file_object]).execute(hints=hints, libraries=libraries) # Call the map function to transfer the resources parameter. 
      
      print(example)  # Display the text segmentation result, which is of the MAP type. 
      abci=list(example)   # Convert the text segmentation result to the List type. 
      for i in range(i,len(abci)):
          pq=str(abci[i])
          o.write_table('jieba_result',[pq])  # Write the data records to the jieba_result table one by one. 
          i+=1
      else:
          print("done")
    7. Run the code and compare the results before and after the custom dictionary is referenced.