This topic describes how to use a PyODPS node in DataWorks to segment Chinese text based on Jieba, an open source segmentation tool, and write the segmented words and phrases to a new table. This topic also describes how to use closure functions to segment Chinese text based on a custom dictionary.

Prerequisites

  • A DataWorks workspace is created. In this example, a workspace in basic mode is used. The workspace is bound to multiple MaxCompute compute engines. For more information, see Create a workspace.
  • The open source Jieba package is downloaded from GitHub.clone

Background information

PyODPS integrates MaxCompute SDK for Python. You can directly edit Python code and use MaxCompute SDK for Python in PyODPS nodes of DataWorks. For more information about PyODPS nodes, see Create a PyODPS 2 node.
Notice Sample code in this topic is for reference only. We recommend that you do not use the code in your production environment.

Procedure

  1. Create a workflow.
    1. Log on to the DataWorks console.
    2. In the left-side navigation pane, click Workspaces.
    3. After you select the region where the workspace resides, find the required workspace and click Data Analytics.
    4. Move the pointer over the Create icon and click Workflow.
    5. In the Create Workflow dialog box, set the Workflow Name and Description parameters.
      Notice The workflow name must not exceed 128 characters in length and can contain uppercase and lowercase letters, digits, underscores (_), and periods (.).
    6. Click Create.
  2. Upload the jieba-master.zip package.
    1. Click the created workflow, right-click Resource under MaxCompute, and then choose Create > Archive.
    2. In the Create Resource dialog box, set the parameters as required.
      Create Resource
      Parameter Description
      Resource Name The name of the resource. It can be different from the name of the uploaded file. However, the resource name must comply with the following conventions:
      • The name can contain only letters, digits, periods (.), underscores (_), and hyphens (-).
      • If the resource type is Archive, the extension of the resource name must be the same as that of the file name. The resource name must end with .zip, .tgz, .tar.gz, or .tar.
      Location The default path is the path of the current folder. The path can be modified.
      Resource Type Select Archive.
      Note If the resource package has been uploaded from the MaxCompute client, clear Upload to MaxCompute. Otherwise, an error occurs during the upload process.
      Engine Instance MaxCompute Select the compute engine where the resource resides from the drop-down list.

      If only one instance is bound to your workspace, this parameter is not displayed.

      Upload Click Upload, select the downloaded file jieb-master.zip from the on-premises machine, and then click Open.
    3. In the Create Resource dialog box, click OK.
    4. Click the Submit icon in the toolbar. In the Commit Node dialog box, set the Change description parameter.
    5. Click OK.
  3. Create a table for storing test data.
    1. Click the created workflow, right-click Table under MaxCompute, and then select Create Table.
    2. In the Create Table dialog box, set Table Name, for example, to jieba_test, and set Engine Instance MaxCompute.
    3. Click Commit.
    4. Click DDL Statement and enter the following DDL statement for creating a table.
      The table in this example contains two columns. You can segment text in one column during the development.
      CREATE TABLE jieba_test (
          `chinese` string,
          `content` string
      );
    5. Click Generate Table Schema.
    6. In the Confirm message, click OK.
    7. In the General section, set the Display Name parameter for the table.
    8. Click Commit to Production Environment.
    9. In the Commit to Production Environment dialog box, select I am aware of the risk and confirm the commissions and click OK.
  4. Use the same method to create a table for storing the test result.
    In this example, only text in the chinese column in the test data is segmented. Therefore, the result table contains only one column. Use the following DDL statement:
    CREATE TABLE jieba_result (
        `chinese` string
    ) ;
  5. Download the test data for word segmentation.
  6. Upload test data.
    1. Click the Import icon on the Data Analytics tab.
    2. In the Data Import Wizard dialog box, enter the name of the test table jieba_test to which data needs to be imported, select the table, and then click Next.
    3. Click Browse, upload the jieba_test.csv file from the on-premises machine, and then click Next.
    4. Select By Name and click Import Data.
  7. Create a PyODPS 2 node.
    1. Click the required workflow, right-click Data Analytics under MaxCompute, and then choose Create > PyODPS 2.
    2. In the Create Node dialog box, set Node Name, for example, to word_split, and set Location.
      Note The node name must be 1 to 128 characters in length. It can contain letters, digits, underscores (_), and periods (.).
    3. Click Commit.
    4. On the configuration tab of the node, select the MaxCompute compute engine and enter the following PyODPS code:
      def test(input_var):
          import jieba
          import sys
          reload(sys)
          sys.setdefaultencoding('utf-8')
          result=jieba.cut(input_var, cut_all=False)
          return "/ ".join(result)
      hints = {
          'odps.isolation.session.enable': True
      }
      libraries =['jieba-master.zip']  # Reference the jieba-master.zip package.
      iris = o.get_table('jieba_test').to_df()  # Reference data in the jieba_test table.
      example = iris.chinese.map(test).execute(hints=hints, libraries=libraries)
      print(example)  # Display the text segmentation result, which is of the MAP type.
      abci=list(example)   # Convert the text segmentation result to the List type.
      i = 0
      for i in range(i,len(abci)):
          pq=str(abci[i])
          o.write_table('jieba_result',[pq])  # Write the data to the jieba_result table one by one.
          i+=1
      else:
          print("done")
    5. Click the Save icon in the toolbar to save the code.
    6. Click the Run icon in the toolbar. In the Arguments dialog box, select a resource group from the Resource Group drop-down list.
    7. Click OK to test the PyODPS 2 node.
    8. View the running result of the Jieba segmentation program on the Runtime Log tab in the lower part of the page.
  8. Create and run an ODPS SQL node
    1. Click the required workflow, right-click Data Analytics under MaxCompute, and then choose Create > ODPS SQL.
    2. In the Create Node dialog box, set the Node Name and Location parameters.
      Note The node name must be 1 to 128 characters in length. It can contain letters, digits, underscores (_), and periods (.).
    3. Click Commit.
    4. On the configuration tab of the node, enter the SQL statement select * from jieba_result;.
    5. Click the Save icon in the toolbar to save the query statement.
    6. Click the Run icon in the toolbar. In the Arguments dialog box, select a resource group from the Resource Group drop-down list.
    7. Click OK to run the query statement.
    8. In the Expense Estimate dialog box, check the estimated cost and click Run.
    9. View the running result on the Runtime Log tab in the lower part of the page.
  9. If the dictionary of open source Jieba does not meet your requirements, use a custom dictionary.
    You can use a PyODPS user-defined function (UDF) to read resources uploaded to MaxCompute. In this case, you must write the UDF as a closure function or callable class.

    In this topic, a closure function is used to reference the custom dictionary file key_words.txt uploaded to MaxCompute.

    1. Click the required workflow, right-click Resource under MaxCompute, and then choose Create > File.
    2. In the Create Resource dialog box, set the parameters as required.
      Resource
      Parameter Description
      Resource Name The name of the resource. It can contain letters, digits, periods (.), underscores (_), and hyphens (-).
      Location The default path is the path of the current folder. The path can be modified.
      Resource Type Select File.

      Make sure that Upload to MaxCompute is selected when you create the resource file used in this topic.

      Engine Instance MaxCompute Select the compute engine where the resource resides from the drop-down list.
    3. Click OK.
    4. On the configuration tab of the resource, select the MaxCompute compute engine and enter the content of the custom dictionary.
      Dictionary format:
      • Each word occupies a line.
      • Each line contains the following parts: word, frequency, and parts of speech. The frequency and parts of speech are optional. Separate every two parts with a space. The order of the three parts cannot be adjusted.

      If you upload a dictionary file from the on-premises machine to DataWorks, the file must be encoded in UTF-8.

    5. Click the Submit icon in the toolbar to commit the resource.
    6. Create a PyODPS 2 node and enter the following code:
      def test(resources):
          import jieba
          import sys
          reload(sys)
          sys.setdefaultencoding('utf-8')
          fileobj = resources[0]
      
          def h(input_var):# Use the nested h() function to load the dictionary and segment text.
              import jieba
              jieba.load_userdict(fileobj)
              result=jieba.cut(input_var, cut_all=False)
              return "/ ".join(result)
          return h
      hints = {
          'odps.isolation.session.enable': True
      }
      libraries =['jieba-master.zip']  # Reference the jieba-master.zip package.
      iris = o.get_table('jieba_test').to_df()  # Reference data in the jieba_test table.
      
      file_object = o.get_resource('key_words.txt') # Use the get_resource() function to reference the MaxCompute resource.
      example = iris.chinese.map(test, resources=[file_object]).execute(hints=hints, libraries=libraries) # Call the map function to transfer the resources parameter.
      
      print(example)  # Display the text segmentation result, which is of the MAP type.
      abci=list(example)   # Convert the text segmentation result to the List type.
      i = 0
      for i in range(i,len(abci)):
          pq=str(abci[i])
          o.write_table('jieba_result',[pq])  # Write the data to the jieba_result table one by one.
          i+=1
      else:
          print("done")
    7. Run the code and compare the results before and after the custom dictionary is referenced.