This topic describes how to use a PyODPS node in DataWorks to segment Chinese text based on Jieba, an open-source segmentation tool, and write the segmented words and phrases to a new table. After you read this topic, you will also know how to use closure functions to segment Chinese text based on a custom dictionary.

Prerequisites

  • A DataWorks workspace is created. This topic uses a workspace in the basic mode as an example. For more information, see Create a workspace. You are on the DataStudio page of the created workspace. To go to the DataStudio page, log on to the DataWorks console and click Workspaces in the left-side navigation pane. On the page that appears, find the workspace and click Data Analytics in the Actions column.
  • The jieba-master.zip package of Jieba is downloaded from GitHub.

Background information

PyODPS nodes are integrated with the Python SDK of MaxCompute. You can directly edit Python code and use the Python SDK of MaxCompute in PyODPS nodes. For more information about how to create a PyODPS node, see PyODPS node.

Notice Sample code in this topic is for reference only. We recommend that you do not use the code in your production environment.

Procedure

  1. Create a workflow.
    1. On the Data Analytics tab, right-click Business Flow and select Create Workflow.
    2. In the Create Workflow dialog box that appears, set Workflow Name and Description and click Create.
  2. Upload the jieba-master.zip package.
    1. Click the created workflow, right-click Resource under MaxCompute, and then choose Create > Archive.
    2. In the Create Resource dialog box that appears, set Resource Name, select Upload to MaxCompute, and then click Upload to upload the jieba-master.zip package. Then, click OK.
    3. Click the Submit icon in the toolbar to commit the resource.
  3. Create a source table for storing test data.
    1. Click the target workflow, right-click Table under MaxCompute, and then select Create Table.
    2. In the Create Table dialog box that appears, set Table Name, for example, to jieba_test, and click Commit.
    3. On the configuration tab of the table, click DDL Statement. In the DDL Statement dialog box that appears, enter the following table creation DDL statement.
      The source table in this example contains two columns. You can segment text in one column.
      CREATE TABLE `jieba_test` (
          `chinese` string,
          `content` string
      );
    4. Click Generate Table Schema.
    5. Click Commit to Production Environment.
  4. Create a result table to store the segmented words and phrases.
    This example assumes that only text in the chinese column in the source table is segmented. Therefore, the result table contains only one column. Use the following DDL statement to create the result table:
    CREATE TABLE `jieba_result` (
        `chinese` string
    ) ;
  5. Upload test data.
    Download the jieba_test.csv file to your PC and import it.
    1. On the Data Analytics tab, click the Import icon.
    2. In the Data Import Wizard dialog box that appears, set Select Table to the name of the source table jieba_test and click Next.
    3. Click Browse, upload the jieba_test.csv file from your PC, and then click Next.
    4. Select By Name and click Import Data.
  6. Create a PyODPS 2 node.
    1. Click the target workflow, right-click Data Analytics under MaxCompute, and then choose Create > PyODPS 2.
    2. In the Create Node dialog box that appears, set Node Name to word_split and click Commit. A PyODPS 2 node is created.
    3. On the configuration tab of the PyODPS 2 node, enter the node code.
      In this example, enter the following code:
      def test(input_var):
          import jieba
          import sys
          reload(sys)
          sys.setdefaultencoding('utf-8')
          result=jieba.cut(input_var, cut_all=False)
          return "/ ".join(result)
      hints = {
          'odps.isolation.session.enable': True
      }
      libraries =['jieba-master.zip']  # Reference the jieba-master.zip package.
      iris = o.get_table('jieba_test').to_df()  # Reference data in the jieba_test table.
      example = iris.chinese.map(test).execute(hints=hints, libraries=libraries)
      print(example)  # Display the text segmentation result, which is of the MAP type.
      abci=list(example)   # Convert the text segmentation result to the List type.
      i = 0
      for i in range(i,len(abci)):
          pq=str(abci[i])
          o.write_table('jieba_result',[pq])  # Write the segmented words and phrases to the jieba_result table one by one.
          i+=1
      else:
          print("done")
    4. Click the Run icon in the toolbar to test the PyODPS 2 node.
    5. View the running result of the PyODPS 2 node on the Runtime Log tab.
  7. On the Data Analytics tab, create an ODPS SQL node, enter the statement select * from jieba_result; in the node, and then click the Run icon to run the ODPS SQL node.
  8. Check whether the data is written to the result table based on the result on the Result tab.
  9. If the default dictionary of Jieba does not meet your requirements, use a custom dictionary by following these steps.
    A PyODPS node can reference table or file resources uploaded to MaxCompute by using a user-defined function (UDF). The UDF must be defined as a closure function or a callable class. In this example, define a UDF as a closure function to reference the custom dictionary file key_words.txt uploaded to MaxCompute.
    1. Click the target workflow, right-click Resource under MaxCompute, and then choose Create > File. In the Create Resource dialog box that appears, set the required parameters and click OK.
      Note Select Upload to MaxCompute in the Create Resource dialog box.
    2. On the configuration tab of the file, enter the file content and click the Submit icon icon to commit the file. A custom dictionary file is created.
      The content you enter on the configuration tab of the file must meet the following requirements:
      • Each word or phrase independently occupies a line.
      • Each line contains three parts in sequence: word or phrase, frequency, and parts of speech. Separate every two parts with a space. The frequency and parts of speech are optional. The order of the three parts cannot be adjusted.

      If you upload a local dictionary file to DataWorks, the file must be encoded in UTF-8.

    3. Create another PyODPS 2 node and enter the following code in the node:
      def test(resources):
          import jieba
          import sys
          reload(sys)
          sys.setdefaultencoding('utf-8')
          fileobj = resources[0]
      
          def h(input_var):# Use the h() function to load the custom dictionary file and segment text.
              import jieba
              jieba.load_userdict(fileobj)
              result=jieba.cut(input_var, cut_all=False)
              return "/ ".join(result)
          return h
      hints = {
          'odps.isolation.session.enable': True
      }
      libraries =['jieba-master.zip']  # Reference the jieba-master.zip package.
      iris = o.get_table('jieba_test').to_df()  # Reference data in the jieba_test table.
      
      file_object = o.get_resource('key_words.txt') # Use the get_resource() function to reference the key_words.txt file uploaded to MaxCompute.
      example = iris.chinese.map(test, resources=[file_object]).execute(hints=hints, libraries=libraries) # Call the map function to transfer the file by using the resources parameter.
      
      print(example)  # Display the text segmentation result, which is of the MAP type.
      abci=list(example)   # Convert the text segmentation result to the List type.
      i = 0
      for i in range(i,len(abci)):
          pq=str(abci[i])
          o.write_table('jieba_result',[pq])  # Write the segmented words and phrases to the jieba_result table one by one.
          i+=1
      else:
          print("done")
    4. Run the PyODPS 2 node and compare the segmentation results before and after the custom dictionary is referenced.