This topic describes how to use a PyODPS node in DataWorks to segment Chinese text based on the open source segmentation tool Jieba and write the segmented words and phrases to a new table. This topic also describes how to use closure functions to segment Chinese text based on a custom dictionary.

Prerequisites

  • A DataWorks workspace is created. In this example, a workspace in basic mode is used. The workspace is associated with multiple MaxCompute compute engines. For more information, see Create a workspace.
  • The open source Jieba package is downloaded from GitHub. clone

Background information

PyODPS nodes are integrated with MaxCompute SDK for Python. You can directly edit Python code and use MaxCompute SDK for Python on PyODPS nodes of DataWorks. For more information about PyODPS nodes, see Create a PyODPS 2 node.

This topic describes how to use a PyODPS node to segment Chinese text based on Jieba.
Notice Sample code in this topic is for reference only. We recommend that you do not use the code in your production environment.

Use open source packages to segment Chinese text based on Jieba

  1. Create a workflow.
    1. Log on to the DataWorks console.
    2. In the left-side navigation pane, click Workspaces.
    3. Select the region where the workspace resides, find the workspace, and then click Data Development in the Actions column.
    4. Move the pointer over the Create icon and click Workflow.
    5. In the Create Workflow dialog box, specify the Workflow Name and Description parameters. Then, click Create.
      Notice The workflow name must be 1 to 128 characters in length, and can contain letters, digits, underscores (_), and periods (.).
  2. Upload the jieba-master.zip package.
    1. Click the workflow that you created, expand MaxCompute, right-click Resource, and then choose Create > Archive.
    2. In the Create Resource dialog box, configure the parameters and click Create. The following table describes the parameters.
      Create Resource
      Parameter Description
      Engine Type Select the compute engine where the resource resides from the drop-down list.
      Note If only one instance is bound to your workspace, this parameter is not displayed.
      Engine Instance The name of the MaxCompute engine to which the task is bound.
      Location The folder that is used to store the resource. The default value is the path of the current folder. You can modify the path based on your business requirements.
      File Type Select Archive from the File Type drop-down list.
      Note If the resource package has been uploaded to the MaxCompute client, clear Upload to MaxCompute. Otherwise, an error occurs during the upload process.
      File Click Upload, select the downloaded file named jieb-master.zip from your on-premises machine, and then click Open.
      Resource Name The name of the resource. The resource name can be different from the name of the file that you uploaded but must comply with the following conventions:
      • The resource name can contain only letters, digits, periods (.), underscores (_), and hyphens (-).
      • If you select Archive from the File Type drop-down list, the extension of the resource name must be the same as that of the file name. The extension can be .zip, .tgz, .tar.gz, or .tar.
    3. In the toolbar, click the Submit icon.
    4. In the Commit dialog box, specify the Change description parameter and click Commit.
  3. Create a table that is used to store test data.
    1. Click the workflow that you created, expand MaxCompute, right-click Table, and then select Create Table.
    2. In the Create Table dialog box, specify the Table Name parameter and click Create.
      Note In this example, the table name is jieba_test.
    3. Click DDL Statement and enter the following DDL statement to create a table:
      CREATE TABLE jieba_test (
          'chinese' string,
          `content` string
      );
      Note The table in this example contains two columns. You can segment text in one column during data development.
    4. In the Confirm message, click OK.
    5. In the General section, specify the Display Name parameter for the table. Click Commit to Production Environment.
    6. In the Commit to Production Environment dialog box, select I am aware of the risk and confirm the commissions and click OK.
  4. Use the same method to create a table named jieba_result. This table is used to store the test result. Sample DDL statement:
    CREATE TABLE jieba_result (
        `chinese` string
    );
    Note In this example, only the text in the chinese column of the test data is segmented. Therefore, the result table contains only one column.
  5. Click Test Data to download the test data.
  6. Upload test data.
    1. Click the Import icon icon on the DataStudio page.
    2. In the Data Import Wizard dialog box, enter the name of the test table jieba_test to which you want to import data, select the table, and then click Next.
    3. Click Browse, upload the jieba_test.csv file from your on-premises machine, and then click Next.
    4. Select By Name and click Import Data.
  7. Create a PyODPS 2 node.
    1. Click the workflow, expand MaxCompute, right-click Data Analytics, and then choose Create > PyODPS 2.
    2. In the Create Node dialog box, specify the Node Name and Location parameters and click Commit.
      Note
      • The node name must be 1 to 128 characters in length, and can contain letters, digits, underscores (_), and periods (.).
      • In this example, the Node Name parameter is set to word_split.
    3. On the configuration tab of the node, enter the following PyODPS code:
      def test(input_var):
          import jieba
          import sys
          reload(sys)
          sys.setdefaultencoding('utf-8')
          result=jieba.cut(input_var, cut_all=False)
          return "/ ".join(result)
      hints = {
          'odps.isolation.session.enable': True
      }
      libraries =['jieba-master.zip']  # Reference the jieba-master.zip package. 
      iris = o.get_table('jieba_test').to_df()  # Reference the data in the jieba_test table. 
      example = iris.chinese.map(test).execute(hints=hints, libraries=libraries)
      print(example)  # Display the text segmentation result. The result is of the MAP type. 
      abci=list(example)   # Convert the text segmentation result into the LIST type. 
      i = 0
      for i in range(i,len(abci)):
          pq=str(abci[i])
          o.write_table('jieba_result',[pq])  # Write the data records to the jieba_result table one by one. 
          i+=1
      else:
          print("done")
    4. Click the Save icon in the toolbar to save the code.
    5. Click the Run icon in the toolbar. In the Parameters dialog box, select a resource group from the Resource Group drop-down list and click OK.
      Note For more information about resource groups for scheduling, see Overview.
    6. View the execution result of the Jieba segmentation program on the Runtime Log tab in the lower part of the page.
  8. Create and run an ODPS SQL node.
    1. Click the workflow, expand MaxCompute, right-click Data Analytics, and then choose Create > ODPS SQL.
    2. In the Create Node dialog box, specify the Node Name and Location parameters and click Commit.
      Note The node name must be 1 to 128 characters in length and can contain letters, digits, underscores (_), and periods (.).
    3. On the configuration tab of the node, enter the following SQL statement:
      select * from jieba_result;
    4. Click the Save icon in the toolbar to save the code.
    5. Click the Run icon in the toolbar. In the Parameters dialog box, select a resource group from the Resource Group drop-down list and click OK.
      Note For more information about resource groups for scheduling, see Overview.
    6. In the Expense Estimate dialog box, check the estimated cost and click Run.
    7. View the execution result on the Runtime Log tab in the lower part of the page.

Use custom dictionaries to segment Chinese text based on Jieba

If the dictionary of the Jieba tool does not meet your requirements, you can use a custom dictionary.

You can use a PyODPS user-defined function (UDF) to read table or file resources that are uploaded to MaxCompute. In this case, you must write the UDF as a closure function or a callable class. If you need to reference complex UDFs, you can create a MaxCompute function in DataWorks. For more information, see Create a MaxCompute function.

In this topic, a closure function is used to reference the custom dictionary file key_words.txt that is uploaded to MaxCompute.
Note In this example, the custom dictionary file name is key_words.txt.
  1. Click the workflow, expand MaxCompute, right-click Resource, and then choose Create > File.
  2. In the Create Resource dialog box, configure the parameters and click Create. The following table describes the parameters.
    Create Resource
    Parameter Description
    Engine Type Select the compute engine where the resource resides from the drop-down list.
    Note If only one instance is bound to your workspace, this parameter is not displayed.
    Engine Instance The name of the MaxCompute engine to which the task is bound.
    Location The folder that is used to store the resource. The default value is the path of the current folder. You can modify the path based on your business requirements.
    File Type Select File from the File Type drop-down list.
    Note If you want to upload a dictionary file from the on-premises machine to DataWorks, the file must be encoded in UTF-8.
    Upload Click Upload, select the key_words.txt file from your on-premises machine, and then click Open.
    Resource Name The name of the resource. The resource name can contain only letters, digits, periods (.), underscores (_), and hyphens (-).
  3. On the configuration tab of the key_words.txt resource, enter the content of the custom dictionary. Dictionary format:
    • Each word occupies one line.
    • Each line contains the following parts in sequence: word, frequency, and parts of speech. The frequency and part of speech are optional. Separate every two parts with a space. The order of the three parts cannot be adjusted.
  4. Click the Submit icon in the toolbar to commit the resource.
  5. Create a PyODPS 2 node and enter the following code:
    def test(resources):
        import jieba
        import sys
        reload(sys)
        sys.setdefaultencoding('utf-8')
        fileobj = resources[0]
    
        def h(input_var):# Use the nested function h() to load the dictionary and segment text. 
            import jieba
            jieba.load_userdict(fileobj)
            result=jieba.cut(input_var, cut_all=False)
            return "/ ".join(result)
        return h
    hints = {
        'odps.isolation.session.enable': True
    }
    libraries =['jieba-master.zip']  # Reference the jieba-master.zip package. 
    iris = o.get_table('jieba_test').to_df()  # Reference the data in the jieba_test table. 
    
    file_object = o.get_resource('key_words.txt') # Use the get_resource() function to reference the MaxCompute resource. 
    example = iris.chinese.map(test, resources=[file_object]).execute(hints=hints, libraries=libraries) # Call the map function to transfer the resources parameter. 
    
    print(example)  # Display the text segmentation result. The result is of the MAP type. 
    abci=list(example)   # Convert the text segmentation result into the List type. 
    for i in range(i,len(abci)):
        pq=str(abci[i])
        o.write_table('jieba_result',[pq])  # Write the data records to the jieba_result table one by one. 
        i+=1
    else:
        print("done")
  6. Run the code and compare the results before and after the custom dictionary is referenced.