DataWorks:Use a PyODPS node to segment Chinese text based on Jieba - DataWorks

Segment Chinese text at scale on MaxCompute using the open source Jieba library on a PyODPS 3 node in DataWorks. When the default Jieba dictionary doesn't cover your domain-specific terms, upload a custom dictionary as a MaxCompute resource to extend segmentation results.

Typical use cases include text analysis, information retrieval, text mining, feature extraction, search engine building, machine translation, and language model training.

Important

The steps in this topic are for reference only. Do not use them in a production environment.

Prerequisites

Before you begin, make sure you have:

A DataWorks workspace. See Create and manage workspaces.
A MaxCompute data source added and associated with the workspace. See Add a MaxCompute data source and associate the data source with a workspace.

Background

DataWorks PyODPS nodes let you write Python code and use the MaxCompute SDK for Python directly in your workflows. Two node types are available:

Node type	Python support	Notes
PyODPS 3	Python 3.X only	Supports pip installation; recommended for new development
PyODPS 2	Python 2.X and 3.X	Use only if you require Python 2.X

For new projects, use PyODPS 3 nodes. See Develop a PyODPS 3 task.

Prepare: download the Jieba package

Download the open source Jieba package from GitHub.

Practice 1: Segment text with the default Jieba dictionary

Step 1: Create a workflow

See Create a workflow.

Step 2: Upload the Jieba package as a MaxCompute resource

Right-click the workflow name and choose Create Resource > MaxCompute > Archive.

In the Create Resource dialog box, set the following parameters and click Create.

Parameter	Description
File	Click Upload and select the `jieba-master.zip` file you downloaded.
Name	Enter a resource name. In this practice, use `jieba-master.zip`. The resource name can differ from the file name but must follow naming conventions.

新建资源

Click the icon in the top toolbar to commit the resource.

Step 3: Create the input and output tables

Create two tables:

jieba_test — stores the input text data
jieba_result — stores the segmentation output

To create a table, right-click the workflow name and choose Create Table > MaxCompute > Table. Configure the table in the dialog box and run the DDL statements below to define the schema. After creating each table, commit it to the development environment. See Create and manage MaxCompute tables.

-- Input table: stores test data
CREATE TABLE jieba_test (
  `chinese` string,
  `content` string
);

-- Output table: stores segmentation results
CREATE TABLE jieba_result (
  `chinese` string
);

Step 4: Import test data

Download jieba_test.csv to your local machine.
In the Scheduled Workflow pane on the DataStudio page, click the icon to open the Data Import Wizard.
Enter jieba_test in the table name field, select the table, and click Next.
Upload jieba_test.csv, configure the upload settings, preview the data, and click Next.
Select By Name and click Import Data.

Step 5: Create a PyODPS 3 node

Right-click the workflow name and choose Create Node > MaxCompute > PyODPS 3.
In the Create Node dialog box, set Name to word_split and click Confirm.

Step 6: Run the segmentation code

Paste the following code into the word_split node and run it. The code segments the chinese column in jieba_test, writes the results to jieba_result, and prints the first 10 rows.

def test(input_var):
    import jieba
    result = jieba.cut(input_var, cut_all=False)
    return "/ ".join(result)

hints = {
    'odps.isolation.session.enable': True,
    'odps.stage.mapper.split.size': 64,      # Controls input split size to improve execution parallelism
}
libraries = ['jieba-master.zip']             # Makes the Jieba package available to all worker nodes

src_df = o.get_table('jieba_test').to_df()
result_df = src_df.chinese.map(test).persist('jieba_result', hints=hints, libraries=libraries)
print(result_df.head(10))

odps.stage.mapper.split.size can be used to improve the execution parallelism. See Flag parameters.

Step 7: View results

Two options are available:

Runtime Log tab at the bottom of the page — suitable for quickly checking output during development.
Ad Hoc Query in the left-side navigation pane of DataStudio — lets you query the full result set in the jieba_result table.
```
select * from jieba_result;
```

Practice 2: Segment text with a custom dictionary

When Jieba's default dictionary misses domain-specific terms, upload a custom dictionary as a MaxCompute File resource.

In PyODPS, user-defined functions (UDFs) that read MaxCompute resources must be written as closure functions or callable class functions.

To create MaxCompute functions that reference complex UDFs, see Create and use a MaxCompute function.

Step 1: Create the custom dictionary resource

Right-click the workflow name and choose Create Resource > MaxCompute > File.
In the Create Resource dialog box, set Name to key_words.txt and click Create.
On the key_words.txt configuration tab, enter the custom dictionary content, then save and commit the resource. The following example adds two domain-specific terms. Adjust the content based on your requirements.
```
增量备份
安全合规
```

Step 2: Run the segmentation code with the custom dictionary

def test(resources):
    import jieba
    fileobj = resources[0]
    jieba.load_userdict(fileobj)

    def h(input_var):  # Call the nested h() function to load the dictionary and segment text.
        result = jieba.cut(input_var, cut_all=False)
        return "/ ".join(result)

    return h

hints = {
    'odps.isolation.session.enable': True,
    'odps.stage.mapper.split.size': 64,
}
libraries = ['jieba-master.zip']

src_df = o.get_table('jieba_test').to_df()

file_object = o.get_resource('key_words.txt')  # Call the get_resource() function to reference the MaxCompute resource.
mapped_df = src_df.chinese.map(test, resources=[file_object])  # Call the map function to transfer the resources parameter.
result_df = mapped_df.persist('jieba_result2', hints=hints, libraries=libraries)
print(result_df.head(10))

odps.stage.mapper.split.size can be used to improve the execution parallelism. See Flag parameters.

Step 3: View results

Runtime Log tab at the bottom of the page — suitable for quickly checking output during development.
Ad Hoc Query in the left-side navigation pane of DataStudio — lets you query the full result set in the jieba_result2 table.
```
select * from jieba_result2;
```