Segment Chinese text at scale on MaxCompute using the open source Jieba library on a PyODPS 3 node in DataWorks. When the default Jieba dictionary doesn't cover your domain-specific terms, upload a custom dictionary as a MaxCompute resource to extend segmentation results.
Typical use cases include text analysis, information retrieval, text mining, feature extraction, search engine building, machine translation, and language model training.
The steps in this topic are for reference only. Do not use them in a production environment.
Prerequisites
Before you begin, make sure you have:
-
A DataWorks workspace. See Create and manage workspaces.
-
A MaxCompute data source added and associated with the workspace. See Add a MaxCompute data source and associate the data source with a workspace.
Background
DataWorks PyODPS nodes let you write Python code and use the MaxCompute SDK for Python directly in your workflows. Two node types are available:
| Node type | Python support | Notes |
|---|---|---|
| PyODPS 3 | Python 3.X only | Supports pip installation; recommended for new development |
| PyODPS 2 | Python 2.X and 3.X | Use only if you require Python 2.X |
For new projects, use PyODPS 3 nodes. See Develop a PyODPS 3 task.
Prepare: download the Jieba package
Download the open source Jieba package from GitHub.
Practice 1: Segment text with the default Jieba dictionary
Step 1: Create a workflow
See Create a workflow.
Step 2: Upload the Jieba package as a MaxCompute resource
-
Right-click the workflow name and choose Create Resource > MaxCompute > Archive.
-
In the Create Resource dialog box, set the following parameters and click Create.
Parameter Description File Click Upload and select the jieba-master.zipfile you downloaded.Name Enter a resource name. In this practice, use jieba-master.zip. The resource name can differ from the file name but must follow naming conventions.
-
Click the
icon in the top toolbar to commit the resource.
Step 3: Create the input and output tables
Create two tables:
-
jieba_test— stores the input text data -
jieba_result— stores the segmentation output
To create a table, right-click the workflow name and choose Create Table > MaxCompute > Table. Configure the table in the dialog box and run the DDL statements below to define the schema. After creating each table, commit it to the development environment. See Create and manage MaxCompute tables.
-- Input table: stores test data
CREATE TABLE jieba_test (
`chinese` string,
`content` string
);
-- Output table: stores segmentation results
CREATE TABLE jieba_result (
`chinese` string
);
Step 4: Import test data
-
Download jieba_test.csv to your local machine.
-
In the Scheduled Workflow pane on the DataStudio page, click the
icon to open the Data Import Wizard. -
Enter
jieba_testin the table name field, select the table, and click Next. -
Upload
jieba_test.csv, configure the upload settings, preview the data, and click Next. -
Select By Name and click Import Data.
Step 5: Create a PyODPS 3 node
-
Right-click the workflow name and choose Create Node > MaxCompute > PyODPS 3.
-
In the Create Node dialog box, set Name to
word_splitand click Confirm.
Step 6: Run the segmentation code
Paste the following code into the word_split node and run it. The code segments the chinese column in jieba_test, writes the results to jieba_result, and prints the first 10 rows.
def test(input_var):
import jieba
result = jieba.cut(input_var, cut_all=False)
return "/ ".join(result)
hints = {
'odps.isolation.session.enable': True,
'odps.stage.mapper.split.size': 64, # Controls input split size to improve execution parallelism
}
libraries = ['jieba-master.zip'] # Makes the Jieba package available to all worker nodes
src_df = o.get_table('jieba_test').to_df()
result_df = src_df.chinese.map(test).persist('jieba_result', hints=hints, libraries=libraries)
print(result_df.head(10))
odps.stage.mapper.split.size can be used to improve the execution parallelism. See Flag parameters.
Step 7: View results
Two options are available:
-
Runtime Log tab at the bottom of the page — suitable for quickly checking output during development.
-
Ad Hoc Query in the left-side navigation pane of DataStudio — lets you query the full result set in the
jieba_resulttable.select * from jieba_result;
Practice 2: Segment text with a custom dictionary
When Jieba's default dictionary misses domain-specific terms, upload a custom dictionary as a MaxCompute File resource.
In PyODPS, user-defined functions (UDFs) that read MaxCompute resources must be written as closure functions or callable class functions.
To create MaxCompute functions that reference complex UDFs, see Create and use a MaxCompute function.
Step 1: Create the custom dictionary resource
-
Right-click the workflow name and choose Create Resource > MaxCompute > File.
-
In the Create Resource dialog box, set Name to
key_words.txtand click Create. -
On the
key_words.txtconfiguration tab, enter the custom dictionary content, then save and commit the resource. The following example adds two domain-specific terms. Adjust the content based on your requirements.增量备份 安全合规
Step 2: Run the segmentation code with the custom dictionary
def test(resources):
import jieba
fileobj = resources[0]
jieba.load_userdict(fileobj)
def h(input_var): # Call the nested h() function to load the dictionary and segment text.
result = jieba.cut(input_var, cut_all=False)
return "/ ".join(result)
return h
hints = {
'odps.isolation.session.enable': True,
'odps.stage.mapper.split.size': 64,
}
libraries = ['jieba-master.zip']
src_df = o.get_table('jieba_test').to_df()
file_object = o.get_resource('key_words.txt') # Call the get_resource() function to reference the MaxCompute resource.
mapped_df = src_df.chinese.map(test, resources=[file_object]) # Call the map function to transfer the resources parameter.
result_df = mapped_df.persist('jieba_result2', hints=hints, libraries=libraries)
print(result_df.head(10))
odps.stage.mapper.split.size can be used to improve the execution parallelism. See Flag parameters.
Step 3: View results
-
Runtime Log tab at the bottom of the page — suitable for quickly checking output during development.
-
Ad Hoc Query in the left-side navigation pane of DataStudio — lets you query the full result set in the
jieba_result2table.select * from jieba_result2;