In scenarios in which you want to perform text analysis, information retrieval, text mining, feature extraction, search engine building, machine translation, or language model training, you can use a PyODPS node in DataWorks to segment Chinese text based on the open source segmentation tool Jieba and analyze and process text. If the default dictionary of Jieba cannot meet your business requirements, you can create a custom dictionary to add entries or modify segmentation results.
Background information
DataWorks provides PyODPS nodes. You can edit Python code and use MaxCompute SDK for Python on PyODPS nodes of DataWorks for data development. PyODPS nodes of DataWorks include PyODPS 2 nodes and PyODPS 3 nodes. PyODPS 3 provides a simple, easy-to-use API and can be installed by using pip. This way, you can better use the resources and features of MaxCompute. We recommend that you use PyODPS 3 nodes for data development. For more information, see Develop a PyODPS 3 task.
PyODPS 3 nodes support only Python 3.X. PyODPS 2 nodes support Python 2.X and Python 3.X. If you want to use Python 2.X, you can select only PyODPS 2 nodes.
The operations in this topic are only for reference. We recommend that you do not use them in the production environment.
Prerequisites
A DataWorks workspace is created. For more information, see Create and manage workspaces.
You have a MaxCompute compute engine bound to DataStudio. For more information, see Create a MaxCompute data source and bind it to a workspace.
Preparation: Download the open source Jieba package
Download the open source Jieba package from its GitHub repository at https://github.com/fxsjy/jieba. Click the Code button in the upper-right corner and select Download ZIP to download the source code archive.
Practice 1: Use the open source Jieba package to segment Chinese text
Create a workflow. For more information, see Create a workflow.
Create a MaxCompute resource and upload the jieba-master.zip package.
Right-click the workflow you created and choose .
In the Create Resource dialog box, configure the parameters and click Create.
In this example, set Engine Type to MaxCompute, Engine Instance to doc_test, and Resource Type to Archive. Select Upload as ODPS Resource.
Parameter
Description
Upload File
Click Click Upload and select the downloaded jieba-master.zip file.
Name
The name of the resource. It must follow naming conventions but does not need to match the uploaded file name. In this example, the name is set to jieba-master.zip.
Click the
icon in the top toolbar to commit the resource.
Create a table named jieba_test and a table named jieba_result. The jieba_test table is used to store test data. The jieba_result table is used to store the test result.
Right-click the workflow you created and choose . Create the tables and use the DDL mode to configure their fields. After the tables are created, commit them to the development environment. For more information about creating tables, see Create and use MaxCompute tables.
The following table describes the DDL statements that are used to configure fields in the two tables.
Table
DDL statement
Description
jieba_test
CREATE TABLE jieba_test ( `chinese` string, `content` string );Stores test data.
jieba_result
CREATE TABLE jieba_result ( `chinese` string ) ;Stores the test result.
Download test data and import the test data to the jieba_test table.
Download the jieba_test.csv file that contains test data to your on-premises machine.
On the DataStudio page, click the
icon.In the Data Import Wizard dialog box, enter jieba_test as the destination table, select it, and click Next.
Upload the jieba_test.csv file from your computer, configure the upload settings, preview the data, and click Next.
Select By Name and click Import Data.
Create a PyODPS 3 node.
Right-click the workflow you created and choose .
In the Create Node dialog box, enter a Name (for example, Name) and click OK.
Use the open source Jieba package to run segmentation code.
Run the following sample code on the PyODPS 3 node to segment the test data in the jieba_test table and return the first 10 rows of segmentation result data:
def test(input_var): import jieba result = jieba.cut(input_var, cut_all=False) return "/ ".join(result) # odps.stage.mapper.split.size can be used to increase parallelism. hints = { 'odps.isolation.session.enable': True, 'odps.stage.mapper.split.size': 64, } libraries =['jieba-master.zip'] # Reference your jieba-master.zip archive. src_df = o.get_table('jieba_test').to_df() # Reference the data in your jieba_test table. result_df = src_df.chinese.map(test).persist('jieba_result', hints=hints, libraries=libraries) print(result_df.head(10)) # View the first 10 rows of the segmentation results. For more data, query the jieba_result table.Noteodps.stage.mapper.split.size can be used to improve the execution parallelism. For more information, see Flag parameters.
View the result.
You can use one of the following methods to view the execution result of the Jieba segmentation program:
Method 1: View the results in the Operational Logs area at the bottom of the page.
Method 2: In the left-side navigation pane, click Ad Hoc Query to create an Ad Hoc Query node. Then, view the data in the jieba_result table.
select * from jieba_result;
Practice 2: Use a custom dictionary to segment Chinese text
If the default dictionary of the Jieba tool does not meet your requirements, you can use a custom dictionary. This section provides an example on how to use a custom dictionary to segment Chinese text.
Create a MaxCompute resource.
You can use a PyODPS user-defined function (UDF) to read resources uploaded to MaxCompute. The resources can be tables or files. In this case, you must write the UDF as a closure function or a function of the callable class.
NoteYou can create MaxCompute functions in DataWorks to reference complex UDFs. For more information, see Create and use a MaxCompute function.
In this section, a closure function is called to reference the custom dictionary file key_words.txt that is uploaded to MaxCompute.
Create a MaxCompute function of the File type.
Right-click the workflow you created and choose . Enter the resource name key_words.txt and click Create.
On the configuration tab of the key_words.txt resource, enter the content of the custom dictionary and save and commit the resource.
The following content is the example content of the custom dictionary. You can enter the content of the custom dictionary based on your test requirements.
Use the custom dictionary to run segmentation code.
Run the following sample code on the PyODPS 3 node to segment the test data in the jieba_test table and return the first 10 rows of segmentation result data:
def test(resources): import jieba fileobj = resources[0] jieba.load_userdict(fileobj) def h(input_var): # In the nested function h(), load the dictionary and perform segmentation. result = jieba.cut(input_var, cut_all=False) return "/ ".join(result) return h # odps.stage.mapper.split.size can be used to increase parallelism. hints = { 'odps.isolation.session.enable': True, 'odps.stage.mapper.split.size': 64, } libraries =['jieba-master.zip'] # Reference your jieba-master.zip archive. src_df = o.get_table('jieba_test').to_df() # Reference the data in your jieba_test table. file_object = o.get_resource('key_words.txt') # get_resource() references a MaxCompute resource. mapped_df = src_df.chinese.map(test, resources=[file_object]) # The map function calls the function and passes the resources parameter. result_df = mapped_df.persist('jieba_result2', hints=hints, libraries=libraries) print(result_df.head(10)) # View the first 10 rows of the segmentation results. For more data, query the jieba_result2 table.Noteodps.stage.mapper.split.size can be used to improve the execution parallelism. For more information, see Flag parameters.
View the result.
You can use one of the following methods to view the execution result of the custom dictionary:
Method 1: View the results in the Operational Logs area at the bottom of the page.
Method 2: In the left-side navigation pane, click Ad Hoc Query to create an Ad Hoc Query node. Then, view the data in the jieba_result2 table.
select * from jieba_result2;