This topic describes how to use a PyODPS node in DataWorks to segment Chinese text based on the open source segmentation tool Jieba and write the segmented words and phrases to a new table. This topic also describes how to use closure functions to segment Chinese text based on a custom dictionary.
Prerequisites
- A DataWorks workspace is created. In this example, a workspace in basic mode is used. The workspace is associated with multiple MaxCompute compute engines. For more information, see Create a workspace.
- The open source Jieba package is downloaded from GitHub.
Background information
PyODPS nodes are integrated with MaxCompute SDK for Python. You can directly edit Python code and use MaxCompute SDK for Python on PyODPS nodes of DataWorks. For more information about PyODPS nodes, see Create a PyODPS 2 node.
- Use open source packages to segment Chinese text based on Jieba
- Use custom dictionaries to segment Chinese text based on Jieba
Use open source packages to segment Chinese text based on Jieba
Use custom dictionaries to segment Chinese text based on Jieba
If the dictionary of the Jieba tool does not meet your requirements, you can use a custom dictionary.
You can use a PyODPS user-defined function (UDF) to read table or file resources that are uploaded to MaxCompute. In this case, you must write the UDF as a closure function or a callable class. If you need to reference complex UDFs, you can create a MaxCompute function in DataWorks. For more information, see Create a MaxCompute function.