The LLM-Document Deduplicator (DLC) component of Platform for AI (PAI) is used to deduplicate texts by using the SimHash algorithm to calculate the similarity between texts. The input Object Storage Service (OSS) data file must be in the JSON Lines format and meet the following requirements: Each line in the file is a valid JSON object and the file consists of multiple lines of JSON objects, but the file is not a valid JSON object. For more information, see Example.
Supported computing resources
Configure the component
On the pipeline page of Machine Learning Designer, configure the parameters of the LLM-Document Deduplicator (DLC) component.
Tab | Parameter | Required | Description | Default value | |
Fields Setting | Target Process Field | Yes | The name of the field that you want to process. | N/A | |
Text Separator, default is space | No | The algorithm splits the text into a list of words based on the delimiter. By default, spaces are used. If you leave this parameter empty, the algorithm does not split the text. In this case, the algorithm deduplicates the text based on single characters. Enclose the delimiter by using the double quotation marks (""). | " " | ||
window_size | Yes | The length of the substrings that constitute the features of a document. For example, if the document content is "the cute alibaba mascot" and you set the window_size parameter to 2, the substrings are: ["the cute", "cute alibaba", "alibaba mascot"]. The algorithm then calculates the SimHash value of the texts based on the hash values of the substrings. The value of window_size affects the granularity of the SimHash value. A small window_size value may generate distinct text features, but the hash value is more susceptible to edit operations. A large window_size value can use a longer context as input, but may ignore details. | 6 | ||
num_blocks | Yes | num_blocks determines the number of blocks into which the SimHash value is divided. When the algorithm checks document similarity, the SimHash value is split into several blocks. For example, if the SimHash value is a 64-bit integer and you set the num_blocks parameter to 4, the SimHash value is divided into 4 separate 16-bit blocks. A large number of blocks results in a finer-grained similarity comparison. This may reduce false positives which recognize unrelated texts as similar, but may increase false negatives which fail to recognize similar texts. In most cases, the num_blocks value must be smaller than the number of bits in the SimHash value. | 6 | ||
hamming_distance | Yes | The threshold of the Hamming distance between two SimHash values, which is used to determine whether two texts are similar. For example, if the Hamming distance between SimHash values A and B, which is the number of different bits between SimHash values A and B, is less than or equal to the hamming_distance value, the algorithm recognizes A and B as similar. If you set the hamming_distance parameter to a small value, the algorithm recognizes only highly similar texts as duplicates, resulting in failure to fully recognize some texts with duplicated content. If you set the hamming_distance parameter to a large value, the algorithm recognizes more similar texts, but this may increase the chance of false positives. In most cases, we recommend that you set the parameter to 3, 4, or 5. | 4 | ||
OSS Directory for Saving OutputData | No | The OSS directory in which the generated data is stored. If you do not specify this parameter, the default path of the workspace is used. | N/A | ||
Tuning | Number of Processes | No | The number of processes. | 8 | |
Select Resource Group | Public Resource Group | No | The instance type (CPU or GPU), number of instances, and virtual private cloud (VPC) that you want to use. | N/A | |
Dedicated resource group | No | The number of vCPUs, memory, shared memory, number of GPUs, and number of instances that you want to use. | N/A | ||
Maximum Running Duration | No | The maximum period of time for which the component can run. If this period of time is exceeded, the job is terminated. | N/A | ||