All Products
Search
Document Center

Platform For AI:N-gram Counting

Last Updated:Feb 06, 2024

This topic describes the N-gram Counting component provided by Machine Learning Designer (formerly known as Machine Learning Studio).

N-gram counting is a step in language model training. N-grams are generated based on words. The number of N-grams in all corpora is counted. The counting result is the number of N-grams in all documents rather than those in a single document. For more information, see ngram-count.

Configure the component

You can use one of the following methods to configure the N-gram Counting component.

Method 1: Configure the component on the pipeline page

You can configure the parameters of the N-gram Counting component on the pipeline page of Machine Learning Designer of Platform for AI (PAI). Machine Learning Designer is formerly known as Machine Learning Studio. The following table describes the parameters.

Tab

Parameter

Description

Fields Setting

Column of Sentences in Input Table

The column that contains the sentences in the input table.

Column of Words in the Bag-of-Words

The column that contains the words in the bag of words.

Words Column in Input Counting Result Table

The word column in the input counting result table.

Count Column in Input Counting Result Table

The count column in the input counting result table.

Sentence Weight Column

The column that contains weights of input sentences.

Parameters Setting

Maximum N-gram Length

The maximum length of N-grams. Default value: 3.

Tuning

Optional. The number of cores.

The number of cores. By default, the system determines the value.

Optional. Memory size per core.

The memory size of each core. By default, the system determines the value. Unit: MB.

Method 2: Use PAI commands

Configure the component parameters by using PAI commands. You can use the SQL Script component to call PAI commands. For more information, see SQL Script.

PAI -name ngram_count    
    -project algo_public    
    -DinputTableName=pai_ngram_input    
    -DoutputTableName=pai_ngram_output    
    -DinputSelectedColNames=col0    
    -DweightColName=weight    
    -DcoreNum=2    
    -DmemSizePerCore=1000;

Parameter

Required

Default value

Description

inputTableName

Yes

No default value

The name of the input table.

outputTableName

Yes

No default value

The name of the output table.

inputSelectedColNames

No

Name of the first STRING column

The names of the columns selected from the input table.

weightColName

No

1

The name of the weight column.

inputTablePartitions

No

All partitions

The partitions selected from the input table.

countTableName

No

No default value

The N-gram counting output table previously generated. The table is merged into the output result.

countWordColName

No

Second column

The name of the word column in the counting table.

countCountColName

No

Third column

The name of the count column in the counting table.

countTablePartitions

No

No default value

The partitions in the counting table.

vocabTableName

No

No default value

The name of the bag-of-words table. The words that are not contained in the bag of words are marked as \<unk\.

vocabSelectedColName

No

First STRING column

The name of the column that contains the words in the bag of words.

vocabTablePartitions

No

No default value

The partitions in the bag-of-words table.

order

No

3

The maximum length of N-grams.

lifecycle

No

No default value

The lifecycle of the output table.

coreNum

No

No default value

The number of cores.

memSizePerCore

No

No default value

The memory size for each core. Unit: MB.