This topic describes the N-gram Counting component provided by Machine Learning Studio.
N-gram counting is a step in language model training. N-grams are generated based on words. The number of N-grams in all corpora is counted. The counting result is the number of N-grams in all documents rather than those in a single document. For more information, see ngram-count.
Configure the component
- Machine Learning Platform for AI console
Tab Parameter Description Fields Setting Column of Sentences in Input Table The column that contain the sentences in the input table. Column of Words in the Bag-of-Words The column that contain the words in the bag of words. Words Column in Input Counting Result Table The word column in the input counting result table. Count Column in Input Counting Result Table The count column in the input counting result table. Sentence Weight Column The column that contains weights of input sentences. Parameters Setting Maximum N-gram Length The maximum length of N-grams. Default value: 3. Tuning Optional. The number of cores. Automatically allocated. Optional. Memory size per core. Automatically allocated.
- PAI command
PAI -name ngram_count -project algo_public -DinputTableName=pai_ngram_input -DoutputTableName=pai_ngram_output -DinputSelectedColNames=col0 -DweightColName=weight -DcoreNum=2 -DmemSizePerCore=1000;
Parameter Required Default value Description inputTableName Yes No default value The input table. outputTableName Yes No default value The output table. inputSelectedColNames No The first STRING column Columns selected from the input table. weightColName No 1 The weight column. inputTablePartitions No Full table The partitions selected from the input table. countTableName No No default value The N-gram counting output table previously generated. The table is merged into the output result. countWordColName No The second column The word column in the counting table. countCountColName No The third column The name of the count column in the counting table. countTablePartitions No No default value The partitions in the counting table. vocabTableName No No default value The name of the bag-of-words table. The words that are not contained in the bag of words are marked as \<unk\. vocabSelectedColName No The first STRING column The column that contains the words in the bag of words. vocabTablePartitions No No default value Partitions in the bag-of-words table. order No 3 The maximum length of N-grams. lifecycle No No default value The lifecycle of the output table. coreNum No No default value The number of cores. memSizePerCore No No default value The memory size for each core.