edit-icon download-icon

Text analysis

Last Updated: Aug 15, 2018

Contents

Word frequency statistics

Function overview

Based on the word splitting result of the document, this component outputs the words in the document in order and counts word frequency in the document content (docContent) specified by the document ID column (docId).

Parameter settings

Input parameters: the docId column and docContent column generated by the word splitting component.

Two output parameters:

The first output end: The output table contains the “id”, “word”, and “count” fields, as shown in the following figure.

image

count: indicates the number of times a word appears in every document.

The second output end: The output table contains the “id” and “word” fields, as shown in the following figure:

image

The output table at this output end lists words in order of appearance in the document, but does not count the word frequency. Therefore, there may be multiple records for a word in the same document. The package output table format is intended to realize compatibility with the Word2Vec component.

Example

In the example that uses Alibaba Cloud Word Splitting, the two columns in the output table are used as the input parameters for word frequency statistics.Select the docId column - id; select the docContent column - textThe statistical results on word frequency are displayed in the first figure showing the output parameters in this component.

PAI command

  1. PAI -name doc_word_stat
  2. -project algo_public
  3. -DinputTableName=doc_test_split_word
  4. -DdocId=id
  5. -DdocContent=content
  6. -DoutputTableNameMulti=doc_test_stat_multi
  7. -DoutputTableNameTriple=doc_test_stat_triple
  8. -DinputTablePartitions="region=cctv_news"

Algorithm parameters

Parameter key Description Option Default value
inputTableName Name of the input table - -
docId Name of the docId column Only one column can be specified. -
docContent Name of the docContent column Only one column can be specified. -
outputTableNameMulti Name of the output table listing words in order - -
outputTableNameTriple Name of the output table listing word frequency statistics - -
inputTablePartitions Partitions used for word splitting in the input table, in the format of partition_name=value. The multilevel format is name1=value1/name2=value2. Multiple partitions are separated by commas (,). - All partitions in the input table

Note: The table specified by outputTableNameMulti lists the docIds and the words in order of appearance in the document, which are obtained by splitting the document content (displayed in the docContent column) corresponding to the docId column. The table specified by outputTableNameTriple lists the docIds, words, and the number of times each word appears in the document, which are obtained by splitting the document content (displayed in the docContent column) corresponding to the docId column.

TF-IDF

  • Term Frequency-Inverse Document Frequency (TF-IDF) is a commonly used weighting technique for information retrieval and text mining. TF-IDF is a statistical method for assessing the importance of a word for a document in a collection or corpus.

    The importance of a word increases proportionally to the number of times it appears in the document and is offset by the frequency of the word in the corpus. Variations of the TF-IDF weighting scheme are often used by search engines as a tool in scoring and ranking a document’s relevance to a user query.

  • For details, see [tf-idf in Wikipedia].

  • The TF-IDF component is used to calculate the tf-idf value of each word that appears in a collection of documents based on word frequency statistics.

Parameter settings

omitted

Example

The output table in the example of the word frequency statistics component is used as the input table for the TF-IDF component. The corresponding parameter settings are as follows:

  • Select the docId column: id
  • Select the word column: word
  • Select the word count column: count

The output table has nine columns: docid, word, word_count (the number of times the current word appears in the current document), total_word_count (total number of words in the current document), doc_count (total number of documents that contain the current word), total_doc_count (total number of documents), tf, idf, and tfidf.
The result is as follows:

image

PAI command

  1. PAI -name tfidf
  2. -project algo_public
  3. -DinputTableName=rgdoc_split_triple_out
  4. -DdocIdCol=id
  5. -DwordCol=word
  6. -DcountCol=count
  7. -DoutputTableName=rg_tfidf_out;

Algorithm parameters

Parameter key Description Required/Optional Default value
inputTableName Name of the input table Required -
inputTablePartitions Partition in the input table Optional All partitions in the input table
docIdCol Name of the column listing document IDs. Only one column can be specified. Required -
wordCol Name of the word column. Only one column can be specified. Required -
countCol Name of the count column. Only one column can be specified. Required -
outputTableName Name of the output table Required -
lifecycle Life cycle of the output table, in the unit of days Optional No life cycle restriction
coreNum Number of cores, which must be set together with memSizePerCore Optional Automatically calculated
memSizePerCore Size of memory, which must be set together with coreNum Optional Automatically calculated

PLDA

  • A topic model returns the topic corresponding to a document.

  • Latent Dirichlet Allocation (LDA) is a topic model that provides the topic of each document in a document set by probability distribution. In addition, LDA is an unsupervised learning algorithm that does not require a manually tagged training set during training. Instead, it only requires the document set and the number of topics k.LDA was first proposed by David M. Blei, Andrew Y. Ng, and Michael I. Jordan in 2003. It is now applicable to the text mining field including text topic recognition, text classification, and text similarity calculation.

Parameter settings

  • Number of topics: Number of topics output by LDA.
  • Alpha: Prior Dirichlet distribution parameter of P(z/d).
  • beta: Prior Dirichlet distribution parameter of P(w/z).
  • burn In: Number of burn in iterations, which must be less than the total number of iterations. Default value is 100.
  • Total number of iterations: Positive integer, Optional, default value is 150.
  • Note: z indicates the topic, w indicates the word, and d indicates the document.

Input/Output settings

Input: Data must be in format of sparse matrix. For more information about the format, see the data format description section. Currently, you have to write a MR for data conversion.
The input format is as follows:

The first column is docid, the second column is word and word frequency data kv.

The following are output in order:

  • Output column 1: Word frequency, indicating the number of times the word appears in the topic after internal sampling by using the algorithm.

  • Output column 2: P (word | topic), indicating the probability of the word in a topic.

  • Output column 3: P (topic | word), indicating the probability of the word corresponding to a topic.

  • Output column 4: P (document | topic), indicating the probability of the topic corresponding to a document.

  • Output column 5: P (topic | document), indicating the probability of the document corresponding to a topic.

  • Output column 6: P (topic), indicating the probability of each topic and its weight in the whole document.

The output format of the topic-word frequency contribution table is as follows:

PAI command

  1. PAI -name PLDA
  2. -project algo_public
  3. -DinputTableName=lda_input
  4. DtopicNum=10
  5. -topicWordTableName=lda_output;

Algorithm parameters

Parameter key Description Value range Required/Optional, default value/act
inputTableName Name of the input table Table name Required
inputTablePartitions Partitions used for word splitting in the input table Format: partition_name=value. The multilevel format is name1=value1/name2=value2. Multiple partitions are separated by commas. Optional; default: all partitions in the input table
selectedColNames Names of the columns used for LDA in the input table Column name, separated by commas (,) Optional, default: all columns in the input table
topicNum Number of topics [2, 500] Required
kvDelimiter Delimiter between the key and the value Space, comma, and colon Optional, default: colon
itemDelimiter Delimiter between keys Space, comma, and colon Optional, default: space
alpha Prior Dirichlet distribution parameter of P(z/d) (0, ∞) Optional, default: 0.1
beta Prior Dirichlet distribution parameter of P(w/z) (0, ∞) Optional, default: 0.01
topicWordTableName topic-word frequency contribution table Table name Required
pwzTableName P(w/z) output table Table name Optional, default act: no P(w/z) output table
pzwTableName P(z/w) output table Table name Optional, default act: no P(z/w) output table
pdzTableName P(d/z) output table Table name Optional, default act: no P(d/z) output table
pzdTableName P(z/d) output table Table name Optional, default act: no P(z/d) output table
pzTableName P(z) output table Table name Optional, default act: no P(z) output table
burnInIterations Number of burn in iterations Positive integer Optional; its value must be less than the value of totalIterations; default: 100
totalIterations Total number of iterations Positive integer Optional, default: 150

Note: z indicates the topic, w indicates the word, and d indicates the document.

Word2Vec

Function overview

  • Word2Vec is an algorithm open sourced by Google in 2013, which is used to convert words into vectors. Using neural networks, Word2Vec maps words to K-dimensional space vectors through training, or even maps word vectors to semantics. Word2Vec has been favored by many users for its simplicity and efficiency.

  • For Google Word2Vec toolkit, visit the website https://code.google.com/p/word2vec/.

Parameter settings

Algorithm parameters:

  • Word feature dimension: A value ranging from 0 to 1000 is recommended.
  • Downward sampling threshold: The value 1e-3~1e-5 is recommended.
  • Input: Word column and vocabulary.
  • Output: Output word vector table and vocabulary.

PAI command

  1. PAI -name Word2Vec
  2. -project algo_public
  3. -DinputTableName=w2v_input
  4. DwordColName=word
  5. -DoutputTableName=w2v_output;

Algorithm parameters

Parameter key Description Value range Required/Optional, default value/act
inputTableName Name of the input table Table name Required
inputTablePartitions Partitions used for word splitting in the input table Format: partition_name=value.The multilevel format is name1=value1/name2=value2. Multiple partitions are separated by commas. Optional; default: all partitions in the input table
wordColName Name of the word column. Each row in the word column is a word, and the linefeed in the corpus is represented by . Column name Required
inVocabularyTableName Input vocabulary, which is the wordcount output of inputTableName Table name Optional, default act: The program performs wordcount on the output table.
inVocabularyPartitions Partitions of the input vocabulary Partition name Optional, default: all partitions in the corresponding table of inVocabularyTableName
layerSize Word feature dimension 0 to 1000 Optional, default: 100
cbow Language model 1: continuous bag-of-words (CBOW) model; 0: skip-gram model Optional, default: 0
window Size of word window Positive integer Optional, default: 5
minCount Minimum word truncation frequency Positive integer Optional, default: 5
hs Use of hierarchical softmax 1: to use hierarchical softmax; 0: not to use hierarchical softmax Optional, default: 1
negative Negative sampling 0: negative sampling unavailable; recommended value range: 5 to 10 Optional, default: 0
sample downward sampling threshold 0 or smaller values: downward sampling unavailable; recommended value: 1e-3-1e-5 Optional, default: 0
alpha Initial learning rate Greater than 0 Optional, default: 0.025
iterTrain Number of training iterations greater than or equal to 1 Optional, default: 1
randomWindow Random size of window 1: the window has a random size ranging from 1 to 5; 0: the window size is determined by the window parameter Optional, default: 1
outVocabularyTableName Output vocabulary Table name Optional, default act: no ‘output vocabulary’ output
outVocabularyPartition Partitions of the output vocabulary Partition name Optional, default act: The output vocabulary is not partitioned.
outputTableName Name of the output table Table name Required
outputPartition Output table partition information Partition name Optional, default act: The output table is not partitioned.

Doc2Vec

Function overview

Doc2Vec can map a document to a vector.

Parameter settings

Algorithm parameters:

  • Word feature dimension: A value ranging from 0 to 1000 is recommended.
  • Downward sampling threshold: The value 1e-3-1e-5 is recommended.
  • Input: Word column and vocabulary.
  • Output: Output document vector table, word vector table, and word table.

PAI command

  1. PAI -name pai_doc2vec
  2. -project algo_public
  3. -DinputTableName=d2v_input
  4. -DdocIdColName=docid
  5. -DdocColName=text_seg
  6. -DoutputWordTableName=d2v_word_output
  7. -DoutputDocTableName=d2v_doc_output;;

Algorithm parameters

Parameter key Description Value range Required/Optional, default value/act
inputTableName Name of the input table Table name Required
inputTablePartitions Partitions used for word splitting in the input table Format: partition_name=value.The multilevel format is name1=value1/name2=value2. Multiple partitions are separated by commas. Optional; default: all partitions in the input table
docIdColName Name of the docId column Column name Required
docColName docContent, which can be the space-separated word splitting results Column name Required
layerSize Word feature dimension 0 to 1000 Optional, default: 100
cbow Language model 1: continuous bag-of-words (CBOW) model; 0: skip-gram model Optional, default: 0
window Size of word window Positive integer Optional, default: 5
minCount Minimum word truncation frequency Positive integer Optional, default: 5
hs Use of hierarchical softmax 1: to use hierarchical softmax; 0: not to use hierarchical softmax Optional, default: 1
negative Negative sampling 0: negative sampling unavailable; recommended value range: 5 to 10 Optional, default: 0
sample downward sampling threshold 0 or smaller values: downward sampling unavailable; recommended value: 1e-3-1e-5 Optional, default: 0
alpha Initial learning rate Greater than 0 Optional, default: 0.025
iterTrain Number of training iterations greater than or equal to 1 Optional, default: 1
randomWindow Random size of window 1: the window has a random size ranging from 1 to 5; 0: the window size is determined by the window parameter Optional, default: 1
outVocabularyTableName Output vocabulary Table name Optional, default act: no ‘output vocabulary’ output
outputWordTableName Output word vector table Table name Required
outputDocTableName Output document vector table Table name Required
lifecycle Life cycle of the output table - Optional, default: no life cycle
coreNum Number of cores, which must be set together with memSizePerCore - Automatically calculated
memSizePerCore Size of memory, which must be set together with coreNum - Automatically calculated

SplitWord

Based on Alibaba Word Segmenter (AliWS), this component performs word splitting on documents specified by columns. Segmented words are separated by spaces. If users have set part-of-speech tagging or semantic tagging parameters, the component outputs word splitting results, part-of-speech tagging results, and semantic tagging results. Forward slashes (/) are used as delimiters for part-of-speech tagging. Vertical bars (|) are used as delimiters for semantic tagging. Currently, only Chinese Taobao library and Internet library are supported.

Function overview

Field settings

Omitted

Parameter settings:
  • Word splitting algorithm: CRF and UNIGRAM.
  • Recognition option: whether to recognize nouns with special meanings during word splitting.
  • Merge option: to consider the nouns used in special fields as a whole without splitting.
  • String split length: =0, indicating that a numerical string is split into retrieval units based on the specified length. The default value is 0, indicating not to split a numerical string by length.
  • Use word frequency correction: whether to use the word correction dictionary.
  • Part-of-speech tagging: to mark up a word in output results as corresponding to a particular part of speech.

Example

The following input table consists of two columns: docId column and text column.

image

The output result is as follows:

image

PAI command example

  1. PAI -name split_word
  2. -project algo_public
  3. -DinputTableName=doc_test
  4. -DselectedColNames=content1,content2
  5. -DoutputTableName=doc_test_split_word
  6. -DinputTablePartitions="region=cctv_news"
  7. -DoutputTablePartition="region=news"
  8. -Dtokenizer=TAOBAO_CHN
  9. -DenableDfa=true
  10. -DenablePersonNameTagger=false
  11. -DenableOrgnizationTagger=false
  12. -DenablePosTagger=false
  13. -DenableTelephoneRetrievalUnit=true
  14. -DenableTimeRetrievalUnit=true
  15. -DenableDateRetrievalUnit=true
  16. -DenableNumberLetterRetrievalUnit=true
  17. -DenableChnNumMerge=false
  18. -DenableNumMerge=true
  19. -DenableChnTimeMerge=false
  20. -DenableChnDateMerge=false
  21. -DenableSemanticTagger=true

Algorithm parameters

Parameter key Description Option Default value
inputTableName Name of the input table - -
selectedColNames Names of the columns used for word splitting in the input table The names of multiple columns are separated by commas (,). -
outputTableName Name of the output table - -
inputTablePartitions Partitions used for word splitting in the input table, in the format of partition_name=value.The multilevel format is name1=value1/name2=value2. Multiple partitions are separated by commas (,). - All partitions in the input table
outputTablePartition Partition in the output table - The output table is not partitioned.
tokenizer Type of the classifier TAOBAO_CHN and INTERNET_CHN Default: TAOBAO_CHN, indicating Chinese Taobao library. INTERNET_CHN indicates Internet library.
enableDfa Simple entity recognition True/False True
enablePersonNameTagger Personal name recognition True/False False
enableOrgnizationTagger Organization name recognition True/False False
enablePosTagger Whether to enable part-of-speech tagging True/False False
enableTelephoneRetrievalUnit Retrieval unit configuration - telephone number recognition True/False True
enableTimeRetrievalUnit Retrieval unit configuration - time ID recognition True/False True
enableDateRetrievalUnit Retrieval unit configuration - date ID recognition True/False True
enableNumberLetterRetrievalUnit Retrieval unit configuration - number and letter recognition True/False True
enableChnNumMerge Merge Chinese numbers into a retrieval unit True/False False
enableNumMerge Merge regular numbers into a retrieval unit True/False True
enableChnTimeMerge Merge the Chinese time into a semantic unit True/False False
enableChnDateMerge Merge Chinese dates into a semantic unit True/False False
enableSemanticTagger Whether to enable semantic tagging True/False False

Triples to KV

Function overview

  • Given a triple (row, col, value) of XXD or XXL type, X represents arbitrary type, D represents double, and L represents bigint. To convert it to kv format (row, [col_id:value]), the row and value types are consistent with the original input data, the col_id is bigint and the col is mapped to col_id according to the index table.
  • The input table format is as follows:
id word count
01 a 10
01 b 20
01 c 30
  • The output KV table is as follows, which can contain custom KV delimiters:
id key_value
01 1:10;2:20;3:30
  • The output word index table is as follows:
key key_id
a 1
b 2
c 3

PAI command

  1. PAI -name triple_to_kv
  2. -project algo_public
  3. -DinputTableName=test_data
  4. -DoutputTableName=test_kv_out
  5. -DindexOutputTableName=test_index_out
  6. -DidColName=id
  7. -DkeyColName=word
  8. -DvalueColName=count
  9. -DinputTablePartitions=ds=test1
  10. -DindexInputTableName=test_index_input
  11. -DindexInputKeyColName=word
  12. -DindexInputKeyIdColName=word_id
  13. -DkvDelimiter=:
  14. -DpairDelimiter=;
  15. -Dlifecycle=3

Algorithm parameters

Parameter Description Option Default value Remarks
inputTableName (Required)Name of the input table - - The table cannot be empty.
idColName (Required) Columns reserved during sparse conversion - - -
keyColName (Required) Key in the KV pair - - -
valueColName (Required) Value in the KV pair - - -
outputTableName (Required) Name of the output KV table - - -
indexOutputTableName (Required) Table that outputs the key index - - -
indexInputTableName (Optional) Existing index table in the input - “” The table cannot be empty, and the index of part of the key is allowed.
indexInputKeyColName (Optional) Name of the column that inputs the index table key - “” This parameter is required if indexInputTableName is set.
indexInputKeyIdColName (Optional) Name of the column that inputs the index ID of the index table key - “” This parameter is required if indexInputTableName is set.
inputTablePartitions (Optional) Partitions in the input table - “” Only a single partition can be input.
kvDelimiter (Optional) Delimiter between the key and the value - : -
pairDelimiter (Optional) Delimiter between KV pairs - ; -
lifecycle (Optional) Life cycle of the output table - Life cycle is not set. -
coreNum (Optional) Total number of instances - -1 Calculated by the size of input data by default
memSizePerCore (Optional) Memory size. Value range 100 - 64*1024 - -1 Calculated based by the size of input data by default

Example

Test data

SQL statement for data generation

  1. drop table if exists triple2kv_test_input;
  2. create table triple2kv_test_input as
  3. select
  4. *
  5. from
  6. (
  7. select '01' as id, 'a' as word, 10 as count from dual
  8. union all
  9. select '01' as id, 'b' as word, 20 as count from dual
  10. union all
  11. select '01' as id, 'c' as word, 30 as count from dual
  12. union all
  13. select '02' as id, 'a' as word, 100 as count from dual
  14. union all
  15. select '02' as id, 'd' as word, 200 as count from dual
  16. union all
  17. select '02' as id, 'e' as word, 300 as count from dual
  18. ) tmp;

Running command

  1. PAI -name triple_to_kv
  2. -project algo_public
  3. -DinputTableName=triple2kv_test_input
  4. -DoutputTableName=triple2kv_test_input_out
  5. -DindexOutputTableName=triple2kv_test_input_index_out
  6. -DidColName=id
  7. -DkeyColName=word
  8. -DvalueColName=count
  9. -Dlifecycle=1;

Running resulttriple2kv_test_input_out

  1. +------------+------------+
  2. | id | key_value |
  3. +------------+------------+
  4. | 02 | 1:100;4:200;5:300 |
  5. | 01 | 1:10;2:20;3:30 |
  6. +------------+------------+

triple2kv_test_input_index_out

  1. +------------+------------+
  2. | key | key_id |
  3. +------------+------------+
  4. | a | 1 |
  5. | b | 2 |
  6. | c | 3 |
  7. | d | 4 |
  8. | e | 5 |
  9. +------------+------------+

String similarity

Function overview

String similarity calculation is a basic operation in machine learning, mainly used for information retrieval, natural language processing, bioinformatics, and so on. This algorithm supports five similarity calculation methods: Levenshtein Distance, Longest Common SubString, String Subsequence Kernel, Cosine, and simhash_hamming. It also supports two input methods: String-to-string calculation and top n calculation.

Levenshtein Distance (Levenshtein) supports two parameters: distance and similarity. Similarity = 1- Distance. Distance is represented as Levenshtein in the parameters, and similarity is represented as levenshtein_sim.

Longest Common SubString (LCS) supports two parameters: distance and similarity. Similarity =1 - distance. Distance is represented as lcs in the parameters, and similarity is represented as lcs_sim.

String Subsequence Kernel (SSK) supports similarity calculation, represented as SSK in the parameters.

See Lodhi, Huma; Saunders, Craig; Shawe-Taylor, John; Cristianini, Nello; Watkins, Chris (2002). “Text classification using string kernels”. Journal of Machine Learning Research: 419–444.

Cosine supports similarity calculation, represented as cosine in the parameters.

See Leslie, C.; Eskin, E.; Noble, W.S. (2002), The spectrum kernel: A string kernel for SVM protein classification 7, pp. 566–575

For Simhash_hamming, the SimHash algorithm is used for mapping the original text into a 64-bit binary fingerprint, and the HammingDistance algorithm is used for calculating the number of characters for binary fingerprint in the same position. Simhash_hamming supports two parameters: distance and similarity. Similarity = 1 - distance/64.0. Distance is represented as simhash_hamming in the parameters, and similarity is represented as simhash_hamming_sim.

For more information about SimHash, see pdf;

For more information about HammingDistance, see wiki.

String-to-string calculation

PAI command
  1. PAI -name string_similarity
  2. -project algo_public
  3. -DinputTableName="pai_test_string_similarity"
  4. -DoutputTableName="pai_test_string_similarity_output"
  5. -DinputSelectedColName1="col0"
  6. -DinputSelectedColName2="col1";
Algorithm parameters
Parameter Description Option Default value
inputTableName (Required) Name of the input table - -
outputTableName (Required) Name of the output table - -
inputSelectedColName1 (Optional) Name of the first column for similarity calculation - Name of the first column in string type in the table
inputSelectedColName2 (Optional) Name of the second column for similarity calculation - Name of the second column in string type in the table
inputAppendColNames (Optional) Names of the columns appended to the output table - Not appended
inputTablePartitions (Optional) Partitions selected in the input table - Entire table selected
outputColName (Optional) Name of the similarity column in the output table. A column name must not contain any special character. It can contain only lower- and upper-case letters, numbers, and underscores (_) and must start with a letter. The column name length is no greater than 128 bytes. - output
method (Optional) Similarity calculation method levenshtein, levenshtein_sim, lcs, lcs_sim, ssk, cosine, simhash_hamming, simhash_hamming_sim levenshtein_sim
lambda (Optional) Weight of the matching string, which is available in SSK (0, 1) 0.5
k (Optional) Length of the sub-string, which is available in SSK and Cosine (0, 100) 2
lifecycle (Optional) Life cycle of the output table Positive integer No life cycle
coreNum (Optional) Number of cores for calculation Positive integer Automatically assigned
memSizePerCore (Optional) Memory size for each core, in MB Positive integer in the range of (0, 65536) Automatically assigned
Example

Test data

  1. create table pai_ft_string_similarity_input as select * from
  2. (select 0 as id, "Beijing" as col0, "Beijing" as col1 from dual
  3. union all
  4. select 1 as id, "Beijing" as col0, "Beijing Shanghai" as col1 from dual
  5. union all
  6. select 2 as id, "Beijing" as col0, "Beijing Shanghai Hong Kong" as col1 from dual
  7. )tmp;

PAI command

  1. PAI -name string_similarity
  2. -project sre_mpi_algo_dev
  3. -DinputTableName=pai_ft_string_similarity_input
  4. -DoutputTableName=pai_ft_string_similarity_output
  5. -DinputSelectedColName1=col0
  6. -DinputSelectedColName2=col1
  7. -Dmethod=simhash_hamming
  8. -DinputAppendColNames=col0,col1;

Output description

Output results obtained by using the simhash_hamming method:
image

Output results obtained by using the simhash_hamming_sim method:
image

String similarity - topN

PAI command

  1. PAI -name string_similarity_topn
  2. -project algo_public
  3. -DinputTableName="pai_test_string_similarity_topn"
  4. -DoutputTableName="pai_test_string_similarity_topn_output"
  5. -DmapTableName="pai_test_string_similarity_map_topn"
  6. -DinputSelectedColName="col0"
  7. -DmapSelectedColName="col1";

Algorithm parameters

Parameter Description Option Default value
inputTableName (Required) Name of the input table - -
mapTableName (Required) Name of the mapping table - -
outputTableName (Required) Name of the output table - -
inputSelectedColName (Optional) Name of the column from the left table for similarity calculation - Name of the first column in string type in the table
mapSelectedColName (Optional) Name of the column in the mapping table for similarity calculation. The similarities between each row in the left table and all strings in the mapping table are calculated, and the top N results are given in the end. - Name of the first column in string type in the table
inputAppendColNames (Optional) Names of the columns from the input table appended to the output table - Not appended
inputAppendRenameColNames (Optional) Aliases of the columns from the input table appended to the output table. The parameter is valid when inputAppendColNames is not null. - Aliases not used
mapAppendColNames (Optional) Names of the columns from the mapping table appended to the output table - Not appended
mapAppendRenameColNames (Optional) Aliases of the columns from the mapping table appended to the output table - Aliases not used
inputTablePartitions (Optional) Partitions selected in the input table - Entire table selected
mapTablePartitions (Optional) Partitions in the mapping table - All partitions in the mapping table
outputColName (Optional) Name of the similarity column in the output table. A column name must not contain any special character. It can contain only lower- and upper-case letters, numbers, and underscores (_) and must start with a letter. The column name length is no greater than 128 bytes. - output
method (Optional) Similarity calculation method levenshtein_sim, lcs_sim, ssk, cosine, simhash_hamming_sim levenshtein_sim
lambda (Optional) Weight of the matching string, which is available in SSK (0, 1) 0.5
k (Optional) Length of the sub-string, which is available in SSK and Cosine (0, 100) 2
topN (Optional) Number of similarity maximums in the End (0, +∞) 10
lifecycle (Optional) Life cycle of the output table Positive integer No life cycle
coreNum (Optional) Number of cores for calculation Positive integer Automatically assigned
memSizePerCore (Optional) Memory size for each core, in MB Positive integer in the range of (0, 65536) Automatically assigned

Example

Test data

  1. create table pai_ft_string_similarity_topn_input as select * from
  2. (select 0 as id, "Beijing" as col0 from dual
  3. union all
  4. select 1 as id, "Beijing Shanghai" as col0 from dual
  5. union all
  6. select 2 as id, "Beijing Shanghai Hong Kong" as col0 from dual
  7. )tmp;

PAI command

  1. PAI -name string_similarity_topn
  2. -project sre_mpi_algo_dev
  3. -DinputTableName=pai_ft_string_similarity_topn_input
  4. -DmapTableName=pai_ft_string_similarity_topn_input
  5. -DoutputTableName=pai_ft_string_similarity_topn_output
  6. -DinputSelectedColName=col0
  7. -DmapSelectedColName=col0
  8. -DinputAppendColNames=col0
  9. -DinputAppendRenameColNames=input_col0
  10. -DmapAppendColNames=col0
  11. -DmapAppendRenameColNames=map_col0
  12. -Dmethod=simhash_hamming_sim;

Output description

image

Deprecated word filtering

Function overview

Deprecated word filtering is a preprocessing method in text analysis. Its function is to filter out the noise in word splitting results (for example, of, yes, and ah).

Parameter settings

Component description

image

Two input columns from left to right:

  • Input table, which is a word splitting result table for filtering; parameter: inputTableName

  • Deprecated word table, which is a one-column table with each row containing a deprecated word; parameter: noiseTableName

Parameter UI description

image

Columns available for filtering.

Fine-tune description

image

You can configure the number of concurrent computing cores and memory size. By default, they are automatically assigned.

PAI command

  1. PAI -name FilterNoise -project algo_public \
  2. -DinputTableName=”test_input -DnoiseTableName=”noise_input \
  3. -DoutputTableName=”test_output \
  4. -DselectedColNames=”words_seg1,words_seg2 \
  5. -Dlifecycle=30

Algorithm parameters

Parameter Description Option Default value
inputTableName (Required) Name of the input table - -
inputTablePartitions (Optional) Partitions used for calculation in the input table - All partitions in the input table
noiseTableName (Required) Deprecated word table One-column table with each row containing a deprecated word -
noiseTablePartitions (Optional) Partitions in the deprecated word table All partitions in the table
outputTableName (Required) Name of the output table - -
selectedColNames (Required) Names of the columns to be filtered, separated by commas in case of more than one column - -
lifecycle (Optional) Life cycle of the output table Positive integer No life cycle
coreNum (Optional) Number of cores for calculation Positive integer Automatically assigned
memSizePerCore (Optional) Memory size for each core, in MB Positive integer in the range of (0, 65536) Automatically assigned

Example

Source data
  • Word splitting result table temp_word_seg_input
    filter_noise_demo_seg_input

  • Deprecated word table temp_word_noise_input
    filter_noise_demo_noise_input

  1. Create an experiment.
    image

  2. Select seg in a column to be filtered.
    image
    image

  3. Running result.
    filter_noise_demo_result

Text summarization

Automatic summarization uses computers to automatically extract summaries from the source document. Summaries are simple, consistent, and short documents that fully and accurately reflect the content of a certain document. Based on TextRank, this algorithm forms summaries by extracting existing sentences in the document.

For detailed principles of this algorithm, see the document TextRank: Bringing Order into Texts.

PAI command

  1. PAI -name TextSummarization
  2. -project algo_public
  3. DinputTableName="test_input"
  4. -DoutputTableName="test_output"
  5. -DdocIdCol="doc_id"
  6. -DsentenceCol="sentence"
  7. -DtopN=2
  8. -Dlifecycle=30;

Parameter description

Parameter Description Option Default value
inputTableName (Required) Name of the input table - -
inputTablePartitions (Optional) Partitions used for calculation in the input table - All partitions in the input table
outputTableName (Required) Name of the output table - -
docIdCol (Required) Name of the column listing document IDs - -
sentenceCol (Required) Sentence column. Only one column can be specified. - -
topN (Optional) Top N key sentences that are output - 3
similarityType (Optional) Sentence similarity calculation method lcs_sim, levenshtein_sim, cosine, ssk lcs_sim
lambda (Optional) Weight of the matching string, which is available in SSK (0, 1) 0.5
k (Optional) Length of the sub-string, which is available in SSK and Cosine (0, 100) 2
dampingFactor (Optional) Damping factor (0, 1) 0.85
maxIter (Optional) Maximum number of iterations [1, +] 100
epsilon (Optional) Convergence factor (0, ∞) 0.000001
lifecycle (Optional) Life cycle of the input/output table Positive integer No life cycle
coreNum (Optional) Number of cores for calculation Positive integer Automatically calculated
memSizePerCore (Optional) Memory size for each core Positive integer Automatically calculated

The sentence similarity has the following value options: (Symbol description: A and B indicate two strings, and len(A) indicates the length of string A.)

  • lcs_sim: Formula ‘1.0 - (Length of the longest common subsequence)/max(len(A), len(B))’.

  • levenshtein_sim: Formula ‘1.0 - (Levenshtein distance)/max(len(A), len(B))’.

  • cosine: See Lodhi, Huma; Saunders, Craig; Shawe-Taylor, John; Cristianini, Nello; Watkins, Chris (2002). “Text classification using string kernels”. Journal of Machine Learning Research: 419–444.

  • ssk: See Leslie, C.; Eskin, E.; Noble, W.S. (2002), The spectrum kernel: A string kernel for SVM protein classification 7, pp. 566–575.

Output format description

The output table contains the doc_id and abstract columns. For example:

doc_id abstract
1000894 Shanghai Stock Exchange published the guidelines for disclosure of listed companies’ corporate social responsibility (CSR) in 2008, urging three types of companies to disclose their CSR reports and encouraging other qualified listed companies to voluntarily disclose their CSR reports. Statistics show that in 2012, a total of 379 listed companies on Shanghai Stock Exchange disclosed their CSR reports, including 305 mandatory companies and 74 voluntary companies, which account for 40% of all listed companies on Shanghai Stock Exchange. According to Hu Ruyin, Shanghai Stock Exchange will explore how to expand the scope of CSR report disclosure, revise, and refine the guidelines on disclosure of the CSR reports, and encourage more organizations to promote CSR product innovation.

Document similarity

Based on the similarity of strings, the word-based document similarity is calculated by comparing the similarity between documents or sentences that are separated by spaces. The document similarity is calculated in a similar way to string similarity. Types of document similarity are the same as those of string similarity.

Levenshtein Distance (Levenshtein) supports two parameters: distance and similarity. Similarity = 1- Distance. Distance is represented as Levenshtein in the parameters, and similarity is represented as levenshtein_sim.

Longest Common SubString (LCS) supports two parameters: distance and similarity. Similarity =1 - distance. Distance is represented as lcs in the parameters, and similarity is represented as lcs_sim.

String Subsequence Kernel (SSK) supports similarity calculation, represented as SSK in the parameters.

See Lodhi, Huma; Saunders, Craig; Shawe-Taylor, John; Cristianini, Nello; Watkins, Chris (2002). “Text classification using string kernels”. Journal of Machine Learning Research: 419–444.

Cosine supports similarity calculation, represented as cosine in the parameters.

See Leslie, C.; Eskin, E.; Noble, W.S. (2002), The spectrum kernel: A string kernel for SVM protein classification 7, pp. 566–575

For Simhash_hamming, the SimHash algorithm is used for mapping the original text into a 64-bit binary fingerprint, and the HammingDistance algorithm is used for calculating the number of characters for binary fingerprint in the same position. Simhash_hamming supports two parameters: distance and similarity. Similarity = 1 - distance/64.0. Distance is represented as simhash_hamming in the parameters, and similarity is represented as simhash_hamming_sim.

For more information about SimHash, see pdf;

For more information about HammingDistance, see wiki.

PAI command

  1. PAI -name doc_similarity
  2. -project algo_public
  3. -DinputTableName="pai_test_doc_similarity"
  4. -DoutputTableName="pai_test_doc_similarity_output"
  5. -DinputSelectedColName1="col0"
  6. -DinputSelectedColName2="col1"

Parameter description

Parameter Description Option Default value
inputTableName (Required) Name of the input table - -
outputTableName (Required) Name of the output table - -
inputSelectedColName1 (Optional) Name of the first column for similarity calculation - Name of the first column in string type in the table
inputSelectedColName2 (Optional) Name of the second column for similarity calculation - Name of the second column in string type in the table
inputAppendColNames (Optional) Names of the columns appended to the output table - Not appended
inputTablePartitions (Optional) Partitions selected in the input table - Entire table selected
outputColName (Optional) Name of the similarity column in the output table.A column name must not contain any special character. It can contain only lower- and upper-case letters, numbers, and underscores (_) and must start with a letter. The column name length is no greater than 128 bytes. - output
method (Optional) Similarity calculation method levenshtein, levenshtein_sim, lcs, lcs_sim, ssk, cosine, simhash_hamming, simhash_hamming_sim levenshtein_sim
lambda (Optional) Weight of the matching word combination, which is available in SSK (0, 1) 0.5
k (Optional) Length of the sub-string, which is available in SSK and Cosine (0, 100) 2
lifecycle (Optional) Life cycle of the output table Positive integer No life cycle
coreNum (Optional) Number of cores for calculation Positive integer Automatically assigned
memSizePerCore (Optional) Memory size for each core, in MB Positive integer in the range of (0, 65536) Automatically assigned

Example

Data generation

  1. drop table if exists pai_doc_similarity_input;
  2. create table pai_doc_similarity_input as
  3. select * from
  4. (
  5. select 0 as id, "Beijing Shanghai" as col0, "Beijing Shanghai" as col1 from dual
  6. union all
  7. select 1 as id, "Beijing Shanghai" as col0, "Beijing Shanghai Hong Kong" as col1 from dual
  8. )tmp

PAI command line

  1. drop table if exists pai_doc_similarity_output;
  2. PAI -name doc_similarity
  3. -project algo_public
  4. -DinputTableName=pai_doc_similarity_input
  5. -DoutputTableName=pai_doc_similarity_output
  6. -DinputSelectedColName1=col0
  7. -DinputSelectedColName2=col1
  8. -Dmethod=levenshtein_sim
  9. -DinputAppendColNames=id,col0,col1;

Input description
pai_doc_similarity_input

id col0 col1
1 Beijing Shanghai Beijing Shanghai Hong Kong
0 Beijing Shanghai Beijing Shanghai

Output description
pai_doc_similarity_output

id col0 col1 output
1 Beijing Shanghai Beijing Shanghai Hong Kong 0.6666666666666667
0 Beijing Shanghai Beijing Shanghai 1.0

Important notes

The similarity calculation is based on the word splitting result, that is, each word separated by space is used as a unit for similarity calculation. You can use the string similarity calculation method if the words are input in a string as a whole.

In the parameter Method, levenshtein, lcs, and simhash_hamming represents the distance for calculation. levenshtein_sim, lcs_sim, ssk, cosine, and simhash_hamming_sim represent the distance for calculation. Distance = 1.0 - similarity.

The presence of parameter k in the Cosine and SSK methods indicates that k words are used as a combination for similarity calculation. Therefore, when k is greater than the number of words, the similarity output may be 0 even if the two strings are the same. In this case, the value of k can be changed to the minimum number of words.

Sentence splitting

Component description

A piece of text is split by punctuations.

This component is used for preprocessing of text summarization. It splits a piece of text so that each row contains only one sentence.

PAI command

  1. PAI -name SplitSentences
  2. -project algo_public
  3. -DinputTableName="test_input"
  4. -DoutputTableName="test_output"
  5. -DdocIdCol="doc_id"
  6. -DdocContent="content"
  7. -Dlifecycle=30

Parameter description

Parameter Description Option Default value
inputTableName (Required) Name of the input table - -
inputTablePartitions (Optional) Partitions used for calculation in the input table - All partitions in the input table
outputTableName (Required) Name of the output table - -
docIdCol (Required) Name of the column listing document IDs - -
docContent (Required) Name of the column displaying the document content. Only one column can be specified. - -
delimiter (Optional) Delimiter set of a sentence - !?
lifecycle (Optional) Life cycle of the input/output table Positive integer No life cycle
coreNum (Optional) Number of cores for calculation Positive integer Automatically calculated
memSizePerCore (Optional) Memory size for each core Positive integer Automatically calculated

Output format description

The output table contains the doc_id and sentence columns.For example:

doc_id sentence
1000894 Shanghai Stock Exchange published the guidelines for disclosure of listed companies’ corporate social responsibility (CSR) in 2008, urging three types of companies to disclose their CSR reports and encouraging other qualified listed companies to voluntarily disclose their CSR reports.
1000894 Statistics show that in 2012, a total of 379 listed companies on Shanghai Stock Exchange disclosed their CSR reports, including 305 mandatory companies and 74 voluntary companies, which account for 40% of all listed companies on Shanghai Stock Exchange.

Conditional random field

A conditional random field (CRF) is a conditional probability distribution model of a group of output random variables based on a group of input random variables.This model presumes that the output random variables constitute a Markov random field (MRF). CRFs can be used in different prediction scenarios. The linear chain CRF is mostly used, especially in annotation scenarios.

For more information about CRFs, see wiki.

PAI command

  1. PAI -name=linearcrf
  2. -project=algo_public
  3. -DinputTableName=crf_input_table
  4. -DidColName=sentence_id
  5. -DfeatureColNames=word,f1
  6. -DlabelColName=label
  7. -DoutputTableName=crf_model
  8. -Dlifecycle=28
  9. -DcoreNum=10

Algorithm parameters

Parameter Description Option Default value
inputTableName (Required) Name of the input feature data table - -
inputTablePartitions (Optional) Partitions selected in the input feature data table - All partitions in the table by default
featureColNames (Optional) Feature columns selected in the input table - All columns but the label column are selected by default.
labelColName (Required) Target column - -
idColName (Required) Sample ID column - -
outputTableName (Required) Name of the output model table - -
outputTablePartitions (Optional) Partitions selected in the output model table - All partitions in the table by default
template (Optional) Template generated by algorithm features Definition format: as follows Default value: as follows
freq (Optional) Parameter for filtering features. Only the feature appears equal to or more than the value of freq is retained in the algorithm. Positive integer Default value: 1
iterations (Optional) Maximum number of optimization iterations - Default value: 100
l1Weight (Optional) L1 norm parameter weight - Default value: 1.0
l2Weight (Optional) L2 norm parameter weight - Default value: 1.0
epsilon (Optional) Convergence deviation, which is the termination condition for L-BFGS, that is, the log-likelihood deviation between two iterations - Default value: 0.0001
lbfgsStep (Optional) Historical size during lbfgs optimization, valid only for lbfgs - default value: 10
threadNum (Optional) Number of the threads enabled in parallel during model training - Default value: 3
lifecycle (Optional) Life cycle of the output table - Unspecified by default
coreNum (Optional) Number of cores - Automatically calculated by default
memSizePerCore (Optional) Size of memory - Automatically calculated by default

Feature template definition

  1. <template .=. <template_item,<template_item,...,<template_item
  2. <template_item .=. [row_offset:col_index]/[row_offset:col_index]/.../[row_offset:col_index]
  3. row_offset .=. integer
  4. col_index .=. integer

Default algorithm template

  1. [-2:0],[-1:0],[0:0],[1:0],[2:0],[-1:0]/[0:0],[0:0]/[1:0],[-2:1],[-1:1],[0:1],[1:1],[2:1],[-2:1]/[-1:1],[-1:1]/[0:1],[0:1]/[1:1],[1:1]/[2:1],[-2:1]/[-1:1]/[0:1],[-1:1]/[0:1]/[1:1],[0:1]/[1:1]/[2:1]

Data example

sentence_id word f1 label
1 Rockwell NNP B-NP
1 International NNP I-NP
1 Corp NNP I-NP
1 ‘s POS B-NP
823 Ohio NNP B-NP
823 grew VBD B-VP
823 3.8 CD B-NP
823 % NN I-NP
823 . . O

PAI command example for prediction algorithm

  1. PAI -name=crf_predict
  2. -project=algo_public
  3. -DinputTableName=crf_test_input_table
  4. -DmodelTableName=crf_model
  5. -DidColName=sentence_id
  6. -DfeatureColNames=word,f1
  7. -DlabelColName=label
  8. -DoutputTableName=crf_predict_result
  9. -DdetailColName=prediction_detail
  10. -Dlifecycle=28
  11. -DcoreNum=10

Prediction algorithm parameters

Parameter Description Option Default value
inputTableName (Required) Name of the input feature data table - -
inputTablePartitions (Optional) Partitions selected in the input feature data table - All partitions in the table by default
featureColNames (Optional) Feature columns selected in the input table - All columns but the label column are selected by default.
labelColName (Optional) Target column - -
IdColName (Required) Sample ID column - -
resultColName (Optional) Name of the result column in the output table - Default: prediction_result
scoreColName (Optional) Name of the score column in the output table - Default: prediction_score
detailColName (Optional) Name of the detail column in the output table - Default: null
outputTableName (Required) Name of the output prediction result table - -
outputTablePartitions (Optional) Partitions selected in the output prediction result table - All partitions in the table by default
modelTableName (Required) Name of the algorithm model table - -
modelTablePartitions (Optional) Partitions selected in the algorithm model table - All partitions in the table by default
lifecycle (Optional) Life cycle of the output table - Unspecified by default
coreNum (Optional) Number of cores - Automatically calculated by default
memSizePerCore (Optional) Size of memory - Automatically calculated by default

Test data

sentence_id word f1 label
1 Confidence NN B-NP
1 in IN B-PP
1 the DT B-NP
1 pound NN I-NP
77 have VBP B-VP
77 announced VBN I-VP
77 similar JJ B-NP
77 increases NNS I-NP
77 . . O

Note: The label column can be absent.

Keyword extraction

Keyword extraction is an important natural language processing technique. Specifically, it refers to the process of extracting words that are most relevant to the meaning of a document. This algorithm is based on TextRank and inspired by PageRank (an algorithm that describes the relationship between webpages). It uses the relationship between partial words (co-occurrence window) to construct a network, calculate the importance of words, and select greater weight words as keywords.

PAI command

  1. PAI -name KeywordsExtraction
  2. -DinputTableName=maple_test_keywords_basic_input
  3. -DdocIdCol=docid -DdocContent=word
  4. -DoututTableName=maple_test_keywords_basic_output
  5. -DtopN=19;

Parameter description

Parameter Description Value range Required/Optional, default value
inputTableName Input table Table name Required
inputTablePartitions Partitions used for training in the input table, in the format of partition_name=value. The multilevel partition name format is name1=value1/name2=value2. If you specify multiple partitions, separate them with a comma (,). (Optional) Default: all partitions
outputTableName Name of the output table Required
docIdCol Name of the column listing document IDs. Only one column can be specified. Required
docContent Word column. Only one column can be specified. Required
topN Number of first keywords to be output. If this number is smaller than the number of all words, all words are output. (Optional) Default value: 5
windowSize Size of the window of the TextRank algorithm (Optional) Default value: 2
dumpingFactor Damping factor of the TextRank algorithm (Optional) Default value: 0.85
maxIter Maximum number of iterations of the TextRank algorithm (Optional) Default value: 100
epsilon Convergence residual threshold of the TextRank algorithm (Optional) Default value: 0.000001
lifecycle (Optional) Life cycle of the output table Positive integer No life cycle
coreNum Number of nodes Used together with the memSizePerCore parameter. Positive integer in the range of [1, 9999]. Detailed description (Optional) Automatically calculated by default
memSizePerCore Memory size of each node, in MB Positive integer in the range of [1024, 64*1024] Detailed description (Optional) Automatically calculated by default

Example

Data generation

The words in the input table are separated by space, and the deprecated words and all punctuation marks are filtered out.

docid:string word:string
doc0 The blended-wing-body aircraft is a new direction for the future development in the aviation field Many research institutions in and outside China have carried out researches on the blended-wing-body aircraft while its fully automated shape optimization algorithm has become a new hot topic Based on the existing research achievements in and outside China common modeling and flow solver tools have been analyzed and compared The geometric modeling meshing flow field solver and shape optimization modules have been designed The pros and cons between different algorithms have been compared to get the optimized shape of the blended-wing-body aircraft in the conceptual design stage Geometric modeling and mesh generation module include the transfinite interpolation algorithm and spline based mesh generation method The flow solver module includes the finite difference solver the finite element solver and the panel method solver wherein the finite difference solver includes mathematical modeling of the potential flow the derivation of the Cartesian grid based variable step length difference scheme Cartesian grid generation and indexing algorithm the Cartesian grid based Neumann boundary conditions The aerodynamic parameters of a two-dimensional airfoil are calculated based on the finite difference solver Finite element solver includes potential flow modeling based on the variational principle of the finite element theory the derivation of the two-dimensional finite element Kutta conditional least squares based speed solving algorithm Gmsh based two-dimensional field mesh generator of airfoil with wakes design The aerodynamic parameters of a two-dimensional airfoil are calculated based on the finite element solver The panel method solver includes modeling and automatic wake generation the design of the three-dimensional flow solver of the blended-wing-body drag estimation based on the Blasius solution solver implemented in the Fortran language a mixed compilation of Python and Fortran OpenMP and CUDA based acceleration algorithm The aerodynamic parameters of a three-dimensional wing body are calculated based on the panel method solver The shape optimization module includes a free form deformation algorithm genetic algorithms differential evolution algorithm aircraft surface area calculation algorithm and the moments integration algorithm which can calculate the volume of an aircraft.

PAI command line

  1. PAI -name KeywordsExtraction
  2. -DinputTableName=maple_test_keywords_basic_input
  3. -DdocIdCol=docid -DdocContent=word
  4. -DoututTableName=maple_test_keywords_basic_output
  5. -DtopN=19;

Output description

Output table

docid keywords weight
doc0 Based on 0.041306752223538405
doc0 Algorithm 0.03089845626854151
doc0 Modeling 0.021782865850562882
doc0 Grid 0.020669749212693957
doc0 Solver 0.020245609506360847
doc0 Aircraft 0.019850761705313365
doc0 Research 0.014193732541852615
doc0 Finite element 0.013831122054200538
doc0 Solving 0.012924593244133104
doc0 Module 0.01280216562287212
doc0 Derivation 0.011907588923852495
doc0 Shape 0.011505456605632607
doc0 Difference 0.011477831662367547
doc0 Flow 0.010969269350293957
doc0 Design 0.010830986516637251
doc0 Implementation 0.010747536556701583
doc0 Two-dimensional 0.010695570768457084
doc0 Development 0.010527342662670088
doc0 New 0.010096978306668461

Algorithm scale

Sina data set, 1,080,162 documents, 30 initial nodes, 10 minutes.

ngram-count

Ngram-count is one of the steps in the language model, which generates n-grams based on the words and counts the number of corresponding n-grams on all corpus. It counts the number of n-grams in the global documents rather than in a single document. See ngram-count, which is a subset of the tool.

PAI command

  1. PAI -name ngram_count
  2. -project algo_public
  3. -DinputTableName=pai_ngram_input
  4. -DoutputTableName=pai_ngram_output
  5. -DinputSelectedColNames=col0
  6. -DweightColName=weight
  7. -DcoreNum=2
  8. -DmemSizePerCore=1000;

Parameter description

Parameter Description Option Default value
inputTableName (Required) Name of the input table - -
outputTableName (Required) Name of the output table - -
inputSelectedColNames (Optional) Names of the columns selected in the input table - The first column of character type
weightColName (Optional) Name of the weight column - Default weight: 1
inputTablePartitions (Optional) Partitions in the input table - Entire table selected by default
countTableName (Optional) Previous ngram-count output table, which is merged into the final result - -
countWordColName (Optional) Name of the word column in the count table - The second column is selected by default.
countCountColName (Optional) Name of the count column in the count table - The third column is selected by default.
countTablePartitions (Optional) Partitions in the count table - -
vocabTableName (Optional) Bag-of-words table. The words not contained in the bag-of-words are identified as \<unk\ in the result. - -
vocabSelectedColName (Optional) Name of the bag-of-words column - The first column of character type is selected by default.
vocabTablePartitions (Optional) Partitions in the bag-of-words table - -
order (Optional) Maximum length of N-grams - Default value: 3
lifecycle (Optional) Life cycle of the output table - -
coreNum (Optional) Number of cores - -
memSizePerCore (Optional) Memory size for each core - -

Semantic vector distance

Calculate the extension words (sentences) of the specified words (sentences) based on the calculated semantic vectors (such as word vectors calculated by the Word2vec component). The extension words (sentences) are a set of vectors closest to a certain vector. The following example shows how to generate a list of words that are most similar to the word that you entered based on the word vector results calculated by the Word2vec component.

PAI command

  1. PAI -name SemanticVectorDistance -project algo_public
  2. -DinputTableName="test_input"
  3. -DoutputTableName="test_output"
  4. -DidColName="word"
  5. -DvectorColNames="f0,f1,f2,f3,f4,f5"
  6. -Dlifecycle=30

Parameter description

Parameter Description Option Default value
inputTableName (Required) Name of the input table - -
inputTablePartitions (Optional) Partitions used for calculation in the input table - All partitions in the input table
outputTableName (Required) Name of the output table - -
idTableName (Optional) Name of the table containing the column listing IDs of the close vectors. One-column table with each row assigned with an ID. Null by default, that is, all vectors in the input table are calculated. - -
idTablePartitions (Optional) List of partitions used for calculation in the ID table. All partitions are selected by default. - -
idColName (Required) Name of the ID column - 3
vectorColNames (Optional) List of vector column names, such as f1, f2,… - -
topN (Optional) Number of the output closest vectors [1, ∞] 5
distanceType (Optional) Distance calculation method euclidean, cosine, and manhattan euclidean
distanceThreshold (Optional) Distance threshold. The distance between two vectors is output if it is smaller than the distance threshold. (0, ∞)
lifecycle (Optional) Life cycle of the input/output table Positive integer No life cycle
coreNum (Optional) Number of cores for calculation Positive integer Automatically calculated
memSizePerCore (Optional) Memory size for each core Positive integer Automatically calculated

Example

The output table contains the original_id, near_id, distance, and rank columns. For example:

original_id near_id distance rank
hello hi 0.2 1
hello xxx xx 2
Man Woman 0.3 1
Man xx xx 2
..

Conditional random field

A conditional random field (CRF) is a conditional probability distribution model of a group of output random variables based on a group of input random variables.This model presumes that the output random variables constitute a Markov random field (MRF).CRFs can be used in different prediction scenarios. The linear chain CRF is mostly used, especially in annotation scenarios.

For more information about CRFs, see wiki.

PAI command

  1. PAI -name=linearcrf
  2. -project=algo_public
  3. -DinputTableName=crf_input_table
  4. -DidColName=sentence_id
  5. -DfeatureColNames=word,f1
  6. -DlabelColName=label
  7. -DoutputTableName=crf_model
  8. -Dlifecycle=28
  9. -DcoreNum=10

Algorithm parameters

Parameter Description Option Default value
inputTableName (Required) Name of the input feature data table - -
inputTablePartitions (Optional) Partitions selected in the input feature data table - All partitions in the table by default
featureColNames (Optional) Feature columns selected in the input table - All columns but the label column are selected by default.
labelColName (Required) Target column - -
idColName (Required) Sample ID column - -
outputTableName (Required) Name of the output model table - -
outputTablePartitions (Optional) Partitions selected in the output model table - All partitions in the table by default
template (Optional) Template generated by algorithm features Definition format: as follows Default value: as follows
freq (Optional) Parameter for filtering features. Only the feature appears equal to or more than the value of freq is retained in the algorithm. Positive integer Default value: 1
iterations (Optional) Maximum number of optimization iterations - Default value: 100
l1Weight (Optional) L1 norm parameter weight - Default value: 1.0
l2Weight (Optional) L2 norm parameter weight - Default value: 1.0
epsilon (Optional) Convergence deviation, which is the termination condition for L-BFGS, that is, the log-likelihood deviation between two iterations - Default value: 0.0001
lbfgsStep (Optional) Historical size during lbfgs optimization, valid only for lbfgs - default value: 10
threadNum (Optional) Number of the threads enabled in parallel during model training - Default value: 3
lifecycle (Optional) Life cycle of the output table - Unspecified by default
coreNum (Optional) Number of cores - Automatically calculated by default
memSizePerCore (Optional) Size of memory - Automatically calculated by default

Feature template definition

  1. <template> .=. <template_item>,<template_item>,...,<template_item>
  2. <template_item> .=. [row_offset:col_index]/[row_offset:col_index]/.../[row_offset:col_index]
  3. row_offset .=. integer
  4. col_index .=. integer

Default algorithm template

  1. [-2:0],[-1:0],[0:0],[1:0],[2:0],[-1:0]/[0:0],[0:0]/[1:0],[-2:1],[-1:1],[0:1],[1:1],[2:1],[-2:1]/[-1:1],[-1:1]/[0:1],[0:1]/[1:1],[1:1]/[2:1],[-2:1]/[-1:1]/[0:1],[-1:1]/[0:1]/[1:1],[0:1]/[1:1]/[2:1]

Data example

Training data
sentence_id word f1 label
1 Rockwell NNP B-NP
1 International NNP I-NP
1 Corp NNP I-NP
1 ‘s POS B-NP
823 Ohio NNP B-NP
823 grew VBD B-VP
823 3.8 CD B-NP
823 % NN I-NP
823 . . O
PAI command example for prediction algorithm
  1. PAI -name=crf_predict
  2. -project=algo_public
  3. -DinputTableName=crf_test_input_table
  4. -DmodelTableName=crf_model
  5. -DidColName=sentence_id
  6. -DfeatureColNames=word,f1
  7. -DlabelColName=label
  8. -DoutputTableName=crf_predict_result
  9. -DdetailColName=prediction_detail
  10. -Dlifecycle=28
  11. -DcoreNum=10
Prediction algorithm parameters
Parameter Description Option Default value
inputTableName (Required) Name of the input feature data table - -
inputTablePartitions (Optional) Partitions selected in the input feature data table - All partitions in the table by default
featureColNames (Optional) Feature columns selected in the input table - All columns but the label column are selected by default.
labelColName (Optional) Target column - -
IdColName (Required) Sample ID column - -
resultColName (Optional) Name of the result column in the output table - Default: prediction_result
scoreColName (Optional) Name of the score column in the output table - Default: prediction_score
detailColName (Optional) Name of the detail column in the output table - Default: null
outputTableName (Required) Name of the output prediction result table - -
outputTablePartitions (Optional) Partitions selected in the output prediction result table - All partitions in the table by default
modelTableName (Required) Name of the algorithm model table - -
modelTablePartitions (Optional) Partitions selected in the algorithm model table - All partitions in the table by default
lifecycle (Optional) Life cycle of the output table - Unspecified by default
coreNum (Optional) Number of cores - Automatically calculated by default
memSizePerCore (Optional) Size of memory - Automatically calculated by default
Test data
sentence_id word f1 label
1 Confidence NN B-NP
1 in IN B-PP
1 the DT B-NP
1 pound NN I-NP
77 have VBP B-VP
77 announced VBN I-VP
77 similar JJ B-NP
77 increases NNS I-NP
77 . . O

Note: The label column can be absent.

PMI

  • Mutual information (MI) is a useful measure of information in the information theory. It can be regarded as the amount of information contained in a random variable about the other, or the reduction in uncertainty of a random variable due to the known other.

  • This algorithm counts the co-occurrence of all words in several documents and calculates PMI (point mutual information). MI definition:
    PMI(x,y)=ln(p(x,y)/(p(x)p(y)))=ln(#(x,y)D/(#x#y))

  • Where, #(x,y) is the counted number of pair(x,y) and D is the total number of pairs. If x and y appear in the same window, then #x+=1; #y+=1;#(x,y)+=1.

PAI command line

  1. PAI -name PointwiseMutualInformation
  2. -project algo_public
  3. -DinputTableName=maple_test_pmi_basic_input
  4. -DdocColName=doc
  5. -DoutputTableName=maple_test_pmi_basic_output
  6. -DminCount=0
  7. -DwindowSize=2
  8. -DcoreNum=1
  9. -DmemSizePerCore=110;

Parameter description

Parameter Description Value range Required/Optional, default value
inputTableName Input table Table name Required
outputTableName Name of the output table Table name Required
docColName Column names of documents after word splitting, where words are separated by space Column name Required
windowSize Size of window. For example, the value 5 refers to the five adjacent words to the right of the current word (excluding the current word). Words appearing in the window are considered related to the current word. [1, sentence length] (Optional) Entire row by default
minCount Minimum word truncation frequency. Words that appear for a number of times less than this value are filtered out. [0, 2e63] (Optional) Default value: 5
inputTablePartitions Partitions used for training in the input table, in the format of partition_name=value. The multilevel partition name format is name1=value1/name2=value2. If you specify multiple partitions, separate them with a comma (,). (Optional) Default: all partitions
lifecycle (Optional) Life cycle of the output table Positive integer No life cycle
coreNum Number of nodes Used together with the memSizePerCore parameter, positive integer in the range of [1, 9999] (Optional) Automatically calculated by default
memSizePerCore Memory size of each node, in MB Positive integer in the range of [1024, 64*1024] (Optional) Automatically calculated by default

Example

Data generation

doc:string
w1 w2 w3 w4 w5 w6 w7 w8 w8 w9
w1 w3 w5 w6 w9
w0
w0 w0
w9 w1 w9 w1 w9

PAI command line

  1. PAI -name PointwiseMutualInformation
  2. -project algo_public
  3. -DinputTableName=maple_test_pmi_basic_input
  4. -DdocColName=doc
  5. -DoutputTableName=maple_test_pmi_basic_output
  6. -DminCount=0
  7. -DwindowSize=2
  8. -DcoreNum=1
  9. -DmemSizePerCore=110;

Output description
Output table

word1 word2 word1_count word2_count co_occurrences_count pmi
w0 w0 2 2 1 2.0794415416798357
w1 w1 10 10 1 -1.1394342831883648
w1 w2 10 3 1 0.06453852113757116
w1 w3 10 7 2 -0.08961215868968704
w1 w5 10 8 1 -0.916290731874155
w1 w9 10 12 4 0.06453852113757116
w2 w3 3 7 1 0.4212134650763035
w2 w4 3 4 1 0.9808292530117262
w3 w4 7 4 1 0.13353139262452257
w3 w5 7 8 2 0.13353139262452257
w3 w6 7 7 1 -0.42608439531090014
w4 w5 4 8 1 0
w4 w6 4 7 1 0.13353139262452257
w5 w6 8 7 2 0.13353139262452257
w5 w7 8 4 1 0
w5 w9 8 12 1 -1.0986122886681098
w6 w7 7 4 1 0.13353139262452257
w6 w8 7 7 1 -0.42608439531090014
w6 w9 7 12 1 -0.9650808960435872
w7 w8 4 7 2 0.8266785731844679
w8 w8 7 7 1 -0.42608439531090014
w8 w9 7 12 2 -0.2719337154836418
w9 w9 12 12 2 -0.8109302162163288
Thank you! We've received your feedback.