This topic describes the Keyword Extraction component provided by Machine Learning Studio.

Keyword extraction is one of the important technologies in natural language processing. It is used to extract keywords from a document. The keyword extraction algorithm is based on TextRank, a variation of the PageRank algorithm. This keyword extraction algorithm uses the relationship between specific words to construct a network, calculate the importance of each word, and determine words with larger weights as keywords.

The keyword extraction process involves the following steps:

- Raw corpora preparation
- Tokenization
- Word-based filtering
- Keyword extraction

## Configure the component

You can configure the component by using one of the following methods:

- Machine Learning Platform for AI console
Tab Parameter Description Fields Setting Column of Marked Document IDs The document ID column. Word Splitting Result of Marked Documents The word splitting results of marked documents. Parameters Setting Output First N Keywords The value must be an integer. Default value: 5. Window Size The value must be an integer. Default value: 2. Damping Coefficient Default value: 0.85. Maximum Iterations Default value: 100. Convergence Coefficient Default value: 0.000001. Tuning Cores. Auto-assigned by default. Allocated by the system by default. Memory size per core. Auto-assigned by default. Allocated by the system by default. - PAI command
`PAI -name KeywordsExtraction -DinputTableName=maple_test_keywords_basic_input -DdocIdCol=docid -DdocContent=word -DoutputTableName=maple_test_keywords_basic_output -DtopN=19;`

Parameter Required Description Default value inputTableName Yes The name of the input table. No default value inputTablePartitions No The partitions selected from the input table for training, in the format of "Partition_name=value". To specify multiple partitions, use the following format: name1=value1/name2=value2. Separate multiple partitions with commas (,). All partitions outputTableName Yes The name of the output table. No default value docIdCol Yes The document ID column. You can specify only one column. No default value docContent Yes The word column. You can specify only one column. No default value topN No The number of top N keywords to be provided. If the value of the parameter is greater than the total number of keywords, all keywords are provided. 5 windowSize No The window size of the TextRank algorithm. 2 dumpingFactor No The damping coefficient of the TextRank algorithm. 0.85 maxIter No The maximum number of iterations of the TextRank algorithm. 100 epsilon No The convergence residual threshold of the TextRank algorithm. 0.000001 lifecycle No The lifecycle of the output table. No default value coreNum No The number of cores. Automatically calculated memSizePerCore No The memory size of each core. Unit: MB. Automatically calculated

## Example

- Input
Separate words in the input table with spaces, and filter out deprecated words (such as "of") and all punctuation marks.
docid:string word:string doc0 Blended Wing Body (BWB) aircraft new direction development aviation future Many research institutions home abroad have carried out research on BWB aircraft fully automatic shape optimization algorithm has become new research hotspot Based existing results home abroad methods characteristics commonly used modeling solution platforms analyzed compared Geometric modeling meshing flow field solution shape optimization BWB aircraft shape optimization designed optimization module compares advantages disadvantages different algorithms realizes shape optimization BWB aircraft conceptual design geometric modeling mesh generation module implements mesh generation algorithm based on over-limit interpolation modeling method based on splines flow field solving module includes finite difference solver finite element solver surface element method solver Among them finite difference solver mainly includes mathematical modeling potential flow based finite difference method variable step difference format derivation based Cartesian grid generation index algorithm Cartesian grid Neumann based Cartesian grid boundary condition expression form deduced calculation example two-dimensional airfoil aerodynamic parameters based finite difference solver realized finite element solver mainly includes theoretical modeling potential flow finite element based variational principle derivation two-dimensional finite element Kutta conditional expressions design velocity solving algorithm based on least squares two-dimensional airfoil space with wake based on Gmsh grid generator developed calculation example two-dimensional airfoil aerodynamic parameters based finite element solver realized panel method solver mainly includes potential flow theory modeling based panel method automatic wake generation algorithm design surface-based development three-dimensional BWB flow field solver based element method design resistance estimation algorithm based Brasius plate solution transplantation solver Fortran language mixed compilation Python Fortran code parallelization based on OpenMP CUDA Accelerate design development algorithm realize calculation example aerodynamic parameters three-dimensional BWB based panel method solver shape optimization module implements mesh deformation algorithm based on free shape deformation genetic algorithm differential evolution algorithm aircraft surface area calculation algorithm aircraft volume calculation algorithm based on moment integral Developed data visualization format tool based on VTK - PAI command
`PAI -name KeywordsExtraction -DinputTableName=maple_test_keywords_basic_input -DdocIdCol=docid -DdocContent=word -DoutputTableName=maple_test_keywords_basic_output -DtopN=19;`

- Output
docid keywords weight doc0 based on 0.041306752223538405 doc0 algorithm 0.03089845626854151 doc0 modeling 0.021782865850562882 doc0 grid 0.020669749212693957 doc0 solver 0.020245609506360847 doc0 aircraft 0.019850761705313365 doc0 research 0.014193732541852615 doc0 finite element 0.013831122054200538 doc0 solving 0.012924593244133104 doc0 module 0.01280216562287212 doc0 derivation 0.011907588923852495 doc0 shape 0.011505456605632607 doc0 difference 0.011477831662367547 doc0 flow 0.010969269350293957 doc0 design 0.010830986516637251 doc0 implementation 0.010747536556701583 doc0 two-dimensional 0.010695570768457084 doc0 development 0.010527342662670088 doc0 new 0.010096978306668461