This topic describes the Keyword Extraction component provided by Machine Learning Studio.

Keyword extraction is one of the important technologies in natural language processing. It is used to extract keywords from a document. The keyword extraction algorithm is based on TextRank, a variation of the PageRank algorithm. This keyword extraction algorithm uses the relationship between specific words to construct a network, calculate the importance of each word, and determine words with larger weights as keywords.

The keyword extraction process involves the following steps:
  1. Raw corpora preparation
  2. Tokenization
  3. Word-based filtering
  4. Keyword extraction

Configure the component

You can configure the component by using one of the following methods:
  • Machine Learning Platform for AI console
    Tab Parameter Description
    Fields Setting Column of Marked Document IDs The document ID column.
    Word Splitting Result of Marked Documents The word splitting results of marked documents.
    Parameters Setting Output First N Keywords The value must be an integer. Default value: 5.
    Window Size The value must be an integer. Default value: 2.
    Damping Coefficient Default value: 0.85.
    Maximum Iterations Default value: 100.
    Convergence Coefficient Default value: 0.000001.
    Tuning Cores. Auto-assigned by default. Allocated by the system by default.
    Memory size per core. Auto-assigned by default. Allocated by the system by default.
  • PAI command
    PAI -name KeywordsExtraction      
        -DinputTableName=maple_test_keywords_basic_input    
        -DdocIdCol=docid -DdocContent=word    
        -DoutputTableName=maple_test_keywords_basic_output    
        -DtopN=19;
    Parameter Required Description Default value
    inputTableName Yes The name of the input table. No default value
    inputTablePartitions No The partitions selected from the input table for training, in the format of "Partition_name=value". To specify multiple partitions, use the following format: name1=value1/name2=value2. Separate multiple partitions with commas (,). All partitions
    outputTableName Yes The name of the output table. No default value
    docIdCol Yes The document ID column. You can specify only one column. No default value
    docContent Yes The word column. You can specify only one column. No default value
    topN No The number of top N keywords to be provided. If the value of the parameter is greater than the total number of keywords, all keywords are provided. 5
    windowSize No The window size of the TextRank algorithm. 2
    dumpingFactor No The damping coefficient of the TextRank algorithm. 0.85
    maxIter No The maximum number of iterations of the TextRank algorithm. 100
    epsilon No The convergence residual threshold of the TextRank algorithm. 0.000001
    lifecycle No The lifecycle of the output table. No default value
    coreNum No The number of cores. Automatically calculated
    memSizePerCore No The memory size of each core. Unit: MB. Automatically calculated

Example

  1. Input
    Separate words in the input table with spaces, and filter out deprecated words (such as "of") and all punctuation marks.
    docid:string word:string
    doc0 Blended Wing Body (BWB) aircraft new direction development aviation future Many research institutions home abroad have carried out research on BWB aircraft fully automatic shape optimization algorithm has become new research hotspot Based existing results home abroad methods characteristics commonly used modeling solution platforms analyzed compared Geometric modeling meshing flow field solution shape optimization BWB aircraft shape optimization designed optimization module compares advantages disadvantages different algorithms realizes shape optimization BWB aircraft conceptual design geometric modeling mesh generation module implements mesh generation algorithm based on over-limit interpolation modeling method based on splines flow field solving module includes finite difference solver finite element solver surface element method solver Among them finite difference solver mainly includes mathematical modeling potential flow based finite difference method variable step difference format derivation based Cartesian grid generation index algorithm Cartesian grid Neumann based Cartesian grid boundary condition expression form deduced calculation example two-dimensional airfoil aerodynamic parameters based finite difference solver realized finite element solver mainly includes theoretical modeling potential flow finite element based variational principle derivation two-dimensional finite element Kutta conditional expressions design velocity solving algorithm based on least squares two-dimensional airfoil space with wake based on Gmsh grid generator developed calculation example two-dimensional airfoil aerodynamic parameters based finite element solver realized panel method solver mainly includes potential flow theory modeling based panel method automatic wake generation algorithm design surface-based development three-dimensional BWB flow field solver based element method design resistance estimation algorithm based Brasius plate solution transplantation solver Fortran language mixed compilation Python Fortran code parallelization based on OpenMP CUDA Accelerate design development algorithm realize calculation example aerodynamic parameters three-dimensional BWB based panel method solver shape optimization module implements mesh deformation algorithm based on free shape deformation genetic algorithm differential evolution algorithm aircraft surface area calculation algorithm aircraft volume calculation algorithm based on moment integral Developed data visualization format tool based on VTK
  2. PAI command
    PAI -name KeywordsExtraction      
        -DinputTableName=maple_test_keywords_basic_input    
        -DdocIdCol=docid -DdocContent=word    
        -DoutputTableName=maple_test_keywords_basic_output    
        -DtopN=19;
  3. Output
    docid keywords weight
    doc0 based on 0.041306752223538405
    doc0 algorithm 0.03089845626854151
    doc0 modeling 0.021782865850562882
    doc0 grid 0.020669749212693957
    doc0 solver 0.020245609506360847
    doc0 aircraft 0.019850761705313365
    doc0 research 0.014193732541852615
    doc0 finite element 0.013831122054200538
    doc0 solving 0.012924593244133104
    doc0 module 0.01280216562287212
    doc0 derivation 0.011907588923852495
    doc0 shape 0.011505456605632607
    doc0 difference 0.011477831662367547
    doc0 flow 0.010969269350293957
    doc0 design 0.010830986516637251
    doc0 implementation 0.010747536556701583
    doc0 two-dimensional 0.010695570768457084
    doc0 development 0.010527342662670088
    doc0 new 0.010096978306668461