This topic describes the Keyword Extraction component provided by Machine Learning Designer (formerly known as Machine Learning Studio).

Keyword extraction is one of the important technologies in natural language processing. It is used to extract keywords from a document. The keyword extraction algorithm is based on TextRank, a variation of the PageRank algorithm. This keyword extraction algorithm uses the relationship between specific words to construct a network, calculate the importance of each word, and determine words with larger weights as keywords.

The keyword extraction process includes the following steps:
  1. Raw corpora preparation
  2. Tokenization
  3. Word-based filtering
  4. Keyword extraction

Configure the component

You can use one of the following methods to configure the Keyword Extraction component.

Method 1: Configure the component on the pipeline page

You can configure the parameters of the Keyword Extraction component on the pipeline page of Machine Learning Designer of Machine Learning Platform for AI (PAI). Machine Learning Designer is formerly known as Machine Learning Studio. The following table describes the parameters.
TabParameterDescription
Fields SettingColumn of Marked Document IDsThe name of the document ID column.
Word Splitting Result of Marked DocumentsThe word splitting results of marked documents.
Parameters SettingOutput First N KeywordsThe number of top N keywords to be provided. The value must be an integer. Default value: 5.
Window SizeThe window size. The value must be an integer. Default value: 2.
Damping CoefficientThe damping coefficient. Default value: 0.85.
Maximum IterationsThe maximum number of iterations. Default value: 100.
Convergence CoefficientThe convergence coefficient. Default value: 0.000001.
TuningCores. Auto-assigned by default.The number of cores. By default, the system determines the value.
Memory size per core. Auto-assigned by default.The memory size of each core. By default, the system determines the value.

Method 2: Use PAI commands

Configure the component parameters by using PAI commands. You can use the SQL Script component to call PAI commands. For more information, see SQL Script.
PAI -name KeywordsExtraction      
    -DinputTableName=maple_test_keywords_basic_input    
    -DdocIdCol=docid -DdocContent=word    
    -DoutputTableName=maple_test_keywords_basic_output    
    -DtopN=19;
ParameterRequiredDescriptionDefault value
inputTableNameYesThe name of the input table. No default value
inputTablePartitionsNoThe partitions selected from the input table for training, in the format of Partition_name=value. To specify multiple partitions, use the following format: name1=value1/name2=value2. If you specify multiple partitions, separate them with commas (,). All partitions
outputTableNameYesThe name of the output table. No default value
docIdColYesThe name of the document ID column. You can specify only one column. No default value
docContentYesThe name of the word column. You can specify only one column. No default value
topNNoThe number of top N keywords to be provided. If the value of the parameter is greater than the total number of keywords, all keywords are provided. 5
windowSizeNoThe window size of the TextRank algorithm. 2
dumpingFactorNoThe damping coefficient of the TextRank algorithm. 0.85
maxIterNoThe maximum number of iterations of the TextRank algorithm. 100
epsilonNoThe convergence residual threshold of the TextRank algorithm. 0.000001
lifecycleNoThe lifecycle of the output table. No default value
coreNumNoThe number of cores. Determined by the system
memSizePerCoreNoThe memory size of each core. Unit: MB. Determined by the system

Example

  1. Input data
    Separate words in the input table with spaces, and filter out stop words such as "of" and all punctuation marks.
    docid:stringword:string
    doc0The blended-wing-body aircraft is a new direction for the future development in the aviation field Many research institutions inside and outside China have carried out research on the blended-wing-body aircraft while its fully automated shape optimization algorithm has become a new hot topic Based on the existing research achievements inside and outside China common modeling and flow solver tools have been analyzed and compared The geometric modeling grid flow field solver and shape optimization modules have been designed The pros and cons between different algorithms have been compared to achieve the optimized shape of the blended-wing-body aircraft in the conceptual design stage Geometric modeling and grid generation module are achieved based on the transfinite interpolation algorithm and spline based grid generation method The flow solver module includes the finite difference solver the finite element solver and the panel method solver The finite difference solver includes mathematical modeling of the potential flow the derivation of the Cartesian grid based variable step length difference scheme Cartesian grid generation and indexing algorithm the Cartesian grid based Neumann boundary conditions expression form derivation are achieved based on finite element difference solver The aerodynamic parameters of a two-dimensional airfoil are calculated based on the finite difference solver The finite element solver includes potential flow modeling based on the variational principle of the finite element theory the derivation of the two-dimensional finite element Kutta conditional least squares based speed solving algorithm Gmsh based two-dimensional field grid generator of airfoil with wakes design The aerodynamic parameters of a two-dimensional airfoil are calculated based on the finite element solver The panel method solver includes modeling and automatic wake generation the design of the three-dimensional flow solver of the blended-wing-body drag estimation based on the Blasius solution solver implemented in the Fortran language a mixed compilation of Python and Fortran OpenMP and CUDA based acceleration algorithm The aerodynamic parameters of a three-dimensional wing body are calculated based on the panel method solver The shape optimization module includes free form deformation algorithm genetic algorithms differential evolution algorithm Aircraft surface area calculation algorithm is based on the moments integration algorithm The volume of an aircraft calculation algorithm is based on VKT data visualization format tool
  2. PAI command
    PAI -name KeywordsExtraction      
        -DinputTableName=maple_test_keywords_basic_input    
        -DdocIdCol=docid -DdocContent=word    
        -DoutputTableName=maple_test_keywords_basic_output    
        -DtopN=19;
  3. Output description
    docidkeywordsweight
    doc0based on0.041306752223538405
    doc0algorithm0.03089845626854151
    doc0modeling0.021782865850562882
    doc0grid0.020669749212693957
    doc0solver0.020245609506360847
    doc0aircraft0.019850761705313365
    doc0research0.014193732541852615
    doc0finite element0.013831122054200538
    doc0solving0.012924593244133104
    doc0module0.01280216562287212
    doc0derivation0.011907588923852495
    doc0shape0.011505456605632607
    doc0difference0.011477831662367547
    doc0flow0.010969269350293957
    doc0design0.010830986516637251
    doc0implementation0.010747536556701583
    doc0two-dimensional0.010695570768457084
    doc0development0.010527342662670088
    doc0new0.010096978306668461