This topic describes how to use the text analysis components that Machine Learning Platform for AI (PAI) provides to automatically classify commodity tags.

Background information

The description of a commodity may contain multiple tags that describe the commodity from different dimensions. For example, the description of a pair of shoes may be "Korean Girl Dr. Martens Women's Preppy/British-style Lace-up Dull-polish Ankle High Platform Leather Boots." A bag may be described as "Discount Every Day 2016 Autumn and Winter New Arrival Women's Korean-style Seashell-shaped Tassel Three-way Bag as a Messenger Bag, Hand Carry Bag, and Shoulder Bag." The tags describe commodities from dimensions such as production time, style, and place of origin. On an e-commerce platform, tens of thousands of commodities involve a vast number of tags. How to classify the tags based on dimensions is a key issue that an e-commerce platform must resolve. The text analysis components that PAI provides can automatically learn tag words and classify tags.

Dataset

The experiment described in this topic is based on a dataset that comes from real-life shopping data. The dataset contains more than 2,000 pieces of commodity description. Each piece of description is a cluster of tags.

Procedure

  1. Go to the Machine Learning Studio console.
    1. Log on to the PAI console.
    2. In the left-side navigation pane, choose Model Training > Studio-Modeling Visualization.
    3. On the PAI Visualization Modeling page, find the project in which you want to create an experiment and click Machine Learning in the Operation column.Machine Learning
  2. Create an experiment.
    1. In the left-side navigation pane, click Home.
    2. In the upper-right corner of the Templates page, choose New > New Experiment.
    3. In the New Experiment dialog box, specify the experiment parameters.
      Parameter Description
      Name The name of the experiment. Enter Auto Tag Classification.
      Project The project in which you want to create the experiment. You cannot change the value of this parameter.
      Description The description of the experiment. Enter Use text analysis components.
      Save To The directory for storing the experiment. Select My Experiments.
    4. Click OK.
  3. Configure and run the experiment.
    1. In the left-side navigation pane, click Components.
    2. In the left-side Component Descriptions pane, click Data Source/Target and drag the Read MaxCompute Table component to the canvas. Rename the component shopping_data-1.
    3. Click Text Analysis and drag the Word Splitting, Word Frequency Statistics, and Word2Vec components to the canvas.
    4. Click Data Preprocessing and drag the Add ID Column and Data Type Conversion components to the canvas.
    5. Choose Machine Learning Platform for AI > Clustering and drag the K-means Clustering component to the canvas.
    6. Click Tools and drag the SQL Script component to the canvas.
    7. Drag directed lines to connect the preceding components.
      Area No. Description
      1 The shopping_data-1 component imports data from the dataset. The Word Splitting-1 component divides the commodity description in each data record into words separated with spaces.
      2 The Add ID Column-1 component adds an ID column to the data read from the dataset. The data imported from the dataset contains only one column. Therefore, an ID column is required to serve as the primary key of the data records.
      3 The Word Frequency Statistics-1 component counts the number of occurrences for each word that appears in commodity description.
      4 The Word2Vector-1 component assigns each word a vector in the vector space that consists of a hundred dimensions. Word sectors indicate the semantic similarity between words.
      • Words that have similar vector values are semantically similar.
      • Different words have different vector values. An offset between the values denotes how close the words are semantically related.
      This way, the Word2Vector-1 component maps all words to a vector space.
      5 The K-means Clustering-1 component divides the word vectors into clusters by calculating value offsets among word vectors. This way, the words are automatically classified based on their semantic meanings. The clustering result displays the cluster to which each word belongs.
      6 The SQL Script-1 component verifies the classification result. You can query data in a cluster to check whether words are correctly classified. In the following example, data is queried in the cluster whose cluster_indext value is 10.The result shows that geographical words are added to the cluster. However, some words that are distinct from the geographical words, such as nut, are also added to the cluster. This mistake is probably caused by insufficient training samples. A larger set of training samples can produce classification results that are more accurate.