In most cases, you need to prepare and preprocess the data that is required to build and test a model. The prepared data is then further processed based on your business requirements for model development. This topic describes how to prepare and preprocess data in Machine Learning Platform for AI (PAI). In the example, public data provided by PAI is used.

Prerequisites

A pipeline is created. For more information, see Create a custom pipeline.

Step 1: Go to the pipeline configuration page

  1. Log on to the PAI console. In the left-side navigation pane, click Workspaces. On the page that appears, click the name of the workspace that you want to use.
  2. In the left-side navigation pane, choose Model Training > Visualized Modeling (Designer). The Visualized Modeling (Machine Learning Designer) page appears.
  3. On the Visualized Modeling (Machine Learning Designer) page, select the pipeline that you have created and click Enter pipeline.

Step 2: Prepare data

In this example, public data provided by PAI on heart disease cases is used. You can use the Read Table component to read the public data without the need to create a table or write data to the table.
Note During your own development, you often need to prepare a table in MaxCompute or Object Storage Service (OSS). Then, you need to use a Data Source/Target component such as Read Table, Write Table, or Read File Data to query or write data to the table. For more information, see the topics in Component reference: data source or destination.
Prepare data
  1. In the left-side component list, enter a keyword in the search box to search for the Read Table component.
  2. Drag the Read Table component to the canvas on the right. A pipeline node named Read Table-1 is automatically generated.
  3. Click the Read Table-1 node. On the Select Table tab in the right-side pane of the canvas, set the Table Name parameter to pai_online_project.heart_disease_prediction to read public data on heart disease cases.
    You can click the Fields Information tab to view the details of the fields in the public data.

Step 3: Preprocess data

In this example, the public data on heart disease cases is used as raw data, and all field values of the raw data are normalized during preprocessing. To normalize the field values, perform the following steps:
  1. Convert all non-numeric fields in the raw data into numeric fields by using an SQL statement. This ensures that all fields are numeric fields after preprocessing.
  2. Convert all fields into the DOUBLE type. This ensures that the data type of all fields meets the requirements of normalization.
  3. Normalize the values of all fields in the table.
The following section describes the detailed operations.
  1. Convert non-numeric fields into numeric fields.
    Convert non-numeric fields into numeric filelds
    1. In the left-side component list, enter a keyword in the search box to search for the SQL Script component.
    2. Drag the SQL Script component to the canvas on the right. A pipeline node named SQL Script-1 is automatically generated.
    3. Draw a line from the Read Table-1 node to the t1 input port of the SQL Script-1 node. This sets the t1 input port of the SQL Script-1 node as the downstream node of the Read Table-1 node.
    4. Select the SQL Script-1 node. On the Parameter Setting tab in the right-side pane, t1 is automatically populated into the Input Source field. Then, enter the following SQL code in the SQL Script code editor:
      select age,
      (case sex when 'male' then 1 else 0 end) as sex,
      (case cp when 'angina' then 0  when 'notang' then 1 else 2 end) as cp,
      trestbps,
      chol,
      (case fbs when 'true' then 1 else 0 end) as fbs,
      (case restecg when 'norm' then 0  when 'abn' then 1 else 2 end) as restecg,
      thalach,
      (case exang when 'true' then 1 else 0 end) as exang,
      oldpeak,
      (case slop when 'up' then 0  when 'flat' then 1 else 2 end) as slop,
      ca,
      (case thal when 'norm' then 0  when 'fix' then 1 else 2 end) as thal,
      (case status  when 'sick' then 1 else 0 end) as status
      from  ${t1};
      Note The SQL Script-1 node has four input ports: t1, t2, t3, and t4. In the preceding sample code, ${t1 indicates that t1 is used. If you use a different input port, the port name is automatically populated into the Input Source field on the Parameter Setting tab of the SQL Script-1 node. In this case, you must change the port in the preceding code.
    5. Click the Run icon icon in the upper part of the canvas. The Read Table-1 and SQL Script-1 nodes are run in sequence.
  2. Convert all fields into the DOUBLE type.
    Type conversion
    1. In the left-side component list, enter a keyword in the search box to search for the Data Type Conversion component.
    2. Drag the Data Type Conversion component to the canvas on the right. A pipeline node named Data Type Conversion-1 is automatically generated.
    3. Draw a line from the SQL Script-1 node to the Data Type Conversion-1 node.
    4. Click the Data Type Conversion-1 component on the canvas. On the Fields Setting tab in the right-side pane, click Select Fields in the Convert to Double Type Columns section and select all fields to convert them into the DOUBLE type.
  3. Normalize field values.
    Normalization
    1. In the left-side component list, enter a keyword in the search box to search for the Normalization component.
    2. Drag the Normalization component to the canvas on the right. A pipeline node named Normalization-1 is automatically generated.
    3. Draw a line from the Data Type Conversion-1 node to the Normalization-1 node.
    4. Click the Normalization-1 component on the canvas. On the Fields Setting tab in the right-side pane, click Select Fields and select all fields in the dialog box that is displayed.
  4. In the left-side component list, enter a keyword in the search box to search for the Split component. Drag the Split component to the canvas on the right. Draw a line from the Normalization-1 node to the Split-1 node that is generated.
    By default, the Split component splits the raw data into a model training set and a model prediction set at a ratio of 4:1. To change the ratio, you can click the Split component and set the Splitting Fraction parameter on the Parameters Setting tab.
  5. In the top toolbar of the canvas, click Save.

Debug and run the pipeline

In the upper part of the canvas, click the Run icon icon.Run the pipeline After you click the Run icon, all nodes are run in sequence. After a node is successfully run, the node is marked with a Sucess icon icon in the node box. You can right-click a successful node and select View Data to check whether the output data is correct.
Note If the pipeline is complex, you can save and run the pipeline each time you add a node to the pipeline. If a node fails to run, you can right-click the node and select View Log to troubleshoot the failure.

What to do next

After the data is preprocessed, you must visualize the data. For more information, see Data visualization.