All Products
Search
Document Center

Platform For AI:Predict heart disease

Last Updated:Jan 23, 2024

Heart disease threatens human lives. When body indicators for heart disease can be analyzed based on medical examination data, heart disease can be prevented. This topic describes how to use data mining algorithms to build a heart disease prediction model in Platform for AI (PAI) based on the medical examination data of heart disease patients.

Prerequisites

Data mining procedure

image

Datasets

The pipeline described in this topic uses an open source dataset from UCI Machine Learning Repository. For more information, see Heart Disease Data Set. The dataset contains the medical examination data of 303 heart disease patients in an area of the United States. The following table describes the fields in the dataset.

Field

Type

Description

age

STRING

The age of the patient.

sex

STRING

The gender of the patient. Valid values: female and male.

cp

STRING

The type of chest pain that the patient has. Valid values: typical, atypical, non-anginal, and asymptomatic.

trestbps

STRING

The resting blood pressure level of the patient.

chol

STRING

The serum cholesterol level of the patient.

fbs

STRING

The fasting blood sugar level of the patient. If the fasting blood sugar level is greater than 120 mg/dl, the value is set to true. Otherwise, the value is set to false.

restecg

STRING

The resting electrocardiogram (ECG) result of the patient. Valid values: norm and hyp.

thalach

STRING

The maximum heart rate of the patient.

exang

STRING

Indicates whether the patient has exercise-induced angina. Valid values: true and false.

oldpeak

STRING

The ST depression that is induced by exercise relative to rest.

slop

STRING

The slope of the peak exercise ST segment. Valid values: down, flat, and up.

ca

STRING

The number of major vessels that are colored by fluoroscopy.

thal

STRING

The type of defect that the patient has. Valid values: norm, fix, and rev.

status

STRING

The presence of heart disease in the patient. Valid values: buff and sick.

Predict heart disease

  1. Go to the Machine Learning Designer page.

    1. Log on to the PAI console.

    2. In the left-side navigation pane, click Workspaces. On the Workspaces page, click the name of the workspace that you want to manage.

    3. In the left-side navigation pane, choose Model Training > Visualized Modeling (Designer) to go to the Machine Learning Designer page.

  2. Build a pipeline

    1. On the Visualized Modeling (Designer) page, click the Preset Templates tab.

    2. In the Heart Disease Prediction section of the Pipeline Template tab, click Create.

    3. In the Create Pipeline dialog box, configure the required parameters. You can use the default values.

      The value specified for the Pipeline Data Path parameter is the Object Storage Service (OSS) bucket path of the temporary data and models generated during the runtime of the pipeline.

    4. Click OK.

      It takes about 10 seconds to create the pipeline.

    5. On the Pipeline list tab, find the pipeline named Heart Disease Prediction and click Open.

    6. View the components of the pipeline on the canvas as shown in the following figure. The system automatically creates the pipeline based on the preset template.

      心脏病预测实验

      Area

      Description

      The components displayed in this section preprocess data. For example, the components denoise the data, fill missing values, and convert values to numbers. In each sample, the patient is either healthy or sick. Therefore, heart disease prediction in this pipeline is a classification problem. The dataset used in this pipeline contains 14 feature fields and one goal field. During data preprocessing, the values of each field must be converted to numbers based on the meaning of the field. Parameters:

      • Two-valued field: The component converts one value to 0 and the other value to 1. For example, the value of the sex field is female or male. After the conversion, 0 specifies female and 1 specifies male.

      • Multi-valued field: The component converts the values to 0, 1, 2, or 3. For example, the cp field has four values that specify the type of chest pain from light to heavy. After the conversion, values 0 to 3 are used to specify the level of chest pain from light to heavy.

      Sample SQL script:

      select age,
      (case sex when 'male' then 1 else 0 end) as sex,
      (case cp when 'angina' then 0  when 'notang' then 1 else 2 end) as cp,
      trestbps,
      chol,
      (case fbs when 'true' then 1 else 0 end) as fbs,
      (case restecg when 'norm' then 0  when 'abn' then 1 else 2 end) as restecg,
      thalach,
      (case exang when 'true' then 1 else 0 end) as exang,
      oldpeak,
      (case slop when 'up' then 0  when 'flat' then 1 else 2 end) as slop,
      ca,
      (case thal when 'norm' then 0  when 'fix' then 1 else 2 end) as thal,
      (case status  when 'sick' then 1 else 0 end) as ifHealth
      from  ${t1};

      The components displayed in this section perform feature engineering, including feature derivation and scale change. The Type Transform component converts input feature data to the DOUBLE type because a logistic regression model accepts only input data of the DOUBLE type. Then, the Feature Select Runner component measures the impact of each feature on the prediction results by using the entropy and Gini index. The Normalize component converts the values of each feature to values that range from 0 to 1. This removes the impact of dimensions on the prediction results. The normalization formula is result=(val-min)/(max-min).

      The components displayed in this section train the model and perform prediction.

      1. The Split component divides the dataset into a training dataset and a prediction dataset at a 7:3 ratio.

      2. The Logistic Regression component trains the model.

        Note

        If you want to export PMML model files, select the Whether To Generate PMML check box on the Field Setting tab. Click a blank area on the canvas and specify Data Storage on the Pipeline Attributes tab.

      3. The training and prediction datasets are imported to the Prediction component. The Prediction component generates the prediction results.

      The Confusion Matrix and Evaluate components evaluate the model.

  3. Run the pipeline and view the results.

    1. In the upper-left corner of the canvas, click the Run icon.

    2. After the pipeline is run, right-click the Logistic Regression component on the canvas and choose Model Options > Export to PMML Files to export the trained heart disease prediction model.

    3. Right-click Prediction on the canvas and choose View Data > Prediction Result Output Port to view the prediction results. View the evaluation report of the model.

  4. View the results.

    1. Right-click Evaluate on the canvas and click Visual Analysis.

    2. In the Evaluate dialog box, click the Index data tab to view the indexes that are used to evaluate the model.

      指标数据In the evaluation report, the value of AUC indicates that the prediction accuracy of the model is higher than 90%.

    3. Right-click Confusion Matrix on the canvas and click Visual Analysis.

    4. In the Confusion Matrix dialog box, click the Statistics tab to view the model statistics, such as model accuracy.