Heart disease seriously affects the life and health of people. If the impact of different body indicators on heart disease can be analyzed based on physical examination data, heart disease can be effectively prevented. This topic describes how to use data mining algorithms to build a heart disease prediction model based on physical examination data of heart disease patients.

Dataset

The experiment described in this topic uses an open source dataset from UCI Machine Learning Repository. For more information, see Heart Disease Data Set. The dataset contains the physical examination data of 303 heart disease patients in an area of the United States. The following table describes the fields in the dataset.
Field Data type Description
age STRING The age of the patient.
sex STRING The gender of the patient. Valid values: female and male.
cp STRING The type of chest pain that the patient has. Valid values: typical, atypical, non-anginal, and asymptomatic.
trestbps STRING The resting blood pressure level of the patient.
chol STRING The serum cholesterol level of the patient.
fbs STRING The fasting blood sugar level of the patient. If the fasting blood sugar level is greater than 120 mg/dl, the value of this field is true. Otherwise, the value of this field is false.
restecg STRING The resting electrocardiogram (ECG) result of the patient. Valid values: norm and hyp.
thalach STRING The maximum heart rate that is achieved by the patient.
exang STRING Indicates whether the patient has exercise-induced angina. Valid values: true and false.
oldpeak STRING The ST depression that is induced by exercise relative to rest.
slop STRING The slope of the peak exercise ST segment. Valid values: down, flat, and up.
ca STRING The number of major vessels colored by flouroscopy.
thal STRING The type of defect that the patient has. Valid values: norm, fix, and rev.
status STRING The presence of heart disease in the patient. Valid values: buff and sick.

Procedure

  1. Go to the Machine Learning Studio console.
    1. Log on to the PAI console.
    2. In the left-side navigation pane, choose Model Training > Studio-Modeling Visualization.
    3. On the PAI Visualization Modeling page, find the project in which you want to create an experiment and click Machine Learning in the Operation column.Machine Learning
  2. Create an experiment.
    1. In the left-side navigation pane, click Home.
    2. In the Templates section, click Create below Heart Disease Prediction.
    3. In the New Experiment dialog box, set the experiment parameters. You can use the default values of the parameters.
      Parameter Description
      Name The name of the experiment. Default value: Heart Disease Prediction.
      Project The project in which you want to create the experiment. You cannot change the value of this parameter.
      Description The description of the experiment. Default value: Create a heart disease prediction experiment, including data preprocessing, feature engineering, model training, and prediction.
      Save To The directory for storing the experiment. Default value: My Experiments.
    4. Click OK.
    5. Optional:Wait about 10 seconds. Then, click Experiments in the left-side navigation pane.
    6. Optional:Click Heart Disease Prediction_XX under My Experiments.
      My Experiments is the directory for storing the experiment that you created and Heart Disease Prediction_XX is the name of the experiment. In the experiment name, _XX is the ID that the system automatically creates for the experiment.
    7. View the components of the experiment on the canvas, as shown in the following figure. The system automatically creates the experiment based on the preset template.
      Experiment on heart disease prediction
      Area No. Description
      1 The components in this area preprocess data. For example, the components denoise the data, fill missing values, and convert values to numbers. In each sample, the patient is either healthy or sick with heart disease. Therefore, heart disease prediction in this experiment is a classification problem. The dataset used in this experiment contains 14 feature fields and one goal field. During data preprocessing, the values of each field must be converted to numbers based on the meaning of the field. The SQL Script-1 component converts the values of each field based on the following rules:
      • Two-valued field: The component converts one value to 0 and the other value to 1. For example, the value of the sex field is female or male. After the conversion, 0 indicates female and 1 indicates male.
      • Multi-valued field: The component converts the values to 0, 1, 2, or 3. For example, the cp field has four values indicating the type of chest pain from light to heavy. After the conversion, values 0 to 3 are used to indicate the chest pain from light to heavy.
      Sample SQL script:
      select age,
      (case sex when 'male' then 1 else 0 end) as sex,
      (case cp when 'angina' then 0  when 'notang' then 1 else 2 end) as cp,
      trestbps,
      chol,
      (case fbs when 'true' then 1 else 0 end) as fbs,
      (case restecg when 'norm' then 0  when 'abn' then 1 else 2 end) as restecg,
      thalach,
      (case exang when 'true' then 1 else 0 end) as exang,
      oldpeak,
      (case slop when 'up' then 0  when 'flat' then 1 else 2 end) as slop,
      ca,
      (case thal when 'norm' then 0  when 'fix' then 1 else 2 end) as thal,
      (case status  when 'sick' then 1 else 0 end) as ifHealth
      from  ${t1};
      2 The components in this area perform feature engineering, including feature derivation and scale change. The Data Type Conversion-1 component converts input feature data to the DOUBLE type because a logistic regression model accepts only input data of the DOUBLE type. Then, the Feature Selection (Filter Method)-1 component measures the impact of each feature on the prediction result by using the entropy and Gini index. The Normalization-1 component converts the values of each feature to values ranging from 0 to 1. This removes the impact of dimensions on the prediction result. The normalization formula is result=(val-min)/(max-min).
      3 The components in this area train the model and perform prediction.
      1. The Split-1 component divides the dataset into a training dataset and a prediction dataset at a 7:3 ratio.
      2. The Logistic Regression for Binary Classification-1 component trains the model.
      3. The training and prediction datasets are imported to the Prediction-1 component. The Prediction-1 component generates the prediction result.
      4 The Confusion Matrix-1 and Binary Classification Evaluation-1 components evaluate the model.
  3. Run the experiment and view the result.
    1. In the top toolbar of the canvas, click Run.
    2. After the experiment is run, right-click Logistic Regression for Binary Classification-1 on the canvas and choose Model Option > Show Model to view the heart disease prediction model that has been trained.
    3. Right-click Prediction-1 on the canvas and select View Data to view the prediction result.
  4. View the evaluation report of the model.
    1. Right-click Binary Classification Evaluation-1 on the canvas and select View Evaluation Report.
    2. In the Evaluation Report dialog box, click the Indexes tab to view the model evaluation indexes.
      Model evaluation resultIn the evaluation report, the value of AUC is greater than 0.9, which indicates that the prediction accuracy of the model is higher than 90%.
    3. Right-click Confusion Matrix-1 on the canvas and select View Evaluation Report.
    4. In the Confusion Matrix dialog box, click the Statistics tab to view the model statistics such as Accuracy.