All Products
Search
Document Center

Platform For AI:Heart disease prediction

Last Updated:Apr 21, 2026

Heart disease is a major health risk. By analyzing patient examination data, you can identify key risk factors and enable early prevention. This tutorial demonstrates how to create a heart disease prediction model using a data mining pipeline on real patient data.

Prerequisites

Data mining process

Heart disease prediction

  1. Go to the Machine Learning Designer page.

    1. Log on to the PAI console.

    2. In the left-side navigation pane, click Workspaces. On the Workspaces page, click the name of the workspace that you want to manage.

    3. In the left-side navigation pane, choose Model Training > Visualized Modeling (Designer).

  2. Create the pipeline.

    1. On the Designer page, click the Preset Templates tab.

    2. In the Heart Disease Prediction section of the template list, click Create.

    3. In the Create Pipeline dialog box, configure the parameters. You can use the default settings.

      The Data Storage parameter specifies an Object Storage Service (OSS) bucket path to store temporary data and models generated during the pipeline run.

    4. Click Confirm.

      Wait for about 10 seconds for the pipeline to be created.

    5. In the pipeline list, select the heart disease prediction pipeline and click Open.

    6. The system automatically builds the pipeline based on the preset template, as shown in the following figure.

      心脏病预测实验

      Area

      Description

      Data preprocessing: This stage involves data cleaning, filling missing values, and transforming data types. Because each patient can be either sick or healthy, heart disease prediction is a binary classification problem. The input data for this pipeline includes 14 feature columns and 1 target column. For more information about the fields, see Appendix: Heart disease dataset. During data preprocessing, string values are converted to numeric types based on the meaning of each field. Examples:

      • 二值类数据:以sex字段为例,其取值为femalemale,可以使用0表示female1表示male

      • Multi-valued data: For example, the cp field represents chest pain, and the pain levels can be mapped to numerical values from 0 to 3 in increasing order of severity.

      The following is a sample SQL script for data preprocessing.

      select age,
      (case sex when 'male' then 1 else 0 end) as sex,
      (case cp when 'angina' then 0  when 'notang' then 1 else 2 end) as cp,
      trestbps,
      chol,
      (case fbs when 'true' then 1 else 0 end) as fbs,
      (case restecg when 'norm' then 0  when 'abn' then 1 else 2 end) as restecg,
      thalach,
      (case exang when 'true' then 1 else 0 end) as exang,
      oldpeak,
      (case slop when 'up' then 0  when 'flat' then 1 else 2 end) as slop,
      ca,
      (case thal when 'norm' then 0  when 'fix' then 1 else 2 end) as thal,
      (case status  when 'sick' then 1 else 0 end) as ifHealth
      from  ${t1};

      Feature engineering: This stage involves deriving new features and scaling existing ones. This pipeline first uses the Type Transform component to convert input features to the DOUBLE type because the logistic regression model requires input data of the DOUBLE type. Then, the pipeline uses the Feature Select Runner component to determine the impact of each feature on the outcome, which is reflected by information entropy and the Gini index. The pipeline also uses the Normalize component to scale each feature to a 0-to-1 range. This process, known as normalization, eliminates the impact of different units and scales on the result. The formula is result=(val-min)/(max-min).

      Model training and prediction:

      1. Use the Split component to divide the dataset into a training set and a prediction set at a 7:3 ratio.

      2. Use the Logistic Regression component to train the model.

        Note

        If you want to export a PMML model file, on the Field Setting tab of this component, select the Whether To Generate PMML checkbox. Then, click a blank area of the canvas and configure the data storage path on the Pipeline Attributes tab.

      3. Pass the trained model and the prediction set to the Predicted component to generate predictions.

      Use the Confusion Matrix and Evaluate components for model evaluation.

  3. Run the pipeline and view the results.

    1. Click image at the top of the canvas.

    2. After the pipeline finishes running, right-click the Logistic Regression component on the canvas and choose Model Options > Export to PMML Files to export the trained heart disease prediction model.

    3. Right-click the Predicted component on the canvas and choose View Data > Prediction Result Output Port to view the prediction results.

  4. Evaluate the model performance.

    1. Right-click the Evaluate component on the canvas and click Visual Analysis.

    2. In the Evaluate dialog box, click the Indicator Data tab to view the model evaluation metrics.

      指标数据The AUC value for this model exceeds 0.9, which indicates excellent predictive performance.

    3. Right-click the Confusion Matrix component on the canvas and click Visual Analysis.

    4. In the Confusion Matrix dialog box, click the Summary tab to view information such as the model accuracy.

Appendix: Heart disease dataset

This pipeline uses an open-source dataset from the UCI Machine Learning Repository. It contains physical examination records of 303 patients from a region in the United States. The following table describes the fields.

Parameter

Type

Description

age

STRING

Age of the patient.

sex

STRING

Gender of the patient. Valid values: female and male.

cp

STRING

The types of chest pain, in descending order of severity, are typical, atypical, non-anginal, and asymptomatic.

trestbps

STRING

Resting blood pressure.

chol

STRING

Cholesterol.

fbs

STRING

Fasting blood glucose. If the blood glucose level is greater than 120 mg/dL, the value is true. Otherwise, the value is false.

restecg

STRING

The possible results for the T wave of an electrocardiogram, from mild to severe, are norm and hyp.

thalach

STRING

Maximum heart rate achieved.

exang

STRING

Indicates whether the user has angina. true indicates the presence of angina, and false indicates the absence of angina.

oldpeak

STRING

ST depression induced by exercise relative to rest.

slop

STRING

The slope of the ST segment of an electrocardiogram (ECG). Valid values include down, flat, and up.

ca

STRING

Number of major vessels colored by fluoroscopy.

thal

STRING

The occurrence types, in ascending order of severity, are norm, fix, and rev.

status

STRING

Indicates the health status. buff means healthy, and sick means sick.

Note

Pipelines created from the template include this dataset. To download the dataset or learn more, see Heart Disease Data Set.

Next steps

Once the pipeline runs successfully, you can deploy the model as an online service for inference. For more information about deployment, see Deploy a model as an online service and PMML.