edit-icon download-icon

Heart disease prediction

Last Updated: Aug 17, 2018

Overview

Heart disease is the biggest killer of humans. Heart disease causes 33% of deaths in the world. In China, hundreds and thousands of people die of heart disease every year. Data mining has become extremely important for heart disease prediction and treatment. It uses the relevant health exam indicators and analyzes their influences on heart disease. This document introduces how to use Alibaba Cloud Machine Learning Platform for AI to create a heart disease prediction model based on the data collected from heart disease patients.

Datasets

Data source UCI Heart Disease Dataset. This dataset is created based on 303 cases of heart disease in the United States. The attributes are as follows:

Name Definition Data Type Description
age Age string Age of a patient. The age attribute only uses numbers.
sex Gender string Gender of a patient: female or male.
cp Chest pain type string Chest pain types, including typical, atypical, non-anginal, and asymptomatic.
trestbps Blood pressure string Blood pressure of a patient.
chol cholesterol string Cholesterol of a patient.
fbs Fasting blood sugar string True means that a patient’s fasting blood sugar is greater than 120 mg/dl. False means that a patient’s fasting blood sugar is equal to or less than 120 mg/dl.
restecg Resting electrocardiographic result string The resting electrocardiographic results include normal, having ST-T wave abnormality, and showing probable or definite left ventricular hypertrophy.
thalach Maximum heart rate achieved string Maximum heart rate of a patient.
exang Exercise induced angina string True means that a patient has exercise induced angina. False means that a patient does not have exercise induced angina.
oldpeak ST depression induced by exercise relative to rest string ST depression of a patient.
slop Slope of the peak exercise ST segment string Slopes of the peak exercise ST segment, including down, flat, and up.
ca Number of major vessels colored by flouroscopy string Number of major vessels colored by flouroscopy
thal Defect type string defect types, including norm, fix, and rev.
status Heart disease status string Health means that a patient does not have heart disease. Sick means that a patient has heart disease.

Data exploring procedure

The following figure shows the procedure of data mining:

image

The following figure shows the workflow of the project:

image

Data pre-processing

Data pre-processing, also known as data cleaning, is the process of analyzing and making changes to the source data, including irrelevant data removal, incomplete data fixing, and data type conversion. With 14 indicators and one goal field, this project focuses on predicting the presence or absence of heart disease in patients based on their health exam indicators. The project uses one of the generalized linear models: logistic regression. Additionally, the data type of all input indicators is double.

All input data:

image

During data pre-processing, we must convert data of string and text types to numeric type based on the definition of the data.

  • Boolean data
    For example, you can set the sex attribute to 0 to indicate female and set the attribute to 1 to indicate male.

  • Multivalued data
    For example, you can use 0 through 3 to numerically rate the chest pain in ascending order for the cp attribute.

The data pre-processing is based on SQL scripts. Learn more, see the SQL script-1 component as follows:

  1. select age,
  2. (case sex when 'male' then 1 else 0 end) as sex,
  3. (case cp when 'angina' then 0 when 'notang' then 1 else 2 end) as cp,
  4. trestbps,
  5. chol,
  6. (case fbs when 'true' then 1 else 0 end) as fbs,
  7. (case restecg when 'norm' then 0 when 'abn' then 1 else 2 end) as restecg,
  8. thalach,
  9. (case exang when 'true' then 1 else 0 end) as exang,
  10. oldpeak,
  11. (case slop when 'up' then 0 when 'flat' then 1 else 2 end) as slop,
  12. ca,
  13. (case thal when 'norm' then 0 when 'fix' then 1 else 2 end) as thal,
  14. (case status when 'sick' then 1 else 0 end) as ifHealth
  15. from ${t1};

Feature engineering

Feature engineering includes feature derivation and scale change. This project uses the feature selection and data normalization components for feature engineering.

  • Filter-based feature selection

    This component measures the influence of each indicator on the prediction results by using the entropy and Gini coefficient. You can view the final prediction results in the assessment report.
    image

  • Data normalization

    This project requires you to train your model by using dichotomous logistic regression. Therefore, you must avoid using different fundamental units for the indicators. Data normalization uses the following formula to ensure that all indicators use a value between 0 and 1: result = (val-min ) / (max-min).

    The following figure shows the results of data normalization:
    image

Model training and prediction

Supervised learning requires you to train your model to obtain the prediction results and compare the prediction results with the existing data. In this project, supervised learning is used to train the model to predict the presence or absence of heart disease in a group of patients.

  1. Data split

    Use the split component to split the data into the training dataset and predicting dataset at the ratio of 7:3. The training dataset is imported to the dichotomous logistic regression component for model training. The predicting dataset is imported to the prediction component.

  2. Dichotomous logistic regression

    Logistic regression is a linear model. In this project, dichotomous logistic regression (determining the presence or absence of heart disease) is achieved by comparing the prediction results with a threshold. You can learn more about logistic regression from the Internet or relevant documentation. You can view the model that has already been trained by logistic regression on the Model page.
    image

  3. Prediction

    The prediction component has two inputs: the model and the predicting dataset. The prediction results show the calculated data, the predicting data, and the probability of inconsistencies between the calculated data and predicting data.

Assessment

You can use the confusion matrix to assess the attributes of the model, such as the accuracy.

image

Based on the accuracy of the prediction result, you can determine whether your model is well trained or not.

Conclusions

According to the workflow of data exploring, the following conclusions can be made:

  • Feature weight

    • You can obtain the weight of each indicator in the prediction by using filter-based feature selection.
    • The maximum heart rate achieved (thalach) indicator has the greatest impact on heart disease prediction.
    • The gender indicator does not have any impact on heart disease prediction.
  • Prediction results

    Based on the 14 indicators, the model can predict heart disease with an accuracy of over 80%. This model can be used in heart disease prediction and treatment.

Thank you! We've received your feedback.