Capstone Project: An Algorithm for Predicting the Probability of the Presence of Heart Disease from Cardiovascular Results and Demographics

The following is a final report in completion of the Data Analysis & Interpretation Specialization by Wesleyan University.  You can view a PDF version of this report here.



The purpose of this study is to identify the best predictors of the presence of heart disease using multiple health and demographic factors such as results from an electrocardiographic (EKG) test, the presence of exercise-induced angina (i.e. chest pain caused by reduced blood flow to the heart), age, and sex. The end result of this project is an algorithmic model that predicts the probability of the presence of heart disease in a patient.

As a prospective graduate student in Data Analytics, and as someone deeply interested in health and nutrition, this is an opportunity for me to apply my analytical skills in an area that is of interest to me.

Being able to predict the probability of the presence of heart disease within a patient can lead to earlier action being taken in remedying the situation, which would translate to more lives saved and lower healthcare costs.

Source: The data for this project is provided by the Cleveland Heart Disease Database via the UCI Machine Learning repository and is hosted by DrivenData through their predictive analytics competition, “Machine Learning with a Heart”.

For information on heart disease, read this article by the Center for Disease Control and Prevention.




The data consists of n=180 observations of patients that have undergone cardiovascular tests and have been diagnosed with or without heart disease.  There are 14 features/fields in the data set, to include the binary target of whether the patient has a presence of heart disease or not.


The presence of heart disease for each of the patients in the data set was determined through various cardiovascular tests and clinically professional assessments.

A few primary, clinical predictors include two variables that measure the quality of blood flow to the heart: slope of the peak exercise ST segment (integer) and the results of a nuclear imaging method called a Thallium stress test (categorical), as well as resting blood pressure (integer), type of chest pain (integer), and whether or not the patient experienced angina (chest pain caused by reduced blood flow to the heart as a result of exercise) (binary).  There are only two secondary, non-clinical predictors: sex (binary) and age (integer).  I define primary predictors as features derived as a result of clinical measurements, the results of which are more directly associated to one’s health status, whereas sex and age are demographical features and are not the result of clinical measurements but may still be useful in predicting the probability of the presence of heart disease.


Counts, proportions, and distributions were utilized in summarizing the data, especially by sex, age, and the presence of heart disease.  The mean, standard deviation, min, and max were calculated for numerical features.  Two-dimensional scatter plots, color-coded by presence of heart disease, were used to visualize any notable clustering between patients with heart disease and those without.  When no discernable clustering was evident, three-dimensional scatterplots were used for more depth.

Proportions of patients having heart disease were calculated for each group in the categorical/binary features.  Features with significantly different proportions are likely to be strong predictors in the model.  For instance, nearly 80% of patients who experience angina have heart disease, whereas only about 30% of patients who did not experience angina have heart disease.  This suggests that a patient with angina is more likely to have heart disease.  Subsequent ANOVAs were performed on all but one of these categorical/binary features to establish whether or not the differences visualized are statistically significant.  Whether or not a patient had a fasting blood sugar greater than 120 mg/dl showed no significant difference in the visual, as both binary groups were virtually equal at about 43% of the patients having heart disease.

A Gradient Boosted Trees (Classification) algorithm was utilized in this project for predicting the probability of the presence of heart disease.


Feature Descriptions

Following is a brief description of the features, which will prove useful as you continue to read through the analysis.

  • slope_of_peak_exercise_st_segment (type: int): the slope of the peak exercise ST segment, an electrocardiography read out indicating quality of blood flow to the heart
  • thal (type: categorical): results of thallium stress test measuring blood flow to the heart, with possible values normal, fixed_defect, reversible_defect
  • resting_blood_pressure (type: int): resting blood pressure
  • chest_pain_type (type: int): chest pain type (4 values)
  • num_major_vessels (type: int): number of major vessels (0-3) colored by fluoroscopy[1]
  • fasting_blood_sugar_gt_120_mg_per_dl (type: binary): fasting blood sugar > 120 mg/dl
  • resting_ekg_results (type: int): resting electrocardiographic results (values 0,1,2)
  • serum_cholesterol_mg_per_dl (type: int): serum cholestoral in mg/dl
  • oldpeak_eq_st_depression (type: float): oldpeak = ST depression induced by exercise relative to rest, a measure of abnormality in electrocardiograms
  • sex (type: binary): 0: female, 1: male
  • age (type: int): age in years
  • max_heart_rate_achieved (type: int): maximum heart rate achieved (beats per minute)
  • exercise_induced_angina (type: binary): exercise-induced chest pain (0: False, 1: True)



Descriptive Statistics

The data is comprised of 180 patients: 124 males (69%) and 56 females (31%).  Of these, 80 patients have been diagnosed with heart disease (44%).  Of those diagnosed with heart disease, 78% were 50 years or older.

Table 1 shows key descriptive statistics of the numerical (non-binary) features.


The max heart rate (MHR) that one can be expected to achieve based on their age can be roughly calculated by subtracting their age from 220.  With this information, and with the knowledge of the minimum and maximum ages in the data, we can conclude that the expected maximum of the MHR is 220 – 29 = 191.  The expected minimum of the MHR is 220 – 77 = 143.  Already we can see that the actual min and max MHR is well outside the expected range.

The max resting blood pressure in the data is 180.  Above 180 is classified as hypertensive crisis.  The mean resting blood pressure is within the range that’s considered elevated blood pressure.  A blood pressure chart by the American Heart Association and American Stroke Association is provided for further reference.


Bivariate Analyses

Due to the label (i.e. the value being predicted; presence of heart disease) being of a binary data type, it was utilized as the legend for the associations between other numerical (non-binary) features to see if there was any distinct clustering between patients with heart disease and those without.  Figure 1 shows these results:


(View Larger Image)

There is a high frequency of patients between the ages of 50 and 70, peaking at 60, who have been diagnosed with heart disease.  Patients without heart disease tended to have higher max heart rates, whereas those with heart disease tended to have lower max heart rates.  Additionally, patients without heart disease tended to have a low ST depression reading (peaks at 0), whereas patients with heart disease tended to have a higher ST depression reading (peaks at 2).

While the density plots (diagonal) show slight variation in the distributions of patients for age, max heart rate achieved, and ST depression, there does not exist significant clustering between patients with heart disease and those without.  We especially see this in the 2D scatter plots where the blue data points (patients with heart disease) are largely mixed in with the gray data points (patients without heart disease).  Figures 2, 3, and 4 show 3D scatter plots that provide a clearer view into potential clustering.


While there does appear to be less overlap in the 3D plots, there still exists significant overlap .

Figure 5 shows the proportion of patients with heart disease by categorical/binary features, such as sex and the number of major vessels colored by a fluoroscopy.  All but one of the subplots reveal major differences between two or more groups within each feature.  Whether or not a patient had fasting blood sugar > 120 mg/dl shows virtually no difference in the proportion of patients with heart disease; both groups have about 43% of patients with heart disease.  Considering this, I removed this feature from the data during the preprocessing stage of the model.


(View Larger Image)

ANOVAs were performed on all eight features containing major differences.  The two features with only two groups, sex and exercise induced angina, had statistically significant differences with p-values less than 0.001.  The remaining features had more than two groups and therefore required a post hoc paired comparison.  The Tukey Honest Significant Difference was utilized for the post hoc tests.  Each feature had statistically significant differences between two or more comparisons, with a significance level (alpha) of 0.05.

Algorithm: Gradient Boosted Trees (Classification)

Due to the mixture of continuous and binary features, as well as there being significant overlap in the clusters of patients with heart disease and those without, a powerful and dynamic algorithm was needed to produce high accuracy.  The GradientBoostingClassifier() from Scikit-learn was utilized for this machine learning project – In short, gradient boosting trees is an ensemble method where each tree tries to correct the inaccuracies of the previous tree.  Two key parameters in the GradientBoostingClassifier() is learning_rate and n_estimators.  The learning rate refers to the strength with which each tree tries to correct the mistakes of the previous tree.  The number of estimators refers to the number of trees utilized in the model.  Parameters must be chosen carefully to prevent over-fitting or under-fitting.  Since the purpose of this project was not to simply classify whether or not a patient has heart disease, but was to predict the probability of the presence of heart disease, the predict_proba() function, which is contained within the GradientBoostingClassifier(), was utilized.  This provides output that gives a probability (between 0 and 1) of the binary label (presence of heart disease (0=False; 1=True) for each patient.

Scoring of the model based on binary classification resulted in a score of 0.911.  However, the scoring of the model was based on the logistic loss (or log loss) calculation which is utilized on classification models that make predictions of probabilities rather than predictions of class (e.g. 0, 1).  A perfect model would result in a log loss of 0.  Log loss consists of a weighted penalization; confident predictions that are wrong, at either end of the spectrum (close to 100% or 0%), are penalized the most.  For example, predicting a low probability of 0.01 (1%) when the actual observation label is 1 (True) would result in a high log loss.

Preprocessing was performed on the data which consisted of converting the thallium feature categories to dummy variables and performing one-hot-encoding on the following features: ST segment, chest pain type, and number of major vessels colored during a fluoroscopy.  This reduced the dimensionality of the data and improved the performance of the model.  Standardization of the data is not required for Gradient Boosted Trees and therefore was not utilized.

Plotting the feature importances (Figure 6) revealed that age had the greatest importance in the model, with serum cholesterol mg/dl and max heart rate achieved taking 2nd and 3rd in importance, respectively.  These features performed the best at splitting the data into subsets of patients with heart disease and those without, thereby reducing the entropy or disorder present in the data and allowing for more accurate predictions.  Only four features had 0 importance and were not utilized by the model.


The parameters utilized to fit the model are as follows: learning_rate=0.1, n_estimators=500, max_depth=1, min_samples_leaf=5.  With these parameters, along with the preprocessing, I was able to achieve a log loss score of 0.31653, which at the time of this writing ranks 25 out of about 800 participants in the machine learning competition by DrivenData where the data was obtained.  The #1 ranking participant had a log loss score of 0.27883, a difference of 0.0377 (see Figure 7).




This project utilized Gradient Boosted Trees (classification) to predict the probability of the presence of heart disease in patients utilizing cardiovascular results and demographics (i.e. age, sex).  Data for 180 patients were utilized in the fitting of the model, with 80 patients having heart disease (44%).  While being a small sample of data to work with, the model resulted in a log loss score of 0.31653 (the closer to 0 the better), which resulted in a rank of 25 in about 800 participants in the machine learning competition.  This log loss score suggests that correct predictions primarily consisted of confident predicted probabilities (close to 0% or 100%), whereas incorrect predictions primarily consisted of unconfident predicted probabilities (close to 50%).

Plotting the feature importances revealed that the most important feature utilized in the model, which relies on splitting the data by feature subsets to make predictions, was age.  Of those patients with heart disease, 78% were 50 years or older.  Additional features with high importance in the model were serum cholesterol mg/dl and max heart rate achieved.  Only four features had an importance of 0 in the model: resting ekg results 1 (a one-hot-encoded feature), ST segment, and chest pain type 2 and type 3 (one-hot-encoded features).  These features did not sufficiently reduce the entropy in the data to be utilized by the model for predictions.

This project revealed that demographical features, such as age and sex, coupled with cardiovascular results, such as max heart rate achieved and resting blood pressure, are highly valuable in predicting the probability of the presence of heart disease.  Such an algorithm could be utilized by hospitals and clinics to make more confident and reliable diagnoses of heart disease, resulting in earlier treatment and lower healthcare costs.

While the results of this model are promising, it could be improved with additional data, both in the number of features, such as additional demographics (e.g. ethnicity), and in the number of observations (or patients).  More advanced parameterization and feature engineering could also be utilized to improve the accuracy of the model.



[1] A type of medical imaging that shows a continuous X-ray image on a monitor.


Leave a Reply

Fill in your details below or click an icon to log in: Logo

You are commenting using your account. Log Out /  Change )

Google photo

You are commenting using your Google account. Log Out /  Change )

Twitter picture

You are commenting using your Twitter account. Log Out /  Change )

Facebook photo

You are commenting using your Facebook account. Log Out /  Change )

Connecting to %s