Data Analysis & Interpretation 5.2: Methods (Sample, Measures, Analysis)

For this week’s assignment, we are required to post a draft version of a segment of our report detailing the methods of our research.



The data consists of n=180 observations of patients that have undergone cardiovascular tests and have been diagnosed with or without heart disease.  There are 14 features/fields in the data set, to include the binary target of whether the patient has a presence of heart disease or not.


The presence of heart disease for each of the patients in the data set was determined through various cardiovascular tests and clinically professional assessments.

A few primary, clinical predictors include two variables that measure the quality of blood flow to the heart – slope of the peak exercise ST segment (integer) and the results of a nuclear imaging method called a Thallium stress test (categorical), as well as resting blood pressure (integer), type of chest pain (integer), and whether or not the patient experienced angina (chest pain caused by reduced blood flow to the heart) as a result of exercise (binary).  There are only two secondary, non-clinical predictors: sex (binary) and age (integer).  I define primary predictors as features derived as a result of clinical measurements, the results of which are more directly associated to one’s health status, whereas sex and age are demographical features and are not the result of clinical measurements but may still be useful in predicting the probability of the presence of heart disease.


Counts, proportions, and distributions were utilized in summarizing the data, especially by sex, age, and the presence of heart disease.  Two-dimensional scatter plots, color coded by presence of heart disease, were used to visualize any notable clustering between patients with heart disease and those without.  When no discernable clustering was evident, three-dimensional scatterplots were used for more depth.  While significant overlap was still evident in the 3D plots, there did seem to be less overlap than appeared in the 2D plots.  This implies that a high level of accuracy in the predictive model will not be achieved by a select few features but will require several features.

Proportions of patients having heart disease were calculated for each group in the categorical/binary features.  Features with significantly different proportions are likely to be strong predictors in the model.  For instance, nearly 80% of patients who experience angina have heart disease, whereas only about 30% of patients who did not experience angina have heart disease.  This suggests that a patient with angina is more likely to have heart disease.  Subsequent ANOVAs were performed on all but one of these categorical/binary features to establish whether or not the proportional differences were statistically significant.  Whether or not a patient had a fasting blood sugar greater than 120 mg/dl showed no significant difference, as both binary groups were virtually equal at about 43% of the patients having heart disease, and therefore no ANOVA was performed on this feature.  Categorical features will be converted to dummy variables in order to help improve the performance of the algorithm.

A Gradient Boosted Trees algorithm, with the use of the predict_proba function, will be utilized to predict the probability of the presence of heart disease in patients.


Leave a Reply

Fill in your details below or click an icon to log in: Logo

You are commenting using your account. Log Out /  Change )

Google+ photo

You are commenting using your Google+ account. Log Out /  Change )

Twitter picture

You are commenting using your Twitter account. Log Out /  Change )

Facebook photo

You are commenting using your Facebook account. Log Out /  Change )

Connecting to %s