This week’s assignment is as follows: Submit a blog entry that includes 1) a description of your preliminary statistical analyses and 2) some plots or graphs to help you convey the message.
The data is comprised of 180 patients: 124 males (69%) and 56 females (31%). Of these, 80 patients have been diagnosed with heart disease (44%)
Table 1 shows the descriptive statistics of the numerical (non-binary) features.
Due to the label (i.e. the value being predicted; presence of heart disease) being of a binary data type, it was utilized as the legend for the associations between other numerical (non-binary) features to see if there was any distinct clustering between patients with heart disease and those without. Figure 1 shows these results:
While the density plots (diagonal) show slight variations in the distributions of patients for age, max heart rate achieved, and ST depression, there is not significant clustering between patients with heart disease and those without. We especially see this in the 2D scatter plots where the blue data points (patients with heart disease) are largely mixed in with the gray data points (patients without heart disease). Figures 2, 3, and 4 show 3D scatter plots that provide a clearer view into potential clustering.
While there does appear to be less overlap in the 3D plots, there still exists significant overlap between patients with heart disease and those without.
Figure 5 shows the proportion of patients with heart disease by categorical/binary features, such as sex and the number of major vessels colored by a fluoroscopy. All but one of the subplots reveal major differences between two or more groups within each feature. Whether or not a patient had fasting blood sugar > 120 mg/dl shows virtually no difference in the proportion of patients with heart disease; both groups have about 43% of patients with heart disease. Considering this, I removed this feature from the data during the preprocessing stage of the algorithmic model.
ANOVAs were performed on all eight features containing major differences. The two features with only two groups, sex and exercise induced angina, had statistically significant differences with p-values less than 0.001. The remaining features had more than two groups and therefore required a post hoc paired comparison. The Tukey Honest Significant Difference was utilized for the post hoc tests. Each feature had statistically significant differences between two or more comparisons, with a significance level (alpha) of 0.05.
The next post in this lengthy series will be the final report of the capstone project and will include details regarding the model utilized to predict the probability of the presence of heart disease from cardiovascular results and demographics.