Data Analysis & Interpretation 4.4: K-Means Cluster Analysis

Week 4

This week’s assignment involves running a k-means cluster analysis. Cluster analysis is an unsupervised machine learning method that partitions the observations in a data set into a smaller set of clusters where each observation belongs to only one cluster. The goal of cluster analysis is to group, or cluster, observations into subsets based on their similarity of responses on multiple variables. Clustering variables should be primarily quantitative variables, but binary variables may also be included.

Run a k-means cluster analysis to identify subgroups of observations in your data set that have similar patterns of response on a set of clustering variables.

Data

In [192]:
# create subset of data
sub16 = sub15.drop(['South','Northeast','Midwest','West'], axis=1)
sub16.head()
Out[192]:
Metropolitan White Black Hispanic Native Asian Pacific Poverty ChildPoverty Professional WorkAtHome MeanCommute PrivateWork PublicWork SelfEmployed FamilyWork Unemployment Employee Rate Crime Rate Employed Rate
0 1 83.1 9.5 4.5 0.6 0.7 0.0 13.4 19.2 33.1 3.9 26.4 81.5 12.3 5.8 0.4 7.5 147.60 391.04 44051.13
1 1 74.5 21.4 2.2 0.4 0.1 0.0 16.8 27.9 21.5 0.7 28.8 76.8 16.1 6.7 0.4 8.3 53.09 212.35 36692.62
2 0 22.2 70.7 4.4 1.2 0.2 0.0 24.6 38.4 18.8 2.8 27.5 79.5 15.1 5.4 0.0 18.0 121.75 683.65 36195.92
3 0 79.9 14.4 3.2 0.7 0.0 0.0 16.7 22.5 21.5 2.1 30.3 77.5 14.7 7.8 0.0 9.4 184.68 177.29 38265.49
4 0 92.5 2.9 2.3 0.2 0.4 0.0 17.0 26.3 28.9 3.4 33.3 76.3 15.3 8.4 0.0 8.3 179.98 993.20 40427.94

5 rows × 29 columns

 

In [193]:
# standardize the data since the variables are on different scales
sub16_scaled = sub16.apply(lambda x: preprocessing.scale(x).astype('float64'), axis=0)

In [194]:

sub16_scaled.head()
Out[194]:
Metropolitan White Black Hispanic Native Asian Pacific Poverty ChildPoverty Professional WorkAtHome MeanCommute PrivateWork PublicWork SelfEmployed FamilyWork Unemployment Employee Rate Crime Rate Employed Rate
0 1.309876 0.209502 0.182603 -0.316518 -0.162429 -0.205435 -0.34949 -0.489944 -0.376973 0.360065 -0.284268 0.607450 0.957954 -0.811076 -0.589856 0.213812 -0.029628 -0.208476 -0.600790 0.108763
1 1.309876 -0.258333 1.171681 -0.488479 -0.195342 -0.476472 -0.34949 0.065290 0.525477 -1.545652 -1.314620 1.059329 0.316643 -0.166732 -0.362675 0.213812 0.211680 -0.773872 -0.886156 -1.085727
2 -0.763431 -3.103421 5.269289 -0.323995 -0.063692 -0.431299 -0.34949 1.339064 1.614641 -1.989224 -0.638452 0.814561 0.685056 -0.336296 -0.690826 -0.657753 3.137533 -0.363121 -0.133496 -1.166355
3 -0.763431 0.035424 0.589871 -0.413714 -0.145973 -0.521644 -0.34949 0.048960 -0.034665 -1.545652 -0.863841 1.341753 0.412158 -0.404122 -0.085010 -0.657753 0.543478 0.013351 -0.942146 -0.830407
4 -0.763431 0.720856 -0.365961 -0.481003 -0.228254 -0.340953 -0.34949 0.097951 0.359509 -0.329936 -0.445261 1.906602 0.248419 -0.302383 0.066444 -0.657753 0.211680 -0.014766 0.360851 -0.479381

5 rows × 29 columns

Run Model

In [195]:
# import KMeans
from sklearn.cluster import KMeans
In [196]:
# create X variable
X = sub16_scaled

# iterate through different number of clusters
from scipy.spatial.distance import cdist

n_clust = range(1,10)
mean_dist = []

for i in n_clust:
    # initiate model
    km = KMeans(n_clusters=i)
    # fit model
    km = km.fit(X)
    # clusters variable
    clusters = km.labels_
    # centers variable
    centers = km.cluster_centers_
    # mean distance
    mean_dist.append(sum(np.min(cdist(X, centers, 'euclidean'), axis=1)) / X.shape[0])
In [197]:
# plot elbow curve
plt.plot(n_clust, mean_dist)
plt.xlabel('Number of Clusters')
plt.ylabel('Mean Distance')
plt.title('Selecting K with Elbow Method', loc='left', fontweight='bold', y=1.02)
sns.despine()
plt.tight_layout()
plt.savefig('kmeans_elbow.png')

kmeans_elbow

Choosing K

We want to choose the fewest number of clusters that provide a low mean distance. We are looking for a bend in the “elbow”. There is not a strong bend, but we do some a slight bend at both 2 and 3 clusters. I will run models for both 2 and 3 clusters.

Re-Run Model with 2 Clusters

In [198]:
km2 = KMeans(n_clusters=2).fit(X)
In [199]:
assignments2 = km2.labels_

Variable Reduction with Canonical Discriminant Analysis

We can’t visualize all the variables on a scatterplot. So we need to administer a data reduction technique that helps us zero-in on the key variables, which we can then visualize.

In [200]:
# import decomposition function
from sklearn.decomposition import PCA
In [201]:
# return the top 2 variables
pca_2 = PCA(2)

# apply the canonical discriminate analysis to features
plot_columns = pca_2.fit_transform(X)

# plot
plt.scatter(x=plot_columns[:,0], y=plot_columns[:,1], c=assignments2, alpha=0.4)
plt.xlabel('Canonical Variable 1')
plt.ylabel('Canonical Variable 2')
plt.title('Scatterplot of Canonical Variables for 2 Clusters', loc='left', fontweight='bold', y=1.02)
sns.despine()
plt.tight_layout()
plt.savefig('canonical_scatter.png')

canonical_scatter

In [202]:

# import 3D scatterplot
from mpl_toolkits.mplot3d import Axes3D

# return the top 3 variables
pca_3 = PCA(3)

# apply the canonical discriminate analysis to features
plot_columns3 = pca_3.fit_transform(X)

# plot 3D
fig = plt.figure()
ax = Axes3D(fig)
ax.scatter(plot_columns3[:,0], plot_columns3[:,1], plot_columns3[:,2], c=assignments2, alpha=0.4)
plt.title('Scatterplot of Canonical Variables for 2 Clusters', loc='left', fontweight='bold', y=1.02)
plt.savefig('canonical_scatter_3D.png')

canonical_scatter_3D

Re-run the Model with 3 Clusters

In [203]:
# re-run with 3 clusters
km3 = KMeans(n_clusters=3).fit(X)
In [204]:
# assignments
assignments3 = km3.labels_
In [205]:
# return the top 2 variables
pca2 = PCA(2)

# apply the canonical discriminate analysis to features
plot_columns2 = pca2.fit_transform(X)

# plot
plt.scatter(x=plot_columns2[:,0], y=plot_columns2[:,1], c=assignments3, alpha=0.4)
plt.xlabel('Canonical Variable 1')
plt.ylabel('Canonical Variable 2')
plt.title('Scatterplot of Canonical Variables for 3 Clusters', loc='left', fontweight='bold', y=1.02)
sns.despine()
plt.tight_layout()
plt.savefig('canonical_scatter_3clust.png')

canonical_scatter_3clust

In [206]:

# return the top 3 variables
pca3 = PCA(3)

# apply the canonical discriminate analysis to features
plotcolumns3 = pca3.fit_transform(X)

# plot 3D
fig2 = plt.figure()
ax2 = Axes3D(fig2)
ax2.scatter(plotcolumns3[:,0], plotcolumns3[:,1], plotcolumns3[:,2], c=assignments3, alpha=0.4)
plt.title('Scatterplot of Canonical Variables for 3 Clusters', loc='left', fontweight='bold', y=1.02)
plt.savefig('canonical_scatter_3clust_3D.png')

canonical_scatter_3clust_3D

Examine Cluster Variable Means by Cluster Assignments

In [207]:
# add cluster assignments to the scaled dataframe
sub16_scaled['assignments2'] = assignments2
sub16_scaled['assignments3'] = assignments3
In [208]:
sub16_scaled.head()
Out[208]:
Metropolitan White Black Hispanic Native Asian Pacific Poverty ChildPoverty Professional PrivateWork PublicWork SelfEmployed FamilyWork Unemployment Employee Rate Crime Rate Employed Rate assignments2 assignments3
0 1.309876 0.209502 0.182603 -0.316518 -0.162429 -0.205435 -0.34949 -0.489944 -0.376973 0.360065 0.957954 -0.811076 -0.589856 0.213812 -0.029628 -0.208476 -0.600790 0.108763 0 0
1 1.309876 -0.258333 1.171681 -0.488479 -0.195342 -0.476472 -0.34949 0.065290 0.525477 -1.545652 0.316643 -0.166732 -0.362675 0.213812 0.211680 -0.773872 -0.886156 -1.085727 1 1
2 -0.763431 -3.103421 5.269289 -0.323995 -0.063692 -0.431299 -0.34949 1.339064 1.614641 -1.989224 0.685056 -0.336296 -0.690826 -0.657753 3.137533 -0.363121 -0.133496 -1.166355 1 1
3 -0.763431 0.035424 0.589871 -0.413714 -0.145973 -0.521644 -0.34949 0.048960 -0.034665 -1.545652 0.412158 -0.404122 -0.085010 -0.657753 0.543478 0.013351 -0.942146 -0.830407 1 1
4 -0.763431 0.720856 -0.365961 -0.481003 -0.228254 -0.340953 -0.34949 0.097951 0.359509 -0.329936 0.248419 -0.302383 0.066444 -0.657753 0.211680 -0.014766 0.360851 -0.479381 1 1

5 rows × 31 columns

 

In [209]:
# 2 clusters group
cluster2grp = sub16_scaled.drop('assignments3', axis=1).groupby('assignments2').mean()

# 3 clusters group
cluster3grp = sub16_scaled.drop('assignments2', axis=1).groupby('assignments3').mean()
In [210]:
# print cluster2grp
cluster2grp
Out[210]:
Metropolitan White Black Hispanic Native Asian Pacific Poverty ChildPoverty Professional WorkAtHome MeanCommute PrivateWork PublicWork SelfEmployed FamilyWork Unemployment Employee Rate Crime Rate Employed Rate
assignments2
0 0.176442 0.345882 -0.322019 -0.170491 -0.098473 0.195511 0.038224 -0.653877 -0.682153 0.582696 0.377211 -0.213765 0.045452 -0.202636 0.20665 0.096163 -0.585169 -0.130221 -0.316077 0.713216
1 -0.175597 -0.344226 0.320477 0.169674 0.098002 -0.194575 -0.038041 0.650746 0.678886 -0.579906 -0.375405 0.212741 -0.045235 0.201665 -0.20566 -0.095702 0.582366 0.129598 0.314564 -0.709801

2 rows × 29 columns

Summary of 2 Clusters Group

Examples of how to interpret the above mean clusters table is as follows: In the first cluster group (assignment 0), the counties have a higher likelihood of being metropolitan (0.17 vs. -0.18). Additionally, counties in the first cluster group are more likely to have a higher population of Asians than counties in the second cluster group (0.19 vs. -0.20). Further, counties in the second cluster group are more likely to have higher full-time law enforcement employee rates than counties in the first cluster group (-0.13 vs. 0.14).

 

In [211]:
# print cluster3grp
cluster3grp
Out[211]:
Metropolitan White Black Hispanic Native Asian Pacific Poverty ChildPoverty Professional WorkAtHome MeanCommute PrivateWork PublicWork SelfEmployed FamilyWork Unemployment Employee Rate Crime Rate Employed Rate
assignments3
0 0.570613 0.225107 -0.172883 -0.148091 -0.143469 0.326984 -0.009867 -0.586466 -0.579698 0.439517 -0.103306 0.083363 0.527894 -0.391459 -0.368040 -0.216656 -0.320857 -0.288300 -0.313744 0.618230
1 -0.295571 -0.374155 0.380843 0.190750 0.032821 -0.209937 -0.039323 0.737438 0.766717 -0.670928 -0.426746 0.229374 -0.030091 0.192889 -0.216802 -0.131620 0.613306 0.092708 0.341062 -0.789349
2 -0.668380 0.380571 -0.526143 -0.111801 0.271846 -0.281543 0.122705 -0.397811 -0.487747 0.593972 1.322336 -0.779424 -1.227069 0.483249 1.450289 0.863739 -0.742574 0.479348 -0.079123 0.449300

3 rows × 29 columns

Summary of 3 Clusters Group

Counties in the third cluster group (assignment 2) are much more likely to be metropolitan than counties in the first two cluster groups, although the second cluster group is more likely to be metropolitan than counties in the first cluster group (-0.67 vs. -0.29 vs. 0.57). Counties in the second cluster group are much more likely to have higher poverty than the other cluster groups (-0.39 vs. 0.74 vs. -0.59). Counties in the first cluster group are much more likely to have people working from than than those in the other cluster groups (1.32 vs. -0.43 vs. -0.10).

 

Leave a Reply

Fill in your details below or click an icon to log in:

WordPress.com Logo

You are commenting using your WordPress.com account. Log Out /  Change )

Google+ photo

You are commenting using your Google+ account. Log Out /  Change )

Twitter picture

You are commenting using your Twitter account. Log Out /  Change )

Facebook photo

You are commenting using your Facebook account. Log Out /  Change )

Connecting to %s