Classifying the Iris (Flower) Species Using K-Nearest Neighbor

Iris Species Classification

This was a tutorial machine learning project from the book, Introduction to Machine Learning with Python (Ch. 1). This uses the k-nearest neighbors algorithm to predict the classification of iris flowers based on sepal and pedal width and length.

 

In [25]:
import pandas as pd
import matplotlib.pyplot as plt
import numpy as np
import scipy as sp
import sklearn as sk
%matplotlib notebook
In [26]:
print("scikit-learn version: {}".format(sk.__version__))
scikit-learn version: 0.19.1
In [27]:
# Import the iris dataset from the scikit-learn datasets modeul
from sklearn.datasets import load_iris
iris_dataset = load_iris()
In [28]:
# Print the keys of the iris_dataset
print('Keys of iris_dataset: \n{}'.format(iris_dataset.keys()))
Keys of iris_dataset: 
dict_keys(['data', 'target', 'target_names', 'DESCR', 'feature_names'])
In [29]:
print(iris_dataset['DESCR'])
Iris Plants Database
====================

Notes
-----
Data Set Characteristics:
    :Number of Instances: 150 (50 in each of three classes)
    :Number of Attributes: 4 numeric, predictive attributes and the class
    :Attribute Information:
        - sepal length in cm
        - sepal width in cm
        - petal length in cm
        - petal width in cm
        - class:
                - Iris-Setosa
                - Iris-Versicolour
                - Iris-Virginica
    :Summary Statistics:

    ============== ==== ==== ======= ===== ====================
                    Min  Max   Mean    SD   Class Correlation
    ============== ==== ==== ======= ===== ====================
    sepal length:   4.3  7.9   5.84   0.83    0.7826
    sepal width:    2.0  4.4   3.05   0.43   -0.4194
    petal length:   1.0  6.9   3.76   1.76    0.9490  (high!)
    petal width:    0.1  2.5   1.20  0.76     0.9565  (high!)
    ============== ==== ==== ======= ===== ====================

    :Missing Attribute Values: None
    :Class Distribution: 33.3% for each of 3 classes.
    :Creator: R.A. Fisher
    :Donor: Michael Marshall (MARSHALL%PLU@io.arc.nasa.gov)
    :Date: July, 1988

This is a copy of UCI ML iris datasets.
http://archive.ics.uci.edu/ml/datasets/Iris

The famous Iris database, first used by Sir R.A Fisher

This is perhaps the best known database to be found in the
pattern recognition literature.  Fisher's paper is a classic in the field and
is referenced frequently to this day.  (See Duda & Hart, for example.)  The
data set contains 3 classes of 50 instances each, where each class refers to a
type of iris plant.  One class is linearly separable from the other 2; the
latter are NOT linearly separable from each other.

References
----------
   - Fisher,R.A. "The use of multiple measurements in taxonomic problems"
     Annual Eugenics, 7, Part II, 179-188 (1936); also in "Contributions to
     Mathematical Statistics" (John Wiley, NY, 1950).
   - Duda,R.O., & Hart,P.E. (1973) Pattern Classification and Scene Analysis.
     (Q327.D83) John Wiley & Sons.  ISBN 0-471-22361-1.  See page 218.
   - Dasarathy, B.V. (1980) "Nosing Around the Neighborhood: A New System
     Structure and Classification Rule for Recognition in Partially Exposed
     Environments".  IEEE Transactions on Pattern Analysis and Machine
     Intelligence, Vol. PAMI-2, No. 1, 67-71.
   - Gates, G.W. (1972) "The Reduced Nearest Neighbor Rule".  IEEE Transactions
     on Information Theory, May 1972, 431-433.
   - See also: 1988 MLC Proceedings, 54-64.  Cheeseman et al"s AUTOCLASS II
     conceptual clustering system finds 3 classes in the data.
   - Many, many more ...

In [30]:
# The target names is what we want to predict
print('Target names: {}'.format(iris_dataset['target_names']))
Target names: ['setosa' 'versicolor' 'virginica']
In [31]:
# Shape of data
print('Shape of data: {}'.format(iris_dataset['data'].shape))
Shape of data: (150, 4)
In [32]:
# Print feature names and first 5 samples
print(iris_dataset['feature_names'])
print(iris_dataset['data'][:5])
['sepal length (cm)', 'sepal width (cm)', 'petal length (cm)', 'petal width (cm)']
[[5.1 3.5 1.4 0.2]
 [4.9 3.  1.4 0.2]
 [4.7 3.2 1.3 0.2]
 [4.6 3.1 1.5 0.2]
 [5.  3.6 1.4 0.2]]
In [33]:
# The species are encoded as integers from 0 to 2
# 0=setosa, 1=versicolor, 2=virginica
print('Target: \n{}'.format(iris_dataset['target']))
Target: 
[0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0
 0 0 0 0 0 0 0 0 0 0 0 0 0 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1
 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 2 2 2 2 2 2 2 2 2 2 2
 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2
 2 2]
In [34]:
# Divide the dataset into a training set and a testing set using scikit-learn
from sklearn.model_selection import train_test_split
X_train, X_test, y_train, y_test = train_test_split(iris_dataset['data'], iris_dataset['target'], random_state=0)

train_test_split

This function shuffles the data then splits it into training and test datasets. 75% is utilized for training and the remaining 25% is utilized for testing. X_train and X_test contain the features, whereas y_train and y_test contain the target labels. They are all NumPy arrays.

 

In [35]:
print('X_train shape: {}'.format(X_train.shape))
print('y_train shape: {}'.format(y_train.shape))
X_train shape: (112, 4)
y_train shape: (112,)
In [36]:
print('X_test shape: {}'.format(X_test.shape))
print('y_test shape: {}'.format(y_test.shape))
X_test shape: (38, 4)
y_test shape: (38,)
In [37]:
# Create a dataframe from X_train
# Label the columns using the strings in iris_dataset['feature_names']
iris_df = pd.DataFrame(X_train, columns=iris_dataset['feature_names'])
iris_df.head()
Out[37]:
sepal length (cm) sepal width (cm) petal length (cm) petal width (cm)
0 5.9 3.0 4.2 1.5
1 5.8 2.6 4.0 1.2
2 6.8 3.0 5.5 2.1
3 4.7 3.2 1.3 0.2
4 6.9 3.1 5.1 2.3
In [53]:
# Create a scatter matrix with the df and color by y_train
pd.plotting.scatter_matrix(iris_df, c=y_train, figsize=(10,10), hist_kwds={'bins':15, 'edgecolor':'white', 'linewidth':0.25}, 
                           s=25, alpha=0.8)

iris_matrix

Scatter Matrix Findings

The scatter matrix reveals that the three classes (setosa, veriscolor, virginica) are relatively well separated. That is, the colors are grouped together with very little overlap. We also see that one of the classes (purple colored) is quite different from the other two. This is seen in the scatterplots and two of the histograms with bimodal distributions.

 

In [54]:
# Use the k-nearest neighbors as the model
# This looks at the nearest data point to predict the label of the new data point.  The k refers to the number of neighbors
# utilized in the model.  This can be 1 or multiple neighbors.
from sklearn.neighbors import KNeighborsClassifier

# Instantiate a knn object using 1 as the number of neighbors
knn = KNeighborsClassifier(n_neighbors=1)
In [55]:
# Fit the model
knn.fit(X_train, y_train)
Out[55]:
KNeighborsClassifier(algorithm='auto', leaf_size=30, metric='minkowski',
           metric_params=None, n_jobs=1, n_neighbors=1, p=2,
           weights='uniform')
In [56]:
# Make predictions using the predict method of the knn object, using X_test
y_pred = knn.predict(X_test)
print('Test set predictions: {}'.format(y_pred))
Test set predictions: [2 1 0 2 0 2 0 1 1 1 2 1 1 1 1 0 1 1 0 0 2 1 0 0 2 0 0 1 1 0 2 1 0 2 2 1 0
 2]
In [57]:
# Print test set score
print('Test set score: {:.2f}'.format(np.mean(y_pred==y_test)))
Test set score: 0.97
In [58]:
# We can also compute the accuracy using the score method of the knn object
print('Test set score: {:.2f}'.format(knn.score(X_test, y_test)))
Test set score: 0.97

 

 

 

Leave a Reply

Fill in your details below or click an icon to log in:

WordPress.com Logo

You are commenting using your WordPress.com account. Log Out /  Change )

Google+ photo

You are commenting using your Google+ account. Log Out /  Change )

Twitter picture

You are commenting using your Twitter account. Log Out /  Change )

Facebook photo

You are commenting using your Facebook account. Log Out /  Change )

Connecting to %s