Predicting Breast Cancer Using Logistic Regression

This dataset is part of the Scikit-learn dataset package. It is from the Breast Cancer Wisconsin (Diagnostic) Database and contains 569 instances of tumors that are identified as either benign (357 instances) or malignant (212 instances). This machine learning project seeks to predict the classification of breast tumors as either malignant or benign. More information regarding the data can be found below.

 

In [1]:
# setup environment
import pandas as pd
import matplotlib.pyplot as plt
from sklearn.datasets import load_breast_cancer
%matplotlib inline
In [2]:
# create dataset variable
cancer = load_breast_cancer()

Data Details

In [3]:
# view details of the data
print(cancer.DESCR)
Breast Cancer Wisconsin (Diagnostic) Database
=============================================

Notes
-----
Data Set Characteristics:
    :Number of Instances: 569

    :Number of Attributes: 30 numeric, predictive attributes and the class

    :Attribute Information:
        - radius (mean of distances from center to points on the perimeter)
        - texture (standard deviation of gray-scale values)
        - perimeter
        - area
        - smoothness (local variation in radius lengths)
        - compactness (perimeter^2 / area - 1.0)
        - concavity (severity of concave portions of the contour)
        - concave points (number of concave portions of the contour)
        - symmetry 
        - fractal dimension ("coastline approximation" - 1)

        The mean, standard error, and "worst" or largest (mean of the three
        largest values) of these features were computed for each image,
        resulting in 30 features.  For instance, field 3 is Mean Radius, field
        13 is Radius SE, field 23 is Worst Radius.

        - class:
                - WDBC-Malignant
                - WDBC-Benign

    :Summary Statistics:

    ===================================== ====== ======
                                           Min    Max
    ===================================== ====== ======
    radius (mean):                        6.981  28.11
    texture (mean):                       9.71   39.28
    perimeter (mean):                     43.79  188.5
    area (mean):                          143.5  2501.0
    smoothness (mean):                    0.053  0.163
    compactness (mean):                   0.019  0.345
    concavity (mean):                     0.0    0.427
    concave points (mean):                0.0    0.201
    symmetry (mean):                      0.106  0.304
    fractal dimension (mean):             0.05   0.097
    radius (standard error):              0.112  2.873
    texture (standard error):             0.36   4.885
    perimeter (standard error):           0.757  21.98
    area (standard error):                6.802  542.2
    smoothness (standard error):          0.002  0.031
    compactness (standard error):         0.002  0.135
    concavity (standard error):           0.0    0.396
    concave points (standard error):      0.0    0.053
    symmetry (standard error):            0.008  0.079
    fractal dimension (standard error):   0.001  0.03
    radius (worst):                       7.93   36.04
    texture (worst):                      12.02  49.54
    perimeter (worst):                    50.41  251.2
    area (worst):                         185.2  4254.0
    smoothness (worst):                   0.071  0.223
    compactness (worst):                  0.027  1.058
    concavity (worst):                    0.0    1.252
    concave points (worst):               0.0    0.291
    symmetry (worst):                     0.156  0.664
    fractal dimension (worst):            0.055  0.208
    ===================================== ====== ======

    :Missing Attribute Values: None

    :Class Distribution: 212 - Malignant, 357 - Benign

    :Creator:  Dr. William H. Wolberg, W. Nick Street, Olvi L. Mangasarian

    :Donor: Nick Street

    :Date: November, 1995

This is a copy of UCI ML Breast Cancer Wisconsin (Diagnostic) datasets.
https://goo.gl/U2Uwz2

Features are computed from a digitized image of a fine needle
aspirate (FNA) of a breast mass.  They describe
characteristics of the cell nuclei present in the image.

Separating plane described above was obtained using
Multisurface Method-Tree (MSM-T) [K. P. Bennett, "Decision Tree
Construction Via Linear Programming." Proceedings of the 4th
Midwest Artificial Intelligence and Cognitive Science Society,
pp. 97-101, 1992], a classification method which uses linear
programming to construct a decision tree.  Relevant features
were selected using an exhaustive search in the space of 1-4
features and 1-3 separating planes.

The actual linear program used to obtain the separating plane
in the 3-dimensional space is that described in:
[K. P. Bennett and O. L. Mangasarian: "Robust Linear
Programming Discrimination of Two Linearly Inseparable Sets",
Optimization Methods and Software 1, 1992, 23-34].

This database is also available through the UW CS ftp server:

ftp ftp.cs.wisc.edu
cd math-prog/cpo-dataset/machine-learn/WDBC/

References
----------
   - W.N. Street, W.H. Wolberg and O.L. Mangasarian. Nuclear feature extraction 
     for breast tumor diagnosis. IS&T/SPIE 1993 International Symposium on 
     Electronic Imaging: Science and Technology, volume 1905, pages 861-870,
     San Jose, CA, 1993.
   - O.L. Mangasarian, W.N. Street and W.H. Wolberg. Breast cancer diagnosis and 
     prognosis via linear programming. Operations Research, 43(4), pages 570-577, 
     July-August 1995.
   - W.H. Wolberg, W.N. Street, and O.L. Mangasarian. Machine learning techniques
     to diagnose breast cancer from fine-needle aspirates. Cancer Letters 77 (1994) 
     163-171.

In [4]:

# count of observations by target
import numpy as np

unique, counts = np.unique(cancer.target, return_counts=True)
target_count = dict(zip(unique, counts))
target_count
Out[4]:
{0: 212, 1: 357}

Target Encoding

There are 212 malignant observations and 357 benign observations. There are 212 0’s and 357 1’s. So we know that malignant is encoded as 0 and benign is encoded as 1.

Create Dataframe

Scikit-learn datasets are stored as Bunch objects, which essentially work like dictionaries. However, it is more common to work with data in table format. So I want to create a dataframe of the data before moving forward with the model.

 

In [5]:
# create dataframe
cancer_df = pd.DataFrame(cancer.data, columns=cancer.feature_names)

# add target
cancer_df['benign'] = cancer.target  # 1=True (benign); 0=False (malignant)

# view head
cancer_df.head()
Out[5]:
mean radius mean texture mean perimeter mean area mean smoothness mean compactness mean concavity mean concave points mean symmetry mean fractal dimension worst texture worst perimeter worst area worst smoothness worst compactness worst concavity worst concave points worst symmetry worst fractal dimension benign
0 17.99 10.38 122.80 1001.0 0.11840 0.27760 0.3001 0.14710 0.2419 0.07871 17.33 184.60 2019.0 0.1622 0.6656 0.7119 0.2654 0.4601 0.11890 0
1 20.57 17.77 132.90 1326.0 0.08474 0.07864 0.0869 0.07017 0.1812 0.05667 23.41 158.80 1956.0 0.1238 0.1866 0.2416 0.1860 0.2750 0.08902 0
2 19.69 21.25 130.00 1203.0 0.10960 0.15990 0.1974 0.12790 0.2069 0.05999 25.53 152.50 1709.0 0.1444 0.4245 0.4504 0.2430 0.3613 0.08758 0
3 11.42 20.38 77.58 386.1 0.14250 0.28390 0.2414 0.10520 0.2597 0.09744 26.50 98.87 567.7 0.2098 0.8663 0.6869 0.2575 0.6638 0.17300 0
4 20.29 14.34 135.10 1297.0 0.10030 0.13280 0.1980 0.10430 0.1809 0.05883 16.67 152.20 1575.0 0.1374 0.2050 0.4000 0.1625 0.2364 0.07678 0

5 rows × 31 columns

Train, Test, Split

In [6]:
# import train_test_split
from sklearn.model_selection import train_test_split
In [7]:
# create X and y variables
X = cancer_df.drop('benign', axis=1)
y = cancer_df['benign']

# train, test, split; use stratification since there are significantly more benign instances
# than there are malignant instances
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.3, stratify=y, random_state=0)
In [8]:
print('Training Set Length')
print(len(X_train))
print('')
print('Test Set Length')
print(len(X_test))
Training Set Length
398

Test Set Length
171

Run Logistic Regression

In [9]:
# import LogisticRegression
from sklearn.linear_model import LogisticRegression
In [10]:
# run model
lr = LogisticRegression().fit(X_train, y_train)
y_pred = lr.predict(X_test)

View Accuracy

In [11]:
# view confusion matrix
from sklearn.metrics import confusion_matrix

lr_matrix = confusion_matrix(y_test, y_pred)
lr_matrix = pd.DataFrame(lr_matrix, columns=['Benign', 'Malignant'], index=['Benign','Malignant'])
lr_matrix
Out[11]:
Benign Malignant
Benign 57 7
Malignant 5 102
In [12]:
# view accuracy
from sklearn.metrics import accuracy_score

lr_accuracy = accuracy_score(y_test, y_pred)
lr_accuracy
Out[12]:
0.9298245614035088

Re-run with Different C Parameter Values

The C parameter controls the regularization of the model. The default value is 1. The higher the C parameter, the less regularization (more complex). The lower the C parameter, the more regularization (less complex).

 

In [13]:
# re-run the model with a C parameter value of 100
lr100 = LogisticRegression(C=100).fit(X_train, y_train)
y_pred100 = lr100.predict(X_test)
In [14]:
# view confusion matrix
lr_matrix100 = confusion_matrix(y_test, y_pred100)
lr_matrix100 = pd.DataFrame(lr_matrix100, columns=['Benign', 'Malignant'], index=['Benign','Malignant'])
lr_matrix100
Out[14]:
Benign Malignant
Benign 60 4
Malignant 6 101
In [15]:
# view accuracy
lr100_accuracy = accuracy_score(y_pred, y_pred100)
lr100_accuracy
Out[15]:
0.9766081871345029
In [16]:
# re-run the model with a C parameter value of 0.01
lr01 = LogisticRegression(C=0.01).fit(X_train, y_train)
y_pred01 = lr01.predict(X_test)
In [17]:
# view confusion matrix
lr_matrix01 = confusion_matrix(y_test, y_pred01)
lr_matrix01 = pd.DataFrame(lr_matrix01, columns=['Benign', 'Malignant'], index=['Benign','Malignant'])
lr_matrix01
Out[17]:
Benign Malignant
Benign 56 8
Malignant 5 102
In [18]:
# view accuracy
lr01_accuracy = accuracy_score(y_test, y_pred01)
lr01_accuracy
Out[18]:
0.9239766081871345

Plot Model Coefficients for each C parameter

In [19]:
fig = plt.figure(figsize=(12,5))
plt.plot(lr.coef_.T, 'o', label='C=1')
plt.plot(lr100.coef_.T, '^', label='C=100')
plt.plot(lr01.coef_.T, '*', label='C=0.01', color='purple')
plt.xticks(range(cancer.data.shape[1]), cancer.feature_names, rotation=90)
plt.axhline(0, color='gray', lw=1, linestyle='--')
plt.ylim(-5, 5)
plt.xlabel('Features')
plt.ylabel('Coefficient magnitude')
plt.legend()
plt.title('Breast Cancer Feature Coefficients from Logistic Regression with Different C Parameters', 
          loc='left', fontweight='bold')
plt.tight_layout()
plt.savefig('bc_coeffs_log_reg.png')

bc_coeffs_log_reg

(View Larger Image)

The smaller the C value, the higher the regularization and the closer the coefficients get to 0.

Summary

Using logistic regression, I was able to accurately predict 92-98% (depending on the C parameter value utilized) of breast cancer classification instances as being either malignant or benign.

 

 

 

Leave a Reply

Fill in your details below or click an icon to log in:

WordPress.com Logo

You are commenting using your WordPress.com account. Log Out /  Change )

Google+ photo

You are commenting using your Google+ account. Log Out /  Change )

Twitter picture

You are commenting using your Twitter account. Log Out /  Change )

Facebook photo

You are commenting using your Facebook account. Log Out /  Change )

Connecting to %s