How are Cancer Rates Trending? 1990-2016

Data is provided by IHME through their GBD Results Tool. The data consists of 29 cancer types broken down by three measures (Incidence, Prevalence, Deaths), from the years 1990-2016.

 

In [1]:
# setup environment
import pandas as pd
import matplotlib.pyplot as plt
import seaborn as sns
%matplotlib inline

Data

In [2]:
# read in data
cancer = pd.read_csv('IHME-GBD_2016_DATA-b922583c-1.csv')
In [3]:
# view head
cancer.head()
Out[3]:
measure_id measure_name location_id location_name sex_id sex_name age_id age_name cause_id cause_name metric_id metric_name year val upper lower
0 1 Deaths 102 United States 1 Male 27 Age-standardized 450 Other pharynx cancer 3 Rate 1991 1.431495 1.477536 1.385585
1 1 Deaths 102 United States 2 Female 27 Age-standardized 450 Other pharynx cancer 3 Rate 1991 0.528434 0.543960 0.513026
2 1 Deaths 102 United States 3 Both 27 Age-standardized 450 Other pharynx cancer 3 Rate 1991 0.925713 0.948379 0.903631
3 1 Deaths 102 United States 1 Male 27 Age-standardized 453 Gallbladder and biliary tract cancer 3 Rate 1991 1.288725 1.357985 1.148603
4 1 Deaths 102 United States 2 Female 27 Age-standardized 453 Gallbladder and biliary tract cancer 3 Rate 1991 1.438552 1.475629 1.399575
In [4]:
# data info
cancer.info()
<class 'pandas.core.frame.DataFrame'>
RangeIndex: 6696 entries, 0 to 6695
Data columns (total 16 columns):
measure_id       6696 non-null int64
measure_name     6696 non-null object
location_id      6696 non-null int64
location_name    6696 non-null object
sex_id           6696 non-null int64
sex_name         6696 non-null object
age_id           6696 non-null int64
age_name         6696 non-null object
cause_id         6696 non-null int64
cause_name       6696 non-null object
metric_id        6696 non-null int64
metric_name      6696 non-null object
year             6696 non-null int64
val              6696 non-null float64
upper            6696 non-null float64
lower            6696 non-null float64
dtypes: float64(3), int64(7), object(6)
memory usage: 837.1+ KB
In [5]:
# unique of location
cancer['location_name'].unique()
Out[5]:
array(['United States'], dtype=object)
In [6]:
# unique of measure_name
cancer['measure_name'].unique()
Out[6]:
array(['Deaths', 'Incidence', 'Prevalence'], dtype=object)
In [7]:
# unique of age_name
cancer['age_name'].unique()
Out[7]:
array(['Age-standardized'], dtype=object)
In [8]:
# unique of metric_name
cancer['metric_name'].unique()
Out[8]:
array(['Rate'], dtype=object)
In [9]:
# unique of sex_name
cancer['sex_name'].unique()
Out[9]:
array(['Male', 'Female', 'Both'], dtype=object)
In [10]:
# unique of year
sorted(cancer['year'].unique().tolist())
Out[10]:
[1990,
 1991,
 1992,
 1993,
 1994,
 1995,
 1996,
 1997,
 1998,
 1999,
 2000,
 2001,
 2002,
 2003,
 2004,
 2005,
 2006,
 2007,
 2008,
 2009,
 2010,
 2011,
 2012,
 2013,
 2014,
 2015,
 2016]
In [11]:
# only keep relevant columns
cancer_clean = cancer.loc[:,('measure_name','sex_name','cause_name','year','val','upper','lower')]
In [12]:
# rename columns
cancer_clean.columns = ['measure','sex','cancer_type','year','rate','upper','lower']

Data Description

The data consists of cancer rates (age-standardized and per 100,000 population) in the United States from 1991-2016 for male, female, and both. Three measures are available in the data: Deaths, Incidence, and Prevalence. I want to clean the data up by removing any unnecessary fields. The cleaned data will be used for subsequent analysis and visualization and can be filtered or subset by measure and sex as needed.

 

In [13]:
# view head of cleaned data
cancer_clean.head()
Out[13]:
measure sex cancer_type year rate upper lower
0 Deaths Male Other pharynx cancer 1991 1.431495 1.477536 1.385585
1 Deaths Female Other pharynx cancer 1991 0.528434 0.543960 0.513026
2 Deaths Both Other pharynx cancer 1991 0.925713 0.948379 0.903631
3 Deaths Male Gallbladder and biliary tract cancer 1991 1.288725 1.357985 1.148603
4 Deaths Female Gallbladder and biliary tract cancer 1991 1.438552 1.475629 1.399575

Exploratory Analysis

In [14]:
# summarize data
cancer_clean.describe()
Out[14]:
year rate upper lower
count 6696.000000 6696.000000 6696.000000 6696.000000
mean 2003.000000 29.383218 32.220306 27.113216
std 7.789463 71.059104 78.915067 66.355748
min 1990.000000 0.000000 0.000000 0.000000
25% 1996.000000 2.651761 2.878871 2.531163
50% 2003.000000 7.534496 7.920939 7.189958
75% 2010.000000 22.169924 23.758952 20.951590
max 2016.000000 762.516625 932.816320 713.879146
In [15]:
# cancer rate distributions by measure
sns.boxenplot(y='measure', x='rate', data=cancer_clean, orient='h')
sns.despine()
plt.title('Cancer Rate Distributions by Measure', loc='left', fontweight='bold', y=1.02)
plt.savefig('rate_dist_by_measure.png')

rate_dist_by_measure

In [16]:

# avg cancer rate by measure
measure_grp = cancer_clean.groupby('measure')
print('Average Cancer Rate by Measure')
print('------------------------------')
print(measure_grp['rate'].mean())
Average Cancer Rate by Measure
------------------------------
measure
Deaths         6.068340
Incidence     20.882108
Prevalence    60.918306
Name: rate, dtype: float64
In [17]:
print('Average Cancer Prevalence Rate by Cancer Type')
print('---------------------------------------------')
cancer_avg_rate = cancer_clean[cancer_clean['measure']=='Prevalence'].pivot_table(values='rate', index='cancer_type', aggfunc='mean')
cancer_avg_rate.sort_values(by='rate', ascending=False)
Average Cancer Prevalence Rate by Cancer Type
---------------------------------------------
Out[17]:
rate
cancer_type
Prostate cancer 335.501417
Breast cancer 304.255883
Colon and rectum cancer 193.597531
Uterine cancer 99.455362
Non-melanoma skin cancer 95.349373
Tracheal, bronchus, and lung cancer 93.235849
Malignant skin melanoma 88.131638
Non-Hodgkin lymphoma 72.713050
Kidney cancer 54.780670
Bladder cancer 51.545631
Other neoplasms 49.869394
Leukemia 40.002602
Thyroid cancer 39.529259
Cervical cancer 33.650794
Ovarian cancer 27.715949
Lip and oral cavity cancer 27.126268
Testicular cancer 27.004048
Hodgkin lymphoma 17.617261
Stomach cancer 15.999151
Larynx cancer 15.903920
Brain and nervous system cancer 14.313561
Multiple myeloma 13.513982
Other pharynx cancer 11.206072
Pancreatic cancer 10.192249
Esophageal cancer 5.647933
Liver cancer 3.926270
Gallbladder and biliary tract cancer 2.553970
Nasopharynx cancer 2.392813
Mesothelioma 1.283272

Main Plot

Below is a plot of cancer rate trends for select cancer types and each measure (Incidence, Prevalence, and Deaths). I’ve chosen to zero-in on five cancer types: prostate, breast, colon and rectum, stomach, and Leukemia.

 

In [18]:
# subset data where sex = Both and cancers are of the 5 selected cancer types
cancer_sub = cancer_clean[(cancer_clean['sex']=='Both')&(cancer_clean['cancer_type'].isin(['Prostate cancer','Breast cancer','Colon and rectum cancer','Stomach cancer','Leukemia']))]
cancer_sub = cancer_sub.sort_values(['cancer_type','measure','year'])

# set style
plt.style.use('seaborn-dark')

# define variables
num = 0
cancers = cancer_sub['cancer_type'].unique().tolist()
measures = cancer_sub['measure'].unique().tolist()

# initiate figure
fig = plt.subplots(5, 3, figsize=(16,16), sharex=True)

# loop thru cancers and measures to generate plots
for c in cancers:
    for m in measures:
        # subplot num
        num += 1
    
        # select subplot to plot on
        plt.subplot(5, 3, num)
    
        # create axes variables
        x = cancer_sub['year'][(cancer_sub['cancer_type']==c)&(cancer_sub['measure']==m)]
        y = cancer_sub['rate'][(cancer_sub['cancer_type']==c)&(cancer_sub['measure']==m)]
    
        # plot cancer rates
        plt.plot(x, y, color='orangered', linewidth=1.5, alpha=0.9)
        
        if num in range(1,15, 3):
            plt.title(c + ' ' + m, loc='left', fontweight='bold')
        if num not in range(1,15, 3):
            plt.title(m, loc='left', fontweight='bold')

        plt.axhline(cancer_sub['rate'][(cancer_sub['cancer_type']==c)&(cancer_sub['measure']==m)].mean(), 
                    color='gray', linestyle='--', lw=0.7)
    
        # set x ticks
        plt.xticks(range(1990, 2021, 5))
    
        # only keep ticks on outer subplots
        if num in range(1,13):
            plt.tick_params(labelbottom='off')

        # increase label size
        plt.tick_params(labelsize=11)
            
plt.savefig('cancer_rate_trends_by_measure.png')

cancer_rate_trends_by_measure

(View Larger Image)

Findings:

All of the death rates for the above plotted cancers have been steadily decreasing, which is a good thing. However, for Leukemia, the incidence and prevalence rates have been steadily increasing. For prostate and stomach cancers, the incidence and prevalence rates have started to trend upward starting around 2006 after having a downard trend.

Another interesting finding is that some of the trends between the different types of cancers have similar shapes. In other words, certain cancers seem to increase and decrease together. I have to wonder if this may be due to a common cause.

Source:

GBD Results tool: Global Burden of Disease Collaborative Network. Global Burden of Disease Study 2016 (GBD 2016) Results. Seattle, United States: Institute for Health Metrics and Evaluation (IHME), 2017. Available from http://ghdx.healthdata.org/gbd-results-tool.

Leave a Reply

Fill in your details below or click an icon to log in:

WordPress.com Logo

You are commenting using your WordPress.com account. Log Out /  Change )

Google+ photo

You are commenting using your Google+ account. Log Out /  Change )

Twitter picture

You are commenting using your Twitter account. Log Out /  Change )

Facebook photo

You are commenting using your Facebook account. Log Out /  Change )

Connecting to %s