Gapminder: Life Expectancy and Per Capita GDP by Continent

The data used, Gapminder, originally comes from The version used here is provided by Jennifer Bryan and can be found here. The purpose of this project is to demonstrate how to read in data using Pandas read_csv method, inspect the data, perform basic manipulation and visualization, and communicate my findings.

Note: “Per-capita GDP (Gross domestic product) is given in units of international dollars, ‘a hypothetical unit of currency that has the same purchasing power parity that the U.S. dollar had in the United States at a given point in time’ — 2005, in this case.” (from Jennifer Bryan’s GitHub page) GDP per capita is a measurement of a country’s standard of living that accounts for its number of people. It divides the country’s gross domestic product by its total population. An upward trend in GDP per capita signals economic growth.

import pandas as pd
import matplotlib.pyplot as plt
%matplotlib inline
# Read in data and view summary
df = pd.read_csv("gapminder.tsv", sep="\t")
country continent year lifeExp pop gdpPercap
0 Afghanistan Asia 1952 28.801 8425333 779.445314
1 Afghanistan Asia 1957 30.332 9240934 820.853030
2 Afghanistan Asia 1962 31.997 10267083 853.100710
3 Afghanistan Asia 1967 34.020 11537966 836.197138
4 Afghanistan Asia 1972 36.088 13079460 739.981106
country continent year lifeExp pop gdpPercap
1699 Zimbabwe Africa 1987 62.351 9216418 706.157306
1700 Zimbabwe Africa 1992 60.377 10704340 693.420786
1701 Zimbabwe Africa 1997 46.809 11404948 792.449960
1702 Zimbabwe Africa 2002 39.989 11926563 672.038623
1703 Zimbabwe Africa 2007 43.487 12311143 469.709298
# How many rows and columns are there?
(1704, 6)
# What are the data types of the columns?
country       object
continent     object
year           int64
lifeExp      float64
pop            int64
gdpPercap    float64
dtype: object
# What is the range of years in the data?
years = set(df['year'])
years = list(years)
max_yr = max(years)
min_yr = min(years)
yr_range = max_yr - min_yr
"Year range: " + str(yr_range)
'Year range: 55'
[1952, 1957, 1962, 1967, 1972, 1977, 1982, 1987, 1992, 1997, 2002, 2007]


The years range from 1952 to 2007 (55 year span), but they increment by 5.

# How many countries are represented in the data?
countries = set(df['country'])
countries = list(countries)
"Number of countries: " + str(len(countries))
'Number of countries: 142'


Based on Worldometers (accessed on June 25, 2018), there are 195 countries in the world, which “comprises 193 members states of the United Nations and 2 countries that are non-member observer states: the Holy See and the State of Palestine.” This dataset represents 73% (rounded) of all countries. This is important, as it limits the accuracy of the insights that can be gained. For instance, average life expectancy or average GDP per capita may be slightly skewed. Additionally, summing the population will not be a sum of the total world’s population, only those countries that are represented. For instance, I noticed that Russia is not included in this data. Russia has a population of over 100 million. That’s a huge factor. It’s important to know these things up-front, as it may influence how you go about analyzing the data. Either way, interesting insights can still be gleaned from this dataset.  In this project, I’ll just be focusing on life expectancy and per capita GDP.

# Are there any nulls in this dataset?
is_null = df.isnull()
country      False
continent    False
year         False
lifeExp      False
pop          False
gdpPercap    False
dtype: bool


There are no null values.

# Distribution of life expectancy
plt.hist(df['lifeExp'], edgecolor='white', bins=15)
plt.title('Life Expectancy Distribution Among Countries')
plt.text(42, 165, 'Peak 1')
plt.text(62, 238, 'Peak 2')



It looks like there is a bimodal (two peaks) distribution for life expectancy. This essentially means there are two groups – a large number of countries that have a low life expectancy and a large number of countries that have a high life expectancy.

# What is the average life expentancy by year?
avg_lifeExp_by_yr = df.groupby('year')['lifeExp'].mean()
1952    49.057620
1957    51.507401
1962    53.609249
1967    55.678290
1972    57.647386
1977    59.570157
1982    61.533197
1987    63.212613
1992    64.160338
1997    65.014676
2002    65.694923
2007    67.007423
Name: lifeExp, dtype: float64
# Plot the above result
plt.ylabel('life expectancy')
plt.title('Average Life Expectancy by Year')



There has been a steady increase in the average life expectancy from 1952 to 2007. Let’s breaks this out by continent.

# Are there any continents that stand out as main contributers to the increase of life expectancy?
# Are there any continents that contributed little or negatively contributed?
avg_continent = df.groupby(['year','continent'])['lifeExp'].mean()
avg_continent = avg_continent.reset_index()
avg_continent = avg_continent.pivot_table(values='lifeExp', index='year', columns='continent')
plt.legend(loc='best', ncol=2)
plt.title('Average Life Expectancy by Continent and Year')
plt.text(1988, 56, 'HIV/AIDS Epidemic')



It appears that Asia has the greatest contribution to the increase of life expectancy, as it has the steepest increase. While Oceania has consistently had the highest average life expectancy, it has contributed less to the increase of life expectancy, as it has the least steep increase (which would be expected, as it began high). Overall, every continent has an upward trend. However, Africa, which has consistently had the lowest average life expectancy, leveled off from about 1987-2002 and then started climbing again. This is likely due to the HIV/AIDS epidemic that occurred during the 80’s and 90’s. It wasn’t until around 2000 that global initiatives and cheaper AIDS drugs were made widely available to address this. This epidemic seems to have caused a widening gap between the life expectancy of Africa and that of other continents, as you can see in the plot for 2007. You can read more about this epidemic here.

# How many countries are represented by continent?
country_cnt_by_continent = df.groupby('continent')['country'].nunique()
Africa      52
Americas    25
Asia        33
Europe      30
Oceania      2
Name: country, dtype: int64
# Let's create the same plots as above, but this time we'll look at GDP per capita
avg_gdp_by_yr = df.groupby('year')['gdpPercap'].mean()
plt.title('Average GDP per Capita by Year')
plt.ylabel('gdp per capita')


continent_gdp = df.groupby(['year','continent'])['gdpPercap'].mean()
continent_gdp = continent_gdp.reset_index()
continent_gdp = continent_gdp.pivot_table(values='gdpPercap', index='year', columns='continent')
plt.legend(loc='best', ncol=2)
plt.title('GDP per Capita by Continent and Year')



Oceania and Europe have similar trends, as well as Americas and Asia. Oceania and Europe have the steepest upwards trends, as well as the highest GDP per capita. Africa has consistently had the lowest GDP per capita and shows just a slight trend upward. We see a slight dip in Africa’s GDP per capita from about 1980-2000, which could be related to the HIV/AIDS epidemic of that time.

Leave a Reply

Fill in your details below or click an icon to log in: Logo

You are commenting using your account. Log Out /  Change )

Google+ photo

You are commenting using your Google+ account. Log Out /  Change )

Twitter picture

You are commenting using your Twitter account. Log Out /  Change )

Facebook photo

You are commenting using your Facebook account. Log Out /  Change )

Connecting to %s