The data used, Gapminder, originally comes from www.gapminder.org. The version used here is provided by Jennifer Bryan and can be found here. The purpose of this project is to demonstrate how to read in data using Pandas read_csv method, inspect the data, perform basic manipulation and visualization, and communicate my findings.
Note: “Per-capita GDP (Gross domestic product) is given in units of international dollars, ‘a hypothetical unit of currency that has the same purchasing power parity that the U.S. dollar had in the United States at a given point in time’ — 2005, in this case.” (from Jennifer Bryan’s GitHub page) GDP per capita is a measurement of a country’s standard of living that accounts for its number of people. It divides the country’s gross domestic product by its total population. An upward trend in GDP per capita signals economic growth.
import pandas as pd import matplotlib.pyplot as plt %matplotlib inline
# Read in data and view summary df = pd.read_csv("gapminder.tsv", sep="\t") df.head()
# How many rows and columns are there? df.shape
# What are the data types of the columns? df.dtypes
country object continent object year int64 lifeExp float64 pop int64 gdpPercap float64 dtype: object
# What is the range of years in the data? years = set(df['year']) years = list(years) max_yr = max(years) min_yr = min(years) yr_range = max_yr - min_yr "Year range: " + str(yr_range)
'Year range: 55'
[1952, 1957, 1962, 1967, 1972, 1977, 1982, 1987, 1992, 1997, 2002, 2007]
The years range from 1952 to 2007 (55 year span), but they increment by 5.
# How many countries are represented in the data? countries = set(df['country']) countries = list(countries) "Number of countries: " + str(len(countries))
'Number of countries: 142'
Based on Worldometers (accessed on June 25, 2018), there are 195 countries in the world, which “comprises 193 members states of the United Nations and 2 countries that are non-member observer states: the Holy See and the State of Palestine.” This dataset represents 73% (rounded) of all countries. This is important, as it limits the accuracy of the insights that can be gained. For instance, average life expectancy or average GDP per capita may be slightly skewed. Additionally, summing the population will not be a sum of the total world’s population, only those countries that are represented. For instance, I noticed that Russia is not included in this data. Russia has a population of over 100 million. That’s a huge factor. It’s important to know these things up-front, as it may influence how you go about analyzing the data. Either way, interesting insights can still be gleaned from this dataset. In this project, I’ll just be focusing on life expectancy and per capita GDP.
# Are there any nulls in this dataset? is_null = df.isnull() is_null.any()
country False continent False year False lifeExp False pop False gdpPercap False dtype: bool
There are no null values.
# Distribution of life expectancy plt.hist(df['lifeExp'], edgecolor='white', bins=15) plt.title('Life Expectancy Distribution Among Countries') plt.text(42, 165, 'Peak 1') plt.text(62, 238, 'Peak 2')
It looks like there is a bimodal (two peaks) distribution for life expectancy. This essentially means there are two groups – a large number of countries that have a low life expectancy and a large number of countries that have a high life expectancy.
# What is the average life expentancy by year? avg_lifeExp_by_yr = df.groupby('year')['lifeExp'].mean() avg_lifeExp_by_yr
year 1952 49.057620 1957 51.507401 1962 53.609249 1967 55.678290 1972 57.647386 1977 59.570157 1982 61.533197 1987 63.212613 1992 64.160338 1997 65.014676 2002 65.694923 2007 67.007423 Name: lifeExp, dtype: float64
# Plot the above result avg_lifeExp_by_yr.plot() plt.xticks(range(1952,2010,5)) plt.ylabel('life expectancy') plt.title('Average Life Expectancy by Year') plt.tight_layout() plt.show()
There has been a steady increase in the average life expectancy from 1952 to 2007. Let’s breaks this out by continent.
# Are there any continents that stand out as main contributers to the increase of life expectancy? # Are there any continents that contributed little or negatively contributed? avg_continent = df.groupby(['year','continent'])['lifeExp'].mean() avg_continent = avg_continent.reset_index() avg_continent = avg_continent.pivot_table(values='lifeExp', index='year', columns='continent') avg_continent.plot() plt.legend(loc='best', ncol=2) plt.title('Average Life Expectancy by Continent and Year') plt.xticks(range(1952,2010,5)) plt.tight_layout() plt.text(1988, 56, 'HIV/AIDS Epidemic') plt.show()
It appears that Asia has the greatest contribution to the increase of life expectancy, as it has the steepest increase. While Oceania has consistently had the highest average life expectancy, it has contributed less to the increase of life expectancy, as it has the least steep increase (which would be expected, as it began high). Overall, every continent has an upward trend. However, Africa, which has consistently had the lowest average life expectancy, leveled off from about 1987-2002 and then started climbing again. This is likely due to the HIV/AIDS epidemic that occurred during the 80’s and 90’s. It wasn’t until around 2000 that global initiatives and cheaper AIDS drugs were made widely available to address this. This epidemic seems to have caused a widening gap between the life expectancy of Africa and that of other continents, as you can see in the plot for 2007. You can read more about this epidemic here.
# How many countries are represented by continent? country_cnt_by_continent = df.groupby('continent')['country'].nunique() country_cnt_by_continent
continent Africa 52 Americas 25 Asia 33 Europe 30 Oceania 2 Name: country, dtype: int64
# Let's create the same plots as above, but this time we'll look at GDP per capita avg_gdp_by_yr = df.groupby('year')['gdpPercap'].mean() avg_gdp_by_yr.plot() plt.tight_layout() plt.title('Average GDP per Capita by Year') plt.ylabel('gdp per capita') plt.xticks(range(1952,2010,5)) plt.show()
continent_gdp = df.groupby(['year','continent'])['gdpPercap'].mean() continent_gdp = continent_gdp.reset_index() continent_gdp = continent_gdp.pivot_table(values='gdpPercap', index='year', columns='continent') continent_gdp.plot() plt.legend(loc='best', ncol=2) plt.title('GDP per Capita by Continent and Year') plt.xticks(range(1952,2010,5)) plt.tight_layout() plt.show()
Oceania and Europe have similar trends, as well as Americas and Asia. Oceania and Europe have the steepest upwards trends, as well as the highest GDP per capita. Africa has consistently had the lowest GDP per capita and shows just a slight trend upward. We see a slight dip in Africa’s GDP per capita from about 1980-2000, which could be related to the HIV/AIDS epidemic of that time.