Many of us work with large and complicated datasets. The Pandas library is a useful tool for data analysis in Python. In this lesson,we will introduce some tools that can be used to analyze a population data set from Gapminder http://www.gapminder.org/.
From you desktop navigiate to WiSE-swc/pandas
What is tabular data? Who works with tabular data?
Some data is stored in a tabular format, meaning the data is organized into rows and columns.
Pandas should be installed with the Anaconda distribution.
Documentation: http://pandas.pydata.org/pandas-docs/stable/index.html
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
To read the Gapminder text file we will use a pandas funtion called read_table().
gDat = pd.read_table('gapminderDataFiveYear.txt')
It is often useful get familiar with data we are working with. What is recorded in our dataset? How is it organized? We can gain some insight by taking a quick peak at the dataset.
gDat
In each row, there is a recording for a country, year, population, continent, life expectancy, and GDP per capita. Further, each row has an interger label.
Since the data set is large, it would be nice to read the column headings without printing the entire dataset. First, we need to know more about how Python treats this data set. What type of object are we working with?
type(gDat)
Pandas loads this tabular data as a DataFrame object. This is a special type of object that is particularly useful for organizing data. Some more information can be found at the documentation webpage.
DataFrame objects have a number of attributes.
gDat.shape
gDat.columns
DataFrame objects also a number of methods.
Look at the first few lines.
gDat.head()
Pass an integer into head() to specify the number of lines to print
gDat.head(7)
Look at the end
gDat.tail()
Summary statistics with describe() * Count, mean,std, min, quartiles, max * Note: maybe not useful for every column
gDat.describe()
More resources: http://pandas.pydata.org/pandas-docs/stable/basics.html
It would be useful to look at sections of the data, rather than the entire dataset all at once. We might also want to use simple plots to take a closer look at the data.
Select a row of data with ix. * select row with label 0.
gDat.ix[[0]]
gDat.ix[[0,5,10]]
gDat[0:5]
gDat.lifeExp
or use an indexing-like notation
gDat['lifeExp']
Many of operations we saw earlier can apply to single column.
gDat['lifeExp'].head()
gDat['lifeExp'].describe()
In this dataset, the year column functions a bit like categorical data since there are only a few unique values.
gDat['year'].describe()
But the continent and country columns are truly categorical
gDat['continent'].describe()
gDat['country'].describe()
These categorical columns display a different set of summary statistics. Note that country, continent were not included in the gDat.describe() command.
Several plots can be easily generated for DataFrame objects using the plot method.
First, display the plots inline with a magic command:
%matplotlib inline
We can specify the type of plot with the kind argument. Also, choose the independent and dependent variables with x and y arguments. * Plot year vs life expectancy in a scatter plot.
gDat.plot(x='year',y='lifeExp',kind='scatter' )
gDat.plot(x='gdpPercap',y='lifeExp',kind='scatter')
What other kinds of plots can be generate with the plot() command?
We can also generate plots for specific columns.
gDat['lifeExp'].plot(kind='kde')
Histograms are also availble.
gDat.hist(column='lifeExp')
Sometimes we need to look at only parts of the data. For example, we might want to look at the data for a particular country or in a particular year. Or we may want to group the data by continent.
We can use the groupby method to split up the data according to repeated values in each column. For example, group the data by continent.
continents=gDat.groupby('continent')
len(continents)
How many unique countries are there? Years?
continents.size()
continents.size().plot(kind='bar')
continents.size().plot(kind='barh')
Retrieve a particular group with the get_group() command.
continents.get_group('Africa').describe()
What is the minimum life expectancy for Asia?
continents.get_group('Asia')['lifeExp'].min()
continents.get_group('Asia').describe()
We can also isolate sets of data with the isin() fuction. isin() returns a list of indices where the data matches a condition. These indices can be passed into the DataFrame object as an argument. * Isolate all of the data with year 2007
gDat[gDat['year'].isin([2007])]
You can also use "==" instead of isin().
gDat[gDat['country'] =='Uruguay']
We could have also accomplished this by accessing the indices of these rows:
gDat[1620:1632]
But this isn't very helpful for a few reasons 1. We need to know the rows that are associated with Urugray ahead of time. How are we supposed to find that? 2. If the DataFrame changes then we need to change the indices.
Write function that will take two countries as an argument and plot the life expectancy vs year for each country on the same axis.
def compare_lifeExp(country1, country2):
"""Plot life expectancy vs year for country1 and country2"""
for c in [country1,country2]:
gDat[gDat['country']==c].plot(x='year',y='lifeExp')
plt.legend((country1,country2))
compare_lifeExp('Canada','Mexico')
Another option for subsetting is to use a pivot table to group the data. * http://pandas.pydata.org/pandas-docs/dev/generated/pandas.tools.pivot.pivot_table.html
The pivot_table allows us to organize the information in a different way. For instance, we might prefer to think of the year column as a single set of observations and index the data in that way. For simplicity, we will view the population of each country indexed by year. We will send these arguments in the pivot_table function: * index = 'year' * columns = 'country' * values = 'pop'
pd.pivot_table(gDat,index=['year'],columns='country', values='pop')
We could may also be interested in more than one value for each country * index by year and print each county's pop and gdp.
pd.pivot_table(gDat,index=['year'],columns=['country'], values=['pop','gdpPercap'])
We could also index by year but apply an aggregation function to each column. The aggregation function defines how the information in the column should be summarized. Here are some examples: * len * mean * min * max
pd.pivot_table(gDat,index=['year'],aggfunc=[len])
pd.pivot_table(gDat,index=['country'],values=['gdpPercap','lifeExp','pop'],aggfunc=[np.mean,np.min])
We can also subset a subset.
spec=['year','lifeExp']
gDat[gDat['country']=='Mexico'][spec]
Or pass multiple conditions for subsetting. * Display Mexico's life expectancy in 2002.
gDat[(gDat['country']=='Mexico') & (gDat['year']==2002)]['lifeExp']
Write a function that will print a given country's life expectancy, population and gdp per capita in a given year. (note data is available only for every 5 years between 1952 and 2007).
def print_stats(gDat,country,year):
""" Prints the life expectancy, gdp per capita and population
of country in year. """
spec = ['lifeExp','pop','gdpPercap']
print 'Statisitcs for', country, 'in', year
print gDat[(gDat['country']==country) & (gDat['year']==year)][spec]
print_stats(gDat,'Canada',2007)
Somtimes we want to calculate useful pieces of information that apply to groups of the data. We can accomplish this using groupby and the agg() or apply() functions.
agg() allows use to aggregate the data in a certain way.
For example, we may want to look up the minimum life expectancy for each continent.
gDat.groupby('continent')['lifeExp'].agg(np.min)
But it would also be helpful to know where and when this occured. For this we can use apply() and indexing.
We will break this down by looking up when the minimum life expectancy occured for Africa.
subset= gDat.groupby('continent').get_group('Africa')
ind=subset['lifeExp'].idxmin()
subset.year[ind]
or even faster
subset.year[subset['lifeExp'].idxmin()]
Great! So Africa had a minimum life expecency in 1992. But we don't want to write this again for each contintent. We can apply this aggregation to each continent in a single line using the apply() method.
We will pass our own aggregation function called lambda as an argument in the apply() method.
Note: A lambda function is a function that is not bound to a name at run time.
gDat.groupby('continent').apply(lambda subset: subset.year[subset['lifeExp'].idxmin()])
Repeat again for country.
gDat.groupby('continent').apply(lambda subset: subset.country[subset['lifeExp'].idxmin()])
Or apply to both country and year at once wih a slightly different lambda function
spec=['year','country','lifeExp'] #columns we want to return
gDat.groupby('continent').apply(lambda subset: subset.ix[subset['lifeExp'].idxmin()])[spec]
Suzy wrote some code to determine which country had the lowest life expectancy in 1982.
What is wrong with her solution?
spec=['country','lifeExp']
gDat[gDat['year']==1982][spec].min()
We can do a quick check to look up Afghanistan's life expectancy in 1982.
gDat[(gDat['year']==1982) & (gDat['country']=='Afghanistan')]
This doesnt match with the answer above because the min() function was applied to each column (country and lifeExp).
She should have tried this.
gDat.ix[gDat[gDat['year']==1982]['lifeExp'].idxmin()]['country']
We can use all of these ideas to generate a plot that looks at a subset of the data. * Plot GDP per capita vs life expectancy in 2007 for each continent.
c=gDat.groupby(['continent'])
for cont in c.groups:
group=c.get_group(cont)
group[group['year']==2007].plot(kind='scatter',x='gdpPercap',y='lifeExp',title=cont)
#plt.axis([-10000,60000,30,90]) #add this in
Or plot each on a different axis.
fig,ax=plt.subplots(1,1)
cols=['m','b','r','g','y']
for cont, col in zip(c.groups,cols):
group=c.get_group(cont)
group[group['year']==2007].plot(kind='scatter',x='gdpPercap',y='lifeExp',label=cont,ax=ax,color=col,alpha=0.5)
ax.set_title(2007)
Write a function the takes a country as an argument and plots the life expectancy against GDP per capita for all years in a scatter plot. Also print the year of the minimum/maximum lifeExp and the year of the miniimim/maximum GDP per capita.
def compare_gdp_lifeExp(gDat,country):
""" plot GDP per capita against life expectancy for a given country.
print year of min/max gdp per capita and life expectancy
"""
sub=gDat[gDat['country']==country]
sub.plot(x='gdpPercap',y='lifeExp',kind='scatter',title=country)
print 'Year of Min/Max GDP per capita'
print gDat.ix[[sub['gdpPercap'].idxmin(),sub['gdpPercap'].idxmax()]]['year']
print 'Year of Min/Max life expectancy'
print gDat.ix[[sub['lifeExp'].idxmin(),sub['lifeExp'].idxmax()]]['year']
compare_gdp_lifeExp(gDat,'Afghanistan')
compare_gdp_lifeExp(gDat,'Canada')
Other resources for plotting: * http://pandas.pydata.org/pandas-docs/stable/visualization.html#trellis-plotting-interface