Intro to Pandas

Many of us work with large and complicated datasets. The Pandas library is a useful tool for data analysis in Python. In this lesson,we will introduce some tools that can be used to analyze a population data set from Gapminder http://www.gapminder.org/.

Objectives

  • Read tabular data into an IPython notebook
  • Access columns of the data
  • Isolate subsets of the data
  • Generate plots based on subsetted data

Resources

  • http://pandas.pydata.org/pandas-docs/stable/index.html
  • http://pandas.pydata.org/pandas-docs/stable/basics.html
  • http://pandas.pydata.org/pandas-docs/dev/generated/pandas.DataFrame.html
  • https://github.com/datacarpentry/datacarpentry/blob/master/cheatsheets/R_pandas_compare.md
  • http://pandas.pydata.org/pandas-docs/stable/visualization.html

Preparing for the lesson

From you desktop navigiate to WiSE-swc/pandas

Working with tabular data

Question

What is tabular data? Who works with tabular data?

Some data is stored in a tabular format, meaning the data is organized into rows and columns.

  • We will use a libarary called pandas to read this dataset into our notebook.
  • Pandas should be installed with the Anaconda distribution.

  • Documentation: http://pandas.pydata.org/pandas-docs/stable/index.html

In [1]:
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt

To read the Gapminder text file we will use a pandas funtion called read_table().

In [2]:
gDat = pd.read_table('gapminderDataFiveYear.txt')

Familiarizing yourself with the data

It is often useful get familiar with data we are working with. What is recorded in our dataset? How is it organized? We can gain some insight by taking a quick peak at the dataset.

In [3]:
gDat
Out[3]:
country year pop continent lifeExp gdpPercap
0 Afghanistan 1952 8425333 Asia 28.801 779.445314
1 Afghanistan 1957 9240934 Asia 30.332 820.853030
2 Afghanistan 1962 10267083 Asia 31.997 853.100710
3 Afghanistan 1967 11537966 Asia 34.020 836.197138
4 Afghanistan 1972 13079460 Asia 36.088 739.981106
5 Afghanistan 1977 14880372 Asia 38.438 786.113360
6 Afghanistan 1982 12881816 Asia 39.854 978.011439
7 Afghanistan 1987 13867957 Asia 40.822 852.395945
8 Afghanistan 1992 16317921 Asia 41.674 649.341395
9 Afghanistan 1997 22227415 Asia 41.763 635.341351
10 Afghanistan 2002 25268405 Asia 42.129 726.734055
11 Afghanistan 2007 31889923 Asia 43.828 974.580338
12 Albania 1952 1282697 Europe 55.230 1601.056136
13 Albania 1957 1476505 Europe 59.280 1942.284244
14 Albania 1962 1728137 Europe 64.820 2312.888958
15 Albania 1967 1984060 Europe 66.220 2760.196931
16 Albania 1972 2263554 Europe 67.690 3313.422188
17 Albania 1977 2509048 Europe 68.930 3533.003910
18 Albania 1982 2780097 Europe 70.420 3630.880722
19 Albania 1987 3075321 Europe 72.000 3738.932735
20 Albania 1992 3326498 Europe 71.581 2497.437901
21 Albania 1997 3428038 Europe 72.950 3193.054604
22 Albania 2002 3508512 Europe 75.651 4604.211737
23 Albania 2007 3600523 Europe 76.423 5937.029526
24 Algeria 1952 9279525 Africa 43.077 2449.008185
25 Algeria 1957 10270856 Africa 45.685 3013.976023
26 Algeria 1962 11000948 Africa 48.303 2550.816880
27 Algeria 1967 12760499 Africa 51.407 3246.991771
28 Algeria 1972 14760787 Africa 54.518 4182.663766
29 Algeria 1977 17152804 Africa 58.014 4910.416756
... ... ... ... ... ... ...
1674 Yemen, Rep. 1982 9657618 Asia 49.113 1977.557010
1675 Yemen, Rep. 1987 11219340 Asia 52.922 1971.741538
1676 Yemen, Rep. 1992 13367997 Asia 55.599 1879.496673
1677 Yemen, Rep. 1997 15826497 Asia 58.020 2117.484526
1678 Yemen, Rep. 2002 18701257 Asia 60.308 2234.820827
1679 Yemen, Rep. 2007 22211743 Asia 62.698 2280.769906
1680 Zambia 1952 2672000 Africa 42.038 1147.388831
1681 Zambia 1957 3016000 Africa 44.077 1311.956766
1682 Zambia 1962 3421000 Africa 46.023 1452.725766
1683 Zambia 1967 3900000 Africa 47.768 1777.077318
1684 Zambia 1972 4506497 Africa 50.107 1773.498265
1685 Zambia 1977 5216550 Africa 51.386 1588.688299
1686 Zambia 1982 6100407 Africa 51.821 1408.678565
1687 Zambia 1987 7272406 Africa 50.821 1213.315116
1688 Zambia 1992 8381163 Africa 46.100 1210.884633
1689 Zambia 1997 9417789 Africa 40.238 1071.353818
1690 Zambia 2002 10595811 Africa 39.193 1071.613938
1691 Zambia 2007 11746035 Africa 42.384 1271.211593
1692 Zimbabwe 1952 3080907 Africa 48.451 406.884115
1693 Zimbabwe 1957 3646340 Africa 50.469 518.764268
1694 Zimbabwe 1962 4277736 Africa 52.358 527.272182
1695 Zimbabwe 1967 4995432 Africa 53.995 569.795071
1696 Zimbabwe 1972 5861135 Africa 55.635 799.362176
1697 Zimbabwe 1977 6642107 Africa 57.674 685.587682
1698 Zimbabwe 1982 7636524 Africa 60.363 788.855041
1699 Zimbabwe 1987 9216418 Africa 62.351 706.157306
1700 Zimbabwe 1992 10704340 Africa 60.377 693.420786
1701 Zimbabwe 1997 11404948 Africa 46.809 792.449960
1702 Zimbabwe 2002 11926563 Africa 39.989 672.038623
1703 Zimbabwe 2007 12311143 Africa 43.487 469.709298

1704 rows × 6 columns

In each row, there is a recording for a country, year, population, continent, life expectancy, and GDP per capita. Further, each row has an interger label.

Since the data set is large, it would be nice to read the column headings without printing the entire dataset. First, we need to know more about how Python treats this data set. What type of object are we working with?

In [4]:
type(gDat)
Out[4]:
pandas.core.frame.DataFrame

Pandas loads this tabular data as a DataFrame object. This is a special type of object that is particularly useful for organizing data. Some more information can be found at the documentation webpage.

  • http://pandas.pydata.org/pandas-docs/dev/generated/pandas.DataFrame.html

DataFrame objects have a number of attributes.

In [5]:
gDat.shape
Out[5]:
(1704, 6)
In [6]:
gDat.columns
Out[6]:
Index([u'country', u'year', u'pop', u'continent', u'lifeExp', u'gdpPercap'], dtype='object')

DataFrame objects also a number of methods.

Look at the first few lines.

In [7]:
gDat.head()
Out[7]:
country year pop continent lifeExp gdpPercap
0 Afghanistan 1952 8425333 Asia 28.801 779.445314
1 Afghanistan 1957 9240934 Asia 30.332 820.853030
2 Afghanistan 1962 10267083 Asia 31.997 853.100710
3 Afghanistan 1967 11537966 Asia 34.020 836.197138
4 Afghanistan 1972 13079460 Asia 36.088 739.981106

Pass an integer into head() to specify the number of lines to print

In [8]:
gDat.head(7)
Out[8]:
country year pop continent lifeExp gdpPercap
0 Afghanistan 1952 8425333 Asia 28.801 779.445314
1 Afghanistan 1957 9240934 Asia 30.332 820.853030
2 Afghanistan 1962 10267083 Asia 31.997 853.100710
3 Afghanistan 1967 11537966 Asia 34.020 836.197138
4 Afghanistan 1972 13079460 Asia 36.088 739.981106
5 Afghanistan 1977 14880372 Asia 38.438 786.113360
6 Afghanistan 1982 12881816 Asia 39.854 978.011439

Look at the end

In [9]:
gDat.tail()
Out[9]:
country year pop continent lifeExp gdpPercap
1699 Zimbabwe 1987 9216418 Africa 62.351 706.157306
1700 Zimbabwe 1992 10704340 Africa 60.377 693.420786
1701 Zimbabwe 1997 11404948 Africa 46.809 792.449960
1702 Zimbabwe 2002 11926563 Africa 39.989 672.038623
1703 Zimbabwe 2007 12311143 Africa 43.487 469.709298

Summary statistics with describe() * Count, mean,std, min, quartiles, max * Note: maybe not useful for every column

In [10]:
gDat.describe()
Out[10]:
year pop lifeExp gdpPercap
count 1704.00000 1.704000e+03 1704.000000 1704.000000
mean 1979.50000 2.960121e+07 59.474439 7215.327081
std 17.26533 1.061579e+08 12.917107 9857.454543
min 1952.00000 6.001100e+04 23.599000 241.165877
25% 1965.75000 2.793664e+06 48.198000 1202.060309
50% 1979.50000 7.023596e+06 60.712500 3531.846989
75% 1993.25000 1.958522e+07 70.845500 9325.462346
max 2007.00000 1.318683e+09 82.603000 113523.132900

More resources: http://pandas.pydata.org/pandas-docs/stable/basics.html

A closer look at the data

It would be useful to look at sections of the data, rather than the entire dataset all at once. We might also want to use simple plots to take a closer look at the data.

Select a row of data with ix. * select row with label 0.

In [11]:
gDat.ix[[0]]
Out[11]:
country year pop continent lifeExp gdpPercap
0 Afghanistan 1952 8425333 Asia 28.801 779.445314
  • select a set of rows
In [12]:
gDat.ix[[0,5,10]]
Out[12]:
country year pop continent lifeExp gdpPercap
0 Afghanistan 1952 8425333 Asia 28.801 779.445314
5 Afghanistan 1977 14880372 Asia 38.438 786.113360
10 Afghanistan 2002 25268405 Asia 42.129 726.734055
  • select a range of rows like in numpy indexing
In [13]:
gDat[0:5]
Out[13]:
country year pop continent lifeExp gdpPercap
0 Afghanistan 1952 8425333 Asia 28.801 779.445314
1 Afghanistan 1957 9240934 Asia 30.332 820.853030
2 Afghanistan 1962 10267083 Asia 31.997 853.100710
3 Afghanistan 1967 11537966 Asia 34.020 836.197138
4 Afghanistan 1972 13079460 Asia 36.088 739.981106
  • extract a column
In [14]:
gDat.lifeExp
Out[14]:
0     28.801
1     30.332
2     31.997
3     34.020
4     36.088
5     38.438
6     39.854
7     40.822
8     41.674
9     41.763
10    42.129
11    43.828
12    55.230
13    59.280
14    64.820
...
1689    40.238
1690    39.193
1691    42.384
1692    48.451
1693    50.469
1694    52.358
1695    53.995
1696    55.635
1697    57.674
1698    60.363
1699    62.351
1700    60.377
1701    46.809
1702    39.989
1703    43.487
Name: lifeExp, Length: 1704, dtype: float64

or use an indexing-like notation

In [15]:
gDat['lifeExp']
Out[15]:
0     28.801
1     30.332
2     31.997
3     34.020
4     36.088
5     38.438
6     39.854
7     40.822
8     41.674
9     41.763
10    42.129
11    43.828
12    55.230
13    59.280
14    64.820
...
1689    40.238
1690    39.193
1691    42.384
1692    48.451
1693    50.469
1694    52.358
1695    53.995
1696    55.635
1697    57.674
1698    60.363
1699    62.351
1700    60.377
1701    46.809
1702    39.989
1703    43.487
Name: lifeExp, Length: 1704, dtype: float64

Many of operations we saw earlier can apply to single column.

In [16]:
gDat['lifeExp'].head()
Out[16]:
0    28.801
1    30.332
2    31.997
3    34.020
4    36.088
Name: lifeExp, dtype: float64
In [17]:
gDat['lifeExp'].describe()
Out[17]:
count    1704.000000
mean       59.474439
std        12.917107
min        23.599000
25%        48.198000
50%        60.712500
75%        70.845500
max        82.603000
dtype: float64

In this dataset, the year column functions a bit like categorical data since there are only a few unique values.

In [18]:
gDat['year'].describe()
Out[18]:
count    1704.00000
mean     1979.50000
std        17.26533
min      1952.00000
25%      1965.75000
50%      1979.50000
75%      1993.25000
max      2007.00000
dtype: float64

But the continent and country columns are truly categorical

In [19]:
gDat['continent'].describe()
Out[19]:
count       1704
unique         5
top       Africa
freq         624
dtype: object
In [20]:
gDat['country'].describe()
Out[20]:
count               1704
unique               142
top       Czech Republic
freq                  12
dtype: object

These categorical columns display a different set of summary statistics. Note that country, continent were not included in the gDat.describe() command.

Some visualualization tools

Several plots can be easily generated for DataFrame objects using the plot method.

First, display the plots inline with a magic command:

In [21]:
%matplotlib inline

We can specify the type of plot with the kind argument. Also, choose the independent and dependent variables with x and y arguments. * Plot year vs life expectancy in a scatter plot.

In [22]:
gDat.plot(x='year',y='lifeExp',kind='scatter' )
Out[22]:
<matplotlib.axes._subplots.AxesSubplot at 0xb94fda0>
  • Plot gdp per capity vs life expectancy in a scatter plot
In [23]:
gDat.plot(x='gdpPercap',y='lifeExp',kind='scatter')
Out[23]:
<matplotlib.axes._subplots.AxesSubplot at 0xbacdbe0>

Question

What other kinds of plots can be generate with the plot() command?

We can also generate plots for specific columns.

In [24]:
gDat['lifeExp'].plot(kind='kde')
Out[24]:
<matplotlib.axes._subplots.AxesSubplot at 0xbc1f080>

Histograms are also availble.

In [25]:
gDat.hist(column='lifeExp')
Out[25]:
array([[<matplotlib.axes._subplots.AxesSubplot object at 0x0000000012DF6160>]], dtype=object)

Subsetting and grouping

Sometimes we need to look at only parts of the data. For example, we might want to look at the data for a particular country or in a particular year. Or we may want to group the data by continent.

Groupby

We can use the groupby method to split up the data according to repeated values in each column. For example, group the data by continent.

In [26]:
continents=gDat.groupby('continent')
  • list the number of unique continents.
In [27]:
len(continents)
Out[27]:
5

Question

How many unique countries are there? Years?

  • list the names of the continents and the number of data points in each.
In [28]:
continents.size()
Out[28]:
continent
Africa       624
Americas     300
Asia         396
Europe       360
Oceania       24
dtype: int64
  • Plot in a barchart.
In [29]:
continents.size().plot(kind='bar')
Out[29]:
<matplotlib.axes._subplots.AxesSubplot at 0x12d12b38>
  • Horizontal bar plot
In [30]:
continents.size().plot(kind='barh')
Out[30]:
<matplotlib.axes._subplots.AxesSubplot at 0x132046a0>

Retrieve a particular group with the get_group() command.

In [31]:
continents.get_group('Africa').describe()
Out[31]:
year pop lifeExp gdpPercap
count 624.00000 6.240000e+02 624.00000 624.000000
mean 1979.50000 9.916003e+06 48.86533 2193.754578
std 17.27411 1.549092e+07 9.15021 2827.929863
min 1952.00000 6.001100e+04 23.59900 241.165877
25% 1965.75000 1.342075e+06 42.37250 761.247010
50% 1979.50000 4.579311e+06 47.79200 1192.138217
75% 1993.25000 1.080149e+07 54.41150 2377.417422
max 2007.00000 1.350312e+08 76.44200 21951.211760

Question

What is the minimum life expectancy for Asia?

In [32]:
continents.get_group('Asia')['lifeExp'].min()
Out[32]:
28.800999999999998
In [33]:
continents.get_group('Asia').describe()
Out[33]:
year pop lifeExp gdpPercap
count 396.000000 3.960000e+02 396.000000 396.000000
mean 1979.500000 7.703872e+07 60.064903 7902.150428
std 17.282097 2.068852e+08 11.864532 14045.373112
min 1952.000000 1.204470e+05 28.801000 331.000000
25% 1965.750000 3.844393e+06 51.426250 1056.993223
50% 1979.500000 1.453083e+07 61.791500 2646.786844
75% 1993.250000 4.630035e+07 69.505250 8549.255654
max 2007.000000 1.318683e+09 82.603000 113523.132900

Subset

We can also isolate sets of data with the isin() fuction. isin() returns a list of indices where the data matches a condition. These indices can be passed into the DataFrame object as an argument. * Isolate all of the data with year 2007

In [34]:
gDat[gDat['year'].isin([2007])]
Out[34]:
country year pop continent lifeExp gdpPercap
11 Afghanistan 2007 31889923 Asia 43.828 974.580338
23 Albania 2007 3600523 Europe 76.423 5937.029526
35 Algeria 2007 33333216 Africa 72.301 6223.367465
47 Angola 2007 12420476 Africa 42.731 4797.231267
59 Argentina 2007 40301927 Americas 75.320 12779.379640
71 Australia 2007 20434176 Oceania 81.235 34435.367440
83 Austria 2007 8199783 Europe 79.829 36126.492700
95 Bahrain 2007 708573 Asia 75.635 29796.048340
107 Bangladesh 2007 150448339 Asia 64.062 1391.253792
119 Belgium 2007 10392226 Europe 79.441 33692.605080
131 Benin 2007 8078314 Africa 56.728 1441.284873
143 Bolivia 2007 9119152 Americas 65.554 3822.137084
155 Bosnia and Herzegovina 2007 4552198 Europe 74.852 7446.298803
167 Botswana 2007 1639131 Africa 50.728 12569.851770
179 Brazil 2007 190010647 Americas 72.390 9065.800825
191 Bulgaria 2007 7322858 Europe 73.005 10680.792820
203 Burkina Faso 2007 14326203 Africa 52.295 1217.032994
215 Burundi 2007 8390505 Africa 49.580 430.070692
227 Cambodia 2007 14131858 Asia 59.723 1713.778686
239 Cameroon 2007 17696293 Africa 50.430 2042.095240
251 Canada 2007 33390141 Americas 80.653 36319.235010
263 Central African Republic 2007 4369038 Africa 44.741 706.016537
275 Chad 2007 10238807 Africa 50.651 1704.063724
287 Chile 2007 16284741 Americas 78.553 13171.638850
299 China 2007 1318683096 Asia 72.961 4959.114854
311 Colombia 2007 44227550 Americas 72.889 7006.580419
323 Comoros 2007 710960 Africa 65.152 986.147879
335 Congo, Dem. Rep. 2007 64606759 Africa 46.462 277.551859
347 Congo, Rep. 2007 3800610 Africa 55.322 3632.557798
359 Costa Rica 2007 4133884 Americas 78.782 9645.061420
... ... ... ... ... ... ...
1355 Sierra Leone 2007 6144562 Africa 42.568 862.540756
1367 Singapore 2007 4553009 Asia 79.972 47143.179640
1379 Slovak Republic 2007 5447502 Europe 74.663 18678.314350
1391 Slovenia 2007 2009245 Europe 77.926 25768.257590
1403 Somalia 2007 9118773 Africa 48.159 926.141068
1415 South Africa 2007 43997828 Africa 49.339 9269.657808
1427 Spain 2007 40448191 Europe 80.941 28821.063700
1439 Sri Lanka 2007 20378239 Asia 72.396 3970.095407
1451 Sudan 2007 42292929 Africa 58.556 2602.394995
1463 Swaziland 2007 1133066 Africa 39.613 4513.480643
1475 Sweden 2007 9031088 Europe 80.884 33859.748350
1487 Switzerland 2007 7554661 Europe 81.701 37506.419070
1499 Syria 2007 19314747 Asia 74.143 4184.548089
1511 Taiwan 2007 23174294 Asia 78.400 28718.276840
1523 Tanzania 2007 38139640 Africa 52.517 1107.482182
1535 Thailand 2007 65068149 Asia 70.616 7458.396327
1547 Togo 2007 5701579 Africa 58.420 882.969944
1559 Trinidad and Tobago 2007 1056608 Americas 69.819 18008.509240
1571 Tunisia 2007 10276158 Africa 73.923 7092.923025
1583 Turkey 2007 71158647 Europe 71.777 8458.276384
1595 Uganda 2007 29170398 Africa 51.542 1056.380121
1607 United Kingdom 2007 60776238 Europe 79.425 33203.261280
1619 United States 2007 301139947 Americas 78.242 42951.653090
1631 Uruguay 2007 3447496 Americas 76.384 10611.462990
1643 Venezuela 2007 26084662 Americas 73.747 11415.805690
1655 Vietnam 2007 85262356 Asia 74.249 2441.576404
1667 West Bank and Gaza 2007 4018332 Asia 73.422 3025.349798
1679 Yemen, Rep. 2007 22211743 Asia 62.698 2280.769906
1691 Zambia 2007 11746035 Africa 42.384 1271.211593
1703 Zimbabwe 2007 12311143 Africa 43.487 469.709298

142 rows × 6 columns

You can also use "==" instead of isin().

  • Isolate a country (Uruguay).
In [35]:
gDat[gDat['country'] =='Uruguay']
Out[35]:
country year pop continent lifeExp gdpPercap
1620 Uruguay 1952 2252965 Americas 66.071 5716.766744
1621 Uruguay 1957 2424959 Americas 67.044 6150.772969
1622 Uruguay 1962 2598466 Americas 68.253 5603.357717
1623 Uruguay 1967 2748579 Americas 68.468 5444.619620
1624 Uruguay 1972 2829526 Americas 68.673 5703.408898
1625 Uruguay 1977 2873520 Americas 69.481 6504.339663
1626 Uruguay 1982 2953997 Americas 70.805 6920.223051
1627 Uruguay 1987 3045153 Americas 71.918 7452.398969
1628 Uruguay 1992 3149262 Americas 72.752 8137.004775
1629 Uruguay 1997 3262838 Americas 74.223 9230.240708
1630 Uruguay 2002 3363085 Americas 75.307 7727.002004
1631 Uruguay 2007 3447496 Americas 76.384 10611.462990

We could have also accomplished this by accessing the indices of these rows:

In [36]:
gDat[1620:1632]
Out[36]:
country year pop continent lifeExp gdpPercap
1620 Uruguay 1952 2252965 Americas 66.071 5716.766744
1621 Uruguay 1957 2424959 Americas 67.044 6150.772969
1622 Uruguay 1962 2598466 Americas 68.253 5603.357717
1623 Uruguay 1967 2748579 Americas 68.468 5444.619620
1624 Uruguay 1972 2829526 Americas 68.673 5703.408898
1625 Uruguay 1977 2873520 Americas 69.481 6504.339663
1626 Uruguay 1982 2953997 Americas 70.805 6920.223051
1627 Uruguay 1987 3045153 Americas 71.918 7452.398969
1628 Uruguay 1992 3149262 Americas 72.752 8137.004775
1629 Uruguay 1997 3262838 Americas 74.223 9230.240708
1630 Uruguay 2002 3363085 Americas 75.307 7727.002004
1631 Uruguay 2007 3447496 Americas 76.384 10611.462990

But this isn't very helpful for a few reasons 1. We need to know the rows that are associated with Urugray ahead of time. How are we supposed to find that? 2. If the DataFrame changes then we need to change the indices.

Exercise

Write function that will take two countries as an argument and plot the life expectancy vs year for each country on the same axis.

In [37]:
def compare_lifeExp(country1, country2):
    """Plot life expectancy vs year for country1 and country2"""

    for c in [country1,country2]:
        gDat[gDat['country']==c].plot(x='year',y='lifeExp')
    plt.legend((country1,country2))
In [38]:
compare_lifeExp('Canada','Mexico')

Pivot Table

Another option for subsetting is to use a pivot table to group the data. * http://pandas.pydata.org/pandas-docs/dev/generated/pandas.tools.pivot.pivot_table.html

The pivot_table allows us to organize the information in a different way. For instance, we might prefer to think of the year column as a single set of observations and index the data in that way. For simplicity, we will view the population of each country indexed by year. We will send these arguments in the pivot_table function: * index = 'year' * columns = 'country' * values = 'pop'

In [39]:
pd.pivot_table(gDat,index=['year'],columns='country', values='pop') 
Out[39]:
country Afghanistan Albania Algeria Angola Argentina Australia Austria Bahrain Bangladesh Belgium ... Uganda United Kingdom United States Uruguay Venezuela Vietnam West Bank and Gaza Yemen, Rep. Zambia Zimbabwe
year
1952 8425333 1282697 9279525 4232095 17876956 8691212 6927772 120447 46886859 8730405 ... 5824797 50430000 157553000 2252965 5439568 26246839 1030585 4963829 2672000 3080907
1957 9240934 1476505 10270856 4561361 19610538 9712569 6965860 138655 51365468 8989111 ... 6675501 51430000 171984000 2424959 6702668 28998543 1070439 5498090 3016000 3646340
1962 10267083 1728137 11000948 4826015 21283783 10794968 7129864 171863 56839289 9218400 ... 7688797 53292000 186538000 2598466 8143375 33796140 1133134 6120081 3421000 4277736
1967 11537966 1984060 12760499 5247469 22934225 11872264 7376998 202182 62821884 9556500 ... 8900294 54959000 198712000 2748579 9709552 39463910 1142636 6740785 3900000 4995432
1972 13079460 2263554 14760787 5894858 24779799 13177000 7544201 230800 70759295 9709100 ... 10190285 56079000 209896000 2829526 11515649 44655014 1089572 7407075 4506497 5861135
1977 14880372 2509048 17152804 6162675 26983828 14074100 7568430 297410 80428306 9821800 ... 11457758 56179000 220239000 2873520 13503563 50533506 1261091 8403990 5216550 6642107
1982 12881816 2780097 20033753 7016384 29341374 15184200 7574613 377967 93074406 9856303 ... 12939400 56339704 232187835 2953997 15620766 56142181 1425876 9657618 6100407 7636524
1987 13867957 3075321 23254956 7874230 31620918 16257249 7578903 454612 103764241 9870200 ... 15283050 56981620 242803533 3045153 17910182 62826491 1691210 11219340 7272406 9216418
1992 16317921 3326498 26298373 8735988 33958947 17481977 7914969 529491 113704579 10045622 ... 18252190 57866349 256894189 3149262 20265563 69940728 2104779 13367997 8381163 10704340
1997 22227415 3428038 29072015 9875024 36203463 18565243 8069876 598561 123315288 10199787 ... 21210254 58808266 272911760 3262838 22374398 76048996 2826046 15826497 9417789 11404948
2002 25268405 3508512 31287142 10866106 38331121 19546792 8148312 656397 135656790 10311970 ... 24739869 59912431 287675526 3363085 24287670 80908147 3389578 18701257 10595811 11926563
2007 31889923 3600523 33333216 12420476 40301927 20434176 8199783 708573 150448339 10392226 ... 29170398 60776238 301139947 3447496 26084662 85262356 4018332 22211743 11746035 12311143

12 rows × 142 columns

We could may also be interested in more than one value for each country * index by year and print each county's pop and gdp.

In [40]:
pd.pivot_table(gDat,index=['year'],columns=['country'], values=['pop','gdpPercap']) 
Out[40]:
pop ... gdpPercap
country Afghanistan Albania Algeria Angola Argentina Australia Austria Bahrain Bangladesh Belgium ... Uganda United Kingdom United States Uruguay Venezuela Vietnam West Bank and Gaza Yemen, Rep. Zambia Zimbabwe
year
1952 8425333 1282697 9279525 4232095 17876956 8691212 6927772 120447 46886859 8730405 ... 734.753484 9979.508487 13990.48208 5716.766744 7689.799761 605.066492 1515.592329 781.717576 1147.388831 406.884115
1957 9240934 1476505 10270856 4561361 19610538 9712569 6965860 138655 51365468 8989111 ... 774.371069 11283.177950 14847.12712 6150.772969 9802.466526 676.285448 1827.067742 804.830455 1311.956766 518.764268
1962 10267083 1728137 11000948 4826015 21283783 10794968 7129864 171863 56839289 9218400 ... 767.271740 12477.177070 16173.14586 5603.357717 8422.974165 772.049160 2198.956312 825.623201 1452.725766 527.272182
1967 11537966 1984060 12760499 5247469 22934225 11872264 7376998 202182 62821884 9556500 ... 908.918522 14142.850890 19530.36557 5444.619620 9541.474188 637.123289 2649.715007 862.442146 1777.077318 569.795071
1972 13079460 2263554 14760787 5894858 24779799 13177000 7544201 230800 70759295 9709100 ... 950.735869 15895.116410 21806.03594 5703.408898 10505.259660 699.501644 3133.409277 1265.047031 1773.498265 799.362176
1977 14880372 2509048 17152804 6162675 26983828 14074100 7568430 297410 80428306 9821800 ... 843.733137 17428.748460 24072.63213 6504.339663 13143.950950 713.537120 3682.831494 1829.765177 1588.688299 685.587682
1982 12881816 2780097 20033753 7016384 29341374 15184200 7574613 377967 93074406 9856303 ... 682.266227 18232.424520 25009.55914 6920.223051 11152.410110 707.235786 4336.032082 1977.557010 1408.678565 788.855041
1987 13867957 3075321 23254956 7874230 31620918 16257249 7578903 454612 103764241 9870200 ... 617.724406 21664.787670 29884.35041 7452.398969 9883.584648 820.799445 5107.197384 1971.741538 1213.315116 706.157306
1992 16317921 3326498 26298373 8735988 33958947 17481977 7914969 529491 113704579 10045622 ... 644.170797 22705.092540 32003.93224 8137.004775 10733.926310 989.023149 6017.654756 1879.496673 1210.884633 693.420786
1997 22227415 3428038 29072015 9875024 36203463 18565243 8069876 598561 123315288 10199787 ... 816.559081 26074.531360 35767.43303 9230.240708 10165.495180 1385.896769 7110.667619 2117.484526 1071.353818 792.449960
2002 25268405 3508512 31287142 10866106 38331121 19546792 8148312 656397 135656790 10311970 ... 927.721002 29478.999190 39097.09955 7727.002004 8605.047831 1764.456677 4515.487575 2234.820827 1071.613938 672.038623
2007 31889923 3600523 33333216 12420476 40301927 20434176 8199783 708573 150448339 10392226 ... 1056.380121 33203.261280 42951.65309 10611.462990 11415.805690 2441.576404 3025.349798 2280.769906 1271.211593 469.709298

12 rows × 284 columns

We could also index by year but apply an aggregation function to each column. The aggregation function defines how the information in the column should be summarized. Here are some examples: * len * mean * min * max

In [41]:
pd.pivot_table(gDat,index=['year'],aggfunc=[len]) 
Out[41]:
len
continent country gdpPercap lifeExp pop
year
1952 142 142 142 142 142
1957 142 142 142 142 142
1962 142 142 142 142 142
1967 142 142 142 142 142
1972 142 142 142 142 142
1977 142 142 142 142 142
1982 142 142 142 142 142
1987 142 142 142 142 142
1992 142 142 142 142 142
1997 142 142 142 142 142
2002 142 142 142 142 142
2007 142 142 142 142 142
  • Index by country and only look at gdpPerCap, lifeExp, and pop. Calculate the mean and minimum.
In [42]:
pd.pivot_table(gDat,index=['country'],values=['gdpPercap','lifeExp','pop'],aggfunc=[np.mean,np.min])
Out[42]:
mean amin
gdpPercap lifeExp pop gdpPercap lifeExp pop
country
Afghanistan 802.674598 37.478833 1.582372e+07 635.341351 28.801 8.425333e+06
Albania 3255.366633 68.432917 2.580249e+06 1601.056136 55.230 1.282697e+06
Algeria 4426.025973 59.030167 1.987541e+07 2449.008185 43.077 9.279525e+06
Angola 3607.100529 37.883500 7.309390e+06 2277.140884 30.015 4.232095e+06
Argentina 8955.553783 69.060417 2.860224e+07 5911.315053 62.485 1.787696e+07
Australia 19980.595634 74.662917 1.464931e+07 10039.595640 69.120 8.691212e+06
Austria 20411.916279 73.103250 7.583298e+06 6137.076492 66.800 6.927772e+06
Bahrain 18077.663945 65.605667 3.739132e+05 9867.084765 50.939 1.204470e+05
Bangladesh 817.558818 49.834083 9.075540e+07 630.233627 37.484 4.688686e+07
Belgium 19900.758072 73.641750 9.725119e+06 8343.105127 68.000 8.730405e+06
Benin 1155.395107 48.779917 4.017497e+06 949.499064 38.223 1.738315e+06
Bolivia 2961.228754 52.504583 5.610395e+06 2127.686326 40.414 2.883315e+06
Bosnia and Herzegovina 3484.779069 67.707833 3.816525e+06 973.533195 53.820 2.791000e+06
Botswana 5031.503557 54.597500 9.711862e+05 851.241141 46.634 4.423080e+05
Brazil 5829.316653 62.239500 1.223121e+08 2108.944355 50.917 5.660256e+07
Bulgaria 6384.055172 69.743750 8.182985e+06 2444.286648 59.600 7.274900e+06
Burkina Faso 843.990665 44.694000 7.548677e+06 543.255241 31.975 4.469979e+06
Burundi 471.662990 44.817333 4.651608e+06 339.296459 39.031 2.445618e+06
Cambodia 675.367824 47.902750 8.510431e+06 368.469286 31.220 4.693836e+06
Cameroon 1774.634222 48.128500 9.816648e+06 1172.667655 38.523 5.009067e+06
Canada 22410.746340 74.902750 2.446297e+07 11367.161120 68.750 1.478558e+07
Central African Republic 958.784697 43.866917 2.560963e+06 706.016537 35.463 1.291695e+06
Chad 1165.453674 46.773583 5.329256e+06 797.908101 38.092 2.682462e+06
Chile 6703.289147 67.430917 1.120573e+07 3939.978789 54.745 6.377619e+06
China 1488.307694 61.785140 9.581601e+08 400.448611 44.000 5.562635e+08
Colombia 4195.342920 63.897750 2.725610e+07 2144.115096 50.643 1.235077e+07
Comoros 1314.380339 52.381750 3.616839e+05 986.147879 40.715 1.539360e+05
Congo, Dem. Rep. 648.342646 44.543750 3.268166e+07 241.165877 39.143 1.410000e+07
Congo, Rep. 3312.788215 52.501917 1.923209e+06 2125.621418 42.111 8.548850e+05
Costa Rica 5448.610779 70.181417 2.400008e+06 2627.009471 57.206 9.263170e+05
... ... ... ... ... ... ...
Sierra Leone 1072.819493 36.769167 3.605425e+06 574.648158 30.331 2.143249e+06
Singapore 17425.382267 71.220250 2.667817e+06 2315.138227 60.396 1.127000e+06
Slovak Republic 10415.530689 70.696083 4.774507e+06 5074.659104 64.360 3.558137e+06
Slovenia 14074.582109 71.600750 1.794381e+06 4215.041741 65.570 1.489518e+06
Somalia 1140.793252 40.988667 5.197198e+06 882.081822 32.978 2.526994e+06
South Africa 7247.431074 53.993167 2.992835e+07 4725.295531 45.009 1.426494e+07
Spain 14029.826479 74.203417 3.585180e+07 3834.034742 64.940 2.854987e+07
Sri Lanka 1854.731119 66.526083 1.454583e+07 1072.546602 57.593 7.982342e+06
Sudan 1835.010430 48.400500 2.156033e+07 1492.197043 38.635 8.504667e+06
Swaziland 3163.352358 49.002417 6.790520e+05 1148.376626 39.613 2.902430e+05
Sweden 19943.126104 76.177000 8.220029e+06 8527.844662 71.860 7.124673e+06
Switzerland 27074.334405 75.565083 6.384293e+06 14734.232750 69.620 4.815000e+06
Syria 3009.287981 61.346167 9.865379e+06 1643.485354 45.883 3.661549e+06
Taiwan 10224.807181 70.336667 1.687472e+07 1206.947913 58.500 8.550362e+06
Tanzania 849.281271 47.912333 2.049950e+07 698.535607 41.215 8.322925e+06
Thailand 3045.966474 62.200250 4.496163e+07 757.797418 50.848 2.128940e+07
Togo 1153.820116 51.498750 2.895964e+06 859.808657 38.596 1.219113e+06
Trinidad and Tobago 7866.871946 66.828000 1.006470e+06 3023.271928 59.100 6.628500e+05
Tunisia 3477.210351 60.721000 6.686770e+06 1395.232468 44.600 3.647735e+06
Turkey 4469.453380 59.696417 4.590901e+07 1969.100980 43.585 2.223568e+07
Uganda 810.383788 47.618833 1.436105e+07 617.724406 39.978 5.824797e+06
United Kingdom 19380.472986 73.922583 5.608780e+07 9979.508487 69.180 5.043000e+07
United States 26261.151347 73.478500 2.282112e+08 13990.482080 68.440 1.575530e+08
Uruguay 7100.133176 70.781583 2.912487e+06 5444.619620 66.071 2.252965e+06
Venezuela 10088.516252 66.580667 1.512980e+07 7689.799761 55.088 5.439568e+06
Vietnam 1017.712615 57.479500 5.456857e+07 605.066492 40.412 2.624684e+07
West Bank and Gaza 3759.996781 60.328667 1.848606e+06 1515.592329 43.160 1.030585e+06
Yemen, Rep. 1569.274672 46.780417 1.084319e+07 781.717576 32.548 4.963829e+06
Zambia 1358.199409 45.996333 6.353805e+06 1071.353818 39.193 2.672000e+06
Zimbabwe 635.858042 52.663167 7.641966e+06 406.884115 39.989 3.080907e+06

142 rows × 6 columns

Fancier subsetting

We can also subset a subset.

  • Isolate Mexico's year and life expectancy.
In [43]:
spec=['year','lifeExp']
gDat[gDat['country']=='Mexico'][spec]
Out[43]:
year lifeExp
984 1952 50.789
985 1957 55.190
986 1962 58.299
987 1967 60.110
988 1972 62.361
989 1977 65.032
990 1982 67.405
991 1987 69.498
992 1992 71.455
993 1997 73.670
994 2002 74.902
995 2007 76.195

Or pass multiple conditions for subsetting. * Display Mexico's life expectancy in 2002.

In [44]:
gDat[(gDat['country']=='Mexico') & (gDat['year']==2002)]['lifeExp']
Out[44]:
994    74.902
Name: lifeExp, dtype: float64

Exercise

Write a function that will print a given country's life expectancy, population and gdp per capita in a given year. (note data is available only for every 5 years between 1952 and 2007).

In [45]:
def print_stats(gDat,country,year):
    """ Prints the life expectancy, gdp per capita and population
    of country in year. """
    
    spec = ['lifeExp','pop','gdpPercap']
    
    print 'Statisitcs for', country, 'in', year
    print gDat[(gDat['country']==country) & (gDat['year']==year)][spec]
In [46]:
print_stats(gDat,'Canada',2007)
Statisitcs for Canada in 2007
     lifeExp       pop    gdpPercap
251   80.653  33390141  36319.23501

Calculations on subsets

Somtimes we want to calculate useful pieces of information that apply to groups of the data. We can accomplish this using groupby and the agg() or apply() functions.

agg() allows use to aggregate the data in a certain way.

For example, we may want to look up the minimum life expectancy for each continent.

In [47]:
gDat.groupby('continent')['lifeExp'].agg(np.min)
Out[47]:
continent
Africa       23.599
Americas     37.579
Asia         28.801
Europe       43.585
Oceania      69.120
Name: lifeExp, dtype: float64

But it would also be helpful to know where and when this occured. For this we can use apply() and indexing.

We will break this down by looking up when the minimum life expectancy occured for Africa.

  • Isolate the Africa group and call it subset.
In [48]:
subset= gDat.groupby('continent').get_group('Africa')
  • Find the index of the minimum life expectancy. Use the idxmin() method.
In [49]:
ind=subset['lifeExp'].idxmin()
  • Now we pass that index into the year column of subset
In [50]:
subset.year[ind]
Out[50]:
1992

or even faster

In [51]:
subset.year[subset['lifeExp'].idxmin()]
Out[51]:
1992

Great! So Africa had a minimum life expecency in 1992. But we don't want to write this again for each contintent. We can apply this aggregation to each continent in a single line using the apply() method.

We will pass our own aggregation function called lambda as an argument in the apply() method.

Note: A lambda function is a function that is not bound to a name at run time.

In [52]:
gDat.groupby('continent').apply(lambda subset: subset.year[subset['lifeExp'].idxmin()])
Out[52]:
continent
Africa       1992
Americas     1952
Asia         1952
Europe       1952
Oceania      1952
dtype: int64

Repeat again for country.

In [53]:
gDat.groupby('continent').apply(lambda subset: subset.country[subset['lifeExp'].idxmin()])
Out[53]:
continent
Africa            Rwanda
Americas           Haiti
Asia         Afghanistan
Europe            Turkey
Oceania        Australia
dtype: object

Or apply to both country and year at once wih a slightly different lambda function

In [54]:
spec=['year','country','lifeExp'] #columns we want to return
gDat.groupby('continent').apply(lambda subset: subset.ix[subset['lifeExp'].idxmin()])[spec]
Out[54]:
year country lifeExp
continent
Africa 1992 Rwanda 23.599
Americas 1952 Haiti 37.579
Asia 1952 Afghanistan 28.801
Europe 1952 Turkey 43.585
Oceania 1952 Australia 69.120

Exercise

Suzy wrote some code to determine which country had the lowest life expectancy in 1982.

What is wrong with her solution?

In [55]:
spec=['country','lifeExp']
gDat[gDat['year']==1982][spec].min()
Out[55]:
country    Afghanistan
lifeExp         38.445
dtype: object

We can do a quick check to look up Afghanistan's life expectancy in 1982.

In [56]:
gDat[(gDat['year']==1982) & (gDat['country']=='Afghanistan')]
Out[56]:
country year pop continent lifeExp gdpPercap
6 Afghanistan 1982 12881816 Asia 39.854 978.011439

This doesnt match with the answer above because the min() function was applied to each column (country and lifeExp).

She should have tried this.

In [57]:
gDat.ix[gDat[gDat['year']==1982]['lifeExp'].idxmin()]['country']
Out[57]:
'Sierra Leone'

Putting it together - Plots

We can use all of these ideas to generate a plot that looks at a subset of the data. * Plot GDP per capita vs life expectancy in 2007 for each continent.

In [58]:
c=gDat.groupby(['continent'])
for cont in c.groups:
    group=c.get_group(cont)
    group[group['year']==2007].plot(kind='scatter',x='gdpPercap',y='lifeExp',title=cont)
    #plt.axis([-10000,60000,30,90]) #add this in

Or plot each on a different axis.

In [59]:
fig,ax=plt.subplots(1,1)
cols=['m','b','r','g','y']
for cont, col in zip(c.groups,cols):
    group=c.get_group(cont)
    group[group['year']==2007].plot(kind='scatter',x='gdpPercap',y='lifeExp',label=cont,ax=ax,color=col,alpha=0.5)
ax.set_title(2007)
Out[59]:
<matplotlib.text.Text at 0x13633518>

Exercise

Write a function the takes a country as an argument and plots the life expectancy against GDP per capita for all years in a scatter plot. Also print the year of the minimum/maximum lifeExp and the year of the miniimim/maximum GDP per capita.

In [60]:
def compare_gdp_lifeExp(gDat,country):
    """ plot GDP per capita against life expectancy for a given country.
    print year of min/max gdp per capita and life expectancy
    """
    
    sub=gDat[gDat['country']==country]
    sub.plot(x='gdpPercap',y='lifeExp',kind='scatter',title=country)

    print 'Year of Min/Max GDP per capita'
    print gDat.ix[[sub['gdpPercap'].idxmin(),sub['gdpPercap'].idxmax()]]['year']
    print 'Year of Min/Max life expectancy'
    print gDat.ix[[sub['lifeExp'].idxmin(),sub['lifeExp'].idxmax()]]['year']
In [61]:
compare_gdp_lifeExp(gDat,'Afghanistan')
Year of Min/Max GDP per capita
9    1997
6    1982
Name: year, dtype: int64
Year of Min/Max life expectancy
0     1952
11    2007
Name: year, dtype: int64

In [62]:
compare_gdp_lifeExp(gDat,'Canada')
Year of Min/Max GDP per capita
240    1952
251    2007
Name: year, dtype: int64
Year of Min/Max life expectancy
240    1952
251    2007
Name: year, dtype: int64

Other resources for plotting: * http://pandas.pydata.org/pandas-docs/stable/visualization.html#trellis-plotting-interface

Summary

  • pandas is a powerful tool for reading and manipulating tabular data
  • subsets of data can be created using == and isin()
  • data can be grouped by similar values using groupby
  • pivot tables can be used to calculate aggregate statistics over certain variables
  • pandas supports some simple plotting features
In [62]: