In [119]:
%matplotlib inline

Working with Pandas:

Python data analysis library

Matt Shirley

May 1 2014

Overview

  1. What is (are?) Pandas
  2. Pandas Series & Dataframe
  3. Reading data from CSV
  4. (RE)indexing your data
  5. Using Pandas to plot your data

What is Pandas?

  • Pandas: Panel Data System
  • Pandas provides tools for:
    • Cleaning data
    • Analyzing data
    • Simple regression of data
    • Ploting data
  • Data in Pandas are organized in table-like structures
  • Pandas is built on Numpy as a datastructure backend

What is Pandas?

  • Pandas allows creation of indexed two-dimensional arrays
  • Similar to dataframe object in R language
  • Basic building block is a Series
    • dictionary-like object with an index
  • Series objects are combined as columns to produce a DataFrame
  • DataFrame is an indexed, named two-dimensional array
    • think of it like an Excel workbook

Installation

pip install pandas or conda install pandas

Pandas

In [120]:
from pandas import Series, DataFrame

Recipe for a Series:
s = Series(data, index=index)

  • data can be a dict or a numpy ndarray
In [121]:
fruits = {'red': 'apple',
          'yellow': 'banana',
          'green': 'kiwi',
          'purple': 'grape'}
series_fruits = Series(fruits, index=['red', 'yellow', 'green', 'purple'])
series_fruits
Out[121]:
red        apple
yellow    banana
green       kiwi
purple     grape
dtype: object

Series are like list and dict:

In [122]:
series_fruits[:2]
Out[122]:
red        apple
yellow    banana
dtype: object
In [123]:
series_fruits['red']
Out[123]:
'apple'
In [124]:
try:
    series_fruits['orange']
except KeyError:
    print('Key does not exist!')
Key does not exist!

Series with numeric content can be treated like a Numpy ndarray:

In [125]:
from string import ascii_letters
from numpy.random import randn
numbers = Series(randn(10), index=tuple(ascii_letters[:10]))
In [126]:
numbers
Out[126]:
a   -0.031160
b    1.301435
c    2.039165
d   -0.232808
e   -0.899883
f    0.465924
g   -0.233621
h   -0.422786
i    0.231951
j   -0.400064
dtype: float64
In [127]:
numbers + numbers
Out[127]:
a   -0.062319
b    2.602870
c    4.078330
d   -0.465616
e   -1.799765
f    0.931847
g   -0.467242
h   -0.845571
i    0.463902
j   -0.800128
dtype: float64
In [128]:
numbers*4
Out[128]:
a   -0.124638
b    5.205740
c    8.156660
d   -0.931233
e   -3.599530
f    1.863695
g   -0.934484
h   -1.691143
i    0.927804
j   -1.600255
dtype: float64
In [129]:
numbers**2
Out[129]:
a    0.000971
b    1.693733
c    4.158194
d    0.054200
e    0.809789
f    0.217085
g    0.054579
h    0.178748
i    0.053801
j    0.160051
dtype: float64
In [130]:
numbers.sum()
Out[130]:
1.818153908972731
In [131]:
numbers.describe()
Out[131]:
count    10.000000
mean      0.181815
std       0.885098
min      -0.899883
25%      -0.358453
50%      -0.131984
75%       0.407431
max       2.039165
dtype: float64
In [132]:
veg = {'red': 'tomato',
          'yellow': 'squash',
          'purple': 'eggplant'}
series_veg = Series(veg, index=['red', 'yellow', 'purple'])
series_veg
Out[132]:
red         tomato
yellow      squash
purple    eggplant
dtype: object

Combine Series instances to make a DataFrame:

In [133]:
produce = DataFrame({'fruits': series_fruits, 'vegetables': series_veg})
produce
Out[133]:
fruits vegetables
green kiwi NaN
purple grape eggplant
red apple tomato
yellow banana squash

4 rows × 2 columns

  • You'll notice that Pandas DataFrame handles the missing 'green' vegetable by substituting NaN.
  • NaN, "not a number" is a stand-in for missingness
  • NaN is excluded from operations performed on a dataframe
    • for example, missing numbers would be excluded from a column sum or mean
In [134]:
produce.dropna()
Out[134]:
fruits vegetables
purple grape eggplant
red apple tomato
yellow banana squash

3 rows × 2 columns

In [135]:
produce.fillna('brocolli')
Out[135]:
fruits vegetables
green kiwi brocolli
purple grape eggplant
red apple tomato
yellow banana squash

4 rows × 2 columns

In [136]:
produce.fillna(method='ffill')
Out[136]:
fruits vegetables
green kiwi NaN
purple grape eggplant
red apple tomato
yellow banana squash

4 rows × 2 columns

In [198]:
produce.unstack()
Out[198]:
fruits      green         kiwi
            purple       grape
            red          apple
            yellow      banana
vegetables  green          NaN
            purple    eggplant
            red         tomato
            yellow      squash
dtype: object
In [199]:
produce.reindex(['yellow', 'red', 'purple', 'green'])
Out[199]:
fruits vegetables
yellow banana squash
red apple tomato
purple grape eggplant
green kiwi NaN

4 rows × 2 columns

In [137]:
from pandas import read_csv

movies = read_csv('movies.csv')
movies.head()
Out[137]:
title year length budget rating votes r1 r2 r3 r4 r5 r6 r7 r8 r9 r10 mpaa Action Animation Comedy
0 $ 1971 121 NaN 6.4 348 4.5 4.5 4.5 4.5 14.5 24.5 24.5 14.5 4.5 4.5 NaN 0 0 1 ...
1 $1000 a Touchdown 1939 71 NaN 6.0 20 0.0 14.5 4.5 24.5 14.5 14.5 14.5 4.5 4.5 14.5 NaN 0 0 1 ...
2 $21 a Day Once a Month 1941 7 NaN 8.2 5 0.0 0.0 0.0 0.0 0.0 24.5 0.0 44.5 24.5 24.5 NaN 0 1 0 ...
3 $40,000 1996 70 NaN 8.2 6 14.5 0.0 0.0 0.0 0.0 0.0 0.0 0.0 34.5 45.5 NaN 0 0 1 ...
4 $50,000 Climax Show, The 1975 71 NaN 3.4 17 24.5 4.5 0.0 14.5 14.5 4.5 0.0 0.0 0.0 24.5 NaN 0 0 0 ...

5 rows × 24 columns

In [138]:
movies['rating'].mean()
Out[138]:
5.9328502415459319
In [139]:
movies['length'].mean()
Out[139]:
82.33787507654624
In [189]:
list(movies.columns.values)
Out[189]:
['title',
 'year',
 'length',
 'budget',
 'rating',
 'votes',
 'r1',
 'r2',
 'r3',
 'r4',
 'r5',
 'r6',
 'r7',
 'r8',
 'r9',
 'r10',
 'mpaa',
 'Action',
 'Animation',
 'Comedy',
 'Drama',
 'Documentary',
 'Romance',
 'Short']
In [140]:
movies.describe()
Out[140]:
year length budget rating votes r1 r2 r3 r4 r5 r6 r7 r8 r9 r10 Action Animation Comedy Drama Documentary
count 58788.000000 58788.000000 5.215000e+03 58788.000000 58788.000000 58788.000000 58788.000000 58788.000000 58788.000000 58788.000000 58788.000000 58788.000000 58788.000000 58788.000000 58788.000000 58788.000000 58788.000000 58788.000000 58788.000000 58788.000000 ...
mean 1976.133582 82.337875 1.341251e+07 5.932850 632.130384 7.014382 4.022377 4.721159 6.374847 9.796693 13.039158 15.548088 13.875995 8.954208 16.854018 0.079744 0.062768 0.293784 0.371011 0.059060 ...
std 23.735125 44.347717 2.335008e+07 1.553031 3829.621413 10.936759 5.960056 6.451629 7.586141 9.733612 10.977600 11.589577 11.316963 9.440597 15.650722 0.270899 0.242547 0.455498 0.483080 0.235738 ...
min 1893.000000 1.000000 0.000000e+00 1.000000 5.000000 0.000000 0.000000 0.000000 0.000000 0.000000 0.000000 0.000000 0.000000 0.000000 0.000000 0.000000 0.000000 0.000000 0.000000 0.000000 ...
25% 1958.000000 74.000000 2.500000e+05 5.000000 11.000000 0.000000 0.000000 0.000000 0.000000 4.500000 4.500000 4.500000 4.500000 4.500000 4.500000 0.000000 0.000000 0.000000 0.000000 0.000000 ...
50% 1983.000000 90.000000 3.000000e+06 6.100000 30.000000 4.500000 4.500000 4.500000 4.500000 4.500000 14.500000 14.500000 14.500000 4.500000 14.500000 0.000000 0.000000 0.000000 0.000000 0.000000 ...
75% 1997.000000 100.000000 1.500000e+07 7.000000 112.000000 4.500000 4.500000 4.500000 4.500000 14.500000 14.500000 24.500000 24.500000 14.500000 24.500000 0.000000 0.000000 1.000000 1.000000 0.000000 ...
max 2005.000000 5220.000000 2.000000e+08 10.000000 157608.000000 100.000000 84.500000 84.500000 100.000000 100.000000 84.500000 100.000000 100.000000 100.000000 100.000000 1.000000 1.000000 1.000000 1.000000 1.000000 ...

8 rows × 22 columns

In [158]:
by_year = movies.set_index('year')
by_year.sort_index(inplace=True)
by_year.loc[1990:1991].head()
Out[158]:
title length budget rating votes r1 r2 r3 r4 r5 r6 r7 r8 r9 r10 mpaa Action Animation Comedy Drama
year
1990 Tsarskaya okhota 134 NaN 4.6 9 0.0 0.0 0.0 24.5 0.0 14.5 14.5 34.5 0.0 24.5 NaN 0 0 0 1 ...
1990 Maniac Cop 2 90 4000000 4.4 360 4.5 4.5 4.5 14.5 14.5 24.5 14.5 4.5 4.5 4.5 NaN 1 0 0 0 ...
1990 Jin pai shi jie 87 NaN 4.2 8 14.5 0.0 0.0 0.0 34.5 24.5 14.5 0.0 14.5 0.0 NaN 1 0 0 0 ...
1990 Handmaid's Tale, The 108 NaN 5.8 1261 4.5 4.5 4.5 4.5 14.5 14.5 24.5 14.5 4.5 4.5 NaN 0 0 0 1 ...
1990 Ragazzi fuori 110 NaN 7.5 29 4.5 0.0 0.0 4.5 0.0 24.5 14.5 24.5 24.5 4.5 NaN 0 0 0 1 ...

5 rows × 23 columns

In [142]:
movies[['rating', 'year']].boxplot(by='year')
Out[142]:
<matplotlib.axes.AxesSubplot at 0x1295083c8>
In [143]:
movies[['budget', 'year']].boxplot(by='year')
Out[143]:
<matplotlib.axes.AxesSubplot at 0x127c369e8>
In [156]:
by_year.loc[1990:1999][['budget']].hist(bins=20)
Out[156]:
array([[<matplotlib.axes.AxesSubplot object at 0x12ea77a58>]], dtype=object)
In [157]:
by_year.loc[1960:1969][['budget']].hist(bins=20)
Out[157]:
array([[<matplotlib.axes.AxesSubplot object at 0x12e99ebe0>]], dtype=object)
In [196]:
movies[['rating', 'mpaa']].boxplot(by='mpaa')
Out[196]:
<matplotlib.axes.AxesSubplot at 0x131240208>
In [192]:
movies[(movies['Drama'] == 1) & (movies['Romance'] == 1)][['rating', 'mpaa']].boxplot(by='mpaa')
Out[192]:
<matplotlib.axes.AxesSubplot at 0x130e19438>
In [194]:
movies[(movies['Drama'] == 1) & (movies['Comedy'] == 1)][['budget', 'mpaa']].boxplot(by='mpaa')
Out[194]:
<matplotlib.axes.AxesSubplot at 0x1310475f8>