%matplotlib inline
dataframe
object in R languageSeries
Series
objects are combined as columns to produce a DataFrame
DataFrame
is an indexed, named two-dimensional arraypip install pandas
or conda install pandas
from pandas import Series, DataFrame
Recipe for a Series
:s = Series(data, index=index)
dict
or a numpy ndarray
fruits = {'red': 'apple',
'yellow': 'banana',
'green': 'kiwi',
'purple': 'grape'}
series_fruits = Series(fruits, index=['red', 'yellow', 'green', 'purple'])
series_fruits
red apple yellow banana green kiwi purple grape dtype: object
Series are like list
and dict
:
series_fruits[:2]
red apple yellow banana dtype: object
series_fruits['red']
'apple'
try:
series_fruits['orange']
except KeyError:
print('Key does not exist!')
Key does not exist!
Series with numeric content can be treated like a Numpy ndarray
:
from string import ascii_letters
from numpy.random import randn
numbers = Series(randn(10), index=tuple(ascii_letters[:10]))
numbers
a -0.031160 b 1.301435 c 2.039165 d -0.232808 e -0.899883 f 0.465924 g -0.233621 h -0.422786 i 0.231951 j -0.400064 dtype: float64
numbers + numbers
a -0.062319 b 2.602870 c 4.078330 d -0.465616 e -1.799765 f 0.931847 g -0.467242 h -0.845571 i 0.463902 j -0.800128 dtype: float64
numbers*4
a -0.124638 b 5.205740 c 8.156660 d -0.931233 e -3.599530 f 1.863695 g -0.934484 h -1.691143 i 0.927804 j -1.600255 dtype: float64
numbers**2
a 0.000971 b 1.693733 c 4.158194 d 0.054200 e 0.809789 f 0.217085 g 0.054579 h 0.178748 i 0.053801 j 0.160051 dtype: float64
numbers.sum()
1.818153908972731
numbers.describe()
count 10.000000 mean 0.181815 std 0.885098 min -0.899883 25% -0.358453 50% -0.131984 75% 0.407431 max 2.039165 dtype: float64
veg = {'red': 'tomato',
'yellow': 'squash',
'purple': 'eggplant'}
series_veg = Series(veg, index=['red', 'yellow', 'purple'])
series_veg
red tomato yellow squash purple eggplant dtype: object
Combine Series
instances to make a DataFrame
:
produce = DataFrame({'fruits': series_fruits, 'vegetables': series_veg})
produce
fruits | vegetables | |
---|---|---|
green | kiwi | NaN |
purple | grape | eggplant |
red | apple | tomato |
yellow | banana | squash |
4 rows × 2 columns
DataFrame
handles the missing 'green' vegetable by substituting NaN
.NaN
, "not a number" is a stand-in for missingnessNaN
is excluded from operations performed on a dataframeproduce.dropna()
fruits | vegetables | |
---|---|---|
purple | grape | eggplant |
red | apple | tomato |
yellow | banana | squash |
3 rows × 2 columns
produce.fillna('brocolli')
fruits | vegetables | |
---|---|---|
green | kiwi | brocolli |
purple | grape | eggplant |
red | apple | tomato |
yellow | banana | squash |
4 rows × 2 columns
produce.fillna(method='ffill')
fruits | vegetables | |
---|---|---|
green | kiwi | NaN |
purple | grape | eggplant |
red | apple | tomato |
yellow | banana | squash |
4 rows × 2 columns
produce.unstack()
fruits green kiwi purple grape red apple yellow banana vegetables green NaN purple eggplant red tomato yellow squash dtype: object
produce.reindex(['yellow', 'red', 'purple', 'green'])
fruits | vegetables | |
---|---|---|
yellow | banana | squash |
red | apple | tomato |
purple | grape | eggplant |
green | kiwi | NaN |
4 rows × 2 columns
from pandas import read_csv
movies = read_csv('movies.csv')
movies.head()
title | year | length | budget | rating | votes | r1 | r2 | r3 | r4 | r5 | r6 | r7 | r8 | r9 | r10 | mpaa | Action | Animation | Comedy | ||
---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
0 | $ | 1971 | 121 | NaN | 6.4 | 348 | 4.5 | 4.5 | 4.5 | 4.5 | 14.5 | 24.5 | 24.5 | 14.5 | 4.5 | 4.5 | NaN | 0 | 0 | 1 | ... |
1 | $1000 a Touchdown | 1939 | 71 | NaN | 6.0 | 20 | 0.0 | 14.5 | 4.5 | 24.5 | 14.5 | 14.5 | 14.5 | 4.5 | 4.5 | 14.5 | NaN | 0 | 0 | 1 | ... |
2 | $21 a Day Once a Month | 1941 | 7 | NaN | 8.2 | 5 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 24.5 | 0.0 | 44.5 | 24.5 | 24.5 | NaN | 0 | 1 | 0 | ... |
3 | $40,000 | 1996 | 70 | NaN | 8.2 | 6 | 14.5 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 34.5 | 45.5 | NaN | 0 | 0 | 1 | ... |
4 | $50,000 Climax Show, The | 1975 | 71 | NaN | 3.4 | 17 | 24.5 | 4.5 | 0.0 | 14.5 | 14.5 | 4.5 | 0.0 | 0.0 | 0.0 | 24.5 | NaN | 0 | 0 | 0 | ... |
5 rows × 24 columns
movies['rating'].mean()
5.9328502415459319
movies['length'].mean()
82.33787507654624
list(movies.columns.values)
['title', 'year', 'length', 'budget', 'rating', 'votes', 'r1', 'r2', 'r3', 'r4', 'r5', 'r6', 'r7', 'r8', 'r9', 'r10', 'mpaa', 'Action', 'Animation', 'Comedy', 'Drama', 'Documentary', 'Romance', 'Short']
movies.describe()
year | length | budget | rating | votes | r1 | r2 | r3 | r4 | r5 | r6 | r7 | r8 | r9 | r10 | Action | Animation | Comedy | Drama | Documentary | ||
---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
count | 58788.000000 | 58788.000000 | 5.215000e+03 | 58788.000000 | 58788.000000 | 58788.000000 | 58788.000000 | 58788.000000 | 58788.000000 | 58788.000000 | 58788.000000 | 58788.000000 | 58788.000000 | 58788.000000 | 58788.000000 | 58788.000000 | 58788.000000 | 58788.000000 | 58788.000000 | 58788.000000 | ... |
mean | 1976.133582 | 82.337875 | 1.341251e+07 | 5.932850 | 632.130384 | 7.014382 | 4.022377 | 4.721159 | 6.374847 | 9.796693 | 13.039158 | 15.548088 | 13.875995 | 8.954208 | 16.854018 | 0.079744 | 0.062768 | 0.293784 | 0.371011 | 0.059060 | ... |
std | 23.735125 | 44.347717 | 2.335008e+07 | 1.553031 | 3829.621413 | 10.936759 | 5.960056 | 6.451629 | 7.586141 | 9.733612 | 10.977600 | 11.589577 | 11.316963 | 9.440597 | 15.650722 | 0.270899 | 0.242547 | 0.455498 | 0.483080 | 0.235738 | ... |
min | 1893.000000 | 1.000000 | 0.000000e+00 | 1.000000 | 5.000000 | 0.000000 | 0.000000 | 0.000000 | 0.000000 | 0.000000 | 0.000000 | 0.000000 | 0.000000 | 0.000000 | 0.000000 | 0.000000 | 0.000000 | 0.000000 | 0.000000 | 0.000000 | ... |
25% | 1958.000000 | 74.000000 | 2.500000e+05 | 5.000000 | 11.000000 | 0.000000 | 0.000000 | 0.000000 | 0.000000 | 4.500000 | 4.500000 | 4.500000 | 4.500000 | 4.500000 | 4.500000 | 0.000000 | 0.000000 | 0.000000 | 0.000000 | 0.000000 | ... |
50% | 1983.000000 | 90.000000 | 3.000000e+06 | 6.100000 | 30.000000 | 4.500000 | 4.500000 | 4.500000 | 4.500000 | 4.500000 | 14.500000 | 14.500000 | 14.500000 | 4.500000 | 14.500000 | 0.000000 | 0.000000 | 0.000000 | 0.000000 | 0.000000 | ... |
75% | 1997.000000 | 100.000000 | 1.500000e+07 | 7.000000 | 112.000000 | 4.500000 | 4.500000 | 4.500000 | 4.500000 | 14.500000 | 14.500000 | 24.500000 | 24.500000 | 14.500000 | 24.500000 | 0.000000 | 0.000000 | 1.000000 | 1.000000 | 0.000000 | ... |
max | 2005.000000 | 5220.000000 | 2.000000e+08 | 10.000000 | 157608.000000 | 100.000000 | 84.500000 | 84.500000 | 100.000000 | 100.000000 | 84.500000 | 100.000000 | 100.000000 | 100.000000 | 100.000000 | 1.000000 | 1.000000 | 1.000000 | 1.000000 | 1.000000 | ... |
8 rows × 22 columns
by_year = movies.set_index('year')
by_year.sort_index(inplace=True)
by_year.loc[1990:1991].head()
title | length | budget | rating | votes | r1 | r2 | r3 | r4 | r5 | r6 | r7 | r8 | r9 | r10 | mpaa | Action | Animation | Comedy | Drama | ||
---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
year | |||||||||||||||||||||
1990 | Tsarskaya okhota | 134 | NaN | 4.6 | 9 | 0.0 | 0.0 | 0.0 | 24.5 | 0.0 | 14.5 | 14.5 | 34.5 | 0.0 | 24.5 | NaN | 0 | 0 | 0 | 1 | ... |
1990 | Maniac Cop 2 | 90 | 4000000 | 4.4 | 360 | 4.5 | 4.5 | 4.5 | 14.5 | 14.5 | 24.5 | 14.5 | 4.5 | 4.5 | 4.5 | NaN | 1 | 0 | 0 | 0 | ... |
1990 | Jin pai shi jie | 87 | NaN | 4.2 | 8 | 14.5 | 0.0 | 0.0 | 0.0 | 34.5 | 24.5 | 14.5 | 0.0 | 14.5 | 0.0 | NaN | 1 | 0 | 0 | 0 | ... |
1990 | Handmaid's Tale, The | 108 | NaN | 5.8 | 1261 | 4.5 | 4.5 | 4.5 | 4.5 | 14.5 | 14.5 | 24.5 | 14.5 | 4.5 | 4.5 | NaN | 0 | 0 | 0 | 1 | ... |
1990 | Ragazzi fuori | 110 | NaN | 7.5 | 29 | 4.5 | 0.0 | 0.0 | 4.5 | 0.0 | 24.5 | 14.5 | 24.5 | 24.5 | 4.5 | NaN | 0 | 0 | 0 | 1 | ... |
5 rows × 23 columns
movies[['rating', 'year']].boxplot(by='year')
<matplotlib.axes.AxesSubplot at 0x1295083c8>
movies[['budget', 'year']].boxplot(by='year')
<matplotlib.axes.AxesSubplot at 0x127c369e8>
by_year.loc[1990:1999][['budget']].hist(bins=20)
array([[<matplotlib.axes.AxesSubplot object at 0x12ea77a58>]], dtype=object)
by_year.loc[1960:1969][['budget']].hist(bins=20)
array([[<matplotlib.axes.AxesSubplot object at 0x12e99ebe0>]], dtype=object)
movies[['rating', 'mpaa']].boxplot(by='mpaa')
<matplotlib.axes.AxesSubplot at 0x131240208>
movies[(movies['Drama'] == 1) & (movies['Romance'] == 1)][['rating', 'mpaa']].boxplot(by='mpaa')
<matplotlib.axes.AxesSubplot at 0x130e19438>
movies[(movies['Drama'] == 1) & (movies['Comedy'] == 1)][['budget', 'mpaa']].boxplot(by='mpaa')
<matplotlib.axes.AxesSubplot at 0x1310475f8>