name: inverse layout: true class: center, middle, inverse --- # Data handling: pandas ## Olav Vahtras Computational Python --- layout: false ## Datahandling in Python ### The `pandas` module Setup: ``` >>> import pandas as pd >>> import numpy as np >>> import matplotlib.pyplot as plt >>> import seaborn ``` Two main data structures * Series * Data frames --- ### Series One-dimensional labeled data ``` >>> s = pd.Series([0.1, 0.2, 0.3, 0.4]) >>> print(s) 0 0.1 1 0.2 2 0.3 3 0.4 dtype: float64 ``` -- ``` >>> print(s.index) RangeIndex(start=0, stop=4, step=1) ``` -- ``` >>> print(s.values) [0.1 0.2 0.3 0.4] ``` --- * indices can be labels (like a dict with order) ``` >>> s = pd.Series(np.arange(4), index=['a', 'b', 'c', 'd']) >>> print(s) a 0 b 1 c 2 d 3 dtype: int64 >>> print(s['d']) 3 >>> ``` -- * Initialize with dict ``` >>> s = pd.Series({'a': 1, 'b': 2, 'c': 3, 'd': 4}) >>> print(s) a 1 b 2 c 3 d 4 dtype: int64 >>> ``` -- * Indexing as a dict ``` >>> print(s['a']) 1 ``` --- * Elementwise operations ``` >>> print(s * 100) a 100 b 200 c 300 d 400 dtype: int64 >>> ``` -- * Slicing ``` >>> s['b': 'c'] b 2 c 3 dtype: int64 >>> ``` --- * List indexing ``` >>> print(s[['b', 'c']]) b 2 c 3 dtype: int64 >>> ``` -- * Bool indexing ``` >>> print(s[s>2]) c 3 d 4 dtype: int64 >>> ``` -- * Other operations ``` >>> s.mean() 2.5 >>> ``` --- * Alignment on indices ``` >>> s['a':'b'] + s['b':'c'] a NaN b 4.0 c NaN dtype: float64 >>> ``` --- ### DataFrames * Tabular data structure (like spreadsheet, sql table) * Multiple series with common index ``` >>> data = {'country': ['Belgium', 'France', 'Germany', 'Netherlands', 'United Kingdom'], ... 'population': [11.3, 64.3, 81.3, 16.9, 64.9], ... 'area': [30510, 671308, 357050, 41526, 244820], ... 'capital': ['Brussels', 'Paris', 'Berlin', 'Amsterdam', 'London']} >>> ``` -- ``` >>> countries = pd.DataFrame(data) >>> print(countries) country population area capital 0 Belgium 11.3 30510 Brussels 1 France 64.3 671308 Paris 2 Germany 81.3 357050 Berlin 3 Netherlands 16.9 41526 Amsterdam 4 United Kingdom 64.9 244820 London >>> ``` --- * Attributes: index, columns, dtypes, values ``` >>> countries.index RangeIndex(start=0, stop=5, step=1) >>> ``` -- ``` >>> countries.columns Index(['country', 'population', 'area', 'capital'], dtype='object') >>> ``` -- ``` >>> countries.dtypes country object population float64 area int64 capital object dtype: object >>> ``` -- ``` >>> countries.values array([['Belgium', 11.3, 30510, 'Brussels'], ['France', 64.3, 671308, 'Paris'], ['Germany', 81.3, 357050, 'Berlin'], ['Netherlands', 16.9, 41526, 'Amsterdam'], ['United Kingdom', 64.9, 244820, 'London']], dtype=object) ``` --- * Info ``` >>> countries.info()
RangeIndex: 5 entries, 0 to 4 Data columns (total 4 columns): # Column Non-Null Count Dtype --- ------ -------------- ----- 0 country 5 non-null object 1 population 5 non-null float64 2 area 5 non-null int64 3 capital 5 non-null object dtypes: float64(1), int64(1), object(2) memory usage: 288.0+ bytes ``` --- * Set a column as index ``` >>> print(countries) country population area capital 0 Belgium 11.3 30510 Brussels 1 France 64.3 671308 Paris 2 Germany 81.3 357050 Berlin 3 Netherlands 16.9 41526 Amsterdam 4 United Kingdom 64.9 244820 London ``` -- ``` >>> countries = countries.set_index('country') >>> ``` -- ``` >>> print(countries) population area capital country Belgium 11.3 30510 Brussels France 64.3 671308 Paris Germany 81.3 357050 Berlin Netherlands 16.9 41526 Amsterdam United Kingdom 64.9 244820 London ``` --- * Access a single series in a table ``` >>> print(countries['area']) country Belgium 30510 France 671308 Germany 357050 Netherlands 41526 United Kingdom 244820 Name: area, dtype: int64 >>> ``` -- ``` >>> print(countries['capital']['France']) Paris >>> ``` -- * Arithmetic (population density) ``` >>> print(countries['population']/countries['area']*10**6) country Belgium 370.370370 France 95.783158 Germany 227.699202 Netherlands 406.973944 United Kingdom 265.092721 dtype: float64 >>> ``` --- * Add new column ``` >>> countries['density'] = countries['population']/countries['area']*10**6 >>> print(countries) population area capital density country Belgium 11.3 30510 Brussels 370.370370 France 64.3 671308 Paris 95.783158 Germany 81.3 357050 Berlin 227.699202 Netherlands 16.9 41526 Amsterdam 406.973944 United Kingdom 64.9 244820 London 265.092721 ``` -- * Filter data ``` >>> print(countries[countries['density'] > 300]) population area capital density country Belgium 11.3 30510 Brussels 370.370370 Netherlands 16.9 41526 Amsterdam 406.973944 ``` --- * Sort data ``` >>> print(countries.sort_values('density', ascending=False)) population area capital density country Netherlands 16.9 41526 Amsterdam 406.973944 Belgium 11.3 30510 Brussels 370.370370 United Kingdom 64.9 244820 London 265.092721 Germany 81.3 357050 Berlin 227.699202 France 64.3 671308 Paris 95.783158 ``` -- * Statistics ``` >>> print(countries.describe()) population area density count 5.000000 5.000000 5.000000 mean 47.740000 269042.800000 273.183879 std 31.519645 264012.827994 123.440607 min 11.300000 30510.000000 95.783158 25% 16.900000 41526.000000 227.699202 50% 64.300000 244820.000000 265.092721 75% 64.900000 357050.000000 370.370370 max 81.300000 671308.000000 406.973944 ``` --- ### Features * like numpy arrays with labels * supported import/export formats: CSV, SQL, Excel... * support for missing data * support for heterogeneous data * merging data * reshaping data * easy plotting --- ### Example A movie database in csv format `titles.csv` * Titles ```bash $ head titles.csv title,year The Rising Son,1990 Ashes of Kukulcan,2016 The Thousand Plane Raid,1969 Crucea de piatra,1993 The 86,2015 Gaiking II,2011 Medusa (IV),2015 Cultural Menace,2004 Alex in Wonderland,1970 ``` -- ``` >>> titles = pd.read_csv('titles.csv') >>> titles.head() title year 0 The Rising Son 1990 1 Ashes of Kukulcan 2016 2 The Thousand Plane Raid 1969 3 Crucea de piatra 1993 4 The 86 2015 >>> ``` --- * How many movies? -- ``` >>> len(titles) 215981 >>> ``` -- * Two earliest films? -- ``` >>> titles.sort('year').head(2) title year 165182 Miss Jerry 1894 85708 Reproduction of the Corbett and Fitzsimmons Fight 1897 >>> ``` -- * Number of movies with title *Hamlet*? -- ``` >>> len(titles[titles.title == "Hamlet"]) 19 >>> ``` --- * List all *Treasure Island* movies sorted by year -- ``` >>> titles[titles.title == "Treasure Island"].sort('year') title year 191379 Treasure Island 1918 47769 Treasure Island 1920 192917 Treasure Island 1934 90175 Treasure Island 1950 104714 Treasure Island 1972 103646 Treasure Island 1973 190792 Treasure Island 1985 166675 Treasure Island 1999 >>> ``` --- #### Acknowledgement Based on [EuroSciPy tutorial](https://github.com/jorisvandenbossche/2015-EuroScipy-pandas-tutorial.git) by Joris Van den Bossche and [PyCon tutorial](https://github.com/brandon-rhodes/pycon-pandas-tutorial) by Brandon Rhodes