import addutils.toc ; addutils.toc.js(ipy_notebook=True)
from addutils import css_notebook
css_notebook()
See pandas documentation for more information and examples. Run the code at the end of the Notebook to generate the example files.
import numpy as np
import pandas as pd
from numpy import NaN
from addutils import side_by_side2
from addutils import css_notebook
from addutils import read_txt
from IPython.core.display import HTML
from faker import Factory
css_notebook()
Pandas reads and format data from many different file formats: txt, csv, web, xls, mat
. In this case we use read_csv
to read two textual data files.
First have a look to the file in its original form:
read_txt('temp/p01_prices.txt')
Date,AAPL,GOOG,JNJ,XOM 2015-09-21,115.209999,635.440002,93.129997,73.389999 2015-09-22,113.400002,622.690002,93.239998,72.739998 2015-09-23,114.32,622.359985,92.989998,72.300003 2015-09-24,115.0,625.799988,92.480003,72.730003
This file can be read and formatted at the same time using read_csv
. Lets read the two files p01_prices.txt
and p01_volumes.txt
prices = pd.read_csv('temp/p01_prices.txt', index_col=0, parse_dates=[0])
volumes = pd.read_csv('temp/p01_volumes.txt', index_col=0, parse_dates=[0])
HTML(side_by_side2(prices, volumes))
AAPL | GOOG | JNJ | XOM | |
---|---|---|---|---|
Date | ||||
2015-09-21 | 115.209999 | 635.440002 | 93.129997 | 73.389999 |
2015-09-22 | 113.400002 | 622.690002 | 93.239998 | 72.739998 |
2015-09-23 | 114.320000 | 622.359985 | 92.989998 | 72.300003 |
2015-09-24 | 115.000000 | 625.799988 | 92.480003 | 72.730003 |
2015-09-25 | 114.709999 | 611.969971 | 91.000000 | 73.230003 |
2015-09-28 | 112.440002 | 594.890015 | 91.370003 | 72.599998 |
AAPL | JNJ | XOM | |
---|---|---|---|
Date | |||
2015-09-21 | 46554300 | 6971400 | 10585500 |
2015-09-22 | 49809000 | 10607800 | 14104200 |
2015-09-23 | 35645700 | 5597300 | 13777500 |
2015-09-24 | 49810600 | 7178400 | 14283800 |
Both "prices" and "volumes" datasets are 2D DataFrame objects:
type(prices)
pandas.core.frame.DataFrame
fakeIT = Factory.create('it_IT')
data = {'Name' : [fakeIT.name() for i in range(5)],
'Company' : [fakeIT.company() for i in range(5)],
'City' : [fakeIT.city() for i in range(5)]}
df = pd.DataFrame(data, columns = ['Name','Company','City'])
df
Name | Company | City | |
---|---|---|---|
0 | Ruth Milani | Basile-Gentile SPA | Quarto Elda |
1 | Albino Marchetti | De luca SPA | Donati calabro |
2 | Vienna Cattaneo | Ferretti, Coppola e Colombo SPA | Gioacchino sardo |
3 | Elga Vitale | Bianchi, Battaglia e Fontana e figli | San Cecco umbro |
4 | Diana Greco | Lombardi s.r.l. | Borgo Sarita |
df = pd.DataFrame.from_items([('Name', [fakeIT.name() for i in range(5)]),
('Company', [fakeIT.company() for i in range(5)])])
df
Name | Company | |
---|---|---|
0 | Dott. Fatima Carbone | Caputo s.r.l. |
1 | Sig. Alighiero Sorrentino | De rosa Group |
2 | Sig. Pablo Monti | Cattaneo SPA |
3 | Demian Palmieri | Mazza-Martinelli SPA |
4 | Sig.ra Loretta Martino | Conte-Sartori Group |
df = pd.DataFrame(np.array([[2,5],[3,6]]).T, index=list('ab'), columns=['ONE','TWO'])
df
ONE | TWO | |
---|---|---|
a | 2 | 3 |
b | 5 | 6 |
np.asarray(df)
array([[2, 3], [5, 6]])
In Pandas there are 3 main data structure types ("data container" objects):
For simplicity, in this course we will describe only pandas Series and DataFrames.
As you can see the dates has been interpreted correctly and used as row index. Notice that the rows in the two datafiles are misaligned, this is not a problem in pandas because the Automatic Data Alignment feature: an operation involving the two datasets will simply use NaN
for the undefined (misaligned) values
prices*volumes
AAPL | GOOG | JNJ | XOM | |
---|---|---|---|---|
Date | ||||
2015-09-21 | 5.363521e+09 | NaN | 6.492465e+08 | 7.768698e+08 |
2015-09-22 | 5.648341e+09 | NaN | 9.890713e+08 | 1.025939e+09 |
2015-09-23 | 4.075016e+09 | NaN | 5.204929e+08 | 9.961133e+08 |
2015-09-24 | 5.728219e+09 | NaN | 6.638585e+08 | 1.038861e+09 |
2015-09-25 | NaN | NaN | NaN | NaN |
2015-09-28 | NaN | NaN | NaN | NaN |
Which can be better formatted to a "2 decimal places float number" with comma as thousands separator (see Package Options):
pd.set_option('display.float_format', lambda x: '{:,.1f}'.format(x)) # formatting
(prices*volumes).replace('nan', '-') # replacing NaN
AAPL | GOOG | JNJ | XOM | |
---|---|---|---|---|
Date | ||||
2015-09-21 | 5,363,520,856.4 | - | 649,246,461.1 | 776,869,834.4 |
2015-09-22 | 5,648,340,699.6 | - | 989,071,250.8 | 1,025,939,479.8 |
2015-09-23 | 4,075,016,424.0 | - | 520,492,915.8 | 996,113,291.3 |
2015-09-24 | 5,728,219,000.0 | - | 663,858,453.5 | 1,038,860,816.9 |
2015-09-25 | - | - | - | - |
2015-09-28 | - | - | - | - |
Example: calculate the volume-weighted average price
vwap = (prices*volumes).sum()/volumes.sum()
vwap.dropna()
AAPL 114.5 JNJ 93.0 XOM 72.8 dtype: float64
.loc
is strictly label based, will raise KeyError when the items are not found, allowed inputs are:
HTML(side_by_side2(prices, prices.loc['2012-11-21':'2012-11-27',['AAPL', 'GOOG']]))
AAPL | GOOG | JNJ | XOM | |
---|---|---|---|---|
Date | ||||
2015-09-21 | 115.2 | 635.4 | 93.1 | 73.4 |
2015-09-22 | 113.4 | 622.7 | 93.2 | 72.7 |
2015-09-23 | 114.3 | 622.4 | 93.0 | 72.3 |
2015-09-24 | 115.0 | 625.8 | 92.5 | 72.7 |
2015-09-25 | 114.7 | 612.0 | 91.0 | 73.2 |
2015-09-28 | 112.4 | 594.9 | 91.4 | 72.6 |
AAPL | GOOG | |
---|---|---|
Date |
Columns can be selected without specifying the index:
HTML(side_by_side2(prices, prices[['AAPL', 'GOOG']]))
AAPL | GOOG | JNJ | XOM | |
---|---|---|---|---|
Date | ||||
2015-09-21 | 115.2 | 635.4 | 93.1 | 73.4 |
2015-09-22 | 113.4 | 622.7 | 93.2 | 72.7 |
2015-09-23 | 114.3 | 622.4 | 93.0 | 72.3 |
2015-09-24 | 115.0 | 625.8 | 92.5 | 72.7 |
2015-09-25 | 114.7 | 612.0 | 91.0 | 73.2 |
2015-09-28 | 112.4 | 594.9 | 91.4 | 72.6 |
AAPL | GOOG | |
---|---|---|
Date | ||
2015-09-21 | 115.2 | 635.4 |
2015-09-22 | 113.4 | 622.7 |
2015-09-23 | 114.3 | 622.4 |
2015-09-24 | 115.0 | 625.8 |
2015-09-25 | 114.7 | 612.0 |
2015-09-28 | 112.4 | 594.9 |
.iloc
is strictly position based, will raise KeyError when the items are out of bounds:
HTML(side_by_side2(prices, prices.iloc[1:5,[0, 1]]))
AAPL | GOOG | JNJ | XOM | |
---|---|---|---|---|
Date | ||||
2015-09-21 | 115.2 | 635.4 | 93.1 | 73.4 |
2015-09-22 | 113.4 | 622.7 | 93.2 | 72.7 |
2015-09-23 | 114.3 | 622.4 | 93.0 | 72.3 |
2015-09-24 | 115.0 | 625.8 | 92.5 | 72.7 |
2015-09-25 | 114.7 | 612.0 | 91.0 | 73.2 |
2015-09-28 | 112.4 | 594.9 | 91.4 | 72.6 |
AAPL | GOOG | |
---|---|---|
Date | ||
2015-09-22 | 113.4 | 622.7 |
2015-09-23 | 114.3 | 622.4 |
2015-09-24 | 115.0 | 625.8 |
2015-09-25 | 114.7 | 612.0 |
3.3 Mixed Indexing
.ix
will always try first to resolve labeled index (like .loc
), then it will fall back on potitional indexing (like .loc
).
Rows can be indexed using the ix method. Try by yourself:
prices.ix[1:4,0:2] # Position-based Indexing
prices.ix[:'2012-11-23'] # Label-based Indexing on index (rows)
prices.ix[:,[2,2,1]] # Duplicated values on columns
prices.ix[::2] # One value every two rows
prices.ix[::-1] # Reverse rows
prices.ix[prices['AAPL'] > 380] # Boolean indexing on index
prices.ix[:,[len(c)<4 for c in prices.columns]] # Boolean indexing on columns
prices.ix[:,[len(c)<4 for c in prices.columns]]
JNJ | XOM | |
---|---|---|
Date | ||
2015-09-21 | 93.1 | 73.4 |
2015-09-22 | 93.2 | 72.7 |
2015-09-23 | 93.0 | 72.3 |
2015-09-24 | 92.5 | 72.7 |
2015-09-25 | 91.0 | 73.2 |
2015-09-28 | 91.4 | 72.6 |
data = np.array([[2,5,8,11],[3,6,9,12]])
d1 = pd.DataFrame(data.T, index=list('abce'), columns=['K','W'])
HTML(side_by_side2(d1, pd.DataFrame(d1, index=list('baez'), columns=['W','K','T'])) )
K | W | |
---|---|---|
a | 2 | 3 |
b | 5 | 6 |
c | 8 | 9 |
e | 11 | 12 |
W | K | T | |
---|---|---|---|
b | 6.0 | 5.0 | nan |
a | 3.0 | 2.0 | nan |
e | 12.0 | 11.0 | nan |
z | nan | nan | nan |
d1['Z'] = d1['W']-d1['K']
d1['B'] = d1['W']>4
d1
K | W | Z | B | |
---|---|---|---|---|
a | 2 | 3 | 1 | False |
b | 5 | 6 | 1 | True |
c | 8 | 9 | 1 | True |
e | 11 | 12 | 1 | True |
d1['SUM'] = d1.sum(axis=1)
HTML(side_by_side2(d1, d1.drop(['b', 'c'], axis=0), d1.drop(['Z', 'B'], axis=1)))
K | W | Z | B | SUM | |
---|---|---|---|---|---|
a | 2 | 3 | 1 | False | 6.0 |
b | 5 | 6 | 1 | True | 13.0 |
c | 8 | 9 | 1 | True | 19.0 |
e | 11 | 12 | 1 | True | 25.0 |
K | W | Z | B | SUM | |
---|---|---|---|---|---|
a | 2 | 3 | 1 | False | 6.0 |
e | 11 | 12 | 1 | True | 25.0 |
K | W | SUM | |
---|---|---|---|
a | 2 | 3 | 6.0 |
b | 5 | 6 | 13.0 |
c | 8 | 9 | 19.0 |
e | 11 | 12 | 25.0 |
d2 = d1.copy() # .copy() method is needed to create a new object.
d2.insert(1, 'Exp(W)', np.exp(d1['W']))
HTML(side_by_side2(d1, d2, space=10))
K | W | Z | B | SUM | |
---|---|---|---|---|---|
a | 2 | 3 | 1 | False | 6.0 |
b | 5 | 6 | 1 | True | 13.0 |
c | 8 | 9 | 1 | True | 19.0 |
e | 11 | 12 | 1 | True | 25.0 |
K | Exp(W) | W | Z | B | SUM | |
---|---|---|---|---|---|---|
a | 2 | 20.1 | 3 | 1 | False | 6.0 |
b | 5 | 403.4 | 6 | 1 | True | 13.0 |
c | 8 | 8,103.1 | 9 | 1 | True | 19.0 |
e | 11 | 162,754.8 | 12 | 1 | True | 25.0 |
Example: Indexing rows to create a new column with empty values, then use the Forward Fill Padding to fill the gaps
d1['part'] = d1['K'].ix[:2]
d1['part'] = d1['part'].fillna(method='ffill')
d1
K | W | Z | B | SUM | part | |
---|---|---|---|---|---|---|
a | 2 | 3 | 1 | False | 6.0 | 2.0 |
b | 5 | 6 | 1 | True | 13.0 | 5.0 |
c | 8 | 9 | 1 | True | 19.0 | 5.0 |
e | 11 | 12 | 1 | True | 25.0 | 5.0 |
# TODO: Upgrade side_by_side2 to include series
HTML(side_by_side2(d1, d1['K'].isin([3, 8])))
K | W | Z | B | SUM | part | |
---|---|---|---|---|---|---|
a | 2 | 3 | 1 | False | 6.0 | 2.0 |
b | 5 | 6 | 1 | True | 13.0 | 5.0 |
c | 8 | 9 | 1 | True | 19.0 | 5.0 |
e | 11 | 12 | 1 | True | 25.0 | 5.0 |
K | |
---|---|
a | False |
b | False |
c | True |
e | False |
HTML(side_by_side2(d1, d1.rename(columns={'K':'ONE','W':'TWO','Z':'THREE'})))
K | W | Z | B | SUM | part | |
---|---|---|---|---|---|---|
a | 2 | 3 | 1 | False | 6.0 | 2.0 |
b | 5 | 6 | 1 | True | 13.0 | 5.0 |
c | 8 | 9 | 1 | True | 19.0 | 5.0 |
e | 11 | 12 | 1 | True | 25.0 | 5.0 |
ONE | TWO | THREE | B | SUM | part | |
---|---|---|---|---|---|---|
a | 2 | 3 | 1 | False | 6.0 | 2.0 |
b | 5 | 6 | 1 | True | 13.0 | 5.0 |
c | 8 | 9 | 1 | True | 19.0 | 5.0 |
e | 11 | 12 | 1 | True | 25.0 | 5.0 |
iterrows
returns an iterator yielding each index value along with a Series containing the data in each row:
for row_index, row in d1.iterrows():
print(row_index, '**', ' - '.join([str(item) for item in row]))
a ** 2 - 3 - 1 - False - 6.0 - 2.0 b ** 5 - 6 - 1 - True - 13.0 - 5.0 c ** 8 - 9 - 1 - True - 19.0 - 5.0 e ** 11 - 12 - 1 - True - 25.0 - 5.0
itertuples
returns an iterator yielding a tuple for each row in the DataFrame. The first element of the tuple is the row’s corresponding index value, while the remaining elements are the row values:
for t in d1.itertuples():
print(t)
('a', 2, 3, 1, False, 6.0, 2.0) ('b', 5, 6, 1, True, 13.0, 5.0) ('c', 8, 9, 1, True, 19.0, 5.0) ('e', 11, 12, 1, True, 25.0, 5.0)
d3 = pd.read_csv('temp/p01_d2.csv', index_col=0)
d3['a dup'] = d3.duplicated(['a'])
d3['a+b dup'] = d3.duplicated(['a', 'b'])
d3['a+b dup - take last'] = d3.duplicated(['a', 'b'], keep='last')
d3
a | b | c | a dup | a+b dup | a+b dup - take last | |
---|---|---|---|---|---|---|
0 | one | x | -0.3 | False | False | False |
1 | one | y | -0.1 | True | False | False |
2 | two | y | 1.2 | False | False | True |
3 | three | x | -0.5 | False | False | False |
4 | two | y | 1.3 | True | True | False |
d3.drop_duplicates(['a', 'b'],keep='last')
a | b | c | a dup | a+b dup | a+b dup - take last | |
---|---|---|---|---|---|---|
0 | one | x | -0.3 | False | False | False |
1 | one | y | -0.1 | True | False | False |
3 | three | x | -0.5 | False | False | False |
4 | two | y | 1.3 | True | True | False |
Let's start by generating a DataFrame from a Numpy Array. We'll see than there is no memory overhead on DataFrame Values:
rows, cols = 100, 100
np_array = np.array(np.random.randn(rows, cols), dtype=np.float64)
d4 = pd.DataFrame(np_array)
print ('Rows x Cols x 8: ', rows*cols*8)
print ('np Array Memory Occupation: ', np_array.nbytes)
print ('Dataframe Values Memory Occupation: ', d4.values.nbytes)
print ('Dataframe Index Memory Occupation: ', d4.index.nbytes)
print ('Dataframe Columns Memory Occupation: ', d4.columns.nbytes)
Rows x Cols x 8: 80000 np Array Memory Occupation: 80000 Dataframe Values Memory Occupation: 80000 Dataframe Index Memory Occupation: 800 Dataframe Columns Memory Occupation: 800
To reduce the memory occupation it's possible to change the value's dtype:
d4 = d4.astype(dtype=np.float16)
print ('Dataframe Values Memory Occupation: ', d4.values.nbytes)
Dataframe Values Memory Occupation: 20000
If the data is sparse the Dataframe can be sparsified as well to save further resources with the to_sparse()
method:
d4.ix[2:,4:] = np.nan
print ('Dataframe Values Memory Occupation: ', d4.values.nbytes)
d4 = d4.to_sparse()
print ('Dataframe Values Memory Occupation: ', d4.values.nbytes)
Dataframe Values Memory Occupation: 20000 Dataframe Values Memory Occupation: 80000
In this case rows and colums are np.int64 arrays:
d4.columns
Int64Index([ 0, 1, 2, 3, 4, 5, 6, 7, 8, 9, 10, 11, 12, 13, 14, 15, 16, 17, 18, 19, 20, 21, 22, 23, 24, 25, 26, 27, 28, 29, 30, 31, 32, 33, 34, 35, 36, 37, 38, 39, 40, 41, 42, 43, 44, 45, 46, 47, 48, 49, 50, 51, 52, 53, 54, 55, 56, 57, 58, 59, 60, 61, 62, 63, 64, 65, 66, 67, 68, 69, 70, 71, 72, 73, 74, 75, 76, 77, 78, 79, 80, 81, 82, 83, 84, 85, 86, 87, 88, 89, 90, 91, 92, 93, 94, 95, 96, 97, 98, 99], dtype='int64')
Working with large arrays: in Excel is difficult to explore arrays with thousands of lines and columns. Explore the pandas capabilities with the following code. The first line visualize the firts two lines, while the second actually load the whole file. Try to do the same in Excel for comparison.
d3 = pd.read_csv('example_data/p01_d3.csv.gz', compression='gzip')
for col in d3.columns:
print (col, end=' - ')
Istat - Comune - Provincia - Regione - Prefisso - CAP - CodFisco - Abitanti - Link -
Try by yourself:
d3.head()
d3[d3.columns[:3]].head()
d3[d3.columns[-4:-1]].tail()
d3.ix[1000:1010, :7]
d3.ix[:, 'Abitanti'].describe()
d3[d3.columns[-4:-1]].tail()
CAP | CodFisco | Abitanti | |
---|---|---|---|
8087 | 33020 | M200 | 607 |
8088 | 13848 | M201 | 1152 |
8089 | 87040 | M202 | 2413 |
8090 | 83030 | M203 | 1232 |
8091 | 89867 | M204 | 2055 |
It is possibile to add multiple new columns to a DataFrame
.
data = np.array([[3, 5, 7, 10, 13, 16, 56, 72],
[8, 16, 28, 37, 45, 57, 69, 90],
[3, 6, NaN, NaN, 15, 18, NaN, NaN],
[1, 2, 4, 7, 11, 16, 65, 88],
[NaN, NaN, NaN, NaN, 16, 19, 82, 91]])
d4 = pd.DataFrame(data.T, columns=['one', 'two', 'three', 'four', 'five'])
d4
one | two | three | four | five | |
---|---|---|---|---|---|
0 | 3.0 | 8.0 | 3.0 | 1.0 | nan |
1 | 5.0 | 16.0 | 6.0 | 2.0 | nan |
2 | 7.0 | 28.0 | nan | 4.0 | nan |
3 | 10.0 | 37.0 | nan | 7.0 | nan |
4 | 13.0 | 45.0 | 15.0 | 11.0 | 16.0 |
5 | 16.0 | 57.0 | 18.0 | 16.0 | 19.0 |
6 | 56.0 | 69.0 | nan | 65.0 | 82.0 |
7 | 72.0 | 90.0 | nan | 88.0 | 91.0 |
d4[['one ret','two ret']] = d4[['one','two']].pct_change()+1
d4
one | two | three | four | five | one ret | two ret | |
---|---|---|---|---|---|---|---|
0 | 3.0 | 8.0 | 3.0 | 1.0 | nan | nan | nan |
1 | 5.0 | 16.0 | 6.0 | 2.0 | nan | 1.7 | 2.0 |
2 | 7.0 | 28.0 | nan | 4.0 | nan | 1.4 | 1.8 |
3 | 10.0 | 37.0 | nan | 7.0 | nan | 1.4 | 1.3 |
4 | 13.0 | 45.0 | 15.0 | 11.0 | 16.0 | 1.3 | 1.2 |
5 | 16.0 | 57.0 | 18.0 | 16.0 | 19.0 | 1.2 | 1.3 |
6 | 56.0 | 69.0 | nan | 65.0 | 82.0 | 3.5 | 1.2 |
7 | 72.0 | 90.0 | nan | 88.0 | 91.0 | 1.3 | 1.3 |
d4['four var'] = np.log(d4['four'] - d4['four'].shift())
d4
one | two | three | four | five | one ret | two ret | four var | |
---|---|---|---|---|---|---|---|---|
0 | 3.0 | 8.0 | 3.0 | 1.0 | nan | nan | nan | nan |
1 | 5.0 | 16.0 | 6.0 | 2.0 | nan | 1.7 | 2.0 | 0.0 |
2 | 7.0 | 28.0 | nan | 4.0 | nan | 1.4 | 1.8 | 0.7 |
3 | 10.0 | 37.0 | nan | 7.0 | nan | 1.4 | 1.3 | 1.1 |
4 | 13.0 | 45.0 | 15.0 | 11.0 | 16.0 | 1.3 | 1.2 | 1.4 |
5 | 16.0 | 57.0 | 18.0 | 16.0 | 19.0 | 1.2 | 1.3 | 1.6 |
6 | 56.0 | 69.0 | nan | 65.0 | 82.0 | 3.5 | 1.2 | 3.9 |
7 | 72.0 | 90.0 | nan | 88.0 | 91.0 | 1.3 | 1.3 | 3.1 |
DataFrame.reindex
method conforms a DataFrame
to a new index, filling cells with no values. It is possible to use this method to rearrange columns.
d5 = d4.reindex(columns=['one','one ret','two','two ret','four','four var'])
d5
one | one ret | two | two ret | four | four var | |
---|---|---|---|---|---|---|
0 | 3.0 | nan | 8.0 | nan | 1.0 | nan |
1 | 5.0 | 1.7 | 16.0 | 2.0 | 2.0 | 0.0 |
2 | 7.0 | 1.4 | 28.0 | 1.8 | 4.0 | 0.7 |
3 | 10.0 | 1.4 | 37.0 | 1.3 | 7.0 | 1.1 |
4 | 13.0 | 1.3 | 45.0 | 1.2 | 11.0 | 1.4 |
5 | 16.0 | 1.2 | 57.0 | 1.3 | 16.0 | 1.6 |
6 | 56.0 | 3.5 | 69.0 | 1.2 | 65.0 | 3.9 |
7 | 72.0 | 1.3 | 90.0 | 1.3 | 88.0 | 3.1 |
Notice that DataFrame.reindex
gives a new view, hence d4
isn't changed.
d4
one | two | three | four | five | one ret | two ret | four var | |
---|---|---|---|---|---|---|---|---|
0 | 3.0 | 8.0 | 3.0 | 1.0 | nan | nan | nan | nan |
1 | 5.0 | 16.0 | 6.0 | 2.0 | nan | 1.7 | 2.0 | 0.0 |
2 | 7.0 | 28.0 | nan | 4.0 | nan | 1.4 | 1.8 | 0.7 |
3 | 10.0 | 37.0 | nan | 7.0 | nan | 1.4 | 1.3 | 1.1 |
4 | 13.0 | 45.0 | 15.0 | 11.0 | 16.0 | 1.3 | 1.2 | 1.4 |
5 | 16.0 | 57.0 | 18.0 | 16.0 | 19.0 | 1.2 | 1.3 | 1.6 |
6 | 56.0 | 69.0 | nan | 65.0 | 82.0 | 3.5 | 1.2 | 3.9 |
7 | 72.0 | 90.0 | nan | 88.0 | 91.0 | 1.3 | 1.3 | 3.1 |
d6 = pd.read_csv('temp/p01_d4.csv', index_col=['Country',
'Number',
'Dir'])
d6 = d6.sortlevel()
d6
0 | 1 | 2 | |||
---|---|---|---|---|---|
Country | Number | Dir | |||
Fra | one | x | -0.1 | 0.3 | -0.2 |
two | y | 0.3 | -0.9 | 0.3 | |
z | 1.8 | 1.0 | -1.4 | ||
Ger | one | x | -0.7 | -0.5 | 0.4 |
Jap | one | x | 1.2 | 1.2 | -0.2 |
two | x | -0.4 | 0.5 | -0.6 | |
USA | one | y | -1.9 | -0.9 | 0.5 |
z | -0.1 | -1.0 | 0.7 |
Try by yourself:
d6.ix['Fra']
d6.ix['Fra', 'two']
d6.ix['Fra':'Ger']
d6.reorder_levels([2,1,0], axis=0).sortlevel(0)
d6.reset_index(level=1)
d6.ix['Fra']
0 | 1 | 2 | ||
---|---|---|---|---|
Number | Dir | |||
one | x | -0.1 | 0.3 | -0.2 |
two | y | 0.3 | -0.9 | 0.3 |
z | 1.8 | 1.0 | -1.4 |
The way the DataFrames are displayed can be customized ijn many ways: (See documentation):
print ('Display Max Rows: \t', pd.get_option('display.max_rows'))
pd.describe_option('display.max_rows')
Display Max Rows: 60 display.max_rows : int If max_rows is exceeded, switch to truncate view. Depending on `large_repr`, objects are either centrally truncated or printed as a summary view. 'None' value means unlimited. In case python/IPython is running in a terminal and `large_repr` equals 'truncate' this can be set to 0 and pandas will auto-detect the height of the terminal and print a truncated object which fits the screen height. The IPython notebook, IPython qtconsole, or IDLE do not run in a terminal and hence it is not possible to do correct auto-detection. [default: 60] [currently: 60]
pd.set_option('display.max_rows', 10)
pd.get_option('display.max_rows')
10
pd.reset_option('display.max_rows')
pd.get_option('display.max_rows')
60
Try by yourself:
pd.describe_option('display.chop_threshold')
pd.describe_option('display.colheader_justify')
pd.describe_option('display.column_space')
pd.describe_option('display.date_dayfirst')
pd.describe_option('display.date_yearfirst')
pd.describe_option('display.encoding')
pd.describe_option('display.expand_frame_repr')
pd.describe_option('display.float_format')
pd.describe_option('display.max_columns')
pd.describe_option('display.max_colwidth')
pd.describe_option('display.max_rows')
pd.describe_option('display.notebook_repr_html')
pd.describe_option('display.precision')
# for more options see documentation...
pd.set_option('display.precision', 2)
pd.set_option('display.notebook_repr_html', False)
prices
AAPL GOOG JNJ XOM Date 2015-09-21 115.2 635.4 93.1 73.4 2015-09-22 113.4 622.7 93.2 72.7 2015-09-23 114.3 622.4 93.0 72.3 2015-09-24 115.0 625.8 92.5 72.7 2015-09-25 114.7 612.0 91.0 73.2 2015-09-28 112.4 594.9 91.4 72.6
pd.set_option('display.notebook_repr_html', True)
prices
AAPL | GOOG | JNJ | XOM | |
---|---|---|---|---|
Date | ||||
2015-09-21 | 115.2 | 635.4 | 93.1 | 73.4 |
2015-09-22 | 113.4 | 622.7 | 93.2 | 72.7 |
2015-09-23 | 114.3 | 622.4 | 93.0 | 72.3 |
2015-09-24 | 115.0 | 625.8 | 92.5 | 72.7 |
2015-09-25 | 114.7 | 612.0 | 91.0 | 73.2 |
2015-09-28 | 112.4 | 594.9 | 91.4 | 72.6 |
Visit www.add-for.com for more tutorials and updates.
This work is licensed under a Creative Commons Attribution-ShareAlike 4.0 International License.