pandas¶

Pandas objects can be thought of as enhanced versions of NumPy structured arrays in which the rows and columns are identified with labels rather than simple integer indices.

Tree fundamental Pandas data structures:

the Series, 
DataFrame, 
and Index.

In [260]:

import pandas as pd
pd.__version__

Out[260]:

'1.3.5'

In [261]:

print(help(pd))

Help on package pandas:

NAME
    pandas

DESCRIPTION
    pandas - a powerful data analysis and manipulation library for Python
    =====================================================================
    
    **pandas** is a Python package providing fast, flexible, and expressive data
    structures designed to make working with "relational" or "labeled" data both
    easy and intuitive. It aims to be the fundamental high-level building block for
    doing practical, **real world** data analysis in Python. Additionally, it has
    the broader goal of becoming **the most powerful and flexible open source data
    analysis / manipulation tool available in any language**. It is already well on
    its way toward this goal.
    
    Main Features
    -------------
    Here are just a few of the things that pandas does well:
    
      - Easy handling of missing data in floating point as well as non-floating
        point data.
      - Size mutability: columns can be inserted and deleted from DataFrame and
        higher dimensional objects
      - Automatic and explicit data alignment: objects can be explicitly aligned
        to a set of labels, or the user can simply ignore the labels and let
        `Series`, `DataFrame`, etc. automatically align the data for you in
        computations.
      - Powerful, flexible group by functionality to perform split-apply-combine
        operations on data sets, for both aggregating and transforming data.
      - Make it easy to convert ragged, differently-indexed data in other Python
        and NumPy data structures into DataFrame objects.
      - Intelligent label-based slicing, fancy indexing, and subsetting of large
        data sets.
      - Intuitive merging and joining data sets.
      - Flexible reshaping and pivoting of data sets.
      - Hierarchical labeling of axes (possible to have multiple labels per tick).
      - Robust IO tools for loading data from flat files (CSV and delimited),
        Excel files, databases, and saving/loading data from the ultrafast HDF5
        format.
      - Time series-specific functionality: date range generation and frequency
        conversion, moving window statistics, date shifting and lagging.

PACKAGE CONTENTS
    _config (package)
    _libs (package)
    _testing (package)
    _typing
    _version
    api (package)
    arrays (package)
    compat (package)
    conftest
    core (package)
    errors (package)
    io (package)
    plotting (package)
    testing
    tests (package)
    tseries (package)
    util (package)

SUBMODULES
    _hashtable
    _lib
    _tslib
    offsets

FUNCTIONS
    __getattr__(name)
        # GH 27101

DATA
    IndexSlice = <pandas.core.indexing._IndexSlice object>
    NA = <NA>
    NaT = NaT
    __docformat__ = 'restructuredtext'
    __git_version__ = '66e3805b8cabe977f40c05259cc3fcf7ead5687d'
    describe_option = <pandas._config.config.CallableDynamicDoc object>
    get_option = <pandas._config.config.CallableDynamicDoc object>
    options = <pandas._config.config.DictWrapper object>
    reset_option = <pandas._config.config.CallableDynamicDoc object>
    set_option = <pandas._config.config.CallableDynamicDoc object>

VERSION
    1.3.5

FILE
    /usr/local/lib/python3.7/dist-packages/pandas/__init__.py


None

In [262]:

dir(pd)

Out[262]:

['BooleanDtype',
 'Categorical',
 'CategoricalDtype',
 'CategoricalIndex',
 'DataFrame',
 'DateOffset',
 'DatetimeIndex',
 'DatetimeTZDtype',
 'ExcelFile',
 'ExcelWriter',
 'Flags',
 'Float32Dtype',
 'Float64Dtype',
 'Float64Index',
 'Grouper',
 'HDFStore',
 'Index',
 'IndexSlice',
 'Int16Dtype',
 'Int32Dtype',
 'Int64Dtype',
 'Int64Index',
 'Int8Dtype',
 'Interval',
 'IntervalDtype',
 'IntervalIndex',
 'MultiIndex',
 'NA',
 'NaT',
 'NamedAgg',
 'Period',
 'PeriodDtype',
 'PeriodIndex',
 'RangeIndex',
 'Series',
 'SparseDtype',
 'StringDtype',
 'Timedelta',
 'TimedeltaIndex',
 'Timestamp',
 'UInt16Dtype',
 'UInt32Dtype',
 'UInt64Dtype',
 'UInt64Index',
 'UInt8Dtype',
 '__builtins__',
 '__cached__',
 '__doc__',
 '__docformat__',
 '__file__',
 '__getattr__',
 '__git_version__',
 '__loader__',
 '__name__',
 '__package__',
 '__path__',
 '__spec__',
 '__version__',
 '_config',
 '_hashtable',
 '_is_numpy_dev',
 '_lib',
 '_libs',
 '_np_version_under1p18',
 '_testing',
 '_tslib',
 '_typing',
 '_version',
 'api',
 'array',
 'arrays',
 'bdate_range',
 'compat',
 'concat',
 'core',
 'crosstab',
 'cut',
 'date_range',
 'describe_option',
 'errors',
 'eval',
 'factorize',
 'get_dummies',
 'get_option',
 'infer_freq',
 'interval_range',
 'io',
 'isna',
 'isnull',
 'json_normalize',
 'lreshape',
 'melt',
 'merge',
 'merge_asof',
 'merge_ordered',
 'notna',
 'notnull',
 'offsets',
 'option_context',
 'options',
 'pandas',
 'period_range',
 'pivot',
 'pivot_table',
 'plotting',
 'qcut',
 'read_clipboard',
 'read_csv',
 'read_excel',
 'read_feather',
 'read_fwf',
 'read_gbq',
 'read_hdf',
 'read_html',
 'read_json',
 'read_orc',
 'read_parquet',
 'read_pickle',
 'read_sas',
 'read_spss',
 'read_sql',
 'read_sql_query',
 'read_sql_table',
 'read_stata',
 'read_table',
 'read_xml',
 'reset_option',
 'set_eng_float_format',
 'set_option',
 'show_versions',
 'test',
 'testing',
 'timedelta_range',
 'to_datetime',
 'to_numeric',
 'to_pickle',
 'to_timedelta',
 'tseries',
 'unique',
 'util',
 'value_counts',
 'wide_to_long']

Creating a pandas Series¶

A Pandas Series is a one-dimensional array of indexed data where data can be many different things:

    a Python dict
    an ndarray
    a scalar value (like 5)

In [263]:

import pandas as pd;   import numpy as np

In [264]:

pd.Series([1, 2, 3, 4, 5])

Out[264]:

0    1
1    2
2    3
3    4
4    5
dtype: int64

In [265]:

pd.Series(np.random.randn(5), index=["a", "b", "c", "d", "e"])

Out[265]:

a   -0.203503
b    1.013045
c    2.181949
d   -1.122588
e    1.189388
dtype: float64

In [266]:

pd.Series(np.array([1, 2, 3, 4, 5]))

Out[266]:

0    1
1    2
2    3
3    4
4    5
dtype: int64

In [267]:

pd.Series(np.array([1, 2, 3, 4, 5])) + pd.Series(np.array([5, 4, 3, 2, 1]))

Out[267]:

0    6
1    6
2    6
3    6
4    6
dtype: int64

In [268]:

d = {"b": 1, "a": 0, "c": 2}; pd.Series(d)

Out[268]:

b    1
a    0
c    2
dtype: int64

The Series object is basically interchangeable with a one-dimensional NumPy array. The essential difference is the presence of the index: while the Numpy Array has an implicitly defined integer index used to access the values, the Pandas Series has an explicitly defined index associated with the values.

Creating a pandas DataFrame¶

DataFrame is a 2-dimensional labeled data structure with columns of potentially different types. You can think of it like a spreadsheet or SQL table, or a dict of Series objects. It is generally the most commonly used pandas object. Like Series, DataFrame accepts many different kinds of input:

Dict of 1D ndarrays, lists, dicts, or Series

2-D numpy.ndarray

Structured or record ndarray

A Series

Another DataFrame

In [269]:

df = pd.DataFrame(np.random.randn(5,4),index='A B C D E'.split(),columns='W X Y Z'.split())
df

Out[269]:

	W	X	Y	Z
A	-1.055250	-0.255963	0.523574	-0.216817
B	-1.587131	-0.128045	-0.767518	-0.055505
C	-0.589701	-1.098669	1.061195	-0.407758
D	-0.692038	-0.886125	0.937507	0.980382
E	-1.218705	-0.135298	1.196934	1.009017

In [270]:

import pandas as pd
d = {"one": pd.Series([1.0, 2.0, 3.0], index=["a", "b", "c"]),
    "two": pd.Series([1.0, 2.0, 3.0, 4.0], index=["a", "b", "c", "d"])
    }
df = pd.DataFrame(d); print(type(df));  print('\n'); df.info();   print('\n');  df

<class 'pandas.core.frame.DataFrame'>


<class 'pandas.core.frame.DataFrame'>
Index: 4 entries, a to d
Data columns (total 2 columns):
 #   Column  Non-Null Count  Dtype  
---  ------  --------------  -----  
 0   one     3 non-null      float64
 1   two     4 non-null      float64
dtypes: float64(2)
memory usage: 96.0+ bytes

Out[270]:

	one	two
a	1.0	1.0
b	2.0	2.0
c	3.0	3.0
d	NaN	4.0

In [271]:

import pandas as pd
import numpy as np

df=pd.DataFrame([[1, 2, 3, 4],
                [5, 6, 7, 8],
                [9, 20, 22, 34]],
                index=[0,1,2],
                columns=['A', 'B', 'C', 'D'])
df

Out[271]:

	A	B	C	D
0	1	2	3	4
1	5	6	7	8
2	9	20	22	34

Yahoo Finance Data¶

creating dataframe¶

In [272]:

# source: Yahoo Finance
stock = pd.DataFrame([['tech', 'AAPL', 157.25, 1.23, 26.11, 2539, 0.58, 184.01],
                     ['tech', 'MSFT', 261.47, 0.93, 26.98, 1942, 0.95, 333.00],
                     ['engr', 'XOM', 93.87, 1.08, 10.27, 391, 3.68, 103.28],
                     ['engr', 'COP', 109.45, 1.38, 8.87, 137, 1,75],
                     ['fin', 'BAC',33.47, 1.40, 10.46, 269, 2.62, 42.41],
                     ['fin', 'JPM',114.51, 1.10, 9.18, 336, 3.52, 138.80],
                     ['fin', 'WMT',335.50, 1.39, 7.53, 114, 3.00, 394.07]])
stock

Out[272]:

	0	1	2	3	4	5	6	7
0	tech	AAPL	157.25	1.23	26.11	2539	0.58	184.01
1	tech	MSFT	261.47	0.93	26.98	1942	0.95	333.00
2	engr	XOM	93.87	1.08	10.27	391	3.68	103.28
3	engr	COP	109.45	1.38	8.87	137	1.00	75.00
4	fin	BAC	33.47	1.40	10.46	269	2.62	42.41
5	fin	JPM	114.51	1.10	9.18	336	3.52	138.80
6	fin	WMT	335.50	1.39	7.53	114	3.00	394.07

assign column names and index values¶

In [273]:

stock.columns=['sector', 'ticker', 'price', 'beta', 'pe', 'mktcap', 'dy', 'target']
stock.index = ["a", "b", "c", "d", "e", "f", "g"]
stock

Out[273]:

	sector	ticker	price	beta	pe	mktcap	dy	target
a	tech	AAPL	157.25	1.23	26.11	2539	0.58	184.01
b	tech	MSFT	261.47	0.93	26.98	1942	0.95	333.00
c	engr	XOM	93.87	1.08	10.27	391	3.68	103.28
d	engr	COP	109.45	1.38	8.87	137	1.00	75.00
e	fin	BAC	33.47	1.40	10.46	269	2.62	42.41
f	fin	JPM	114.51	1.10	9.18	336	3.52	138.80
g	fin	WMT	335.50	1.39	7.53	114	3.00	394.07

chekcking data contents¶

In [274]:

stock.info()

<class 'pandas.core.frame.DataFrame'>
Index: 7 entries, a to g
Data columns (total 8 columns):
 #   Column  Non-Null Count  Dtype  
---  ------  --------------  -----  
 0   sector  7 non-null      object 
 1   ticker  7 non-null      object 
 2   price   7 non-null      float64
 3   beta    7 non-null      float64
 4   pe      7 non-null      float64
 5   mktcap  7 non-null      int64  
 6   dy      7 non-null      float64
 7   target  7 non-null      float64
dtypes: float64(5), int64(1), object(2)
memory usage: 504.0+ bytes

In [275]:

stock.index = [0, 1, 2, 3, 4, 5, 6]
stock

Out[275]:

	sector	ticker	price	beta	pe	mktcap	dy	target
0	tech	AAPL	157.25	1.23	26.11	2539	0.58	184.01
1	tech	MSFT	261.47	0.93	26.98	1942	0.95	333.00
2	engr	XOM	93.87	1.08	10.27	391	3.68	103.28
3	engr	COP	109.45	1.38	8.87	137	1.00	75.00
4	fin	BAC	33.47	1.40	10.46	269	2.62	42.41
5	fin	JPM	114.51	1.10	9.18	336	3.52	138.80
6	fin	WMT	335.50	1.39	7.53	114	3.00	394.07

dropping column(s)¶

In [276]:

stock.drop('target',axis=1)

Out[276]:

	sector	ticker	price	beta	pe	mktcap	dy
0	tech	AAPL	157.25	1.23	26.11	2539	0.58
1	tech	MSFT	261.47	0.93	26.98	1942	0.95
2	engr	XOM	93.87	1.08	10.27	391	3.68
3	engr	COP	109.45	1.38	8.87	137	1.00
4	fin	BAC	33.47	1.40	10.46	269	2.62
5	fin	JPM	114.51	1.10	9.18	336	3.52
6	fin	WMT	335.50	1.39	7.53	114	3.00

In [277]:

# but this wont affect the orginal data
stock

Out[277]:

	sector	ticker	price	beta	pe	mktcap	dy	target
0	tech	AAPL	157.25	1.23	26.11	2539	0.58	184.01
1	tech	MSFT	261.47	0.93	26.98	1942	0.95	333.00
2	engr	XOM	93.87	1.08	10.27	391	3.68	103.28
3	engr	COP	109.45	1.38	8.87	137	1.00	75.00
4	fin	BAC	33.47	1.40	10.46	269	2.62	42.41
5	fin	JPM	114.51	1.10	9.18	336	3.52	138.80
6	fin	WMT	335.50	1.39	7.53	114	3.00	394.07

In [278]:

# must put down inplace option
stock.drop('target', axis=1, inplace=True)
stock

Out[278]:

	sector	ticker	price	beta	pe	mktcap	dy
0	tech	AAPL	157.25	1.23	26.11	2539	0.58
1	tech	MSFT	261.47	0.93	26.98	1942	0.95
2	engr	XOM	93.87	1.08	10.27	391	3.68
3	engr	COP	109.45	1.38	8.87	137	1.00
4	fin	BAC	33.47	1.40	10.46	269	2.62
5	fin	JPM	114.51	1.10	9.18	336	3.52
6	fin	WMT	335.50	1.39	7.53	114	3.00

In [279]:

stock['target'] = [184.01, 333.00, 103.28, 75.00, 42.41, 138.80, 394.07]
stock

Out[279]:

	sector	ticker	price	beta	pe	mktcap	dy	target
0	tech	AAPL	157.25	1.23	26.11	2539	0.58	184.01
1	tech	MSFT	261.47	0.93	26.98	1942	0.95	333.00
2	engr	XOM	93.87	1.08	10.27	391	3.68	103.28
3	engr	COP	109.45	1.38	8.87	137	1.00	75.00
4	fin	BAC	33.47	1.40	10.46	269	2.62	42.41
5	fin	JPM	114.51	1.10	9.18	336	3.52	138.80
6	fin	WMT	335.50	1.39	7.53	114	3.00	394.07

In [280]:

stock2 = stock.drop(6, axis=0)
stock2

Out[280]:

	sector	ticker	price	beta	pe	mktcap	dy	target
0	tech	AAPL	157.25	1.23	26.11	2539	0.58	184.01
1	tech	MSFT	261.47	0.93	26.98	1942	0.95	333.00
2	engr	XOM	93.87	1.08	10.27	391	3.68	103.28
3	engr	COP	109.45	1.38	8.87	137	1.00	75.00
4	fin	BAC	33.47	1.40	10.46	269	2.62	42.41
5	fin	JPM	114.51	1.10	9.18	336	3.52	138.80

In [281]:

newstock = pd.DataFrame([['ind', 184.71, 1.00, 14.61, 98, 2.63, 214.00]], 
                        index=['CAT'],
                        columns=['sector', 'price', 'beta', 'pe', 'mktcap', 'dy', 'target'])
                        
newstock

Out[281]:

	sector	price	beta	pe	mktcap	dy	target
CAT	ind	184.71	1.0	14.61	98	2.63	214.0

In [282]:

stock3 = stock.append(newstock)
stock3

Out[282]:

	sector	ticker	price	beta	pe	mktcap	dy	target
0	tech	AAPL	157.25	1.23	26.11	2539	0.58	184.01
1	tech	MSFT	261.47	0.93	26.98	1942	0.95	333.00
2	engr	XOM	93.87	1.08	10.27	391	3.68	103.28
3	engr	COP	109.45	1.38	8.87	137	1.00	75.00
4	fin	BAC	33.47	1.40	10.46	269	2.62	42.41
5	fin	JPM	114.51	1.10	9.18	336	3.52	138.80
6	fin	WMT	335.50	1.39	7.53	114	3.00	394.07
CAT	ind	NaN	184.71	1.00	14.61	98	2.63	214.00

In [283]:

stock3 = pd.concat([stock, newstock])
stock3

Out[283]:

	sector	ticker	price	beta	pe	mktcap	dy	target
0	tech	AAPL	157.25	1.23	26.11	2539	0.58	184.01
1	tech	MSFT	261.47	0.93	26.98	1942	0.95	333.00
2	engr	XOM	93.87	1.08	10.27	391	3.68	103.28
3	engr	COP	109.45	1.38	8.87	137	1.00	75.00
4	fin	BAC	33.47	1.40	10.46	269	2.62	42.41
5	fin	JPM	114.51	1.10	9.18	336	3.52	138.80
6	fin	WMT	335.50	1.39	7.53	114	3.00	394.07
CAT	ind	NaN	184.71	1.00	14.61	98	2.63	214.00

replacing value¶

In [284]:

stock.loc[6, 'ticker'] = 'GS'
stock

Out[284]:

	sector	ticker	price	beta	pe	mktcap	dy	target
0	tech	AAPL	157.25	1.23	26.11	2539	0.58	184.01
1	tech	MSFT	261.47	0.93	26.98	1942	0.95	333.00
2	engr	XOM	93.87	1.08	10.27	391	3.68	103.28
3	engr	COP	109.45	1.38	8.87	137	1.00	75.00
4	fin	BAC	33.47	1.40	10.46	269	2.62	42.41
5	fin	JPM	114.51	1.10	9.18	336	3.52	138.80
6	fin	GS	335.50	1.39	7.53	114	3.00	394.07

In [285]:

stock['price'].replace(335.50, 435.50, inplace=True)

In [286]:

stock

Out[286]:

	sector	ticker	price	beta	pe	mktcap	dy	target
0	tech	AAPL	157.25	1.23	26.11	2539	0.58	184.01
1	tech	MSFT	261.47	0.93	26.98	1942	0.95	333.00
2	engr	XOM	93.87	1.08	10.27	391	3.68	103.28
3	engr	COP	109.45	1.38	8.87	137	1.00	75.00
4	fin	BAC	33.47	1.40	10.46	269	2.62	42.41
5	fin	JPM	114.51	1.10	9.18	336	3.52	138.80
6	fin	GS	435.50	1.39	7.53	114	3.00	394.07

selecting row or column or both¶

In [287]:

print(stock['ticker']); print('\n');
print(stock[['ticker']]); print('\n');
print(stock.ticker); print('\n');
print(stock.loc[:, 'ticker']); print('\n');
print(stock.iloc[:, 1]); print('\n');
print(stock.iloc[:, -1]); print('\n');

0    AAPL
1    MSFT
2     XOM
3     COP
4     BAC
5     JPM
6      GS
Name: ticker, dtype: object


  ticker
0   AAPL
1   MSFT
2    XOM
3    COP
4    BAC
5    JPM
6     GS


0    AAPL
1    MSFT
2     XOM
3     COP
4     BAC
5     JPM
6      GS
Name: ticker, dtype: object


0    AAPL
1    MSFT
2     XOM
3     COP
4     BAC
5     JPM
6      GS
Name: ticker, dtype: object


0    AAPL
1    MSFT
2     XOM
3     COP
4     BAC
5     JPM
6      GS
Name: ticker, dtype: object


0    184.01
1    333.00
2    103.28
3     75.00
4     42.41
5    138.80
6    394.07
Name: target, dtype: float64

In [288]:

stock.loc[:, 'ticker':'beta']

Out[288]:

	ticker	price	beta
0	AAPL	157.25	1.23
1	MSFT	261.47	0.93
2	XOM	93.87	1.08
3	COP	109.45	1.38
4	BAC	33.47	1.40
5	JPM	114.51	1.10
6	GS	435.50	1.39

In [289]:

stock.loc[:, :]

Out[289]:

	sector	ticker	price	beta	pe	mktcap	dy	target
0	tech	AAPL	157.25	1.23	26.11	2539	0.58	184.01
1	tech	MSFT	261.47	0.93	26.98	1942	0.95	333.00
2	engr	XOM	93.87	1.08	10.27	391	3.68	103.28
3	engr	COP	109.45	1.38	8.87	137	1.00	75.00
4	fin	BAC	33.47	1.40	10.46	269	2.62	42.41
5	fin	JPM	114.51	1.10	9.18	336	3.52	138.80
6	fin	GS	435.50	1.39	7.53	114	3.00	394.07

In [290]:

stock.loc[0:2, 'beta']

Out[290]:

0    1.23
1    0.93
2    1.08
Name: beta, dtype: float64

In [291]:

stock.loc[3, "ticker"]

Out[291]:

'COP'

In [292]:

stock.loc[5, "mktcap"] > stock.loc[6, "mktcap"]

Out[292]:

True

In [293]:

stock.index

Out[293]:

Int64Index([0, 1, 2, 3, 4, 5, 6], dtype='int64')

data summary¶

In [294]:

stock.describe()

Out[294]:

	price	beta	pe	mktcap	dy	target
count	7.000000	7.000000	7.000000	7.000000	7.000000	7.000000
mean	172.217143	1.215714	14.200000	818.285714	2.192857	181.510000
std	135.503274	0.184829	8.492224	991.686058	1.315025	133.400706
min	33.470000	0.930000	7.530000	114.000000	0.580000	42.410000
25%	101.660000	1.090000	9.025000	203.000000	0.975000	89.140000
50%	114.510000	1.230000	10.270000	336.000000	2.620000	138.800000
75%	209.360000	1.385000	18.285000	1166.500000	3.260000	258.505000
max	435.500000	1.400000	26.980000	2539.000000	3.680000	394.070000

In [295]:

print(stock['price'].mean()); 
print(stock['beta'].median()); 
print(stock['dy'].std()); 
print(stock['mktcap'].max());

172.21714285714285
1.23
1.315024895654252
2539

In [296]:

stock.groupby('sector').mean().round(2)

Out[296]:

	price	beta	pe	mktcap	dy	target
sector
engr	101.66	1.23	9.57	264.00	2.34	89.14
fin	194.49	1.30	9.06	239.67	3.05	191.76
tech	209.36	1.08	26.54	2240.50	0.76	258.50

In [297]:

stock.set_index('ticker', inplace=True)

In [298]:

stock.loc['JPM', 'mktcap'] > stock.loc['GS', 'mktcap']

Out[298]:

True

In [299]:

stock

Out[299]:

	sector	price	beta	pe	mktcap	dy	target
ticker
AAPL	tech	157.25	1.23	26.11	2539	0.58	184.01
MSFT	tech	261.47	0.93	26.98	1942	0.95	333.00
XOM	engr	93.87	1.08	10.27	391	3.68	103.28
COP	engr	109.45	1.38	8.87	137	1.00	75.00
BAC	fin	33.47	1.40	10.46	269	2.62	42.41
JPM	fin	114.51	1.10	9.18	336	3.52	138.80
GS	fin	435.50	1.39	7.53	114	3.00	394.07