This notebook serves as explanation for a stackoverflow solution:
More answers can be found at https://stackoverflow.com/search?q=user:333403+[pandas].
Please remember: upvoting answers keeps people motivated. (And reputation helps to place bounties and thus get answers to hard questions).
import time
print(time.ctime())
try:
import ipydex
# this optional and enables the magic comment `##:` to view intermediate results
%load_ext ipydex.displaytools
except ImportError:
pass
Sun Dec 25 10:54:25 2022
# original data
import numpy as np
import pandas as pd
df = pd.DataFrame(
{"Timestamp" : ['1993-11-01' ,'1993-11-02', '1993-11-03', '1993-11-04','1993-11-15'],
"Austria" : [6.18 ,6.18, 6.17, 6.17, 6.40],
"Belgium" : [7.05, 7.05, 7.2, 7.5, 7.6],"France" : [7.69, 7.61, 7.67, 7.91, 8.61]
},
index = [1, 2, 3,4,5]
) ##:
Timestamp | Austria | Belgium | France | |
---|---|---|---|---|
1 | 1993-11-01 | 6.18 | 7.05 | 7.69 |
2 | 1993-11-02 | 6.18 | 7.05 | 7.61 |
3 | 1993-11-03 | 6.17 | 7.20 | 7.67 |
4 | 1993-11-04 | 6.17 | 7.50 | 7.91 |
5 | 1993-11-15 | 6.40 | 7.60 | 8.61 |
---
import itertools as it
# omit timestamp-col
cols = list(df.columns)[1:] ##:
col_pairs = list(it.combinations(cols, 2)) ##:
# new data frame (yet empty)
res = pd.DataFrame()
cols := ['Austria', 'Belgium', 'France']
---
col_pairs := [('Austria', 'Belgium'), ('Austria', 'France'), ('Belgium', 'France')]
---
# complete solution
for pair in col_pairs:
# select the first three letters of each name of the pair
corr_name = f"{pair[0][:3]}_{pair[1][:3]}_corr"
res[corr_name] = df[list(pair)].\
rolling(min_periods=1, window=3).\
corr().iloc[0::2, -1].reset_index(drop=True)
print(str(res))
Aus_Bel_corr Aus_Fra_corr Bel_Fra_corr 0 NaN NaN NaN 1 NaN NaN NaN 2 -1.000000 -0.277350 0.277350 3 -0.755929 -0.654654 0.989743 4 0.693375 0.969346 0.849167
# one iteration (i.e. one column):
pair = col_pairs[0] ##:
corr_name = f"{pair[0][:3]}_{pair[1][:3]}_corr" ##:
pair := ('Austria', 'Belgium')
---
corr_name := 'Aus_Bel_corr'
---
# select the corresponding columns
tmp = df[list(pair)] ##:
Austria | Belgium | |
---|---|---|
1 | 6.18 | 7.05 |
2 | 6.18 | 7.05 |
3 | 6.17 | 7.20 |
4 | 6.17 | 7.50 |
5 | 6.40 | 7.60 |
---
# this rolling objects does not yet specify the operation
tmp2 = tmp.rolling(min_periods=1, window=3) ##:
tmp2 := Rolling [window=3,min_periods=1,center=False,axis=0,method=single]
---
# now specify the operation to `corr`
tmp3 = tmp2.corr() ##:
Austria | Belgium | ||
---|---|---|---|
1 | Austria | NaN | NaN |
Belgium | NaN | NaN | |
2 | Austria | NaN | NaN |
Belgium | NaN | NaN | |
3 | Austria | 1.000000 | -1.000000 |
Belgium | -1.000000 | 1.000000 | |
4 | Austria | 1.000000 | -0.755929 |
Belgium | -0.755929 | 1.000000 | |
5 | Austria | 1.000000 | 0.693375 |
Belgium | 0.693375 | 1.000000 |
---
# The previous result contains to much data. We are interested only in every second row.
# It also has a "messed up" index:
tmp3.index
MultiIndex([(1, 'Austria'), (1, 'Belgium'), (2, 'Austria'), (2, 'Belgium'), (3, 'Austria'), (3, 'Belgium'), (4, 'Austria'), (4, 'Belgium'), (5, 'Austria'), (5, 'Belgium')], )
# To only select every second row, we use the iloc property which is documentend here:
# https://pandas.pydata.org/docs/reference/api/pandas.DataFrame.iloc.html
# in short: it allows access to the dataframe cells by purely integer indexing (without using column names etc.)
print(tmp3.iloc.__doc__[:800])
Purely integer-location based indexing for selection by position. ``.iloc[]`` is primarily integer position based (from ``0`` to ``length-1`` of the axis), but may also be used with a boolean array. Allowed inputs are: - An integer, e.g. ``5``. - A list or array of integers, e.g. ``[4, 3, 0]``. - A slice object with ints, e.g. ``1:7``. - A boolean array. - A ``callable`` function with one argument (the calling Series or DataFrame) and that returns valid output for indexing (one of the above). This is useful in method chains, when you don't have a reference to the calling object, but would like to base your selection on some value. ``.iloc`` will raise ``IndexError`` if a requested indexer is out-of-bounds, except *slice* indexers which allow out-of-bounds indexing (this conforms
# the slicing-syntax of iloc is like for numpy arrays: `start:stop:step` for rows and columns
# `0::2` means: beginning with row index 0 until the end (no stop index) with a stepsizt of 2
# the `-1` means: just select the last column (index -1), no colon -> no slice
# for more information see: https://numpy.org/doc/stable/user/basics.indexing.html#slicing-and-striding
tmp4 = tmp3.iloc[0::2, -1] ##:\n
type(tmp4) ##:
tmp4 := 1 Austria NaN 2 Austria NaN 3 Austria -1.000000 4 Austria -0.755929 5 Austria 0.693375 Name: Belgium, dtype: float64
---
(type(tmp4)) := pandas.core.series.Series
---
# finally, we install a new index beginning at 0:
tmp5 = tmp4.reset_index(drop=True)##:\n
tmp5 := 0 NaN 1 NaN 2 -1.000000 3 -0.755929 4 0.693375 Name: Belgium, dtype: float64
---