import pandas as pd
cities = {'name': ["Copenhagen", "London"],
'population': [1.5, 11.2],
'dist_to_coast': [0.0, 2.3]}
df = pd.DataFrame(cities)
df
df[df.name=='London']
df.population.mean()
df.iloc[0]
df.iloc[1]
df = df.set_index('name')
df
df.loc["London"]
df.index
df.columns
We can transpose the dataframe, (rows -> columns)
df.T
df.loc["London"].population
Delimited files, separated by comma, semi-colon, tabs, spaces or any other special character, is a very common data format for tabular data. Comma separated value (csv) files can be read by the pandas read_csv
function. It is a very powerful function, with a lot of options. It is very rare, that you have to write your own python function to parse csv files.
Below is an example of csv file:
df = pd.read_csv("data/boogus.csv",
comment="%",
sep=";",
decimal=",",
parse_dates=[["dato","tid"]],
usecols=[1,2,3])
df
Most functions in Pandas returns a copy, so even though the below line, looks like it changes the name, since it is printed to the screen, the df
variable is not changed.
df.rename(columns={"temperatur": "air_temperature"})
df # not modfied
df_new_name = df.rename(columns={"temperatur": "air_temperature"})
df_new_name
Inline exercise
air_temperature
column from the df_new_name
dataframe like this: df_new_name.air_temperature
and call the .min()
and max()
methods.pd.read_csv?
df = pd.read_csv("data/oceandata.csv", parse_dates=True, index_col='datetime')
df.head()
type(df)
type(df.waterlevel)
df.plot()
df.waterlevel.plot()
df.index
df.describe()
df.loc['2015-02-18'].plot()
df.loc['2015-02-18'].interpolate().plot()
df.loc['2015-02-18 14:00':'2015-02-18 15:20']
df_interp = df.interpolate()
df_interp.loc['2015-02-18 14:00':'2015-02-18 15:20']
Aggregate temporal data
df.resample('H')
Resampling requires an aggregation function, e.g., sum, mean, median,...
df.resample('D').sum().head()
The sum
function doesn't make sense in this example. Better to use mean
.
df.resample('H').mean().head()
df.resample('H').first().head()
df.resample('H').median().head()
df_h = df.resample('H').interpolate().dropna()
df_h.head()
Note: resample will use either the left or the right end-point depending on the resampling frequency (e.g. for hours the beginning of the hour but for months the end of the month). If you want to make sure you are resampling right - specify the closed argument.
Please find the maximum value for every 6 hour period.
# insert your code here
rng = pd.date_range("2015-02-17","2015-02-20",freq='H')
ix = pd.DatetimeIndex(rng)
dfr = df_interp.reindex(ix)
dfr.plot()
dfr.ffill().plot()
df_extra = dfr.bfill().ffill()
df_extra.plot()
df_extra
from IPython.display import YouTubeVideo
YouTubeVideo("8upGdZMlkYM")
For more tips and tricks on how to use Pandas for timeseries data see this talk: Ian Ozsvald: A gentle introduction to Pandas timeseries and Seaborn | PyData London 2019