->
Licensed under CC BY 4.0 Creative Commons
If you want to follow along, this is a notebook that you can view or run yourself:
pandas
> 0.15 (easy solution is using Anaconda)Some imports:
%matplotlib inline
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
import seaborn
pd.options.display.max_rows = 8
AirBase (The European Air quality dataBase): hourly measurements of all air quality monitoring stations from Europe
Starting from these hourly data for different stations:
data
BETR801 | BETN029 | FR04037 | FR04012 | |
---|---|---|---|---|
1990-01-01 00:00:00 | NaN | 16.0 | NaN | NaN |
1990-01-01 01:00:00 | NaN | 18.0 | NaN | NaN |
1990-01-01 02:00:00 | NaN | 21.0 | NaN | NaN |
1990-01-01 03:00:00 | NaN | 26.0 | NaN | NaN |
... | ... | ... | ... | ... |
2012-12-31 20:00:00 | 16.5 | 2.0 | 16 | 47 |
2012-12-31 21:00:00 | 14.5 | 2.5 | 13 | 43 |
2012-12-31 22:00:00 | 16.5 | 3.5 | 14 | 42 |
2012-12-31 23:00:00 | 15.0 | 3.0 | 13 | 49 |
198895 rows × 4 columns
to answering questions about this data in a few lines of code:
Does the air pollution show a decreasing trend over the years?
data['1999':].resample('A').plot(ylim=[0,100])
<matplotlib.axes._subplots.AxesSubplot at 0xab4c292c>
How many exceedances of the limit values?
exceedances = data > 200
exceedances = exceedances.groupby(exceedances.index.year).sum()
ax = exceedances.loc[2005:].plot(kind='bar')
ax.axhline(18, color='k', linestyle='--')
<matplotlib.lines.Line2D at 0xab02004c>
What is the difference in diurnal profile between weekdays and weekend?
data['weekday'] = data.index.weekday
data['weekend'] = data['weekday'].isin([5, 6])
data_weekend = data.groupby(['weekend', data.index.hour])['FR04012'].mean().unstack(level=0)
data_weekend.plot()
<matplotlib.axes._subplots.AxesSubplot at 0xab3cb5cc>
We will come back to these example, and build them up step by step.
When working with tabular or structured data (like R dataframe, SQL table, Excel spreadsheet, ...):
For data-intensive work in Python the Pandas library has become essential.
What is pandas
?
R
's data.frame
in Python.It's documentation: http://pandas.pydata.org/pandas-docs/stable/
.dropna()
, pd.isnull()
)concat
, join
)groupby
functionalitystack
, pivot
)Pandas does this through two fundamental object types, both built upon NumPy arrays: the Series
object, and the DataFrame
object.
A Series is a basic holder for one-dimensional labeled data. It can be created much as a NumPy array is created:
s = pd.Series([0.1, 0.2, 0.3, 0.4])
s
0 0.1 1 0.2 2 0.3 3 0.4 dtype: float64
index
and values
¶The series has a built-in concept of an index, which by default is the numbers 0 through N - 1
s.index
Int64Index([0, 1, 2, 3], dtype='int64')
You can access the underlying numpy array representation with the .values
attribute:
s.values
array([ 0.1, 0.2, 0.3, 0.4])
We can access series values via the index, just like for NumPy arrays:
s[0]
0.10000000000000001
Unlike the NumPy array, though, this index can be something other than integers:
s2 = pd.Series(np.arange(4), index=['a', 'b', 'c', 'd'])
s2
a 0 b 1 c 2 d 3 dtype: int32
s2['c']
2
In this way, a Series
object can be thought of as similar to an ordered dictionary mapping one typed value to another typed value:
population = pd.Series({'Germany': 81.3, 'Belgium': 11.3, 'France': 64.3, 'United Kingdom': 64.9, 'Netherlands': 16.9})
population
Belgium 11.3 France 64.3 Germany 81.3 Netherlands 16.9 United Kingdom 64.9 dtype: float64
population['France']
64.299999999999997
but with the power of numpy arrays:
population * 1000
Belgium 11300 France 64300 Germany 81300 Netherlands 16900 United Kingdom 64900 dtype: float64
We can index or slice the populations as expected:
population['Belgium']
11.300000000000001
population['Belgium':'Germany']
Belgium 11.3 France 64.3 Germany 81.3 dtype: float64
Many things you can do with numpy arrays, can also be applied on objects.
Fancy indexing, like indexing with a list or boolean indexing:
population[['France', 'Netherlands']]
France 64.3 Netherlands 16.9 dtype: float64
population[population > 20]
France 64.3 Germany 81.3 United Kingdom 64.9 dtype: float64
Element-wise operations:
population / 100
Belgium 0.113 France 0.643 Germany 0.813 Netherlands 0.169 United Kingdom 0.649 dtype: float64
A range of methods:
population.mean()
47.739999999999995
Only, pay attention to alignment: operations between series will align on the index:
s1 = population[['Belgium', 'France']]
s2 = population[['France', 'Germany']]
s1
Belgium 11.3 France 64.3 dtype: float64
s2
France 64.3 Germany 81.3 dtype: float64
s1 + s2
Belgium NaN France 128.6 Germany NaN dtype: float64
A DataFrame is a tablular data structure (multi-dimensional object to hold labeled data) comprised of rows and columns, akin to a spreadsheet, database table, or R's data.frame object. You can think of it as multiple Series object which share the same index.
One of the most common ways of creating a dataframe is from a dictionary of arrays or lists.
Note that in the IPython notebook, the dataframe will display in a rich HTML view:
data = {'country': ['Belgium', 'France', 'Germany', 'Netherlands', 'United Kingdom'],
'population': [11.3, 64.3, 81.3, 16.9, 64.9],
'area': [30510, 671308, 357050, 41526, 244820],
'capital': ['Brussels', 'Paris', 'Berlin', 'Amsterdam', 'London']}
countries = pd.DataFrame(data)
countries
area | capital | country | population | |
---|---|---|---|---|
0 | 30510 | Brussels | Belgium | 11.3 |
1 | 671308 | Paris | France | 64.3 |
2 | 357050 | Berlin | Germany | 81.3 |
3 | 41526 | Amsterdam | Netherlands | 16.9 |
4 | 244820 | London | United Kingdom | 64.9 |
A DataFrame has besides a index
attribute, also a columns
attribute:
countries.index
Int64Index([0, 1, 2, 3, 4], dtype='int64')
countries.columns
Index([u'area', u'capital', u'country', u'population'], dtype='object')
To check the data types of the different columns:
countries.dtypes
area int64 capital object country object population float64 dtype: object
An overview of that information can be given with the info()
method:
countries.info()
<class 'pandas.core.frame.DataFrame'> Int64Index: 5 entries, 0 to 4 Data columns (total 4 columns): area 5 non-null int64 capital 5 non-null object country 5 non-null object population 5 non-null float64 dtypes: float64(1), int64(1), object(2) memory usage: 160.0+ bytes
Also a DataFrame has a values
attribute, but attention: when you have heterogeneous data, all values will be upcasted:
countries.values
array([[30510L, 'Brussels', 'Belgium', 11.3], [671308L, 'Paris', 'France', 64.3], [357050L, 'Berlin', 'Germany', 81.3], [41526L, 'Amsterdam', 'Netherlands', 16.9], [244820L, 'London', 'United Kingdom', 64.9]], dtype=object)
If we don't like what the index looks like, we can reset it and set one of our columns:
countries = countries.set_index('country')
countries
area | capital | population | |
---|---|---|---|
country | |||
Belgium | 30510 | Brussels | 11.3 |
France | 671308 | Paris | 64.3 |
Germany | 357050 | Berlin | 81.3 |
Netherlands | 41526 | Amsterdam | 16.9 |
United Kingdom | 244820 | London | 64.9 |
To access a Series representing a column in the data, use typical indexing syntax:
countries['area']
country Belgium 30510 France 671308 Germany 357050 Netherlands 41526 United Kingdom 244820 Name: area, dtype: int64
As you play around with DataFrames, you'll notice that many operations which work on NumPy arrays will also work on dataframes.
Let's compute density of each country:
countries['population']*1000000 / countries['area']
country Belgium 370.370370 France 95.783158 Germany 227.699202 Netherlands 406.973944 United Kingdom 265.092721 dtype: float64
Adding a new column to the dataframe is very simple:
countries['density'] = countries['population']*1000000 / countries['area']
countries
area | capital | population | density | |
---|---|---|---|---|
country | ||||
Belgium | 30510 | Brussels | 11.3 | 370.370370 |
France | 671308 | Paris | 64.3 | 95.783158 |
Germany | 357050 | Berlin | 81.3 | 227.699202 |
Netherlands | 41526 | Amsterdam | 16.9 | 406.973944 |
United Kingdom | 244820 | London | 64.9 | 265.092721 |
We can use masking to select certain data:
countries[countries['density'] > 300]
area | capital | population | density | |
---|---|---|---|---|
country | ||||
Belgium | 30510 | Brussels | 11.3 | 370.370370 |
Netherlands | 41526 | Amsterdam | 16.9 | 406.973944 |
And we can do things like sorting the items in the array, and indexing to take the first two rows:
countries.sort_index(by='density', ascending=False)
area | capital | population | density | |
---|---|---|---|---|
country | ||||
Netherlands | 41526 | Amsterdam | 16.9 | 406.973944 |
Belgium | 30510 | Brussels | 11.3 | 370.370370 |
United Kingdom | 244820 | London | 64.9 | 265.092721 |
Germany | 357050 | Berlin | 81.3 | 227.699202 |
France | 671308 | Paris | 64.3 | 95.783158 |
One useful method to use is the describe
method, which computes summary statistics for each column:
countries.describe()
area | population | density | |
---|---|---|---|
count | 5.000000 | 5.000000 | 5.000000 |
mean | 269042.800000 | 47.740000 | 273.183879 |
std | 264012.827994 | 31.519645 | 123.440607 |
min | 30510.000000 | 11.300000 | 95.783158 |
25% | 41526.000000 | 16.900000 | 227.699202 |
50% | 244820.000000 | 64.300000 | 265.092721 |
75% | 357050.000000 | 64.900000 | 370.370370 |
max | 671308.000000 | 81.300000 | 406.973944 |
The plot
method can be used to quickly visualize the data in different ways:
countries.plot()
<matplotlib.axes._subplots.AxesSubplot at 0xab20740c>
However, for this dataset, it does not say that much.
countries['population'].plot(kind='bar')
<matplotlib.axes._subplots.AxesSubplot at 0xab258d8c>
countries.plot(kind='scatter', x='population', y='area')
<matplotlib.axes._subplots.AxesSubplot at 0xab12eaec>
The available plotting types: ‘line’ (default), ‘bar’, ‘barh’, ‘hist’, ‘box’ , ‘kde’, ‘area’, ‘pie’, ‘scatter’, ‘hexbin’.
One of pandas' basic features is the labeling of rows and columns, but this makes indexing also a bit more complex compared to numpy. We now have to distuinguish between:
For a DataFrame, basic indexing selects the columns.
Selecting a single column:
countries['area']
country Belgium 30510 France 671308 Germany 357050 Netherlands 41526 United Kingdom 244820 Name: area, dtype: int64
or multiple columns:
countries[['area', 'density']]
area | density | |
---|---|---|
country | ||
Belgium | 30510 | 370.370370 |
France | 671308 | 95.783158 |
Germany | 357050 | 227.699202 |
Netherlands | 41526 | 406.973944 |
United Kingdom | 244820 | 265.092721 |
But, slicing accesses the rows:
countries['France':'Netherlands']
area | capital | population | density | |
---|---|---|---|---|
country | ||||
France | 671308 | Paris | 64.3 | 95.783158 |
Germany | 357050 | Berlin | 81.3 | 227.699202 |
Netherlands | 41526 | Amsterdam | 16.9 | 406.973944 |
For more advanced indexing, you have some extra attributes:
loc
: selection by labeliloc
: selection by positioncountries.loc['Germany', 'area']
357050
countries.loc['France':'Germany', :]
area | capital | population | density | |
---|---|---|---|---|
country | ||||
France | 671308 | Paris | 64.3 | 95.783158 |
Germany | 357050 | Berlin | 81.3 | 227.699202 |
countries.loc[countries['density']>300, ['capital', 'population']]
capital | population | |
---|---|---|
country | ||
Belgium | Brussels | 11.3 |
Netherlands | Amsterdam | 16.9 |
Selecting by position with iloc
works similar as indexing numpy arrays:
countries.iloc[0:2,1:3]
capital | population | |
---|---|---|
country | ||
Belgium | Brussels | 11.3 |
France | Paris | 64.3 |
The different indexing methods can also be used to assign data:
countries.loc['Belgium':'Germany', 'population'] = 10
countries
There are many, many more interesting operations that can be done on Series and DataFrame objects, but rather than continue using this toy data, we'll instead move to a real-world example, and illustrate some of the advanced concepts along the way.
AirBase: hourly measurements of all air quality monitoring stations from Europe.
from IPython.display import HTML
HTML('<iframe src=http://www.eea.europa.eu/data-and-maps/data/airbase-the-european-air-quality-database-8#tab-data-by-country width=700 height=350></iframe>')
A wide range of input/output formats are natively supported by pandas:
pd.read
countries.to
I downloaded some of the raw data files of AirBase and included it in the repo:
station code: BETR801, pollutant code: 8 (nitrogen dioxide)
!head -1 ./data/BETR8010000800100hour.1-1-1990.31-12-2012
Just reading the tab-delimited data:
data = pd.read_csv("data/BETR8010000800100hour.1-1-1990.31-12-2012", sep='\t')
data.head()
1990-01-01 | -999.000 | 0 | -999.000.1 | 0.1 | -999.000.2 | 0.2 | -999.000.3 | 0.3 | -999.000.4 | ... | -999.000.19 | 0.19 | -999.000.20 | 0.20 | -999.000.21 | 0.21 | -999.000.22 | 0.22 | -999.000.23 | 0.23 | |
---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
0 | 1990-01-02 | -999 | 0 | -999 | 0 | -999 | 0 | -999 | 0 | -999 | ... | 57 | 1 | 58 | 1 | 54 | 1 | 49 | 1 | 48 | 1 |
1 | 1990-01-03 | 51 | 1 | 50 | 1 | 47 | 1 | 48 | 1 | 51 | ... | 84 | 1 | 75 | 1 | -999 | 0 | -999 | 0 | -999 | 0 |
2 | 1990-01-04 | -999 | 0 | -999 | 0 | -999 | 0 | -999 | 0 | -999 | ... | 69 | 1 | 65 | 1 | 64 | 1 | 60 | 1 | 59 | 1 |
3 | 1990-01-05 | 51 | 1 | 51 | 1 | 48 | 1 | 50 | 1 | 51 | ... | -999 | 0 | -999 | 0 | -999 | 0 | -999 | 0 | -999 | 0 |
4 | 1990-01-06 | -999 | 0 | -999 | 0 | -999 | 0 | -999 | 0 | -999 | ... | -999 | 0 | -999 | 0 | -999 | 0 | -999 | 0 | -999 | 0 |
5 rows × 49 columns
Not really what we want.
With using some more options of read_csv
:
colnames = ['date'] + [item for pair in zip(["{:02d}".format(i) for i in range(24)], ['flag']*24) for item in pair]
data = pd.read_csv("data/BETR8010000800100hour.1-1-1990.31-12-2012",
sep='\t', header=None, na_values=[-999, -9999], names=colnames)
data.head()
date | 00 | flag | 01 | flag | 02 | flag | 03 | flag | 04 | ... | 19 | flag | 20 | flag | 21 | flag | 22 | flag | 23 | flag | |
---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
0 | 1990-01-01 | NaN | 0 | NaN | 0 | NaN | 0 | NaN | 0 | NaN | ... | NaN | 0 | NaN | 0 | NaN | 0 | NaN | 0 | NaN | 0 |
1 | 1990-01-02 | NaN | 1 | NaN | 1 | NaN | 1 | NaN | 1 | NaN | ... | 57 | 1 | 58 | 1 | 54 | 1 | 49 | 1 | 48 | 1 |
2 | 1990-01-03 | 51 | 0 | 50 | 0 | 47 | 0 | 48 | 0 | 51 | ... | 84 | 0 | 75 | 0 | NaN | 0 | NaN | 0 | NaN | 0 |
3 | 1990-01-04 | NaN | 1 | NaN | 1 | NaN | 1 | NaN | 1 | NaN | ... | 69 | 1 | 65 | 1 | 64 | 1 | 60 | 1 | 59 | 1 |
4 | 1990-01-05 | 51 | 0 | 51 | 0 | 48 | 0 | 50 | 0 | 51 | ... | NaN | 0 | NaN | 0 | NaN | 0 | NaN | 0 | NaN | 0 |
5 rows × 49 columns
So what did we do:
For now, we disregard the 'flag' columns
data = data.drop('flag', axis=1)
data
date | 00 | 01 | 02 | 03 | 04 | 05 | 06 | 07 | 08 | ... | 14 | 15 | 16 | 17 | 18 | 19 | 20 | 21 | 22 | 23 | |
---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
0 | 1990-01-01 | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | ... | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN |
1 | 1990-01-02 | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | ... | 55.0 | 59.0 | 58 | 59.0 | 58.0 | 57.0 | 58.0 | 54.0 | 49.0 | 48.0 |
2 | 1990-01-03 | 51.0 | 50.0 | 47.0 | 48.0 | 51.0 | 52.0 | 58.0 | 57.0 | NaN | ... | 69.0 | 74.0 | NaN | NaN | 103.0 | 84.0 | 75.0 | NaN | NaN | NaN |
3 | 1990-01-04 | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | ... | NaN | 71.0 | 74 | 70.0 | 70.0 | 69.0 | 65.0 | 64.0 | 60.0 | 59.0 |
... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... |
8388 | 2012-12-28 | 26.5 | 28.5 | 35.5 | 32.0 | 35.5 | 50.5 | 62.5 | 74.5 | 76.0 | ... | 56.5 | 52.0 | 55 | 53.5 | 49.0 | 46.5 | 42.5 | 38.5 | 30.5 | 26.5 |
8389 | 2012-12-29 | 21.5 | 16.5 | 13.0 | 13.0 | 16.0 | 23.5 | 23.5 | 27.5 | 46.0 | ... | 48.0 | 41.5 | 36 | 33.0 | 25.5 | 21.0 | 22.0 | 20.5 | 20.0 | 15.0 |
8390 | 2012-12-30 | 11.5 | 9.5 | 7.5 | 7.5 | 10.0 | 11.0 | 13.5 | 13.5 | 17.5 | ... | NaN | 25.0 | 25 | 25.5 | 24.5 | 25.0 | 18.5 | 17.0 | 15.5 | 12.5 |
8391 | 2012-12-31 | 9.5 | 8.5 | 8.5 | 8.5 | 10.5 | 15.5 | 18.0 | 23.0 | 25.0 | ... | NaN | NaN | 28 | 27.5 | 26.0 | 21.0 | 16.5 | 14.5 | 16.5 | 15.0 |
8392 rows × 25 columns
Now, we want to reshape it: our goal is to have the different hours as row indices, merged with the date into a datetime-index.
stack
, unstack
and pivot
¶The docs say:
Pivot a level of the (possibly hierarchical) column labels, returning a
DataFrame (or Series in the case of an object with a single level of column labels) having a hierarchical index with a new inner-most level of row labels.
df = pd.DataFrame({'A':['one', 'one', 'two', 'two'], 'B':['a', 'b', 'a', 'b'], 'C':range(4)})
df
A | B | C | |
---|---|---|---|
0 | one | a | 0 |
1 | one | b | 1 |
2 | two | a | 2 |
3 | two | b | 3 |
To use stack
/unstack
, we need the values we want to shift from rows to columns or the other way around as the index:
df = df.set_index(['A', 'B'])
df
C | ||
---|---|---|
A | B | |
one | a | 0 |
b | 1 | |
two | a | 2 |
b | 3 |
result = df['C'].unstack()
result
B | a | b |
---|---|---|
A | ||
one | 0 | 1 |
two | 2 | 3 |
df = result.stack().reset_index(name='C')
df
A | B | C | |
---|---|---|---|
0 | one | a | 0 |
1 | one | b | 1 |
2 | two | a | 2 |
3 | two | b | 3 |
pivot
is similar to unstack
, but let you specify column names:
df.pivot(index='A', columns='B', values='C')
B | a | b |
---|---|---|
A | ||
one | 0 | 1 |
two | 2 | 3 |
pivot_table
is similar as pivot
, but can work with duplicate indices and let you specify an aggregation function:
df = pd.DataFrame({'A':['one', 'one', 'two', 'two', 'one', 'two'], 'B':['a', 'b', 'a', 'b', 'a', 'b'], 'C':range(6)})
df
A | B | C | |
---|---|---|---|
0 | one | a | 0 |
1 | one | b | 1 |
2 | two | a | 2 |
3 | two | b | 3 |
4 | one | a | 4 |
5 | two | b | 5 |
df.pivot_table(index='A', columns='B', values='C', aggfunc='count') #'mean'
B | a | b |
---|---|---|
A | ||
one | 2 | 1 |
two | 1 | 2 |
We can now use stack
to create a timeseries:
data = data.set_index('date')
data_stacked = data.stack()
data_stacked
date 1990-01-02 09 48.0 12 48.0 13 50.0 14 55.0 ... 2012-12-31 20 16.5 21 14.5 22 16.5 23 15.0 dtype: float64
Now, lets combine the two levels of the index:
data_stacked = data_stacked.reset_index(name='BETR801')
data_stacked.index = pd.to_datetime(data_stacked['date'] + data_stacked['level_1'], format="%Y-%m-%d%H")
data_stacked = data_stacked.drop(['date', 'level_1'], axis=1)
data_stacked
BETR801 | |
---|---|
1990-01-02 09:00:00 | 48.0 |
1990-01-02 12:00:00 | 48.0 |
1990-01-02 13:00:00 | 50.0 |
1990-01-02 14:00:00 | 55.0 |
... | ... |
2012-12-31 20:00:00 | 16.5 |
2012-12-31 21:00:00 | 14.5 |
2012-12-31 22:00:00 | 16.5 |
2012-12-31 23:00:00 | 15.0 |
170794 rows × 1 columns
For this talk, I put the above code in a separate function, and repeated this for some different monitoring stations:
import airbase
no2 = airbase.load_data()
Some useful methods:
head
and tail
no2.head(3)
BETR801 | BETN029 | FR04037 | FR04012 | |
---|---|---|---|---|
1990-01-01 00:00:00 | NaN | 16 | NaN | NaN |
1990-01-01 01:00:00 | NaN | 18 | NaN | NaN |
1990-01-01 02:00:00 | NaN | 21 | NaN | NaN |
no2.tail()
BETR801 | BETN029 | FR04037 | FR04012 | |
---|---|---|---|---|
2012-12-31 19:00:00 | 21.0 | 2.5 | 28 | 67 |
2012-12-31 20:00:00 | 16.5 | 2.0 | 16 | 47 |
2012-12-31 21:00:00 | 14.5 | 2.5 | 13 | 43 |
2012-12-31 22:00:00 | 16.5 | 3.5 | 14 | 42 |
2012-12-31 23:00:00 | 15.0 | 3.0 | 13 | 49 |
info()
no2.info()
<class 'pandas.core.frame.DataFrame'> DatetimeIndex: 198895 entries, 1990-01-01 00:00:00 to 2012-12-31 23:00:00 Data columns (total 4 columns): BETR801 170794 non-null float64 BETN029 174807 non-null float64 FR04037 120384 non-null float64 FR04012 119448 non-null float64 dtypes: float64(4) memory usage: 7.6 MB
Getting some basic summary statistics about the data with describe
:
no2.describe()
BETR801 | BETN029 | FR04037 | FR04012 | |
---|---|---|---|---|
count | 170794.000000 | 174807.000000 | 120384.000000 | 119448.000000 |
mean | 47.914561 | 16.687756 | 40.040005 | 87.993261 |
std | 22.230921 | 13.106549 | 23.024347 | 41.317684 |
min | 0.000000 | 0.000000 | 0.000000 | 0.000000 |
25% | 32.000000 | 7.000000 | 23.000000 | 61.000000 |
50% | 46.000000 | 12.000000 | 37.000000 | 88.000000 |
75% | 61.000000 | 23.000000 | 54.000000 | 115.000000 |
max | 339.000000 | 115.000000 | 256.000000 | 358.000000 |
Quickly visualizing the data
no2.plot(kind='box', ylim=[0,250])
<matplotlib.axes._subplots.AxesSubplot at 0xaa1d544c>
no2['BETR801'].plot(kind='hist', bins=50)
<matplotlib.axes._subplots.AxesSubplot at 0xaabe97ec>
no2.plot(figsize=(12,6))
<matplotlib.axes._subplots.AxesSubplot at 0xa7b3e74c>
This does not say too much ..
We can select part of the data (eg the latest 500 data points):
no2[-500:].plot(figsize=(12,6))
<matplotlib.axes._subplots.AxesSubplot at 0xa7c325ec>