Smoothing¶

Smoothing can help to discover trends that otherwise might be hard to see in raw data.

In [1]:

import pandas as pd

from lets_plot import *
LetsPlot.setup_html()

In [2]:

mpg_df = pd.read_csv('https://raw.githubusercontent.com/JetBrains/lets-plot-docs/master/data/mpg.csv')
mpg_df

Out[2]:

	Unnamed: 0	manufacturer	model	displ	year	cyl	trans	drv	cty	hwy	fl	class
0	1	audi	a4	1.8	1999	4	auto(l5)	f	18	29	p	compact
1	2	audi	a4	1.8	1999	4	manual(m5)	f	21	29	p	compact
2	3	audi	a4	2.0	2008	4	manual(m6)	f	20	31	p	compact
3	4	audi	a4	2.0	2008	4	auto(av)	f	21	30	p	compact
4	5	audi	a4	2.8	1999	6	auto(l5)	f	16	26	p	compact
...	...	...	...	...	...	...	...	...	...	...	...	...
229	230	volkswagen	passat	2.0	2008	4	auto(s6)	f	19	28	p	midsize
230	231	volkswagen	passat	2.0	2008	4	manual(m6)	f	21	29	p	midsize
231	232	volkswagen	passat	2.8	1999	6	auto(l5)	f	16	26	p	midsize
232	233	volkswagen	passat	2.8	1999	6	manual(m5)	f	18	26	p	midsize
233	234	volkswagen	passat	3.6	2008	6	auto(s6)	f	17	26	p	midsize

234 rows × 12 columns

The default smoothing method is `'linear model'` (or `'lm'`)¶

In [3]:

mpg_plot = ggplot(mpg_df, aes(x='displ', y='hwy'))
mpg_plot + geom_point() + geom_smooth()

Out[3]:

`LOESS` model does seem to better fit MPG data than the linear model.¶

In [4]:

mpg_plot + geom_point() + geom_smooth(method='loess', size=1)

Out[4]:

Applying smoothing to groups¶

Let's map the vehicle drivetrain type (variable 'drv') to the color of points.

This makes it easy to see that points with the same type of the drivetrain are forming some kind of groups or clusters.

In [5]:

mpg_plot + geom_point(aes(color='drv'))\
         + geom_smooth(aes(color='drv'), method='loess', size=1)

Out[5]:

Apply linear model with 2nd degree polynomial.¶

As LOESS prediction looks a bit weird let's try 2nd degree polinomial regression.

In [6]:

mpg_plot + geom_point(aes(color='drv'))\
         + geom_smooth(aes(color='drv'), method='lm', deg=2, size=1)

Out[6]:

Using `as_discrete()` function with numeric data series¶

In the previous examples we were using a discrete (or categorical) variable 'drv' to split the data into a groups.

Now let's try to use a numeric variable 'cyl' for the same purpose.

In [7]:

mpg_plot + geom_point(aes(color='cyl'))\
         + geom_smooth(aes(color='cyl'), method='lm', deg=2, size=1)

Out[7]:

Easy to see that the data wasn't split into groups. Lets-Plot offers two solutions in this situation:

Use the group aesthetic
Use the as_discrete() function

The group aesthetic helps to create a groups.

In [8]:

mpg_plot + geom_point(aes(color='cyl'))\
         + geom_smooth(aes(color='cyl', group='cyl'), method='lm', deg=2, size=1)

Out[8]:

The as_discrete('cyl') function will "annotate" the 'cyl' variable as discrete.

This leads to creation of the groups and to assigning of a discrete color scale instead of a continuous.

In [9]:

from lets_plot.mapping import as_discrete

mpg_plot + geom_point(aes(color='cyl'))\
         + geom_smooth(aes(color=as_discrete('cyl')), method='lm', deg=2, size=1)

Out[9]:

Effect of `span` parameter on the "wiggliness" the LOESS smoother.¶

The span is the fraction of points used to fit each local regression. Small numbers make a wigglier curve, larger numbers make a smoother curve.

In [10]:

import math
import random
import numpy as np

In [11]:

n = 150
x_range = np.arange(-2 * math.pi, 2 * math.pi, 4 * math.pi / n)
y_range = np.sin(x_range) + np.array([random.uniform(-.5, .5) for i in range(n)])
df = pd.DataFrame({ 'x' : x_range, 'y' : y_range })

In [12]:

p = ggplot(df, aes(x='x', y='y')) + geom_point(shape=21, fill='yellow', color='#8c564b')
p1 = p + geom_smooth(method='loess', size=1.5, color='#d62728') + ggtitle('default (span = 0.5)')
p2 = p + geom_smooth(method='loess', span=.2, size=1.5, color='#9467bd') + ggtitle('span = 0.2')
p3 = p + geom_smooth(method='loess', span=.7, size=1.5, color='#1f77b4') + ggtitle('span = 0.7')
p4 = p + geom_smooth(method='loess', span=1, size=1.5, color='#2ca02c') + ggtitle('span = 1')

bunch = GGBunch()
bunch.add_plot(p1, 0, 0, 400, 300)
bunch.add_plot(p2, 400, 0, 400, 300)
bunch.add_plot(p3, 0, 300, 400, 300)
bunch.add_plot(p4, 400, 300, 400, 300)
bunch.show()

Smoothing¶

The default smoothing method is 'linear model' (or 'lm')¶

LOESS model does seem to better fit MPG data than the linear model.¶