#!/usr/bin/env python
# coding: utf-8
# #### New to Plotly?
# Plotly's Python library is free and open source! [Get started](https://plotly.com/python/getting-started/) by dowloading the client and [reading the primer](https://plotly.com/python/getting-started/).
# You can set up Plotly to work in [online](https://plotly.com/python/getting-started/#initialization-for-online-plotting) or [offline](https://plotly.com/python/getting-started/#initialization-for-offline-plotting) mode, or in [jupyter notebooks](https://plotly.com/python/getting-started/#start-plotting-online).
# We also have a quick-reference [cheatsheet](https://images.plot.ly/plotly-documentation/images/python_cheat_sheet.pdf) (new!) to help you get started!
# #### Imports
# The tutorial below imports [NumPy](http://www.numpy.org/), [Pandas](https://plotly.com/pandas/intro-to-pandas-tutorial/), [SciPy](https://www.scipy.org/), and [Statsmodels](http://statsmodels.sourceforge.net/stable/).
# In[1]:
import plotly.plotly as py
import plotly.graph_objs as go
from plotly.tools import FigureFactory as FF
import numpy as np
import pandas as pd
import scipy
import statsmodels
import statsmodels.api as sm
from statsmodels.formula.api import ols
# #### One-Way ANOVA
# An `Analysis of Variance Test` or an `ANOVA` is a generalization of the t-tests to more than 2 groups. Our null hypothesis states that there are equal means in the populations from which the groups of data were sampled. More succinctly:
#
# $$
# \begin{align*}
# \mu_1 = \mu_2 = ... = \mu_n
# \end{align*}
# $$
#
# for $n$ groups of data. Our alternative hypothesis would be that any one of the equivalences in the above equation fail to be met.
# In[2]:
moore = sm.datasets.get_rdataset("Moore", "car", cache=True)
data = moore.data
data = data.rename(columns={"partner.status" :"partner_status"}) # make name pythonic
moore_lm = ols('conformity ~ C(fcategory, Sum)*C(partner_status, Sum)', data=data).fit()
table = sm.stats.anova_lm(moore_lm, typ=2) # Type 2 ANOVA DataFrame
print(table)
# In this ANOVA test, we are dealing with an `F-Statistic` and not a `p-value`. Their connection is integral as they are two ways of expressing the same thing. When we set a `significance level` at the start of our statistical tests (usually 0.05), we are saying that if our variable in question takes on the 5% ends of our distribution, then we can start to make the case that there is evidence against the null, which states that the data belongs to _this particular distribution_.
#
# The F value is the point such that the area of the curve past that point to the tail is just the p-value. Therefore:
#
# $$
# \begin{align*}
# Pr(>F) = p
# \end{align*}
# $$
#
# For more information on the choice of 0.05 for a significance level, check out [this page](http://www.investopedia.com/exam-guide/cfa-level-1/quantitative-methods/hypothesis-testing.asp).
# Let us import some data for our next analysis. This time some data on tooth growth:
# In[3]:
data = pd.read_csv('https://raw.githubusercontent.com/plotly/datasets/master/tooth_growth_csv')
df = data[0:10]
table = FF.create_table(df)
py.iplot(table, filename='tooth-data-sample')
# #### Two-Way ANOVA
# In a `Two-Way ANOVA`, there are two variables to consider. The question is whether our variable in question (tooth length `len`) is related to the two other variables `supp` and `dose` by the equation:
#
# $$
# \begin{align*}
# len = supp + dose + supp \times dose
# \end{align*}
# $$
# In[4]:
formula = 'len ~ C(supp) + C(dose) + C(supp):C(dose)'
model = ols(formula, data).fit()
aov_table = statsmodels.stats.anova.anova_lm(model, typ=2)
print(aov_table)
# In[1]:
from IPython.display import display, HTML
display(HTML(''))
display(HTML(''))
get_ipython().system(' pip install git+https://github.com/plotly/publisher.git --upgrade')
import publisher
publisher.publish(
'python-Anova.ipynb', 'python/anova/', 'Anova | plotly',
'Learn how to perform a one and two way ANOVA test using Python.',
title='Anova in Python | plotly',
name='Anova',
language='python',
page_type='example_index', has_thumbnail='false', display_as='statistics', order=8,
ipynb= '~notebook_demo/108')
# In[ ]: