#!/usr/bin/env python
# coding: utf-8

# ## Linear Regression Exercise

# We will use a couple of the variables we measured, soil moisture content and aggregate stability to see if there is any relationship between them. The data is located here: https://github.com/gabbymyers/516X-Project/blob/master/_data/Class%20Exercise%20Data.xlsx.    
# 
# Soil moisture content refers to the moisture in the soil. We determine this by collecting a sample from each of our 36 plots at the Northeast Iowa Research Farm. The sample is taken back to the lab and the moisture content is determined by weighing the soil before and after drying.    
# 
# Aggregate stability is a measure of how the soil aggregates (groups of soil particles) fall apart when wet. We use the SLAKES app to get the aggregate stability numbers for our samples. The SLAKES app take continuous images of soil aggregates as they are submerged in water and returns aggregate stability values on a range of 0-14. The lower the number, the more stable the aggregates are. Stable aggregates indicate the soil likely has better water infiltration, reducing the risk of runoff. Cover crops are known to increase aggreagate stability, so I would expect our cover crop treatments to have lower aggregate stability values from the SLAKES app.

# We have different 12 different treatments in triplicate on our 36 plots. I have these listed below.    
# 
# * 1C: Corn (on corn/soy rotation) with spring UAN    
# * 1S: Soy (on corn/soy rotation) with spring UAN     
# * 2C: Corn (on corn/soy rotation) with spring manure     
# * 2S: Soy (on corn/soy rotation) with spring manure     
# * 3.1: Continuous Corn     
# * 3.2: Continuous Corn + Interseeded Cover crop (30 inch row)    
# * 4.1: Continuous Corn + Perennial Groundcover    
# * 4.2: Continuous Corn + Interseeded Cover crop (60 inch row)     
# * 5C: Corn (on corn/soy rotation) + cereal rye     
# * 5S: Soy (on corn/soy rotation) + cereal rye     
# * 6C: Corn (on corn/soy rotation) + fall manure     
# * 6S: Soy (on corn/soy rotation) + fall manure    

# ### Research Question:
# 
# Is there any relationship between aggregate stability and moisture content?

# In[45]:


# imports
import pandas as pd
import seaborn as sns
import numpy as np
import matplotlib.pyplot as plt
import statsmodels.api as sm
# allow plots to appear directly in the notebook
get_ipython().run_line_magic('matplotlib', 'inline')


# #### Read the soil moisture data into its own data frame.

# In[16]:


soil_moisture = pd.read_excel(___, sheet_name = ______)
soil_moisture.head()


# Get rid of unneeded columns

# In[19]:


soil_moisture = soil_moisture.iloc[____:____ ]
soil_moisture.head()


# See how many weeks of data you have for soil moisture

# In[20]:


max(_____)


# We have 12 weeks of data for the soil moisture content. 

# #### Read the aggregate stability data into its own data frame. 

# In[21]:


aggregate_stability = pd.read_excel(____, sheet_name = _____ )
aggregate_stability.head()


# We need to filter out the extra columns that are contained in the aggregate stability data frame. 

# In[22]:


aggregate_stability = aggregate_stability.iloc[___:___ ]
aggregate_stability.head()


# Make sure that the dates match the soil moisture so we have the same amount of data:

# In[23]:


max(________)


# We only have 10 weeks of data for aggregate stability, so we need to filter out the extra two weeks of soil moisture data. 10 weeks * 36 plots = 360 rows of data. 

# In[24]:


soil_moisture = soil_moisture.iloc[___:____]
max(soil_moisture['Sample Date'])


# In[25]:


soil_moisture


# I wanted to make sure that the last row of data is correct. It is because it is plot 36 on date 10. 

# #### Merging the data frames
# First we need to rename the columns so the are the same across the different data frames. 

# In[27]:


soil_moisture = soil_moisture.rename(columns={'Plot Number': 'Plot'})
soil_moisture = soil_moisture.rename(columns={'Sample Date': 'Date'})


# Now we can merge the two data frames

# In[37]:


merged_data = pd.merge(left=___, right=______, 
                       left_on=[__,___,__,__], right_on=[__,___,__,__])
merged_data.head()


# Count the number of na values in the data frame

# In[38]:


# In[39]:


merged_data.shape


# There are four na values in the aggregate stability column. Drop those rows:

# In[40]:


merged_data = merged_data.____()
merged_data.shape


# #### Explore the data

# ###### Make box plots of the data

# In[106]:


ax = sns.boxplot(x = ________, y = _________, data = merged_data)
plt.title('Moisture Content by Treatment')


# In[107]:


ax = sns.boxplot(x = __________, y = ________, data = merged_data)
plt.title('Aggregate Stability by Treatment')


# ###### How many unstable aggregate stability measurements do we have and which treatments have the most?
# 
# Unstable is defined as values greater than or equal to 7. 
# 
# Start by making a data frame of the unstable measurements:

# In[108]:


unstable_df = merged_data[____]>=___]
unstable_df.head()


# In[109]:


unstable_df = unstable_df.reset_index(drop=True)
unstable_df.head()


# In[110]:


len(____)


# There are 17 unstable measurements. 

# See which treatments the unstable measurements belong to, using "value_counts":

# In[111]:


# The treatment with the most unstable measurements is 3.2, followed by 3.1.

# ###### How are the unstable measurements distributed among the blocks?

# In[112]:


# The block with the most unstable measurements is 1. 

# ### Regression

# Plot moisture content vs. aggregate stability to see if you can see any relationship.

# In[58]:


plt.scatter()
plt.title('MC vs. AS')
plt.xlabel('Soil Moisture Content')
plt.ylabel('Aggregate Stability')


# In[61]:


X = merged_data['Moisture Content']
Y = merged_data['Aggregate Stability']
print(X.shape)
print(Y.shape)


# Use scipy.stats.linregress (https://docs.scipy.org/doc/scipy/reference/generated/scipy.stats.linregress.html)

# In[67]:


res = sp.stats.___________
print(f"R-squared: {res.rvalue**2:.6f}")
print(f"Slope: {res.slope:.6f}")
print(f"Intercept: {res.intercept:.6f}")


# In[77]:


plt.plot(X, Y, 'o', label='original data')
plt.plot(X, res.intercept + res.slope*X, 'r', label='fitted line')
plt.title('MC vs. AS')
plt.xlabel('Soil Moisture Content')
plt.ylabel('Aggregate Stability')
plt.legend()
plt.show()
plt.savefig('MC_AS_Regress.jpg', bbox_inches='tight')


# #### Using the method we learned in class: 

# In[72]:


X = X[:, np.newaxis]
print(X.shape)


# In[73]:


model = LinearRegression(fit_intercept=True)
model
model.fit(____,________)


# In[74]:


model.coef_


# In[75]:


model.intercept_


# ### Equation: Y = 0.9466X + 2.57073

# ### R^2 value: 0.001122

# The regression model is not powerful at all, with a very low R^2 value. 

# In[ ]: