#!/usr/bin/env python
# coding: utf-8

# # Crop to Table: A Data Wrangling Challenge for Agricultural Shipments  
# 
# ### Description  
# You're a data analyst for a large agricultural company that produces and distributes fresh produce to supermarkets across the country. The company has recently expanded its operations and needs your help to clean and prepare its dataset of produce shipments for analysis. Your task is to wrangle the dataset to make it suitable for analysis and visualization.
# 
# ### Tasks  
# - **Data Transformation:** Convert the shipment_date column to a standard datetime format and extract the month and year of each shipment.
# - **Data Categorization:** Create a new column produce_category based on the produce_type, categorizing produce into fruits (Apples, Bananas, Oranges) and vegetables (Carrots, Lettuce).
# - **Data Enrichment:** Create a new column season based on the month of the shipment, categorizing shipments into spring (March to May), summer (June to August), fall (September to November), and winter (December to February).

# In[1]:


# import libraries
import pandas as pd
import numpy as np
import sys

print('Python version ' + sys.version)
print('Pandas version ' + pd.__version__)
print('Numpy version ' + np.__version__)


# # The Data  
# 
# The data represents a sample of shipments from farms to supermarkets across the country, with varying types of produce and shipment characteristics.  
# 
# ### Columns:  
# - **produce_type:** The type of produce (Apples, Bananas, Carrots, Oranges, Lettuce)
# - **shipment_date:** The date the shipment was sent, with inconsistent formatting
# - **farm_id:** The ID of the farm that produced the shipment
# - **supermarket_id:** The ID of the supermarket that received the shipment
# - **shipment_weight:** The weight of the shipment in pounds
# - **revenue:** The revenue generated by the shipment in dollars
# - **shipping_time:** The time it took for the shipment to arrive in days
# - **delayed:** A boolean indicating whether the shipment was delayed or lost

# In[2]:


# set the seed
np.random.seed(0)

# generate the data
data = {
    'produce_type': np.random.choice(['Apples', 'Bananas', 'Carrots', 'Oranges', 'Lettuce'], size=10000),
    'shipment_date': pd.date_range('2022-01-01', periods=10000, freq='D').astype(str) + np.random.choice(['', ' ', '  ', '\t'], size=10000),
    'farm_id': np.random.choice(['farm1', 'farm2', 'farm3', 'farm4', 'farm5'], size=10000),
    'supermarket_id': np.random.choice(['market1', 'market2', 'market3', 'market4', 'market5'], size=10000),
    'shipment_weight': np.random.uniform(100, 1000, size=10000),
    'revenue': np.random.uniform(100, 1000, size=10000),
    'shipping_time': np.random.uniform(1, 10, size=10000),
    'delayed': np.random.choice([True, False], size=10000)
}

df = pd.DataFrame(data)
df.head()


# Data types inspection:    
# 
# Looking at the output below, we can see the only column that does not have the correct data type is the shipment_date column. If you take a close look at how we created the data, we added extra spaces to similate bad data in this column.

# In[3]:


df.info()


# # Data Transformation:  
# 
# Convert the shipment_date column to a standard datetime format and extract the month and year of each shipment.  
# 
# ---
# 
# Let us use the `pd.to_datetime` method to clean up this column.
# 
# Since we have spaces added to the string, we first need to remove those using the `strip()` method before attempting to convert them to dates.

# In[4]:


df['shipment_date'] = pd.to_datetime(df['shipment_date'].str.strip())

# we now have a date object
df.info()


# Here are the steps to create a new column named "Year_Month".
# 
# - pull out the year and convert to a string
# - pull out the month and convert to a string
# - add a dash "-" in between the two strings

# In[5]:


# new column composed of the year and the month
df['Year_Month'] = df['shipment_date'].dt.year.astype(str) + '-' + df['shipment_date'].dt.month.astype(str)

df.head()


# # Data Categorization: 
# 
# Create a new column produce_category based on the produce_type, categorizing produce into fruits (Apples, Bananas, Oranges) and vegetables (Carrots, Lettuce).

# `**dict.fromkeys()` is a Python expression that creates a new dictionary from an iterable (like a list or tuple) and assigns a default value to each key.  
#   
# For this task, I am going to create a Python dictionary that will categorize the "produce_type" column.  
# 
# **Let us start by explaining how I created the dictionary:**    
# 
# - `dict.fromkeys()` is a method that creates a new dictionary from an iterable.
# - The ** operator is used to unpack the iterable and pass it as arguments to the `fromkeys()` method.
# 
# For example, if you have a list:  
# 
# ``` Python
# my_list = ['apple', 'banana', 'cherry']
# ```
# 
# Using `**dict.fromkeys()`:
# 
# ``` Python
# my_dict = **dict.fromkeys(my_list, 'fruit')
# ```
# 
# Creates a new dictionary:  
# ``` Python
# my_dict = {'apple': 'fruit', 'banana': 'fruit', 'cherry': 'fruit'}
# ```
# 
# In the context of the code below, `**dict.fromkeys(['Carrots', 'Lettuce'], 'vegetables')` creates a dictionary with each value in ['Carrots', 'Lettuce'] as a key and 'vegetables' as the corresponding value.

# In[6]:


# map the values
mapping = {**dict.fromkeys(['Apples', 'Bananas', 'Oranges'], 'fruits'), 
           **dict.fromkeys(['Carrots', 'Lettuce'], 'vegetables')}

# apply the map to the produce_type column
df['produce_category'] = df['produce_type'].map(mapping)
df.head()


# # Data Enrichment:  
# 
# Create a new column season based on the month of the shipment, categorizing shipments into spring (March to May), summer (June to August), fall (September to November), and winter (December to February).

# For this task, we will use the same strategy of mapping a dictionary to the dataframe.

# In[7]:


# the trick is to use numbers instead of dates
season_map = {**dict.fromkeys([12,1,2], 'winter'), 
              **dict.fromkeys(range(3, 6), 'spring'), 
              **dict.fromkeys(range(6, 9), 'summer'), 
              **dict.fromkeys(range(9, 12), 'fall')}

# apply map
df['season'] = df['shipment_date'].dt.month.map(season_map)
df.head()


# # Summary  
# 
# In this tutorial, you learned how to clean and transform data, create new columns based on existing data, and map values to categories using dictionaries and Pandas functions. You now have a solid understanding of data wrangling and analysis techniques, which will serve as a great foundation for your future work with data. 
# 
# ### Key Takeaways:  
# - **Data Cleaning:** Use `pd.to_datetime()` to convert date columns to a standard format.
# - **Data Transformation:** Use `dt.year` and `dt.month` to extract year and month from a datetime column.
# - **Data Categorization:** Use `dict.fromkeys()` to create a dictionary for mapping values and `map()` to apply the mapping to a column.
# - **Data Enrichment:** Use `dt.month` to extract the month and map it to a season using a dictionary.

# <p class="text-muted">This tutorial was created by <a href="https://www.hedaro.com" target="_blank"><strong>HEDARO</strong></a></p>