Crop to Table: A Data Wrangling Challenge for Agricultural Shipments¶

Description¶

You're a data analyst for a large agricultural company that produces and distributes fresh produce to supermarkets across the country. The company has recently expanded its operations and needs your help to clean and prepare its dataset of produce shipments for analysis. Your task is to wrangle the dataset to make it suitable for analysis and visualization.

Tasks¶

Data Transformation: Convert the shipment_date column to a standard datetime format and extract the month and year of each shipment.
Data Categorization: Create a new column produce_category based on the produce_type, categorizing produce into fruits (Apples, Bananas, Oranges) and vegetables (Carrots, Lettuce).
Data Enrichment: Create a new column season based on the month of the shipment, categorizing shipments into spring (March to May), summer (June to August), fall (September to November), and winter (December to February).

In [1]:

# import libraries
import pandas as pd
import numpy as np
import sys

print('Python version ' + sys.version)
print('Pandas version ' + pd.__version__)
print('Numpy version ' + np.__version__)

Python version 3.11.7 | packaged by Anaconda, Inc. | (main, Dec 15 2023, 18:05:47) [MSC v.1916 64 bit (AMD64)]
Pandas version 2.2.1
Numpy version 1.26.4

The Data¶

The data represents a sample of shipments from farms to supermarkets across the country, with varying types of produce and shipment characteristics.

Columns:¶

produce_type: The type of produce (Apples, Bananas, Carrots, Oranges, Lettuce)
shipment_date: The date the shipment was sent, with inconsistent formatting
farm_id: The ID of the farm that produced the shipment
supermarket_id: The ID of the supermarket that received the shipment
shipment_weight: The weight of the shipment in pounds
revenue: The revenue generated by the shipment in dollars
shipping_time: The time it took for the shipment to arrive in days
delayed: A boolean indicating whether the shipment was delayed or lost

In [2]:

# set the seed
np.random.seed(0)

# generate the data
data = {
    'produce_type': np.random.choice(['Apples', 'Bananas', 'Carrots', 'Oranges', 'Lettuce'], size=10000),
    'shipment_date': pd.date_range('2022-01-01', periods=10000, freq='D').astype(str) + np.random.choice(['', ' ', '  ', '\t'], size=10000),
    'farm_id': np.random.choice(['farm1', 'farm2', 'farm3', 'farm4', 'farm5'], size=10000),
    'supermarket_id': np.random.choice(['market1', 'market2', 'market3', 'market4', 'market5'], size=10000),
    'shipment_weight': np.random.uniform(100, 1000, size=10000),
    'revenue': np.random.uniform(100, 1000, size=10000),
    'shipping_time': np.random.uniform(1, 10, size=10000),
    'delayed': np.random.choice([True, False], size=10000)
}

df = pd.DataFrame(data)
df.head()

Out[2]:

	produce_type	shipment_date	farm_id	supermarket_id	shipment_weight	revenue	shipping_time	delayed
0	Lettuce	2022-01-01	farm1	market3	522.434901	253.916357	4.193670	True
1	Apples	2022-01-02	farm3	market2	129.924851	359.323176	2.133125	True
2	Oranges	2022-01-03	farm4	market4	148.890313	509.488722	6.331273	True
3	Oranges	2022-01-04	farm3	market1	260.623032	728.175679	2.599539	False
4	Oranges	2022-01-05	farm1	market3	878.212644	533.657207	5.697890	True

Data types inspection:

Looking at the output below, we can see the only column that does not have the correct data type is the shipment_date column. If you take a close look at how we created the data, we added extra spaces to similate bad data in this column.

In [3]:

df.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 10000 entries, 0 to 9999
Data columns (total 8 columns):
 #   Column           Non-Null Count  Dtype  
---  ------           --------------  -----  
 0   produce_type     10000 non-null  object 
 1   shipment_date    10000 non-null  object 
 2   farm_id          10000 non-null  object 
 3   supermarket_id   10000 non-null  object 
 4   shipment_weight  10000 non-null  float64
 5   revenue          10000 non-null  float64
 6   shipping_time    10000 non-null  float64
 7   delayed          10000 non-null  bool   
dtypes: bool(1), float64(3), object(4)
memory usage: 556.8+ KB

Data Transformation:¶

Convert the shipment_date column to a standard datetime format and extract the month and year of each shipment.

Let us use the pd.to_datetime method to clean up this column.

Since we have spaces added to the string, we first need to remove those using the strip() method before attempting to convert them to dates.

In [4]:

df['shipment_date'] = pd.to_datetime(df['shipment_date'].str.strip())

# we now have a date object
df.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 10000 entries, 0 to 9999
Data columns (total 8 columns):
 #   Column           Non-Null Count  Dtype         
---  ------           --------------  -----         
 0   produce_type     10000 non-null  object        
 1   shipment_date    10000 non-null  datetime64[ns]
 2   farm_id          10000 non-null  object        
 3   supermarket_id   10000 non-null  object        
 4   shipment_weight  10000 non-null  float64       
 5   revenue          10000 non-null  float64       
 6   shipping_time    10000 non-null  float64       
 7   delayed          10000 non-null  bool          
dtypes: bool(1), datetime64[ns](1), float64(3), object(3)
memory usage: 556.8+ KB

Here are the steps to create a new column named "Year_Month".

pull out the year and convert to a string
pull out the month and convert to a string
add a dash "-" in between the two strings

In [5]:

# new column composed of the year and the month
df['Year_Month'] = df['shipment_date'].dt.year.astype(str) + '-' + df['shipment_date'].dt.month.astype(str)

df.head()

Out[5]:

	produce_type	shipment_date	farm_id	supermarket_id	shipment_weight	revenue	shipping_time	delayed	Year_Month
0	Lettuce	2022-01-01	farm1	market3	522.434901	253.916357	4.193670	True	2022-1
1	Apples	2022-01-02	farm3	market2	129.924851	359.323176	2.133125	True	2022-1
2	Oranges	2022-01-03	farm4	market4	148.890313	509.488722	6.331273	True	2022-1
3	Oranges	2022-01-04	farm3	market1	260.623032	728.175679	2.599539	False	2022-1
4	Oranges	2022-01-05	farm1	market3	878.212644	533.657207	5.697890	True	2022-1

Data Categorization:¶

Create a new column produce_category based on the produce_type, categorizing produce into fruits (Apples, Bananas, Oranges) and vegetables (Carrots, Lettuce).

**dict.fromkeys() is a Python expression that creates a new dictionary from an iterable (like a list or tuple) and assigns a default value to each key.

For this task, I am going to create a Python dictionary that will categorize the "produce_type" column.

Let us start by explaining how I created the dictionary:

dict.fromkeys() is a method that creates a new dictionary from an iterable.
The ** operator is used to unpack the iterable and pass it as arguments to the fromkeys() method.

For example, if you have a list:

my_list = ['apple', 'banana', 'cherry']

Using **dict.fromkeys():

my_dict = **dict.fromkeys(my_list, 'fruit')

Creates a new dictionary:

my_dict = {'apple': 'fruit', 'banana': 'fruit', 'cherry': 'fruit'}

In the context of the code below, **dict.fromkeys(['Carrots', 'Lettuce'], 'vegetables') creates a dictionary with each value in ['Carrots', 'Lettuce'] as a key and 'vegetables' as the corresponding value.

In [6]:

# map the values
mapping = {**dict.fromkeys(['Apples', 'Bananas', 'Oranges'], 'fruits'), 
           **dict.fromkeys(['Carrots', 'Lettuce'], 'vegetables')}

# apply the map to the produce_type column
df['produce_category'] = df['produce_type'].map(mapping)
df.head()

Out[6]:

	produce_type	shipment_date	farm_id	supermarket_id	shipment_weight	revenue	shipping_time	delayed	Year_Month	produce_category
0	Lettuce	2022-01-01	farm1	market3	522.434901	253.916357	4.193670	True	2022-1	vegetables
1	Apples	2022-01-02	farm3	market2	129.924851	359.323176	2.133125	True	2022-1	fruits
2	Oranges	2022-01-03	farm4	market4	148.890313	509.488722	6.331273	True	2022-1	fruits
3	Oranges	2022-01-04	farm3	market1	260.623032	728.175679	2.599539	False	2022-1	fruits
4	Oranges	2022-01-05	farm1	market3	878.212644	533.657207	5.697890	True	2022-1	fruits

Data Enrichment:¶

Create a new column season based on the month of the shipment, categorizing shipments into spring (March to May), summer (June to August), fall (September to November), and winter (December to February).

For this task, we will use the same strategy of mapping a dictionary to the dataframe.

In [7]:

# the trick is to use numbers instead of dates
season_map = {**dict.fromkeys([12,1,2], 'winter'), 
              **dict.fromkeys(range(3, 6), 'spring'), 
              **dict.fromkeys(range(6, 9), 'summer'), 
              **dict.fromkeys(range(9, 12), 'fall')}

# apply map
df['season'] = df['shipment_date'].dt.month.map(season_map)
df.head()

Out[7]:

	produce_type	shipment_date	farm_id	supermarket_id	shipment_weight	revenue	shipping_time	delayed	Year_Month	produce_category	season
0	Lettuce	2022-01-01	farm1	market3	522.434901	253.916357	4.193670	True	2022-1	vegetables	winter
1	Apples	2022-01-02	farm3	market2	129.924851	359.323176	2.133125	True	2022-1	fruits	winter
2	Oranges	2022-01-03	farm4	market4	148.890313	509.488722	6.331273	True	2022-1	fruits	winter
3	Oranges	2022-01-04	farm3	market1	260.623032	728.175679	2.599539	False	2022-1	fruits	winter
4	Oranges	2022-01-05	farm1	market3	878.212644	533.657207	5.697890	True	2022-1	fruits	winter

Summary¶

In this tutorial, you learned how to clean and transform data, create new columns based on existing data, and map values to categories using dictionaries and Pandas functions. You now have a solid understanding of data wrangling and analysis techniques, which will serve as a great foundation for your future work with data.

Key Takeaways:¶

Data Cleaning: Use pd.to_datetime() to convert date columns to a standard format.
Data Transformation: Use dt.year and dt.month to extract year and month from a datetime column.
Data Categorization: Use dict.fromkeys() to create a dictionary for mapping values and map() to apply the mapping to a column.
Data Enrichment: Use dt.month to extract the month and map it to a season using a dictionary.

This tutorial was created by HEDARO