You're a data analyst for a large agricultural company that produces and distributes fresh produce to supermarkets across the country. The company has recently expanded its operations and needs your help to clean and prepare its dataset of produce shipments for analysis. Your task is to wrangle the dataset to make it suitable for analysis and visualization.
# import libraries
import pandas as pd
import numpy as np
import sys
print('Python version ' + sys.version)
print('Pandas version ' + pd.__version__)
print('Numpy version ' + np.__version__)
Python version 3.11.7 | packaged by Anaconda, Inc. | (main, Dec 15 2023, 18:05:47) [MSC v.1916 64 bit (AMD64)] Pandas version 2.2.1 Numpy version 1.26.4
The data represents a sample of shipments from farms to supermarkets across the country, with varying types of produce and shipment characteristics.
# set the seed
np.random.seed(0)
# generate the data
data = {
'produce_type': np.random.choice(['Apples', 'Bananas', 'Carrots', 'Oranges', 'Lettuce'], size=10000),
'shipment_date': pd.date_range('2022-01-01', periods=10000, freq='D').astype(str) + np.random.choice(['', ' ', ' ', '\t'], size=10000),
'farm_id': np.random.choice(['farm1', 'farm2', 'farm3', 'farm4', 'farm5'], size=10000),
'supermarket_id': np.random.choice(['market1', 'market2', 'market3', 'market4', 'market5'], size=10000),
'shipment_weight': np.random.uniform(100, 1000, size=10000),
'revenue': np.random.uniform(100, 1000, size=10000),
'shipping_time': np.random.uniform(1, 10, size=10000),
'delayed': np.random.choice([True, False], size=10000)
}
df = pd.DataFrame(data)
df.head()
produce_type | shipment_date | farm_id | supermarket_id | shipment_weight | revenue | shipping_time | delayed | |
---|---|---|---|---|---|---|---|---|
0 | Lettuce | 2022-01-01 | farm1 | market3 | 522.434901 | 253.916357 | 4.193670 | True |
1 | Apples | 2022-01-02 | farm3 | market2 | 129.924851 | 359.323176 | 2.133125 | True |
2 | Oranges | 2022-01-03 | farm4 | market4 | 148.890313 | 509.488722 | 6.331273 | True |
3 | Oranges | 2022-01-04 | farm3 | market1 | 260.623032 | 728.175679 | 2.599539 | False |
4 | Oranges | 2022-01-05 | farm1 | market3 | 878.212644 | 533.657207 | 5.697890 | True |
Data types inspection:
Looking at the output below, we can see the only column that does not have the correct data type is the shipment_date column. If you take a close look at how we created the data, we added extra spaces to similate bad data in this column.
df.info()
<class 'pandas.core.frame.DataFrame'> RangeIndex: 10000 entries, 0 to 9999 Data columns (total 8 columns): # Column Non-Null Count Dtype --- ------ -------------- ----- 0 produce_type 10000 non-null object 1 shipment_date 10000 non-null object 2 farm_id 10000 non-null object 3 supermarket_id 10000 non-null object 4 shipment_weight 10000 non-null float64 5 revenue 10000 non-null float64 6 shipping_time 10000 non-null float64 7 delayed 10000 non-null bool dtypes: bool(1), float64(3), object(4) memory usage: 556.8+ KB
Convert the shipment_date column to a standard datetime format and extract the month and year of each shipment.
Let us use the pd.to_datetime
method to clean up this column.
Since we have spaces added to the string, we first need to remove those using the strip()
method before attempting to convert them to dates.
df['shipment_date'] = pd.to_datetime(df['shipment_date'].str.strip())
# we now have a date object
df.info()
<class 'pandas.core.frame.DataFrame'> RangeIndex: 10000 entries, 0 to 9999 Data columns (total 8 columns): # Column Non-Null Count Dtype --- ------ -------------- ----- 0 produce_type 10000 non-null object 1 shipment_date 10000 non-null datetime64[ns] 2 farm_id 10000 non-null object 3 supermarket_id 10000 non-null object 4 shipment_weight 10000 non-null float64 5 revenue 10000 non-null float64 6 shipping_time 10000 non-null float64 7 delayed 10000 non-null bool dtypes: bool(1), datetime64[ns](1), float64(3), object(3) memory usage: 556.8+ KB
Here are the steps to create a new column named "Year_Month".
# new column composed of the year and the month
df['Year_Month'] = df['shipment_date'].dt.year.astype(str) + '-' + df['shipment_date'].dt.month.astype(str)
df.head()
produce_type | shipment_date | farm_id | supermarket_id | shipment_weight | revenue | shipping_time | delayed | Year_Month | |
---|---|---|---|---|---|---|---|---|---|
0 | Lettuce | 2022-01-01 | farm1 | market3 | 522.434901 | 253.916357 | 4.193670 | True | 2022-1 |
1 | Apples | 2022-01-02 | farm3 | market2 | 129.924851 | 359.323176 | 2.133125 | True | 2022-1 |
2 | Oranges | 2022-01-03 | farm4 | market4 | 148.890313 | 509.488722 | 6.331273 | True | 2022-1 |
3 | Oranges | 2022-01-04 | farm3 | market1 | 260.623032 | 728.175679 | 2.599539 | False | 2022-1 |
4 | Oranges | 2022-01-05 | farm1 | market3 | 878.212644 | 533.657207 | 5.697890 | True | 2022-1 |
Create a new column produce_category based on the produce_type, categorizing produce into fruits (Apples, Bananas, Oranges) and vegetables (Carrots, Lettuce).
**dict.fromkeys()
is a Python expression that creates a new dictionary from an iterable (like a list or tuple) and assigns a default value to each key.
For this task, I am going to create a Python dictionary that will categorize the "produce_type" column.
Let us start by explaining how I created the dictionary:
dict.fromkeys()
is a method that creates a new dictionary from an iterable.fromkeys()
method.For example, if you have a list:
my_list = ['apple', 'banana', 'cherry']
Using **dict.fromkeys()
:
my_dict = **dict.fromkeys(my_list, 'fruit')
Creates a new dictionary:
my_dict = {'apple': 'fruit', 'banana': 'fruit', 'cherry': 'fruit'}
In the context of the code below, **dict.fromkeys(['Carrots', 'Lettuce'], 'vegetables')
creates a dictionary with each value in ['Carrots', 'Lettuce'] as a key and 'vegetables' as the corresponding value.
# map the values
mapping = {**dict.fromkeys(['Apples', 'Bananas', 'Oranges'], 'fruits'),
**dict.fromkeys(['Carrots', 'Lettuce'], 'vegetables')}
# apply the map to the produce_type column
df['produce_category'] = df['produce_type'].map(mapping)
df.head()
produce_type | shipment_date | farm_id | supermarket_id | shipment_weight | revenue | shipping_time | delayed | Year_Month | produce_category | |
---|---|---|---|---|---|---|---|---|---|---|
0 | Lettuce | 2022-01-01 | farm1 | market3 | 522.434901 | 253.916357 | 4.193670 | True | 2022-1 | vegetables |
1 | Apples | 2022-01-02 | farm3 | market2 | 129.924851 | 359.323176 | 2.133125 | True | 2022-1 | fruits |
2 | Oranges | 2022-01-03 | farm4 | market4 | 148.890313 | 509.488722 | 6.331273 | True | 2022-1 | fruits |
3 | Oranges | 2022-01-04 | farm3 | market1 | 260.623032 | 728.175679 | 2.599539 | False | 2022-1 | fruits |
4 | Oranges | 2022-01-05 | farm1 | market3 | 878.212644 | 533.657207 | 5.697890 | True | 2022-1 | fruits |
Create a new column season based on the month of the shipment, categorizing shipments into spring (March to May), summer (June to August), fall (September to November), and winter (December to February).
For this task, we will use the same strategy of mapping a dictionary to the dataframe.
# the trick is to use numbers instead of dates
season_map = {**dict.fromkeys([12,1,2], 'winter'),
**dict.fromkeys(range(3, 6), 'spring'),
**dict.fromkeys(range(6, 9), 'summer'),
**dict.fromkeys(range(9, 12), 'fall')}
# apply map
df['season'] = df['shipment_date'].dt.month.map(season_map)
df.head()
produce_type | shipment_date | farm_id | supermarket_id | shipment_weight | revenue | shipping_time | delayed | Year_Month | produce_category | season | |
---|---|---|---|---|---|---|---|---|---|---|---|
0 | Lettuce | 2022-01-01 | farm1 | market3 | 522.434901 | 253.916357 | 4.193670 | True | 2022-1 | vegetables | winter |
1 | Apples | 2022-01-02 | farm3 | market2 | 129.924851 | 359.323176 | 2.133125 | True | 2022-1 | fruits | winter |
2 | Oranges | 2022-01-03 | farm4 | market4 | 148.890313 | 509.488722 | 6.331273 | True | 2022-1 | fruits | winter |
3 | Oranges | 2022-01-04 | farm3 | market1 | 260.623032 | 728.175679 | 2.599539 | False | 2022-1 | fruits | winter |
4 | Oranges | 2022-01-05 | farm1 | market3 | 878.212644 | 533.657207 | 5.697890 | True | 2022-1 | fruits | winter |
In this tutorial, you learned how to clean and transform data, create new columns based on existing data, and map values to categories using dictionaries and Pandas functions. You now have a solid understanding of data wrangling and analysis techniques, which will serve as a great foundation for your future work with data.
pd.to_datetime()
to convert date columns to a standard format.dt.year
and dt.month
to extract year and month from a datetime column.dict.fromkeys()
to create a dictionary for mapping values and map()
to apply the mapping to a column.dt.month
to extract the month and map it to a season using a dictionary.This tutorial was created by HEDARO