In 2020, between "720 to 811 million people worldwide were suffering from hunger." Unfortunately, even more people (about 2.4 Billion people) are moderately to severely food insecure (UN SDG Indicators). Due to this our group decided to address this by adhereing to the UN Sustainable Development Goals, especially Goal 2: Zero Hunger. We do so by aiming to create a regression model for farmers to predict crop yield using just satellite data and some ground-truth data. Vietnam is a low-income country with limited resources and a major exporter of rice. Hence, we choose to predict rice crop yield in Vietnam.
Currently, the highest accuracy achieved in predicting any crop yield, after decades of research, is 0.86 (Seungtaek Jeong et al. (2021)).
pip install xgboost catboost
Requirement already satisfied: xgboost in /srv/conda/envs/notebook/lib/python3.10/site-packages (1.7.5) Requirement already satisfied: catboost in /srv/conda/envs/notebook/lib/python3.10/site-packages (1.1.1) Requirement already satisfied: scipy in /srv/conda/envs/notebook/lib/python3.10/site-packages (from xgboost) (1.9.1) Requirement already satisfied: numpy in /srv/conda/envs/notebook/lib/python3.10/site-packages (from xgboost) (1.22.4) Requirement already satisfied: matplotlib in /srv/conda/envs/notebook/lib/python3.10/site-packages (from catboost) (3.5.3) Requirement already satisfied: plotly in /srv/conda/envs/notebook/lib/python3.10/site-packages (from catboost) (5.14.1) Requirement already satisfied: six in /srv/conda/envs/notebook/lib/python3.10/site-packages (from catboost) (1.16.0) Requirement already satisfied: pandas>=0.24.0 in /srv/conda/envs/notebook/lib/python3.10/site-packages (from catboost) (1.4.4) Requirement already satisfied: graphviz in /srv/conda/envs/notebook/lib/python3.10/site-packages (from catboost) (0.20.1) Requirement already satisfied: python-dateutil>=2.8.1 in /srv/conda/envs/notebook/lib/python3.10/site-packages (from pandas>=0.24.0->catboost) (2.8.2) Requirement already satisfied: pytz>=2020.1 in /srv/conda/envs/notebook/lib/python3.10/site-packages (from pandas>=0.24.0->catboost) (2022.2.1) Requirement already satisfied: cycler>=0.10 in /srv/conda/envs/notebook/lib/python3.10/site-packages (from matplotlib->catboost) (0.11.0) Requirement already satisfied: fonttools>=4.22.0 in /srv/conda/envs/notebook/lib/python3.10/site-packages (from matplotlib->catboost) (4.37.2) Requirement already satisfied: pyparsing>=2.2.1 in /srv/conda/envs/notebook/lib/python3.10/site-packages (from matplotlib->catboost) (3.0.9) Requirement already satisfied: packaging>=20.0 in /srv/conda/envs/notebook/lib/python3.10/site-packages (from matplotlib->catboost) (21.3) Requirement already satisfied: kiwisolver>=1.0.1 in /srv/conda/envs/notebook/lib/python3.10/site-packages (from matplotlib->catboost) (1.4.4) Requirement already satisfied: pillow>=6.2.0 in /srv/conda/envs/notebook/lib/python3.10/site-packages (from matplotlib->catboost) (9.2.0) Requirement already satisfied: tenacity>=6.2.0 in /srv/conda/envs/notebook/lib/python3.10/site-packages (from plotly->catboost) (8.0.1) Note: you may need to restart the kernel to use updated packages.
# Supress Warnings
import warnings
warnings.filterwarnings('ignore')
# Visualization
import ipyleaflet
import matplotlib.pyplot as plt
from IPython.display import Image
import seaborn as sns
# Data Science
import numpy as np
import pandas as pd
import statsmodels.api as sm
from sklearn.model_selection import train_test_split
from scipy.interpolate import interp1d
from datetime import datetime, timedelta
# Machine Learning
import graphviz
import xgboost as xgb
from xgboost import XGBRegressor
from catboost import CatBoostRegressor
from sklearn.metrics import r2_score, mean_squared_error
from sklearn.preprocessing import StandardScaler
from sklearn.pipeline import make_pipeline
from sklearn.linear_model import LinearRegression, Lasso
from sklearn.ensemble import RandomForestRegressor, GradientBoostingRegressor, VotingRegressor, AdaBoostRegressor
from sklearn.model_selection import learning_curve
from sklearn.feature_selection import RFECV
# Planetary Computer Tools
import pystac
import pystac_client
import odc
from pystac_client import Client
from pystac.extensions.eo import EOExtension as eo
from odc.stac import stac_load
import planetary_computer as pc
# Import common GIS tools
import pandas as pd
import numpy as np
import xarray as xr
import matplotlib.pyplot as plt
import rasterio.features
import rioxarray as rio
import xrspatial.multispectral as ms
from contextlib import redirect_stdout
import os
import sys
# Pass your API key here
pc.settings.set_subscription_key('71113eb526914deba51505c8b5463b88')
locations = pd.read_csv("Crop_Yield_Data.csv") # Ground Truth Data
weather_data = pd.read_csv('Chau Phu, Vietnam 2021-10-01 to 2022-09-01.csv')
weather_data1 = pd.read_csv('Chau Thanh, Vietnam 2021-10-01 to 2022-09-01.csv')
weather_data2 = pd.read_csv('Thoai Son 2021-10-01 to 2022-09-01.csv')
locations.head()
District | Latitude | Longitude | Season(SA = Summer Autumn, WS = Winter Spring) | Rice Crop Intensity(D=Double, T=Triple) | Date of Harvest | Field size (ha) | Rice Yield (kg/ha) | |
---|---|---|---|---|---|---|---|---|
0 | Chau_Phu | 10.510542 | 105.248554 | SA | T | 15-07-2022 | 3.40 | 5500 |
1 | Chau_Phu | 10.509150 | 105.265098 | SA | T | 15-07-2022 | 2.43 | 6000 |
2 | Chau_Phu | 10.467721 | 105.192464 | SA | D | 15-07-2022 | 1.95 | 6400 |
3 | Chau_Phu | 10.494453 | 105.241281 | SA | T | 15-07-2022 | 4.30 | 6000 |
4 | Chau_Phu | 10.535058 | 105.252744 | SA | D | 14-07-2022 | 3.30 | 6400 |
locations.info()
<class 'pandas.core.frame.DataFrame'> RangeIndex: 557 entries, 0 to 556 Data columns (total 8 columns): # Column Non-Null Count Dtype --- ------ -------------- ----- 0 District 557 non-null object 1 Latitude 557 non-null float64 2 Longitude 557 non-null float64 3 Season(SA = Summer Autumn, WS = Winter Spring) 557 non-null object 4 Rice Crop Intensity(D=Double, T=Triple) 557 non-null object 5 Date of Harvest 557 non-null object 6 Field size (ha) 557 non-null float64 7 Rice Yield (kg/ha) 557 non-null int64 dtypes: float64(3), int64(1), object(4) memory usage: 34.9+ KB
weather_data.info()
<class 'pandas.core.frame.DataFrame'> RangeIndex: 336 entries, 0 to 335 Data columns (total 33 columns): # Column Non-Null Count Dtype --- ------ -------------- ----- 0 name 336 non-null object 1 datetime 336 non-null object 2 tempmax 336 non-null float64 3 tempmin 336 non-null float64 4 temp 336 non-null float64 5 feelslikemax 336 non-null float64 6 feelslikemin 336 non-null float64 7 feelslike 336 non-null float64 8 dew 336 non-null float64 9 humidity 336 non-null float64 10 precip 336 non-null float64 11 precipprob 336 non-null int64 12 precipcover 336 non-null float64 13 preciptype 274 non-null object 14 snow 235 non-null float64 15 snowdepth 235 non-null float64 16 windgust 238 non-null float64 17 windspeed 336 non-null float64 18 winddir 336 non-null float64 19 sealevelpressure 336 non-null float64 20 cloudcover 336 non-null float64 21 visibility 336 non-null float64 22 solarradiation 336 non-null float64 23 solarenergy 336 non-null float64 24 uvindex 336 non-null int64 25 severerisk 235 non-null float64 26 sunrise 336 non-null object 27 sunset 336 non-null object 28 moonphase 336 non-null float64 29 conditions 336 non-null object 30 description 336 non-null object 31 icon 336 non-null object 32 stations 336 non-null object dtypes: float64(22), int64(2), object(9) memory usage: 86.8+ KB
Rice crop phenology examines growth stages and timings in rice plants. Phenology curves track development parameters like leaf area index, biomass, and vegetation indices. These curves aid in monitoring crops, predicting yields, and planning agricultural practices in Vietnam. By combining weather data with phenology curves, we can use growth stages as predictors in a data-driven ML model.
First, we define our area of interest using the centroid's latitude and longitude coordinates. We then establish the size of the surrounding bounding box (in degrees) and set a time window for a typical rice growing season.
Bounding boxes are rectangular regions outlining areas of interest in remote sensing imagery, represented by (xmin, ymin, xmax, ymax) coordinates. Landsat's 30-meter and Sentinel-1's 10-meter resolutions balance data volume, swath widths, revisit times, and applications like land coverage and agriculture.
# Sample Rice Crop Field in An Giang, Vietnam
lat_long = (10.4391, 105.3338) # Lat-Lon centroid location
box_size_deg1 = 0.10 # Surrounding box in degrees
# Calculate the Lat-Lon bounding box region
min_lon = lat_long[1]-box_size_deg1/2
min_lat = lat_long[0]-box_size_deg1/2
max_lon = lat_long[1]+box_size_deg1/2
max_lat = lat_long[0]+box_size_deg1/2
bounds = (min_lon, min_lat, max_lon, max_lat)
# Define the time window
time_window="2021-12-01/2022-04-30"
# Define the pixel resolution for the final product
# Define the scale according to our selected crs, so we will use degrees
resolution1 = 30 # meters per pixel
scale1 = resolution1 / 111320.0 # degrees per pixel for CRS:4326
Using the pystac_client
we can search the Planetary Computer's STAC catalog for items matching our query parameters.
The result is the number of scenes matching our search criteria that touch our area of interest.
Some of these may be partial scenes and may contain clouds.
stac = pystac_client.Client.open("https://planetarycomputer.microsoft.com/api/stac/v1")
search = stac.search(
collections=["landsat-c2-l2"],
bbox=bounds,
datetime=time_window,
query={"platform": {"in": ["landsat-8", "landsat-9"]},},
)
items = list(search.get_all_items())
print('This is the number of scenes that touch our region:',len(items))
This is the number of scenes that touch our region: 31
xx = stac_load(
items,
bands=["red", "green", "blue", "nir08", "qa_pixel"],
crs="EPSG:4326", # Latitude-Longitude
resolution=scale1, # Degrees
chunks={"x": 2048, "y": 2048},
patch_url=pc.sign,
bbox=bounds
)
# Apply scaling and offsets for Landsat Collection-2 (reference below) to the spectral bands ONLY
# https://planetarycomputer.microsoft.com/dataset/landsat-c2-l2
xx['red'] = (xx['red']*0.0000275)-0.2
xx['green'] = (xx['green']*0.0000275)-0.2
xx['blue'] = (xx['blue']*0.0000275)-0.2
xx['nir08'] = (xx['nir08']*0.0000275)-0.2
# View the dimensions of our XARRAY and the variables
display(xx)
<xarray.Dataset> Dimensions: (latitude: 372, longitude: 372, time: 31) Coordinates: * latitude (latitude) float64 10.49 10.49 10.49 ... 10.39 10.39 10.39 * longitude (longitude) float64 105.3 105.3 105.3 ... 105.4 105.4 105.4 spatial_ref int32 4326 * time (time) datetime64[ns] 2021-12-01T03:20:47.203656 ... 2022-04... Data variables: red (time, latitude, longitude) float64 dask.array<chunksize=(1, 372, 372), meta=np.ndarray> green (time, latitude, longitude) float64 dask.array<chunksize=(1, 372, 372), meta=np.ndarray> blue (time, latitude, longitude) float64 dask.array<chunksize=(1, 372, 372), meta=np.ndarray> nir08 (time, latitude, longitude) float64 dask.array<chunksize=(1, 372, 372), meta=np.ndarray> qa_pixel (time, latitude, longitude) uint16 dask.array<chunksize=(1, 372, 372), meta=np.ndarray>
Data is available for Landsat-8 from April 2013 onwards and for Landsat-9 from February 2022 onwards. Typically, our region is viewed every 8 days, with additional scenes due to overlaps. Over a 5-month example, 31 time slices touch our region, but only 7 are very clear, with several others being partially cloudy.
plot_xx = xx[["red","green","blue"]].to_array()
plot = plot_xx.plot.imshow(col='time', col_wrap=4, robust=True, vmin=0, vmax=0.3)
plot.fig.suptitle("Fig. 1. Regions Captured from Landsat", fontsize=15, fontweight='bold')
plot.fig.subplots_adjust(top=0.96)
plt.show()
# Select a time slice to view a simple RGB image and the cloud mask
# See the XARRAY dimensions above for the number of time slices (starts at 0)
# Slice #6 - Mostly Clear
# Slice #24 - Scattered Clouds
# Slice #7 - Very Cloudy
time_slice = 24
# Plot and RGB Real Color Image for a single date
fig, ax = plt.subplots(figsize=(7, 5))
xx.isel(time=time_slice)[["red", "green", "blue"]].to_array().plot.imshow(robust=True, ax=ax, vmin=0, vmax=0.3)
ax.set_title("Fig. 2. RGB Real Color", fontweight='bold')
ax.axis('off')
plt.show()
Cloud masking for Landsat Collection-2 Level-2 data involves using the "qa_pixel" band to classify pixels as cloud, water, or cloud shadow. Creating a mask for these categories allows us to extract clear pixels for assessing vegetation state.
# To mask the pixels and find clouds or water, it is best to use the bit values of the 16-bit qa_pixel flag
# See the website above for a nice explanation of the process
bit_flags = {
'fill': 1<<0,
'dilated_cloud': 1<<1,
'cirrus': 1<<2,
'cloud': 1<<3,
'shadow': 1<<4,
'snow': 1<<5,
'clear': 1<<6,
'water': 1<<7
}
# Create a function that will mask pixels with a given type
def get_mask(mask, flags_list):
# Create the result mask filled with zeros and the same shape as the mask
final_mask = np.zeros_like(mask)
# Loop through the flags
for flag in flags_list:
# get the mask for each flag
flag_mask = np.bitwise_and(mask, bit_flags[flag])
# add it to the final flag
final_mask = final_mask | flag_mask
return final_mask > 0
# Pick a single time slice to view a mask with clouds and water
sample_xx = xx.isel(time=time_slice)
# Find the pixels that are no data (fill), clouds, cloud shadows, or water
my_mask = get_mask(sample_xx['qa_pixel'],
['fill', 'dilated_cloud', 'cirrus',
'cloud', 'shadow', 'water'])
# Show only the mask (Yellow) with valid data in Purple
plt.figure(figsize=(7,5))
plt.imshow(my_mask)
plt.title("Fig. 3. Cloud / Shadows / Water Mask > YELLOW", fontweight='bold')
plt.axis('off')
plt.show()
# Create an RGB function that will display the mask over the background RGB image
def plot_masked_rgb(red, green, blue, mask, color_mask=(1, 0, 0), transparency=0.5, brightness=2):
# to improve our visualization, we will increase the brightness of our values
red = red / red.max() * brightness
green = green / green.max() * brightness
blue = blue / blue.max() * brightness
red = np.where(mask==True, red*transparency+color_mask[0]*(1-transparency), red)
green = np.where(mask==True, green*transparency+color_mask[1]*(1-transparency), green)
blue = np.where(mask==True, blue*transparency+color_mask[2]*(1-transparency), blue)
rgb = np.stack([red, green, blue], axis=2)
return rgb
rgb = plot_masked_rgb(sample_xx['red'], sample_xx['green'], sample_xx['blue'], my_mask, color_mask=(1, 0, 1), transparency=0.2, brightness=3)
# This is a nice image that shows the clouds and water pixels (Purple) among clear land pixels
plt.figure(figsize=(7,5))
plt.imshow(rgb)
plt.title("Fig. 4. Cloud / Shadows / Water Mask > MAGENTA", fontweight='bold')
plt.axis('off')
plt.show()
Clipping input data to the valid range for imshow with RGB data ([0..1] for floats or [0..255] for integers).
After applying cloud masking and filtering to our region, we'll calculate the Normalized Difference Vegetation Index (NDVI) for both filtered and unfiltered pixels to compare accuracy. We will eventually calculate NDVI for all the locations in our dataset.
NDVI measures vegetation "greenness" with a range of 0.0 to 1.0. Low values (0.0-0.25) indicate little vegetation, middle values (0.25-0.6) represent growing crops, and high values (0.6-1.0) signify peak vegetation. The equation is: NDVI = (NIR-Red) / (NIR+Red).
The NDVI plot shows unfiltered (BLUE) and filtered (GREEN) pixels. Gaps in filtered data signify unavailable clean pixels for NDVI calculation. Similar values suggest low cloud volume, while large differences indicate high cloud volume affecting NDVI. Filtering clouds before NDVI calculation improves phenology accuracy for agricultural plots.
# Calculate the mask for the entire xarray (all time slices)
full_mask = get_mask(xx['qa_pixel'], ['fill', 'dilated_cloud', 'cirrus', 'cloud', 'shadow', 'water'])
# Create a "clean" dataset with the mask applied
cleaned_data = xx.where(~full_mask)
# Calculate the mean of the data across the sample region and then NDVI
# Perform this calculation for the unfiltered and cloud-filtered (clean) datasets
mean_unfiltered = xx.mean(dim=['longitude','latitude']).compute()
ndvi_mean = (mean_unfiltered.nir08-mean_unfiltered.red)/(mean_unfiltered.nir08+mean_unfiltered.red)
mean_clean = cleaned_data.mean(dim=['longitude','latitude']).compute()
ndvi_mean_clean = (mean_clean.nir08-mean_clean.red)/(mean_clean.nir08+mean_clean.red)
fig = plt.figure(figsize=(7, 5))
ndvi_mean_clean.plot(color='green',marker='o',markersize=4,linewidth=2, label="Filtered = Clouds and Water Removed")
ndvi_mean.plot(color='blue',marker='o',markersize=4,linewidth=2, label="All Pixels = Clouds and Water Included")
plt.title("Fig. 5. Mean NDVI (Vegetation Index)", fontweight='bold')
plt.xlabel("Date")
plt.ylabel("NDVI")
plt.ylim(-0.1,1.0)
plt.legend(loc="upper right", markerscale=2., scatterpoints=1, fontsize=10)
plt.show()
# Plot an NDVI image for a single date with few clouds
# We will select image (index=6) on January 2, 2022. Notice how the water is masked out.
fig = plt.figure(figsize=(7, 5))
ndvi_image = (cleaned_data.nir08-cleaned_data.red)/(cleaned_data.nir08+cleaned_data.red)
ndvi_image.isel(time=6).plot(vmin=0.0, vmax=0.8, cmap="RdYlGn")
plt.title("Fig. 6. NDVI", fontweight='bold')
plt.axis('off')
plt.show()
box_size_deg2 = 0.0004 # Surrounding box in degrees, yields approximately 5x5 pixel region
resolution2 = 10 # meters per pixel
scale2 = resolution2 / 111320.0 # degrees per pixel for crs=4326
catalog = pystac_client.Client.open("https://planetarycomputer.microsoft.com/api/stac/v1")
search = catalog.search(collections=["sentinel-1-rtc"], bbox=bounds, datetime=time_window)
items = list(search.get_all_items())
print('This is the number of scenes that touch our region:',len(items))
This is the number of scenes that touch our region: 27
# Load the data using Open Data Cube
data = stac_load(items,bands=["vv", "vh"], patch_url=pc.sign, bbox=bounds, crs="EPSG:4326", resolution=scale2)
# View the details of our xarray dataset
# The X and Y dimensions tell us the pixel dimensions of our bounding box
# The "time" variable is the number of scenes that touch our region
data
<xarray.Dataset> Dimensions: (latitude: 1114, longitude: 1114, time: 27) Coordinates: * latitude (latitude) float64 10.49 10.49 10.49 ... 10.39 10.39 10.39 * longitude (longitude) float64 105.3 105.3 105.3 ... 105.4 105.4 105.4 spatial_ref int32 4326 * time (time) datetime64[ns] 2021-12-04T22:46:07.919581 ... 2022-04... Data variables: vv (time, latitude, longitude) float32 0.02417 0.01493 ... 0.272 vh (time, latitude, longitude) float32 0.004146 ... 0.01252
Sentinel-1 SAR data uses VV and VH polarizations to capture radar backscatter from the Earth's surface, creating grayscale images that depict low backscatter as dark and high backscatter as light (Xu (2021)). This method helps identify surface features like water bodies (dark), vegetation (medium-light), and urban areas (light). The code visualizes VV and VH polarizations of Sentinel-1 SAR data as grayscale images for a specific time slice, aiding in the understanding of backscatter distribution and surface characteristics.
# Assuming that the loaded data is in 'data' variable, let's visualize the vv and vh bands
vv_band = data.vv
vh_band = data.vh
# You can select a specific time slice to visualize, e.g., the first one:
time_index = 0
vv_slice = vv_band.isel(time=time_index)
vh_slice = vh_band.isel(time=time_index)
# Plot vv polarization
plt.figure(figsize=(7, 5))
vv_slice.plot(cmap='gray', robust=True)
plt.title('Fig. 7. Sentinel-1 VV Polarization', fontweight='bold')
plt.show()
# Plot vh polarization
plt.figure(figsize=(7, 5))
vh_slice.plot(cmap='gray', robust=True)
plt.title('Fig. 8. Sentinel-1 VH Polarization', fontweight='bold')
plt.show()
After visualizing and assessing Landsat and Sentinel 1 data, we'll calculate predictor variables for our prediction model. Using spatial information (latitude, longitude) and harvest dates, we'll derive predictors from both Landsat and Sentinel 1 data, similar to NDVI calculations.
We made an initial selection of predictor variables through robust research; namely Miao (2011), Srivastava (2012), Wu (2015) and more.
We use filtered data, free from clouds and water bodies, to calculate vegetation indices. Indices are computed after applying a masking algorithm on each location in our dataset. Instead of using data for an entire crop cycle (like we did above in our sampling model), we request data for a time window around the harvest date. As Landsat provides data every 8 days, we set a time window for 8 days before and after the harvest date. This approach accounts for accurate cubic-spline interpolation, reduces noise, memory usage, processing power, processing time, and enhances data reliability.
Along with NDVI (spoken above), we chose NDWI, SAVI, EVI2 and Albedo to calculate from Landsat.
NDWI (Normalized Difference Water Index): The index detects water presence in vegetation, useful for identifying waterlogged and flood-affected areas, and monitoring irrigation and water management. Ranging from -1.0 to 1.0, negative values indicate low water content, while increasing values signify higher water content. It's calculated as (NIR - SWIR) / (NIR + SWIR), with NIR representing the near-infrared band and SWIR the shortwave-infrared band.
SAVI (Soil Adjusted Vegetation Index): The index adjusts for soil brightness influence on vegetation indices and is helpful in low vegetation areas with high soil reflectance. SAVI ranges from -1.0 to 1.0, with negative values indicating water, values close to 0 for bare soil/low vegetation, up to 0.25 for sparse vegetation, 0.6 for growing vegetation, and 0.6 to 1.0 for dense vegetation. It's calculated as ((NIR - Red) / (NIR + Red + L)) * (1 + L), with L being a soil adjustment factor (typically between 0 and 1).
EVI 2 (Enhanced Vegetation Index 2): EVI 2, a modification of the original EVI, requires only red and near-infrared bands, making it suitable for sensors like Landsat. Useful for monitoring vegetation health and productivity, it's less sensitive to atmospheric conditions and soil background signals than NDVI. Ranging from -1.0 to 1.0, negative values indicate water, and increasing values represent more vegetation. It's calculated as 2.5 ((NIR - Red) / (NIR + 2.4 Red + 1)).
Albedo: Albedo measures surface reflectivity, with values ranging from 0 (completely absorbing) to 1 (completely reflecting). In vegetation, albedo is crucial for understanding energy balance and heat exchange between land and atmosphere. Higher albedo values in croplands can indicate healthier vegetation, as denser, greener canopies reflect more solar radiation.
By combining these vegetation indices, we gained a comprehensive understanding of the health, productivity, and water management of the croplands, as well as the energy balance and heat exchange between the land surface and the atmosphere.
# Load the data using Open Data Cube for each location
#EVI 2 Coeeficients selected from current literature
G = 2.5
C1 = 6
C2 = 7.5
L = 1
latitudes = []
longitudes = []
albedo_values = []
ndvi_landsat = []
evi2_landsat = []
savi_landsat = []
ndwi_landsat = []
for index, row in locations.iterrows():
lat = row['Latitude']
lon = row['Longitude']
toh = row['Date of Harvest']
date_object = datetime.strptime(toh, "%d-%m-%Y")
trial_date = date_object.strftime("%Y-%m-%d")
time_delta = timedelta(days=8)
trial_datetime = datetime.strptime(trial_date,"%Y-%m-%d")
start_date = (trial_datetime - time_delta)
end_date = (trial_datetime + time_delta)
time_window3 = f"{start_date}/{end_date}"
#Load the data using Open Data Cube
bounds = (lon - box_size_deg1/2, lat - box_size_deg1/2, lon + box_size_deg1/2, lat + box_size_deg1/2)
#Load the data using Open Data Cube
catalog = pystac_client.Client.open("https://planetarycomputer.microsoft.com/api/stac/v1")
search = catalog.search(collections=["landsat-c2-l2"], bbox=bounds, datetime=time_window3, query={"platform": {"in": ["landsat-8", "landsat-9"]}})
items = list(search.get_all_items())
xx = stac_load(
items,
bands=["red", "green", "blue", "nir08","swir16","qa_pixel"],
crs="EPSG:4326",
resolution=scale1,
chunks={"x": 2048, "y": 2048},
patch_url=pc.sign,
bbox=bounds
)
# Apply scaling and offsets for Landsat Collection-2 (reference below) to the spectral bands ONLY
# https://planetarycomputer.microsoft.com/dataset/landsat-c2-l2
xx['red'] = (xx['red']*0.0000275)-0.2
xx['green'] = (xx['green']*0.0000275)-0.2
xx['blue'] = (xx['blue']*0.0000275)-0.2
xx['nir08'] = (xx['nir08']*0.0000275)-0.2
xx['swir16'] = (xx['swir16']*0.0000275)-0.2
# Calculate the mask for the entire xarray (all time slices)
full_mask = get_mask(xx['qa_pixel'], ['fill', 'dilated_cloud', 'cirrus', 'cloud', 'shadow', 'water'])
# Create a "clean" dataset with the mask applied
cleaned_data = xx.where(~full_mask)
mean_clean = cleaned_data.mean(dim=['longitude','latitude']).compute()
ndvi_mean_clean = (mean_clean.nir08-mean_clean.red)/(mean_clean.nir08+mean_clean.red)
evi2_mean_clean = G * ((mean_clean.nir08 - mean_clean.red) / (mean_clean.nir08 + C1 * mean_clean.red - C2 * mean_clean.blue + L))
savi_mean_clean = ((mean_clean.nir08 - mean_clean.red) / (mean_clean.nir08 + mean_clean.red + 0.5)) * 1.5
ndwi_mean_clean = (mean_clean.green - mean_clean.nir08) / (mean_clean.green + mean_clean.nir08)
albedo = 0.356 * mean_clean.blue + 0.130 * mean_clean.green + 0.373 * mean_clean.red + 0.085 * mean_clean.nir08 + 0.072 * mean_clean.swir16 +0.0018 # constant term
# Appending the latitude, longitude, NDVI, NDWI, EVI 2, SAVI and Albedo values to their respective lists
latitudes.append(lat)
longitudes.append(lon)
albedo_values.append(albedo)
ndvi_landsat.append(ndvi_mean_clean)
evi2_landsat.append(evi2_mean_clean)
savi_landsat.append(savi_mean_clean)
ndwi_landsat.append(ndwi_mean_clean)
We need to make a new for loop specifically to find land surface temperature. This is simply because we utilise the thermal bands offered by Landsat 9 (for it's higher accuracy and more reliable data). We'll tweak the code to specifically get data from Landsat 9, and we'll include the thermal band we are interested in.
# Load the data using Open Data Cube for each location
lst_landsat = []
for index, row in locations.iterrows():
lat = row['Latitude']
lon = row['Longitude']
toh = row['Date of Harvest']
date_object = datetime.strptime(toh, "%d-%m-%Y")
trial_date = date_object.strftime("%Y-%m-%d")
time_delta = timedelta(days=8) # Taking for 8 originally with 8 up and 8 down
trial_datetime = datetime.strptime(trial_date,"%Y-%m-%d")
start_date = (trial_datetime - time_delta)
end_date = (trial_datetime + time_delta)
time_window3 = f"{start_date}/{end_date}"
#Load the data using Open Data Cube
bounds = (lon - box_size_deg1/2, lat - box_size_deg1/2, lon + box_size_deg1/2, lat + box_size_deg1/2)
try:
#Load the data using Open Data Cube
catalog = pystac_client.Client.open("https://planetarycomputer.microsoft.com/api/stac/v1")
search = catalog.search(collections=["landsat-c2-l2"], bbox=bounds, datetime=time_window3, query={"platform": {"in": ["landsat-9"]}})
items = list(search.get_all_items())
#data = xr.open_rasterio(items, chunks={"x": 2048, "y": 2048}, crs="EPSG:4326", resolution=scale, patch_url=sign)
xx = stac_load(
items,
bands=["red", "green", "blue", "nir08","swir16","qa_pixel", "lwir11"],
crs="EPSG:4326", # Latitude-Longitude
resolution=scale1, # Degrees
chunks={"x": 2048, "y": 2048},
patch_url=pc.sign,
bbox=bounds
)
# Apply scaling and offsets for Landsat Collection-2 (reference below) to the spectral bands ONLY
# https://planetarycomputer.microsoft.com/dataset/landsat-c2-l2
xx['red'] = (xx['red']*0.0000275)-0.2
xx['green'] = (xx['green']*0.0000275)-0.2
xx['blue'] = (xx['blue']*0.0000275)-0.2
xx['nir08'] = (xx['nir08']*0.0000275)-0.2
xx['swir16'] = (xx['swir16']*0.0000275)-0.2
xx['lwir11'] = (xx['lwir11'] * 0.0003342) + 0.1
full_mask = get_mask(xx['qa_pixel'], ['fill', 'dilated_cloud', 'cirrus', 'cloud', 'shadow', 'water'])
cleaned_data = xx.where(~full_mask)
mean_clean = cleaned_data.mean(dim=['longitude','latitude']).compute()
ndvi_mean_clean = (mean_clean.nir08-mean_clean.red)/(mean_clean.nir08+mean_clean.red)
# Constants adjusted to calculated temperature brightness from Landsat 9
K1_CONSTANT_BAND_11 = 774.8853
K2_CONSTANT_BAND_11 = 1321.0789
brightness_temperature = K2_CONSTANT_BAND_11 / np.log((K1_CONSTANT_BAND_11 / xx['lwir11']) + 1)
# Emissivity values
emissivity_vegetation = 0.99
emissivity_soil = 0.97
# Estimate emissivity using the NDVI threshold method
ndvi_clean = (cleaned_data.nir08 - cleaned_data.red) / (cleaned_data.nir08 + cleaned_data.red)
emissivity = emissivity_soil + (emissivity_vegetation - emissivity_soil) * np.where(ndvi_clean > 0.5, 1, 0)
# Calculate LST in Celsius
lst_celsius = (brightness_temperature / (1 + (0.00115 * brightness_temperature / 1.4388) * np.log(emissivity))) - 273.15
cleaned_lst = lst_celsius.where(~full_mask)
mean_lst = cleaned_lst.mean(dim=['longitude', 'latitude']).compute()
lst_landsat.append(mean_lst)
except:
print(f"No data found for location {lat}, {lon}")
continue
No data found for location 10.387123, 105.119326 No data found for location 10.426536, 105.115181 No data found for location 10.419552, 105.116409 No data found for location 10.403175, 105.123726 No data found for location 10.399444, 105.118528 No data found for location 10.42324, 105.126544 No data found for location 10.429339, 105.125493 No data found for location 10.435839, 105.132981
We are receiving no data (not just missing values) for some of the locations. To resolve this issue, we'll insert dummy xarray's (with common spatial information) and numpy array with NaN values. Then, we'll simply just interpolate (explained further down) these NaN values.
def create_and_insert_dataarray(index, date_of_harvest, lst_landsat):
doh_dt = datetime.strptime(date_of_harvest, '%d-%m-%Y')
ta = xr.DataArray(np.full(2, np.nan), dims='time', coords={'time': [doh_dt.strftime('%Y-%m-%dT03:20:15.057547'), doh_dt.strftime('%Y-%m-%dT03:20:15.057547')]})
ta.attrs['spatial_ref'] = 4326
lst_landsat.insert(index, ta)
# Assuming the locations DataFrame and lst_landsat list are defined
indices = [201, 204, 205, 209, 211, 306, 310, 323]
for index in indices:
date_of_harvest = locations.iloc[index]['Date of Harvest']
create_and_insert_dataarray(index, date_of_harvest, lst_landsat)
Similarly, instead of pulling data for an entire crop cycle (like we did above in our sampling model), we only request data for a time window around the date of harvest. Sentinel 1 sends back data after every 6 days and hence, the 12 day time window.
# Utilising this for loamy soil as this is the best generalized soil type for Vietnam
A = 0.71
B = 1.40
sm_feature = []
# Load the data using Open Data Cube for each location
for index, row in locations.iterrows():
lon = row['Longitude']
lat = row['Latitude']
toh = row['Date of Harvest']
date_object = datetime.strptime(toh, "%d-%m-%Y")
trial_date = date_object.strftime("%Y-%m-%d")
time_delta = timedelta(days=6)
trial_datetime = datetime.strptime(trial_date,"%Y-%m-%d")
start_date = (trial_datetime - time_delta)
end_date = (trial_datetime + time_delta)
time_window3 = f"{start_date}/{end_date}"
bbox = (lon - box_size_deg2/2, lat - box_size_deg2/2, lon + box_size_deg2/2, lat + box_size_deg2/2)
catalog = pystac_client.Client.open("https://planetarycomputer.microsoft.com/api/stac/v1")
search = catalog.search(collections=["sentinel-1-rtc"], bbox=bbox, datetime=time_window3)
items = list(search.get_all_items())
data = stac_load(items,bands=["vv", "vh"], patch_url=pc.sign, bbox=bbox, crs="EPSG:4326", resolution=scale2)
mean = data.mean(dim=['latitude','longitude']).compute()
dop = (mean.vv / (mean.vv + mean.vh))
#Calculating Soil Moisture
soil_moisture = (1 - ((10 ** (0.1 * dop)) / A) ** B)
sm_feature.append(soil_moisture)
Now we check if our data has any missing values or 'NaN' values.
albedo_values[:5]
[<xarray.DataArray (time: 4)> array([nan, nan, nan, nan]) Coordinates: spatial_ref int32 4326 * time (time) datetime64[ns] 2022-07-13T03:20:15.057547 ... 2022-07..., <xarray.DataArray (time: 4)> array([nan, nan, nan, nan]) Coordinates: spatial_ref int32 4326 * time (time) datetime64[ns] 2022-07-13T03:20:15.057547 ... 2022-07..., <xarray.DataArray (time: 2)> array([nan, nan]) Coordinates: spatial_ref int32 4326 * time (time) datetime64[ns] 2022-07-13T03:20:38.965531 2022-07-21T..., <xarray.DataArray (time: 4)> array([nan, nan, nan, nan]) Coordinates: spatial_ref int32 4326 * time (time) datetime64[ns] 2022-07-13T03:20:15.057547 ... 2022-07..., <xarray.DataArray (time: 4)> array([nan, nan, nan, nan]) Coordinates: spatial_ref int32 4326 * time (time) datetime64[ns] 2022-07-13T03:20:15.057547 ... 2022-07...]
From the satellites, we see that we receive missing data, or in other words a number of missing values. To further make predictions using this data, we need to replace the 'NaN' values with real values. Replacing with zero or mean could be an option but this introduces greater bias in our data. We chose to interpolate the missing values, using cubic-spline, to accurately smoothen the data and replace the data with less bias.
def interpolate_and_create_dataframe(data_list, column_name, latitudes, longitudes):
values_arrays = [data.values for data in data_list]
lengths = [len(arr) for arr in values_arrays]
values_flat = np.concatenate(values_arrays)
# Get the indices of non-NaN values
non_nan_indices = np.where(~np.isnan(values_flat))[0]
non_nan_values = values_flat[non_nan_indices]
# Create a cubic spline interpolator
spline_interpolator = interp1d(non_nan_indices, non_nan_values, kind='cubic', bounds_error=False, fill_value=(non_nan_values[0], non_nan_values[-1]))
# Replace NaN values with interpolated values
interpolated_data = np.array([spline_interpolator(i) if np.isnan(value) else value for i, value in enumerate(values_flat)])
# Split the interpolated_data back to original lengths
interpolated_arrays = np.split(interpolated_data, np.cumsum(lengths)[:-1])
data = {'Latitude': latitudes, 'Longitude': longitudes, column_name: interpolated_arrays}
return pd.DataFrame(data)
al_df = interpolate_and_create_dataframe(albedo_values, 'Albedo', latitudes, longitudes)
ndvi_df = interpolate_and_create_dataframe(ndvi_landsat, 'NDVI', latitudes, longitudes)
ndwi_df = interpolate_and_create_dataframe(ndwi_landsat, 'NDWI', latitudes, longitudes)
evi2_df = interpolate_and_create_dataframe(evi2_landsat, 'EVI 2', latitudes, longitudes)
savi_df = interpolate_and_create_dataframe(savi_landsat, 'SAVI', latitudes, longitudes)
lst_df = interpolate_and_create_dataframe(lst_landsat, 'Land Surface Temperature', latitudes, longitudes)
sm_df = interpolate_and_create_dataframe(sm_feature, 'Soil Moisture', latitudes, longitudes)
al_df.head(5)
Latitude | Longitude | Albedo | |
---|---|---|---|
0 | 10.510542 | 105.248554 | [0.09556685977230825, 0.09556685977230825, 0.0... |
1 | 10.509150 | 105.265098 | [0.09556685977230825, 0.09556685977230825, 0.0... |
2 | 10.467721 | 105.192464 | [0.09556685977230825, 0.09556685977230825] |
3 | 10.494453 | 105.241281 | [0.09556685977230825, 0.09556685977230825, 0.0... |
4 | 10.535058 | 105.252744 | [0.09556685977230825, 0.09556685977230825, 0.0... |
def count_nan_df_col(df, column_name):
count = 0
for array in df[column_name]:
count += np.isnan(array).sum()
return count
print(f"Number of NaN values in Albedo DataFrame: {count_nan_df_col(al_df, 'Albedo')}")
print(f"Number of NaN values in NDVI DataFrame: {count_nan_df_col(ndvi_df, 'NDVI')}")
print(f"Number of NaN values in NDWI DataFrame: {count_nan_df_col(ndwi_df, 'NDWI')}")
print(f"Number of NaN values in SAVI DataFrame: {count_nan_df_col(savi_df, 'SAVI')}")
print(f"Number of NaN values in EVI 2 DataFrame: {count_nan_df_col(evi2_df, 'EVI 2')}")
print(f"Number of NaN values in Land Surface Temp DataFrame: {count_nan_df_col(lst_df, 'Land Surface Temperature')}")
print(f"Number of NaN values in Soil Moisture DataFrame: {count_nan_df_col(sm_df, 'Soil Moisture')}")
Number of NaN values in Albedo DataFrame: 0 Number of NaN values in NDVI DataFrame: 0 Number of NaN values in NDWI DataFrame: 0 Number of NaN values in SAVI DataFrame: 0 Number of NaN values in EVI 2 DataFrame: 0 Number of NaN values in Land Surface Temp DataFrame: 0 Number of NaN values in Soil Moisture DataFrame: 0
We now see that we do not have NaN values anymore. But each latitude and longitude combination has a list of indicator values. This intuitively makes sense, since we pulled in data from the satellites over a specific time window around the date of harvest. Since we chose a narrow time window, we can expect the values to be quite similar. So we will just take the mean of these index values for each latitude and longitude combination.
def find_mean(lst):
return sum(lst) / len(lst)
al_df['Average Albedo'] = al_df['Albedo'].apply(find_mean)
evi2_df['Average EVI 2'] = evi2_df['EVI 2'].apply(find_mean)
ndvi_df['Average NDVI'] = ndvi_df['NDVI'].apply(find_mean)
savi_df['Average SAVI'] = savi_df['SAVI'].apply(find_mean)
ndwi_df['Average NDWI'] = ndwi_df['NDWI'].apply(find_mean)
lst_df['Average Land Surface Temperature'] = lst_df['Land Surface Temperature'].apply(find_mean)
sm_df['Average Soil Moisture'] = sm_df['Soil Moisture'].apply(find_mean)
al_df.head()
Latitude | Longitude | Albedo | Average Albedo | |
---|---|---|---|---|
0 | 10.510542 | 105.248554 | [0.09556685977230825, 0.09556685977230825, 0.0... | 0.095567 |
1 | 10.509150 | 105.265098 | [0.09556685977230825, 0.09556685977230825, 0.0... | 0.095567 |
2 | 10.467721 | 105.192464 | [0.09556685977230825, 0.09556685977230825] | 0.095567 |
3 | 10.494453 | 105.241281 | [0.09556685977230825, 0.09556685977230825, 0.0... | 0.095567 |
4 | 10.535058 | 105.252744 | [0.09556685977230825, 0.09556685977230825, 0.0... | 0.095567 |
feature_df = pd.concat([al_df['Average Albedo'], evi2_df['Average EVI 2'], ndvi_df['Average NDVI'], savi_df['Average SAVI'],
ndwi_df['Average NDWI'], sm_df['Average Soil Moisture'], lst_df['Average Land Surface Temperature']],axis=1)
feature_df.head()
Average Albedo | Average EVI 2 | Average NDVI | Average SAVI | Average NDWI | Average Soil Moisture | Average Land Surface Temperature | |
---|---|---|---|---|---|---|---|
0 | 0.095567 | 0.447983 | 0.635155 | 0.419686 | -0.583988 | -1.069717 | 66.657175 |
1 | 0.095567 | 0.447983 | 0.635155 | 0.419686 | -0.583988 | -1.105420 | 66.657175 |
2 | 0.095567 | 0.447983 | 0.635155 | 0.419686 | -0.583988 | -1.060978 | 66.657175 |
3 | 0.095567 | 0.447983 | 0.635155 | 0.419686 | -0.583988 | -1.017536 | 66.657175 |
4 | 0.095567 | 0.447983 | 0.635155 | 0.419686 | -0.583988 | -1.102196 | 66.657175 |
Abbas (2020) and Ray (2015) emphasize the importance of including precipitation as a predictor for rice yield variability. Therefore, we obtained precipitation data from a public website: Visual Crossing.
The code computes precipitation statistics (mean, maximum, minimum, and variance) for three districts in Vietnam based on historical weather data and various window sizes. These precipitation statistics are then combined with other indices for further analysis and modeling in predicting rice yield in the study area.
precip = weather_data['precip']
precip1 = weather_data1['precip']
precip2 = weather_data2['precip']
dates = locations['Date of Harvest']
unique_dates = dates.unique()
def sort_dates(date):
day, month, year = date.split('-')
return year, month, day
sorted_dates = sorted(unique_dates, key=sort_dates)
window_sizes = [168,170,171,172,174,175,176,177,178,182,183,184,185,186,187,188,190,191,193,194,
197,201,203,278,283,285,286,287,288,290,292,293,294,295,296,301,302,304,307,308,312]
def calculate_precip_stats(precip, window_sizes):
prec_mean = []
prec_max = []
prec_min = []
prec_var = []
for window_size in window_sizes:
prec_mean.append(precip.iloc[:window_size].mean())
prec_max.append(precip.iloc[:window_size].max())
prec_min.append(precip.iloc[:window_size].min())
prec_var.append(precip.iloc[:window_size].var())
return prec_mean, prec_max, prec_min, prec_var
# Chau Phu
prec_df_cp, prec_max_df_cp, prec_min_df_cp, prec_var_df_cp = calculate_precip_stats(precip, window_sizes)
# Chau Thanh
prec_df_ct, prec_max_df_ct, prec_min_df_ct, prec_var_df_ct = calculate_precip_stats(precip1, window_sizes)
# Thoai Son
prec_df_ts, prec_max_df_ts, prec_min_df_ts, prec_var_df_ts = calculate_precip_stats(precip2, window_sizes)
# create a list of districts and precipitation data
districts = ['Chau_Phu'] * 41 + ['Chau_Thanh'] * 41 + ['Thoai_Son'] * 41
precipitation_data = [prec_df_cp, prec_df_ct, prec_df_ts]
max_precipitation_data = [prec_max_df_cp, prec_max_df_ct, prec_max_df_ts]
min_precipitation_data = [prec_min_df_cp, prec_min_df_ct, prec_min_df_ts]
var_precipitation_data = [prec_var_df_cp, prec_var_df_ct, prec_var_df_ts]
dates_extended = sorted_dates * 3
# flatten the precipitation data list
precipitation_data = [item for sublist in precipitation_data for item in sublist]
max_precipitation_data = [item for sublist in max_precipitation_data for item in sublist]
min_precipitation_data = [item for sublist in min_precipitation_data for item in sublist]
var_precipitation_data = [item for sublist in var_precipitation_data for item in sublist]
# create a dictionary with districts and precipitation data
prec = {'District': districts, 'Date of Harvest': dates_extended, 'Precipitation': precipitation_data, 'Precipitation Variance': var_precipitation_data}
# create a DataFrame from the dictionary
df = pd.DataFrame(prec)
new_df = pd.DataFrame({'Average Albedo': feature_df['Average Albedo'], 'Average EVI 2': feature_df['Average EVI 2'],
'Average NDVI': feature_df['Average NDVI'] , 'Average SAVI': feature_df['Average SAVI'], 'Average NDWI': feature_df['Average NDWI'],
'Average Land Surface Temperature': feature_df['Average Land Surface Temperature'], 'Average Soil Moisture': feature_df['Average Soil Moisture']})
new_df['Date of Harvest'] = locations['Date of Harvest']
new_df['District'] = locations['District']
new_df = pd.merge(left=new_df, right=df, on=['District', 'Date of Harvest'], how='left')
new_df.drop(columns=['District', 'Date of Harvest'], inplace=True)
new_df.head()
Average Albedo | Average EVI 2 | Average NDVI | Average SAVI | Average NDWI | Average Land Surface Temperature | Average Soil Moisture | Precipitation | Precipitation Variance | |
---|---|---|---|---|---|---|---|---|---|
0 | 0.095567 | 0.447983 | 0.635155 | 0.419686 | -0.583988 | 66.657175 | -1.069717 | 6.304514 | 177.830676 |
1 | 0.095567 | 0.447983 | 0.635155 | 0.419686 | -0.583988 | 66.657175 | -1.105420 | 6.304514 | 177.830676 |
2 | 0.095567 | 0.447983 | 0.635155 | 0.419686 | -0.583988 | 66.657175 | -1.060978 | 6.304514 | 177.830676 |
3 | 0.095567 | 0.447983 | 0.635155 | 0.419686 | -0.583988 | 66.657175 | -1.017536 | 6.304514 | 177.830676 |
4 | 0.095567 | 0.447983 | 0.635155 | 0.419686 | -0.583988 | 66.657175 | -1.102196 | 6.322997 | 178.353735 |
Let's start off by predicting rice yield using a simple linear regression model. To hedge against overfitting and compensate for the lack of validation data, we split our dataset into a 75:25 train-test ratio instead of 80:20.
X = new_df
y = locations['Rice Yield (kg/ha)']
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.25, random_state=123)
linear_model = LinearRegression()
linear_model.fit(X_train, y_train)
# Predict on the test set
y_pred_linear = linear_model.predict(X_test)
r2_linear = r2_score(y_test, y_pred_linear)
mse_linear = mean_squared_error(y_test, y_pred_linear)
print(f"Linear Regression Accuracy: {r2_linear:.2f}")
print(f"Linear Regression MSE: {mse_linear**(1/2):.2f}")
Linear Regression Accuracy: 0.68 Linear Regression MSE: 436.25
The fact that 0.86 is the benchmark accuracy, a score of 0.68 seems like a good start. We can introduce some regularization to our regression model and go with Lasso.
lasso_model = Lasso(alpha=0.5)
lasso_model.fit(X_train, y_train)
y_pred_lasso = lasso_model.predict(X_test)
r2_lasso = r2_score(y_test, y_pred_lasso)
mse_lasso = mean_squared_error(y_test, y_pred_lasso)
print(f"Lasso Regression Accuracy: {r2_lasso:.2f}")
print(f"Lasso Regression MSE: {mse_lasso**(1/2):.2f}")
Lasso Regression Accuracy: 0.67 Lasso Regression MSE: 440.32
The accuracy didn't improve, nor did the MSE reduce. The added regularization is leading to a slightly worse performance compared to linear regression. The coefficient shrinkage caused by Lasso can be visualized below.
coeff_df = pd.DataFrame()
coeff_df['Feature'] = X_train.columns
coeff_df['Linear Coefficients'] = linear_model.coef_
coeff_df['Lasso Coefficients'] = lasso_model.coef_
coeff_df
Feature | Linear Coefficients | Lasso Coefficients | |
---|---|---|---|
0 | Average Albedo | -7207.746314 | -0.000000 |
1 | Average EVI 2 | -1653.822764 | -728.091388 |
2 | Average NDVI | 600.007497 | 446.673485 |
3 | Average SAVI | 2569.392649 | -0.000000 |
4 | Average NDWI | 1666.820463 | 245.993988 |
5 | Average Land Surface Temperature | -1.849313 | -3.391375 |
6 | Average Soil Moisture | 827.565980 | 219.788827 |
7 | Precipitation | -1243.472173 | -1186.544723 |
8 | Precipitation Variance | 18.794090 | 18.538168 |
Let's move-on to employ advanced machine learning regression models to better capture the underlying patterns within our parameters.
Before performing feature extraction, we decided to first visualize the correlations between our predictor variables to identify any underlying patterns among them.
Xy = pd.concat([X, y], axis=1)
corr_matrix = Xy.corr()
sns.heatmap(corr_matrix, annot=True, cmap='GnBu', fmt='.2f', vmin=-1, vmax=1, linewidths=0.5, cbar_kws={'shrink': 0.5})
plt.title("Fig. 9. Correlation Heatmap", fontweight='bold')
plt.show()
Recursive Feature Elimination (RFE) is an iterative technique for feature selection in machine learning. It trains a model on subsets of features, eliminating the least important ones until the desired count is reached. We're using RFE to identify the best features for the machine learning models we plan to test.
# Separate the features and target
def perform_rfe(estimator, X, y):
with open(os.devnull, 'w') as devnull:
with redirect_stdout(devnull):
selector = RFECV(estimator, cv=5)
selector.fit(X, y)
return X.columns[selector.support_]
estimators = {
"Random Forest Regressor": RandomForestRegressor(),
"Gradient Boosting Regressor": GradientBoostingRegressor(),
"XGB Regressor": XGBRegressor(),
"CatBoost Regressor": CatBoostRegressor(logging_level='Silent')
}
selected_features = {}
for name, estimator in estimators.items():
selected_features[name] = perform_rfe(estimator, X, y)
print(f"Selected features for {name}:", selected_features[name])
Selected features for Random Forest Regressor: Index(['Average Albedo', 'Average NDWI', 'Average Land Surface Temperature', 'Average Soil Moisture', 'Precipitation', 'Precipitation Variance'], dtype='object') Selected features for Gradient Boosting Regressor: Index(['Average Albedo', 'Average NDWI', 'Average Land Surface Temperature', 'Precipitation', 'Precipitation Variance'], dtype='object') Selected features for XGB Regressor: Index(['Precipitation', 'Precipitation Variance'], dtype='object') Selected features for CatBoost Regressor: Index(['Average Albedo', 'Average EVI 2', 'Average NDVI', 'Average NDWI', 'Average Land Surface Temperature', 'Average Soil Moisture', 'Precipitation', 'Precipitation Variance'], dtype='object')
We'll set up machine learning models tailored to our input parameters, experimenting with Random Forest, Gradient Boosting, XGBoost, and CatBoost. These four models effectively handle non-linear and complex relationships, as observed in the correlation heatmap, and provide robust regularization to prevent overfitting. Gradient Boosting combines multiple decision trees for better performance on complex datasets, while XGBoost offers an optimized, scalable version of Gradient Boosting. CatBoost, though commonly used for categorical data, is fast, efficient, and resistant to overfitting.
def train_and_evaluate(model, model_name, X_train, y_train, X_test, y_test):
model.fit(X_train, y_train)
y_pred = model.predict(X_test)
r2 = r2_score(y_test, y_pred)
mse = mean_squared_error(y_test, y_pred)
print(f"{model_name} Accuracy: {r2:.2f}")
print(f"{model_name} MSE: {mse**(1/2):.2f}")
print()
return model, y_pred
# Create model instances
rf_model = RandomForestRegressor(n_estimators=15, max_depth=5, random_state=42)
gb_model = GradientBoostingRegressor(n_estimators=14, learning_rate=0.21, random_state=42)
xgb_model = xgb.XGBRegressor(objective='reg:squarederror', n_estimators=18, learning_rate=0.201, seed=42)
cb_model = CatBoostRegressor(learning_rate=0.21, depth=5, n_estimators=25, verbose=0, random_state=42)
selection1 = ['Average Albedo', 'Average NDWI', 'Average Land Surface Temperature', 'Precipitation', 'Precipitation Variance']
selection2 = ['Average NDWI', 'Average Land Surface Temperature', 'Precipitation', 'Precipitation Variance']
selection3 = ['Precipitation', 'Precipitation Variance']
selection4 = ['Average Albedo', 'Average EVI 2', 'Average NDVI', 'Average NDWI', 'Average Land Surface Temperature', 'Average Soil Moisture', 'Precipitation', 'Precipitation Variance']
X1 = new_df[selection1]
X2 = new_df[selection2]
X3 = new_df[selection3]
X4 = new_df[selection4]
y = locations['Rice Yield (kg/ha)']
X_train1, X_test1, y_train, y_test = train_test_split(X1, y, test_size=0.25, random_state=123)
X_train2, X_test2, y_train, y_test = train_test_split(X2, y, test_size=0.25, random_state=123)
X_train3, X_test3, y_train, y_test = train_test_split(X3, y, test_size=0.25, random_state=123)
X_train4, X_test4, y_train, y_test = train_test_split(X4, y, test_size=0.25, random_state=123)
# Train and evaluate models
rf_model, y_pred_rf = train_and_evaluate(rf_model, "Random Forest Regression", X_train1, y_train, X_test1, y_test)
gb_model, y_pred_gb = train_and_evaluate(gb_model, "Gradient Boosting Regression", X_train2, y_train, X_test2, y_test)
xgb_model, y_pred_xgb = train_and_evaluate(xgb_model, "XGBoost Regression", X_train3, y_train, X_test3, y_test)
cb_model, y_pred_cb = train_and_evaluate(cb_model, "CatBoost Regression", X_train4, y_train, X_test4, y_test)
Random Forest Regression Accuracy: 0.68 Random Forest Regression MSE: 437.55 Gradient Boosting Regression Accuracy: 0.68 Gradient Boosting Regression MSE: 434.28 XGBoost Regression Accuracy: 0.67 XGBoost Regression MSE: 445.02 CatBoost Regression Accuracy: 0.64 CatBoost Regression MSE: 461.87
These metrics don't appear to differ significantly from the Accuracy and MSE of Linear and Lasso regression. However, these metrics are likely more reliable, as they offer increased robustness against overfitting. We can visualize the performance of these models using scatter plots to better understand their regression characteristics.
data = pd.DataFrame({'Actual': y_test,
'Model1': y_pred_rf,
'Model2': y_pred_gb,
'Model3': y_pred_xgb,
'Model4': y_pred_cb})
fig, axes = plt.subplots(2, 2, figsize=(9, 8))
sns.regplot(ax=axes[0, 0], data=data, x='Actual', y='Model1', scatter_kws={'alpha': 0.3}, color = "Black")
axes[0, 0].set_title('Fig. 10. Random Forest', fontweight='bold')
sns.regplot(ax=axes[0, 1], data=data, x='Actual', y='Model2', scatter_kws={'alpha': 0.3}, color = "Orange")
axes[0, 1].set_title('Fig. 11. Gradient Boosting', fontweight='bold')
sns.regplot(ax=axes[1, 0], data=data, x='Actual', y='Model3', scatter_kws={'alpha': 0.3}, color = "#121bc7")
axes[1, 0].set_title('Fig. 12. XGBoost', fontweight='bold')
sns.regplot(ax=axes[1, 1], data=data, x='Actual', y='Model4', scatter_kws={'alpha': 0.3}, color="Red")
axes[1, 1].set_title('Fig. 13. CatBoost', fontweight='bold')
plt.tight_layout()
plt.show()
The ML models have been hyperparameter-tuned to their optimal capacity. However, out of curiosity, we explored whether further optimization of individual models or an ensemble approach could yield improved results. After several iterations, we successfully enhanced our Gradient Boosting model. Initially, Gradient Boosting had provided the highest accuracy and lowest MSE among the tested models. By further boosting it with carefully selected parameters, we achieved our best results.
adaboost_regressor = AdaBoostRegressor(base_estimator=gb_model, n_estimators=90, learning_rate=0.025, random_state=42)
adaboost_regressor.fit(X_train2, y_train)
y_pred_adaboost = adaboost_regressor.predict(X_test2)
r2_adaboost = r2_score(y_test, y_pred_adaboost)
mse_adaboost = mean_squared_error(y_test, y_pred_adaboost)
print(f"Boosting Regression Accuracy: {r2_adaboost:.2f}")
print(f"Boosting Regression MSE: {mse_adaboost**(1/2):.2f}")
Boosting Regression Accuracy: 0.69 Boosting Regression MSE: 430.36
data2 = pd.DataFrame({'Actual': y_test, 'Predicted': y_pred_adaboost})
plt.figure(figsize=(5, 4))
ax = sns.regplot(data=data2, x='Actual', y='Predicted', scatter_kws={'alpha': 0.3})
ax.set_title('Fig. 14. Boosting', fontweight='bold')
plt.show()
Our final model achieved an accuracy of 0.69 and an MSE of 430.36. While not reaching 0.86, we consider this a modest success, given the reliance on satellite data and a limited ground-truth dataset. However, the model isn't optimal. For instance, the histogram of residuals reveals an approximately normal distribution with slight skewness, but the peak isn't centered around zero. The residuals' mean is also not close to zero, indicating model bias. These observations suggest room for further improvement in our model.
data2['Residuals'] = data2['Actual'] - data2['Predicted']
plt.figure(figsize=(5, 4))
ax = sns.histplot(data=data2, x='Residuals', kde=True, color='Black')
ax.set_title('Fig. 15. Residuals Distribution', fontweight='bold')
plt.show()
Insufficient Validation Data: To properly assess overfitting or underfitting, a validation set is needed to evaluate model performance on unseen data. Interestingly, we found a high R-squared for Linear Regression, which might be due to overfitting. The model could be capturing excessive noise without undergoing RFE and having too many parameters.
Feature Engineering Improvements: As new data and research become available, there's always potential for enhancement in feature engineering.
Exploring Alternative Regression Techniques: Although we experimented with several models, there may be others that outperform those tested. For instance, Neural Networks can effectively capture complex relationships in a dataset.
This project was both challenging and intriguing to work on. Each section required extensive research, whether it was determining how to visualize the Cloud Masking algorithm, examining RGB images from time series data, or identifying the most suitable method to address missing values (Cubic Spline Interpolation). Overall, this was an excellent learning experience and a genuinely enjoyable endeavor.
In the end, we build upon each individual work together; formulating the bottlenecks and improvements needed and the conclusion for the project.