pip install xgboost catboost

Requirement already satisfied: xgboost in /srv/conda/envs/notebook/lib/python3.10/site-packages (1.7.5)
Requirement already satisfied: catboost in /srv/conda/envs/notebook/lib/python3.10/site-packages (1.1.1)
Requirement already satisfied: scipy in /srv/conda/envs/notebook/lib/python3.10/site-packages (from xgboost) (1.9.1)
Requirement already satisfied: numpy in /srv/conda/envs/notebook/lib/python3.10/site-packages (from xgboost) (1.22.4)
Requirement already satisfied: matplotlib in /srv/conda/envs/notebook/lib/python3.10/site-packages (from catboost) (3.5.3)
Requirement already satisfied: plotly in /srv/conda/envs/notebook/lib/python3.10/site-packages (from catboost) (5.14.1)
Requirement already satisfied: six in /srv/conda/envs/notebook/lib/python3.10/site-packages (from catboost) (1.16.0)
Requirement already satisfied: pandas>=0.24.0 in /srv/conda/envs/notebook/lib/python3.10/site-packages (from catboost) (1.4.4)
Requirement already satisfied: graphviz in /srv/conda/envs/notebook/lib/python3.10/site-packages (from catboost) (0.20.1)
Requirement already satisfied: python-dateutil>=2.8.1 in /srv/conda/envs/notebook/lib/python3.10/site-packages (from pandas>=0.24.0->catboost) (2.8.2)
Requirement already satisfied: pytz>=2020.1 in /srv/conda/envs/notebook/lib/python3.10/site-packages (from pandas>=0.24.0->catboost) (2022.2.1)
Requirement already satisfied: cycler>=0.10 in /srv/conda/envs/notebook/lib/python3.10/site-packages (from matplotlib->catboost) (0.11.0)
Requirement already satisfied: fonttools>=4.22.0 in /srv/conda/envs/notebook/lib/python3.10/site-packages (from matplotlib->catboost) (4.37.2)
Requirement already satisfied: pyparsing>=2.2.1 in /srv/conda/envs/notebook/lib/python3.10/site-packages (from matplotlib->catboost) (3.0.9)
Requirement already satisfied: packaging>=20.0 in /srv/conda/envs/notebook/lib/python3.10/site-packages (from matplotlib->catboost) (21.3)
Requirement already satisfied: kiwisolver>=1.0.1 in /srv/conda/envs/notebook/lib/python3.10/site-packages (from matplotlib->catboost) (1.4.4)
Requirement already satisfied: pillow>=6.2.0 in /srv/conda/envs/notebook/lib/python3.10/site-packages (from matplotlib->catboost) (9.2.0)
Requirement already satisfied: tenacity>=6.2.0 in /srv/conda/envs/notebook/lib/python3.10/site-packages (from plotly->catboost) (8.0.1)
Note: you may need to restart the kernel to use updated packages.


# Supress Warnings
import warnings
warnings.filterwarnings('ignore')

# Visualization
import ipyleaflet
import matplotlib.pyplot as plt
from IPython.display import Image
import seaborn as sns

# Data Science
import numpy as np
import pandas as pd
import statsmodels.api as sm
from sklearn.model_selection import train_test_split
from scipy.interpolate import interp1d
from datetime import datetime, timedelta

# Machine Learning
import graphviz
import xgboost as xgb
from xgboost import XGBRegressor
from catboost import CatBoostRegressor
from sklearn.metrics import r2_score, mean_squared_error
from sklearn.preprocessing import StandardScaler
from sklearn.pipeline import make_pipeline
from sklearn.linear_model import LinearRegression, Lasso
from sklearn.ensemble import RandomForestRegressor, GradientBoostingRegressor, VotingRegressor, AdaBoostRegressor
from sklearn.model_selection import learning_curve
from sklearn.feature_selection import RFECV

# Planetary Computer Tools
import pystac
import pystac_client
import odc
from pystac_client import Client
from pystac.extensions.eo import EOExtension as eo
from odc.stac import stac_load
import planetary_computer as pc

# Import common GIS tools
import pandas as pd
import numpy as np
import xarray as xr
import matplotlib.pyplot as plt
import rasterio.features
import rioxarray as rio
import xrspatial.multispectral as ms
from contextlib import redirect_stdout
import os
import sys

# Pass your API key here
pc.settings.set_subscription_key('71113eb526914deba51505c8b5463b88')


locations = pd.read_csv("Crop_Yield_Data.csv") # Ground Truth Data
weather_data = pd.read_csv('Chau Phu, Vietnam 2021-10-01 to 2022-09-01.csv')
weather_data1 = pd.read_csv('Chau Thanh, Vietnam 2021-10-01 to 2022-09-01.csv')
weather_data2 = pd.read_csv('Thoai Son 2021-10-01 to 2022-09-01.csv')

locations.head()


locations.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 557 entries, 0 to 556
Data columns (total 8 columns):
 #   Column                                          Non-Null Count  Dtype  
---  ------                                          --------------  -----  
 0   District                                        557 non-null    object 
 1   Latitude                                        557 non-null    float64
 2   Longitude                                       557 non-null    float64
 3   Season(SA = Summer Autumn, WS = Winter Spring)  557 non-null    object 
 4   Rice Crop Intensity(D=Double, T=Triple)         557 non-null    object 
 5   Date of Harvest                                 557 non-null    object 
 6   Field size (ha)                                 557 non-null    float64
 7   Rice Yield (kg/ha)                              557 non-null    int64  
dtypes: float64(3), int64(1), object(4)
memory usage: 34.9+ KB


weather_data.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 336 entries, 0 to 335
Data columns (total 33 columns):
 #   Column            Non-Null Count  Dtype  
---  ------            --------------  -----  
 0   name              336 non-null    object 
 1   datetime          336 non-null    object 
 2   tempmax           336 non-null    float64
 3   tempmin           336 non-null    float64
 4   temp              336 non-null    float64
 5   feelslikemax      336 non-null    float64
 6   feelslikemin      336 non-null    float64
 7   feelslike         336 non-null    float64
 8   dew               336 non-null    float64
 9   humidity          336 non-null    float64
 10  precip            336 non-null    float64
 11  precipprob        336 non-null    int64  
 12  precipcover       336 non-null    float64
 13  preciptype        274 non-null    object 
 14  snow              235 non-null    float64
 15  snowdepth         235 non-null    float64
 16  windgust          238 non-null    float64
 17  windspeed         336 non-null    float64
 18  winddir           336 non-null    float64
 19  sealevelpressure  336 non-null    float64
 20  cloudcover        336 non-null    float64
 21  visibility        336 non-null    float64
 22  solarradiation    336 non-null    float64
 23  solarenergy       336 non-null    float64
 24  uvindex           336 non-null    int64  
 25  severerisk        235 non-null    float64
 26  sunrise           336 non-null    object 
 27  sunset            336 non-null    object 
 28  moonphase         336 non-null    float64
 29  conditions        336 non-null    object 
 30  description       336 non-null    object 
 31  icon              336 non-null    object 
 32  stations          336 non-null    object 
dtypes: float64(22), int64(2), object(9)
memory usage: 86.8+ KB


# Sample Rice Crop Field in An Giang, Vietnam
lat_long = (10.4391, 105.3338) # Lat-Lon centroid location
box_size_deg1 = 0.10 # Surrounding box in degrees

# Calculate the Lat-Lon bounding box region
min_lon = lat_long[1]-box_size_deg1/2
min_lat = lat_long[0]-box_size_deg1/2
max_lon = lat_long[1]+box_size_deg1/2
max_lat = lat_long[0]+box_size_deg1/2
bounds = (min_lon, min_lat, max_lon, max_lat)

# Define the time window
time_window="2021-12-01/2022-04-30"

# Define the pixel resolution for the final product
# Define the scale according to our selected crs, so we will use degrees
resolution1 = 30  # meters per pixel 
scale1 = resolution1 / 111320.0 # degrees per pixel for CRS:4326


stac = pystac_client.Client.open("https://planetarycomputer.microsoft.com/api/stac/v1")
search = stac.search(
    collections=["landsat-c2-l2"], 
    bbox=bounds, 
    datetime=time_window,
    query={"platform": {"in": ["landsat-8", "landsat-9"]},},
)
items = list(search.get_all_items())
print('This is the number of scenes that touch our region:',len(items))

This is the number of scenes that touch our region: 31


xx = stac_load(
    items,
    bands=["red", "green", "blue", "nir08", "qa_pixel"],
    crs="EPSG:4326", # Latitude-Longitude
    resolution=scale1, # Degrees
    chunks={"x": 2048, "y": 2048},
    patch_url=pc.sign,
    bbox=bounds
)

# Apply scaling and offsets for Landsat Collection-2 (reference below) to the spectral bands ONLY
# https://planetarycomputer.microsoft.com/dataset/landsat-c2-l2
xx['red'] = (xx['red']*0.0000275)-0.2
xx['green'] = (xx['green']*0.0000275)-0.2
xx['blue'] = (xx['blue']*0.0000275)-0.2
xx['nir08'] = (xx['nir08']*0.0000275)-0.2

# View the dimensions of our XARRAY and the variables
display(xx)

<xarray.Dataset>
Dimensions:      (latitude: 372, longitude: 372, time: 31)
Coordinates:
  * latitude     (latitude) float64 10.49 10.49 10.49 ... 10.39 10.39 10.39
  * longitude    (longitude) float64 105.3 105.3 105.3 ... 105.4 105.4 105.4
    spatial_ref  int32 4326
  * time         (time) datetime64[ns] 2021-12-01T03:20:47.203656 ... 2022-04...
Data variables:
    red          (time, latitude, longitude) float64 dask.array<chunksize=(1, 372, 372), meta=np.ndarray>
    green        (time, latitude, longitude) float64 dask.array<chunksize=(1, 372, 372), meta=np.ndarray>
    blue         (time, latitude, longitude) float64 dask.array<chunksize=(1, 372, 372), meta=np.ndarray>
    nir08        (time, latitude, longitude) float64 dask.array<chunksize=(1, 372, 372), meta=np.ndarray>
    qa_pixel     (time, latitude, longitude) uint16 dask.array<chunksize=(1, 372, 372), meta=np.ndarray>

array([10.489086, 10.488816, 10.488547, ..., 10.389642, 10.389373, 10.389103])

array([105.283911, 105.284181, 105.28445 , ..., 105.383354, 105.383624,
       105.383893])

array(4326, dtype=int32)

array(['2021-12-01T03:20:47.203656000', '2021-12-09T03:20:46.977108000',
       '2021-12-10T03:14:36.030555000', '2021-12-18T03:14:31.113564000',
       '2021-12-25T03:20:42.957971000', '2021-12-26T03:14:31.822391000',
       '2022-01-02T03:20:40.283448000', '2022-01-03T03:14:31.372292000',
       '2022-01-10T03:20:40.399282000', '2022-01-11T03:14:28.733155000',
       '2022-01-18T03:20:37.922626000', '2022-01-19T03:14:30.003400000',
       '2022-01-26T03:20:41.842180000', '2022-01-27T03:14:23.128333000',
       '2022-02-03T03:20:33.610922000', '2022-02-04T03:14:30.653162000',
       '2022-02-11T03:20:39.348336000', '2022-02-12T03:14:20.499449000',
       '2022-02-19T03:20:27.771564000', '2022-02-20T03:14:23.439714000',
       '2022-02-27T03:20:28.774876000', '2022-03-07T03:20:25.278657000',
       '2022-03-16T03:14:10.125549000', '2022-03-23T03:20:15.728245000',
       '2022-03-24T03:14:12.685366000', '2022-04-01T03:13:58.770662000',
       '2022-04-08T03:20:13.150736000', '2022-04-09T03:14:09.306474000',
       '2022-04-16T03:20:15.559889000', '2022-04-24T03:20:13.200949000',
       '2022-04-25T03:13:59.967091000'], dtype='datetime64[ns]')


plot_xx = xx[["red","green","blue"]].to_array()
plot = plot_xx.plot.imshow(col='time', col_wrap=4, robust=True, vmin=0, vmax=0.3)
plot.fig.suptitle("Fig. 1. Regions Captured from Landsat", fontsize=15, fontweight='bold')
plot.fig.subplots_adjust(top=0.96)
plt.show()


# Select a time slice to view a simple RGB image and the cloud mask
# See the XARRAY dimensions above for the number of time slices (starts at 0)

# Slice #6 - Mostly Clear
# Slice #24 - Scattered Clouds
# Slice #7 - Very Cloudy

time_slice = 24


# Plot and RGB Real Color Image for a single date
fig, ax = plt.subplots(figsize=(7, 5))
xx.isel(time=time_slice)[["red", "green", "blue"]].to_array().plot.imshow(robust=True, ax=ax, vmin=0, vmax=0.3)
ax.set_title("Fig. 2. RGB Real Color", fontweight='bold')
ax.axis('off')
plt.show()


# To mask the pixels and find clouds or water, it is best to use the bit values of the 16-bit qa_pixel flag
# See the website above for a nice explanation of the process

bit_flags = {
            'fill': 1<<0,
            'dilated_cloud': 1<<1,
            'cirrus': 1<<2, 
            'cloud': 1<<3,
            'shadow': 1<<4, 
            'snow': 1<<5, 
            'clear': 1<<6,
            'water': 1<<7
}


# Create a function that will mask pixels with a given type

def get_mask(mask, flags_list):
    
    # Create the result mask filled with zeros and the same shape as the mask
    final_mask = np.zeros_like(mask)
    
    # Loop through the flags  
    for flag in flags_list:
        
        # get the mask for each flag
        flag_mask = np.bitwise_and(mask, bit_flags[flag])
        
        # add it to the final flag
        final_mask = final_mask | flag_mask
    
    return final_mask > 0


# Pick a single time slice to view a mask with clouds and water
sample_xx = xx.isel(time=time_slice)


# Find the pixels that are no data (fill), clouds, cloud shadows, or water
my_mask = get_mask(sample_xx['qa_pixel'],
                   ['fill', 'dilated_cloud', 'cirrus', 
                    'cloud', 'shadow', 'water'])


# Show only the mask (Yellow) with valid data in Purple
plt.figure(figsize=(7,5))
plt.imshow(my_mask)
plt.title("Fig. 3. Cloud / Shadows / Water Mask > YELLOW", fontweight='bold')
plt.axis('off')
plt.show()


# Create an RGB function that will display the mask over the background RGB image

def plot_masked_rgb(red, green, blue, mask, color_mask=(1, 0, 0), transparency=0.5, brightness=2):
    
    # to improve our visualization, we will increase the brightness of our values
    red = red / red.max() * brightness
    green = green / green.max() * brightness
    blue = blue / blue.max() * brightness
    
    red = np.where(mask==True, red*transparency+color_mask[0]*(1-transparency), red)
    green = np.where(mask==True, green*transparency+color_mask[1]*(1-transparency), green)
    blue = np.where(mask==True, blue*transparency+color_mask[2]*(1-transparency), blue)
    
    rgb = np.stack([red, green, blue], axis=2)
    
    return rgb

rgb = plot_masked_rgb(sample_xx['red'], sample_xx['green'], sample_xx['blue'], my_mask, color_mask=(1, 0, 1), transparency=0.2, brightness=3)


# This is a nice image that shows the clouds and water pixels (Purple) among clear land pixels
plt.figure(figsize=(7,5))
plt.imshow(rgb)
plt.title("Fig. 4. Cloud / Shadows / Water Mask > MAGENTA", fontweight='bold')
plt.axis('off')
plt.show()

Clipping input data to the valid range for imshow with RGB data ([0..1] for floats or [0..255] for integers).


# Calculate the mask for the entire xarray (all time slices)
full_mask = get_mask(xx['qa_pixel'], ['fill', 'dilated_cloud', 'cirrus', 'cloud', 'shadow', 'water'])

# Create a "clean" dataset with the mask applied 
cleaned_data = xx.where(~full_mask)

# Calculate the mean of the data across the sample region and then NDVI
# Perform this calculation for the unfiltered and cloud-filtered (clean) datasets
mean_unfiltered = xx.mean(dim=['longitude','latitude']).compute()
ndvi_mean = (mean_unfiltered.nir08-mean_unfiltered.red)/(mean_unfiltered.nir08+mean_unfiltered.red)
mean_clean = cleaned_data.mean(dim=['longitude','latitude']).compute()
ndvi_mean_clean = (mean_clean.nir08-mean_clean.red)/(mean_clean.nir08+mean_clean.red)


fig = plt.figure(figsize=(7, 5))
ndvi_mean_clean.plot(color='green',marker='o',markersize=4,linewidth=2, label="Filtered = Clouds and Water Removed")
ndvi_mean.plot(color='blue',marker='o',markersize=4,linewidth=2, label="All Pixels = Clouds and Water Included")
plt.title("Fig. 5. Mean NDVI (Vegetation Index)", fontweight='bold')
plt.xlabel("Date")
plt.ylabel("NDVI")
plt.ylim(-0.1,1.0)
plt.legend(loc="upper right", markerscale=2., scatterpoints=1, fontsize=10)
plt.show()


# Plot an NDVI image for a single date with few clouds
# We will select image (index=6) on January 2, 2022. Notice how the water is masked out.

fig = plt.figure(figsize=(7, 5))
ndvi_image = (cleaned_data.nir08-cleaned_data.red)/(cleaned_data.nir08+cleaned_data.red)
ndvi_image.isel(time=6).plot(vmin=0.0, vmax=0.8, cmap="RdYlGn")
plt.title("Fig. 6. NDVI", fontweight='bold')
plt.axis('off')
plt.show()


box_size_deg2 = 0.0004 # Surrounding box in degrees, yields approximately 5x5 pixel region
resolution2 = 10  # meters per pixel 
scale2 = resolution2 / 111320.0 # degrees per pixel for crs=4326 

catalog = pystac_client.Client.open("https://planetarycomputer.microsoft.com/api/stac/v1")
search = catalog.search(collections=["sentinel-1-rtc"], bbox=bounds, datetime=time_window)
items = list(search.get_all_items())
print('This is the number of scenes that touch our region:',len(items))

This is the number of scenes that touch our region: 27


# Load the data using Open Data Cube
data = stac_load(items,bands=["vv", "vh"], patch_url=pc.sign, bbox=bounds, crs="EPSG:4326", resolution=scale2)

# View the details of our xarray dataset
# The X and Y dimensions tell us the pixel dimensions of our bounding box
# The "time" variable is the number of scenes that touch our region
data

<xarray.Dataset>
Dimensions:      (latitude: 1114, longitude: 1114, time: 27)
Coordinates:
  * latitude     (latitude) float64 10.49 10.49 10.49 ... 10.39 10.39 10.39
  * longitude    (longitude) float64 105.3 105.3 105.3 ... 105.4 105.4 105.4
    spatial_ref  int32 4326
  * time         (time) datetime64[ns] 2021-12-04T22:46:07.919581 ... 2022-04...
Data variables:
    vv           (time, latitude, longitude) float32 0.02417 0.01493 ... 0.272
    vh           (time, latitude, longitude) float32 0.004146 ... 0.01252

array([10.489086, 10.488996, 10.488906, ..., 10.389283, 10.389193, 10.389103])

array([105.283821, 105.283911, 105.284001, ..., 105.383624, 105.383714,
       105.383803])

array(4326, dtype=int32)

array(['2021-12-04T22:46:07.919581000', '2021-12-05T11:11:54.609425000',
       '2021-12-10T22:45:35.316746000', '2021-12-16T22:46:07.511215000',
       '2021-12-17T11:11:54.196354000', '2021-12-22T22:45:34.433936000',
       '2021-12-28T22:46:06.761481000', '2021-12-29T11:11:53.438580000',
       '2022-01-09T22:46:06.347730000', '2022-01-10T11:11:53.057355000',
       '2022-01-21T22:46:05.657153000', '2022-01-22T11:11:52.377922000',
       '2022-02-02T22:46:04.929540000', '2022-02-03T11:11:51.699083000',
       '2022-02-14T22:46:05.071585000', '2022-02-15T11:11:51.767747000',
       '2022-02-26T22:46:04.969244000', '2022-03-10T22:46:04.925705000',
       '2022-03-11T11:11:51.691272000', '2022-03-22T22:46:05.287764000',
       '2022-03-23T11:11:52.064532000', '2022-04-03T22:46:05.477503000',
       '2022-04-04T11:11:52.234391000', '2022-04-15T22:46:05.644861000',
       '2022-04-16T11:11:52.470632000', '2022-04-27T22:46:06.296981000',
       '2022-04-28T11:11:53.135539000'], dtype='datetime64[ns]')

array([[[0.02417101, 0.01493058, 0.01866966, ..., 0.1351117 ,
         0.13312186, 0.07470179],
        [0.01556157, 0.01662732, 0.01762288, ..., 0.09577081,
         0.10823271, 0.08096705],
        [0.01048668, 0.0134217 , 0.0117636 , ..., 0.10771083,
         0.08537243, 0.08583598],
        ...,
        [0.07889048, 0.05197918, 0.09618551, ..., 0.01098593,
         0.01214612, 0.01412993],
        [0.06246216, 0.04127437, 0.07787105, ..., 0.01052941,
         0.00976794, 0.00828947],
        [0.12281648, 0.0785541 , 0.07129239, ..., 0.00410046,
         0.00807947, 0.00576253]],

       [[0.01406382, 0.0143489 , 0.01710286, ..., 0.09546202,
         0.05939918, 0.07858594],
        [0.02011441, 0.02592376, 0.02726987, ..., 0.19192752,
         0.06553745, 0.07993296],
        [0.01974167, 0.02631255, 0.02614574, ..., 0.16457804,
         0.04583229, 0.03341462],
...
        [0.2212939 , 0.16180752, 0.12709288, ..., 0.29225406,
         0.24925222, 0.26113272],
        [0.31790692, 0.19440225, 0.08606878, ..., 0.31218913,
         0.51343554, 0.48167843],
        [0.43326014, 0.32252944, 0.16411823, ..., 0.45604324,
         0.58543694, 0.48651794]],

       [[0.16879988, 0.22080727, 0.45312023, ..., 0.06394066,
         0.0531206 , 0.06490773],
        [0.17604157, 0.25631052, 0.43659902, ..., 0.1027702 ,
         0.07629566, 0.09150371],
        [0.14171255, 0.26319695, 0.4560728 , ..., 0.11853301,
         0.11126872, 0.2265447 ],
        ...,
        [0.15140437, 0.2989662 , 0.37947145, ..., 0.1524738 ,
         0.20052004, 0.17682123],
        [0.17726156, 0.2191192 , 0.28943285, ..., 0.19817811,
         0.2196678 , 0.21833059],
        [0.24807045, 0.21953791, 0.3113074 , ..., 0.37869424,
         0.3726446 , 0.2720224 ]]], dtype=float32)

array([[[0.00414605, 0.00348855, 0.00265869, ..., 0.07254129,
         0.1134388 , 0.12465546],
        [0.00494762, 0.00205696, 0.00172492, ..., 0.06132611,
         0.07009073, 0.06592007],
        [0.00537293, 0.00438383, 0.00350324, ..., 0.04859182,
         0.03122397, 0.02754842],
        ...,
        [0.01847825, 0.02475439, 0.02264003, ..., 0.00810464,
         0.00908654, 0.00665346],
        [0.01511999, 0.03052613, 0.03599842, ..., 0.00781537,
         0.00840591, 0.00574743],
        [0.01947351, 0.02331621, 0.03448671, ..., 0.00533757,
         0.00448371, 0.00601815]],

       [[0.00823141, 0.00515232, 0.00490454, ..., 0.04391994,
         0.0363352 , 0.0267037 ],
        [0.00449162, 0.00264303, 0.00315097, ..., 0.04883291,
         0.04464468, 0.04391601],
        [0.00405031, 0.00365794, 0.00366866, ..., 0.05824593,
         0.06907108, 0.07108819],
...
        [0.01494024, 0.01399606, 0.03577208, ..., 0.01380739,
         0.01006532, 0.01352517],
        [0.01282636, 0.01618785, 0.0435005 , ..., 0.03398733,
         0.03114486, 0.03133415],
        [0.03378278, 0.01576377, 0.02611222, ..., 0.04463165,
         0.03879599, 0.0292155 ]],

       [[0.02383078, 0.01609869, 0.02356882, ..., 0.03312046,
         0.01375357, 0.0185879 ],
        [0.02813808, 0.02384822, 0.03154745, ..., 0.02769672,
         0.01905811, 0.0353126 ],
        [0.02312637, 0.02367629, 0.02685913, ..., 0.02781266,
         0.03327138, 0.03130198],
        ...,
        [0.03032953, 0.04036504, 0.05917789, ..., 0.01958033,
         0.02263096, 0.01713172],
        [0.05123728, 0.07310684, 0.10459596, ..., 0.02186244,
         0.02186274, 0.01526174],
        [0.06402529, 0.09814875, 0.14598268, ..., 0.01328868,
         0.01499531, 0.01251877]]], dtype=float32)


# Assuming that the loaded data is in 'data' variable, let's visualize the vv and vh bands
vv_band = data.vv
vh_band = data.vh

# You can select a specific time slice to visualize, e.g., the first one:
time_index = 0
vv_slice = vv_band.isel(time=time_index)
vh_slice = vh_band.isel(time=time_index)

# Plot vv polarization
plt.figure(figsize=(7, 5))
vv_slice.plot(cmap='gray', robust=True)
plt.title('Fig. 7. Sentinel-1 VV Polarization', fontweight='bold')
plt.show()

# Plot vh polarization
plt.figure(figsize=(7, 5))
vh_slice.plot(cmap='gray', robust=True)
plt.title('Fig. 8. Sentinel-1 VH Polarization', fontweight='bold')
plt.show()


# Load the data using Open Data Cube for each location
#EVI 2 Coeeficients selected from current literature
G = 2.5
C1 = 6
C2 = 7.5
L = 1

latitudes = []
longitudes = []
albedo_values = []
ndvi_landsat = []
evi2_landsat = []
savi_landsat = []
ndwi_landsat = []

for index, row in locations.iterrows():
    lat = row['Latitude']
    lon = row['Longitude']
    toh = row['Date of Harvest']
    date_object = datetime.strptime(toh, "%d-%m-%Y")
    
    trial_date = date_object.strftime("%Y-%m-%d")
    time_delta = timedelta(days=8)     
    trial_datetime = datetime.strptime(trial_date,"%Y-%m-%d")
    start_date = (trial_datetime - time_delta)
    end_date = (trial_datetime + time_delta)
    time_window3 = f"{start_date}/{end_date}"

    #Load the data using Open Data Cube
    bounds = (lon - box_size_deg1/2, lat - box_size_deg1/2, lon + box_size_deg1/2, lat + box_size_deg1/2)

    #Load the data using Open Data Cube
    catalog = pystac_client.Client.open("https://planetarycomputer.microsoft.com/api/stac/v1")
    search = catalog.search(collections=["landsat-c2-l2"], bbox=bounds, datetime=time_window3, query={"platform": {"in": ["landsat-8", "landsat-9"]}})
    items = list(search.get_all_items())
    
    xx = stac_load(
    items,
    bands=["red", "green", "blue", "nir08","swir16","qa_pixel"],
    crs="EPSG:4326", 
    resolution=scale1,
    chunks={"x": 2048, "y": 2048},
    patch_url=pc.sign,
    bbox=bounds
)
    
    # Apply scaling and offsets for Landsat Collection-2 (reference below) to the spectral bands ONLY
    # https://planetarycomputer.microsoft.com/dataset/landsat-c2-l2
    xx['red'] = (xx['red']*0.0000275)-0.2
    xx['green'] = (xx['green']*0.0000275)-0.2
    xx['blue'] = (xx['blue']*0.0000275)-0.2
    xx['nir08'] = (xx['nir08']*0.0000275)-0.2
    xx['swir16'] = (xx['swir16']*0.0000275)-0.2
 

    # Calculate the mask for the entire xarray (all time slices)
    full_mask = get_mask(xx['qa_pixel'], ['fill', 'dilated_cloud', 'cirrus', 'cloud', 'shadow', 'water'])

    # Create a "clean" dataset with the mask applied 
    cleaned_data = xx.where(~full_mask)
   
    mean_clean = cleaned_data.mean(dim=['longitude','latitude']).compute()
    ndvi_mean_clean = (mean_clean.nir08-mean_clean.red)/(mean_clean.nir08+mean_clean.red)
    evi2_mean_clean = G * ((mean_clean.nir08 - mean_clean.red) / (mean_clean.nir08 + C1 * mean_clean.red - C2 * mean_clean.blue + L))
    savi_mean_clean = ((mean_clean.nir08 - mean_clean.red) / (mean_clean.nir08 + mean_clean.red + 0.5)) * 1.5
    ndwi_mean_clean = (mean_clean.green - mean_clean.nir08) / (mean_clean.green + mean_clean.nir08)
    albedo = 0.356 * mean_clean.blue + 0.130 * mean_clean.green + 0.373 * mean_clean.red + 0.085 * mean_clean.nir08 + 0.072 * mean_clean.swir16 +0.0018 # constant term
   
	# Appending the latitude, longitude, NDVI, NDWI, EVI 2, SAVI and Albedo values to their respective lists
    latitudes.append(lat)
    longitudes.append(lon)
    albedo_values.append(albedo)
    ndvi_landsat.append(ndvi_mean_clean)
    evi2_landsat.append(evi2_mean_clean)
    savi_landsat.append(savi_mean_clean)
    ndwi_landsat.append(ndwi_mean_clean)


# Load the data using Open Data Cube for each location
lst_landsat = []

for index, row in locations.iterrows():
    lat = row['Latitude']
    lon = row['Longitude']
    toh = row['Date of Harvest']
    date_object = datetime.strptime(toh, "%d-%m-%Y")
    
    trial_date = date_object.strftime("%Y-%m-%d")
    time_delta = timedelta(days=8)     # Taking for 8 originally with 8 up and 8 down
    trial_datetime = datetime.strptime(trial_date,"%Y-%m-%d")
    start_date = (trial_datetime - time_delta)
    end_date = (trial_datetime + time_delta)
    time_window3 = f"{start_date}/{end_date}"

    #Load the data using Open Data Cube
    bounds = (lon - box_size_deg1/2, lat - box_size_deg1/2, lon + box_size_deg1/2, lat + box_size_deg1/2)

    try:
        #Load the data using Open Data Cube
        catalog = pystac_client.Client.open("https://planetarycomputer.microsoft.com/api/stac/v1")
        search = catalog.search(collections=["landsat-c2-l2"], bbox=bounds, datetime=time_window3, query={"platform": {"in": ["landsat-9"]}})
        items = list(search.get_all_items())

        #data = xr.open_rasterio(items, chunks={"x": 2048, "y": 2048}, crs="EPSG:4326", resolution=scale, patch_url=sign)
        xx = stac_load(
        items,
        bands=["red", "green", "blue", "nir08","swir16","qa_pixel", "lwir11"],
        crs="EPSG:4326", # Latitude-Longitude
        resolution=scale1, # Degrees
        chunks={"x": 2048, "y": 2048},
        patch_url=pc.sign,
        bbox=bounds
    )

        # Apply scaling and offsets for Landsat Collection-2 (reference below) to the spectral bands ONLY
        # https://planetarycomputer.microsoft.com/dataset/landsat-c2-l2
        xx['red'] = (xx['red']*0.0000275)-0.2
        xx['green'] = (xx['green']*0.0000275)-0.2
        xx['blue'] = (xx['blue']*0.0000275)-0.2
        xx['nir08'] = (xx['nir08']*0.0000275)-0.2
        xx['swir16'] = (xx['swir16']*0.0000275)-0.2
        xx['lwir11'] = (xx['lwir11'] * 0.0003342) + 0.1

        full_mask = get_mask(xx['qa_pixel'], ['fill', 'dilated_cloud', 'cirrus', 'cloud', 'shadow', 'water'])
        cleaned_data = xx.where(~full_mask)
        mean_clean = cleaned_data.mean(dim=['longitude','latitude']).compute()
        ndvi_mean_clean = (mean_clean.nir08-mean_clean.red)/(mean_clean.nir08+mean_clean.red)

        # Constants adjusted to calculated temperature brightness from Landsat 9
        K1_CONSTANT_BAND_11 = 774.8853    
        K2_CONSTANT_BAND_11 = 1321.0789
        brightness_temperature = K2_CONSTANT_BAND_11 / np.log((K1_CONSTANT_BAND_11 / xx['lwir11']) + 1)

        # Emissivity values
        emissivity_vegetation = 0.99
        emissivity_soil = 0.97

        # Estimate emissivity using the NDVI threshold method
        ndvi_clean = (cleaned_data.nir08 - cleaned_data.red) / (cleaned_data.nir08 + cleaned_data.red)
        emissivity = emissivity_soil + (emissivity_vegetation - emissivity_soil) * np.where(ndvi_clean > 0.5, 1, 0)

        # Calculate LST in Celsius
        lst_celsius = (brightness_temperature / (1 + (0.00115 * brightness_temperature / 1.4388) * np.log(emissivity))) - 273.15
        cleaned_lst = lst_celsius.where(~full_mask)
        mean_lst = cleaned_lst.mean(dim=['longitude', 'latitude']).compute()
        lst_landsat.append(mean_lst)
    
    except:
        print(f"No data found for location {lat}, {lon}")
        continue

No data found for location 10.387123, 105.119326
No data found for location 10.426536, 105.115181
No data found for location 10.419552, 105.116409
No data found for location 10.403175, 105.123726
No data found for location 10.399444, 105.118528
No data found for location 10.42324, 105.126544
No data found for location 10.429339, 105.125493
No data found for location 10.435839, 105.132981


def create_and_insert_dataarray(index, date_of_harvest, lst_landsat):
    doh_dt = datetime.strptime(date_of_harvest, '%d-%m-%Y')
    ta = xr.DataArray(np.full(2, np.nan), dims='time', coords={'time': [doh_dt.strftime('%Y-%m-%dT03:20:15.057547'), doh_dt.strftime('%Y-%m-%dT03:20:15.057547')]})
    ta.attrs['spatial_ref'] = 4326
    lst_landsat.insert(index, ta)

# Assuming the locations DataFrame and lst_landsat list are defined

indices = [201, 204, 205, 209, 211, 306, 310, 323]

for index in indices:
    date_of_harvest = locations.iloc[index]['Date of Harvest']
    create_and_insert_dataarray(index, date_of_harvest, lst_landsat)


# Utilising this for loamy soil as this is the best generalized soil type for Vietnam
A = 0.71
B = 1.40
sm_feature = []

# Load the data using Open Data Cube for each location
for index, row in locations.iterrows():
    lon = row['Longitude']
    lat = row['Latitude']
    toh = row['Date of Harvest']
    date_object = datetime.strptime(toh, "%d-%m-%Y")
    
    trial_date = date_object.strftime("%Y-%m-%d")
    time_delta = timedelta(days=6)
    trial_datetime = datetime.strptime(trial_date,"%Y-%m-%d")
    start_date = (trial_datetime - time_delta)
    end_date = (trial_datetime + time_delta)
    time_window3 = f"{start_date}/{end_date}"
    
    bbox = (lon - box_size_deg2/2, lat - box_size_deg2/2, lon + box_size_deg2/2, lat + box_size_deg2/2)
    catalog = pystac_client.Client.open("https://planetarycomputer.microsoft.com/api/stac/v1")
    search = catalog.search(collections=["sentinel-1-rtc"], bbox=bbox, datetime=time_window3)
    items = list(search.get_all_items())
    data = stac_load(items,bands=["vv", "vh"], patch_url=pc.sign, bbox=bbox, crs="EPSG:4326", resolution=scale2)
    mean = data.mean(dim=['latitude','longitude']).compute()
    dop = (mean.vv / (mean.vv + mean.vh))
    
    #Calculating Soil Moisture
    soil_moisture = (1 - ((10 ** (0.1 * dop)) / A) ** B)
    
    sm_feature.append(soil_moisture)


albedo_values[:5]

[<xarray.DataArray (time: 4)>
 array([nan, nan, nan, nan])
 Coordinates:
     spatial_ref  int32 4326
   * time         (time) datetime64[ns] 2022-07-13T03:20:15.057547 ... 2022-07...,
 <xarray.DataArray (time: 4)>
 array([nan, nan, nan, nan])
 Coordinates:
     spatial_ref  int32 4326
   * time         (time) datetime64[ns] 2022-07-13T03:20:15.057547 ... 2022-07...,
 <xarray.DataArray (time: 2)>
 array([nan, nan])
 Coordinates:
     spatial_ref  int32 4326
   * time         (time) datetime64[ns] 2022-07-13T03:20:38.965531 2022-07-21T...,
 <xarray.DataArray (time: 4)>
 array([nan, nan, nan, nan])
 Coordinates:
     spatial_ref  int32 4326
   * time         (time) datetime64[ns] 2022-07-13T03:20:15.057547 ... 2022-07...,
 <xarray.DataArray (time: 4)>
 array([nan, nan, nan, nan])
 Coordinates:
     spatial_ref  int32 4326
   * time         (time) datetime64[ns] 2022-07-13T03:20:15.057547 ... 2022-07...]


def interpolate_and_create_dataframe(data_list, column_name, latitudes, longitudes):
    values_arrays = [data.values for data in data_list]
    lengths = [len(arr) for arr in values_arrays]
    values_flat = np.concatenate(values_arrays)

    # Get the indices of non-NaN values
    non_nan_indices = np.where(~np.isnan(values_flat))[0]
    non_nan_values = values_flat[non_nan_indices]

    # Create a cubic spline interpolator
    spline_interpolator = interp1d(non_nan_indices, non_nan_values, kind='cubic', bounds_error=False, fill_value=(non_nan_values[0], non_nan_values[-1]))

    # Replace NaN values with interpolated values
    interpolated_data = np.array([spline_interpolator(i) if np.isnan(value) else value for i, value in enumerate(values_flat)])

    # Split the interpolated_data back to original lengths
    interpolated_arrays = np.split(interpolated_data, np.cumsum(lengths)[:-1])

    data = {'Latitude': latitudes, 'Longitude': longitudes, column_name: interpolated_arrays}
    return pd.DataFrame(data)


al_df = interpolate_and_create_dataframe(albedo_values, 'Albedo', latitudes, longitudes)
ndvi_df = interpolate_and_create_dataframe(ndvi_landsat, 'NDVI', latitudes, longitudes)
ndwi_df = interpolate_and_create_dataframe(ndwi_landsat, 'NDWI', latitudes, longitudes)
evi2_df = interpolate_and_create_dataframe(evi2_landsat, 'EVI 2', latitudes, longitudes)
savi_df = interpolate_and_create_dataframe(savi_landsat, 'SAVI', latitudes, longitudes)
lst_df = interpolate_and_create_dataframe(lst_landsat, 'Land Surface Temperature', latitudes, longitudes)
sm_df = interpolate_and_create_dataframe(sm_feature, 'Soil Moisture', latitudes, longitudes)

al_df.head(5)


def count_nan_df_col(df, column_name):
    count = 0
    for array in df[column_name]:
        count += np.isnan(array).sum()
    return count

print(f"Number of NaN values in Albedo DataFrame: {count_nan_df_col(al_df, 'Albedo')}")
print(f"Number of NaN values in NDVI DataFrame: {count_nan_df_col(ndvi_df, 'NDVI')}")
print(f"Number of NaN values in NDWI DataFrame: {count_nan_df_col(ndwi_df, 'NDWI')}")
print(f"Number of NaN values in SAVI DataFrame: {count_nan_df_col(savi_df, 'SAVI')}")
print(f"Number of NaN values in EVI 2 DataFrame: {count_nan_df_col(evi2_df, 'EVI 2')}")
print(f"Number of NaN values in Land Surface Temp DataFrame: {count_nan_df_col(lst_df, 'Land Surface Temperature')}")
print(f"Number of NaN values in Soil Moisture DataFrame: {count_nan_df_col(sm_df, 'Soil Moisture')}")

Number of NaN values in Albedo DataFrame: 0
Number of NaN values in NDVI DataFrame: 0
Number of NaN values in NDWI DataFrame: 0
Number of NaN values in SAVI DataFrame: 0
Number of NaN values in EVI 2 DataFrame: 0
Number of NaN values in Land Surface Temp DataFrame: 0
Number of NaN values in Soil Moisture DataFrame: 0


def find_mean(lst):
    return sum(lst) / len(lst)

al_df['Average Albedo'] = al_df['Albedo'].apply(find_mean)
evi2_df['Average EVI 2'] = evi2_df['EVI 2'].apply(find_mean)
ndvi_df['Average NDVI'] = ndvi_df['NDVI'].apply(find_mean)
savi_df['Average SAVI'] = savi_df['SAVI'].apply(find_mean)
ndwi_df['Average NDWI'] = ndwi_df['NDWI'].apply(find_mean)
lst_df['Average Land Surface Temperature'] = lst_df['Land Surface Temperature'].apply(find_mean)
sm_df['Average Soil Moisture'] = sm_df['Soil Moisture'].apply(find_mean)

al_df.head()


feature_df = pd.concat([al_df['Average Albedo'], evi2_df['Average EVI 2'], ndvi_df['Average NDVI'], savi_df['Average SAVI'], 
                        ndwi_df['Average NDWI'], sm_df['Average Soil Moisture'], lst_df['Average Land Surface Temperature']],axis=1)

feature_df.head()


precip = weather_data['precip']
precip1 = weather_data1['precip']
precip2 = weather_data2['precip']
dates = locations['Date of Harvest']
unique_dates = dates.unique()

def sort_dates(date):
    day, month, year = date.split('-')
    return year, month, day

sorted_dates = sorted(unique_dates, key=sort_dates)

window_sizes = [168,170,171,172,174,175,176,177,178,182,183,184,185,186,187,188,190,191,193,194,
                197,201,203,278,283,285,286,287,288,290,292,293,294,295,296,301,302,304,307,308,312]

def calculate_precip_stats(precip, window_sizes):
    prec_mean = []
    prec_max = []
    prec_min = []
    prec_var = []

    for window_size in window_sizes:
        prec_mean.append(precip.iloc[:window_size].mean())
        prec_max.append(precip.iloc[:window_size].max())
        prec_min.append(precip.iloc[:window_size].min())
        prec_var.append(precip.iloc[:window_size].var())

    return prec_mean, prec_max, prec_min, prec_var

# Chau Phu
prec_df_cp, prec_max_df_cp, prec_min_df_cp, prec_var_df_cp = calculate_precip_stats(precip, window_sizes)

# Chau Thanh
prec_df_ct, prec_max_df_ct, prec_min_df_ct, prec_var_df_ct = calculate_precip_stats(precip1, window_sizes)

# Thoai Son
prec_df_ts, prec_max_df_ts, prec_min_df_ts, prec_var_df_ts = calculate_precip_stats(precip2, window_sizes)


# create a list of districts and precipitation data
districts = ['Chau_Phu'] * 41 + ['Chau_Thanh'] * 41 + ['Thoai_Son'] * 41
precipitation_data = [prec_df_cp, prec_df_ct, prec_df_ts]
max_precipitation_data = [prec_max_df_cp, prec_max_df_ct, prec_max_df_ts]
min_precipitation_data = [prec_min_df_cp, prec_min_df_ct, prec_min_df_ts]
var_precipitation_data = [prec_var_df_cp, prec_var_df_ct, prec_var_df_ts]
dates_extended = sorted_dates * 3

# flatten the precipitation data list
precipitation_data = [item for sublist in precipitation_data for item in sublist]
max_precipitation_data = [item for sublist in max_precipitation_data for item in sublist]
min_precipitation_data = [item for sublist in min_precipitation_data for item in sublist]
var_precipitation_data = [item for sublist in var_precipitation_data for item in sublist]

# create a dictionary with districts and precipitation data
prec = {'District': districts, 'Date of Harvest': dates_extended, 'Precipitation': precipitation_data, 'Precipitation Variance': var_precipitation_data}

# create a DataFrame from the dictionary
df = pd.DataFrame(prec)

new_df = pd.DataFrame({'Average Albedo': feature_df['Average Albedo'], 'Average EVI 2': feature_df['Average EVI 2'], 
                       'Average NDVI': feature_df['Average NDVI'] , 'Average SAVI': feature_df['Average SAVI'], 'Average NDWI': feature_df['Average NDWI'],
                      'Average Land Surface Temperature': feature_df['Average Land Surface Temperature'], 'Average Soil Moisture': feature_df['Average Soil Moisture']})
new_df['Date of Harvest'] = locations['Date of Harvest']
new_df['District'] = locations['District']

new_df = pd.merge(left=new_df, right=df, on=['District', 'Date of Harvest'], how='left')
new_df.drop(columns=['District', 'Date of Harvest'], inplace=True)
new_df.head()


X = new_df
y = locations['Rice Yield (kg/ha)']

X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.25, random_state=123)


linear_model = LinearRegression()
linear_model.fit(X_train, y_train)

# Predict on the test set
y_pred_linear = linear_model.predict(X_test)
r2_linear = r2_score(y_test, y_pred_linear)
mse_linear = mean_squared_error(y_test, y_pred_linear)

print(f"Linear Regression Accuracy: {r2_linear:.2f}")
print(f"Linear Regression MSE: {mse_linear**(1/2):.2f}")

Linear Regression Accuracy: 0.68
Linear Regression MSE: 436.25


lasso_model = Lasso(alpha=0.5)
lasso_model.fit(X_train, y_train)

y_pred_lasso = lasso_model.predict(X_test)
r2_lasso = r2_score(y_test, y_pred_lasso)
mse_lasso = mean_squared_error(y_test, y_pred_lasso)

print(f"Lasso Regression Accuracy: {r2_lasso:.2f}")
print(f"Lasso Regression MSE: {mse_lasso**(1/2):.2f}")

Lasso Regression Accuracy: 0.67
Lasso Regression MSE: 440.32


coeff_df = pd.DataFrame()
coeff_df['Feature'] = X_train.columns
coeff_df['Linear Coefficients'] = linear_model.coef_
coeff_df['Lasso Coefficients'] = lasso_model.coef_

coeff_df


Xy = pd.concat([X, y], axis=1)
corr_matrix = Xy.corr()

sns.heatmap(corr_matrix, annot=True, cmap='GnBu', fmt='.2f', vmin=-1, vmax=1, linewidths=0.5, cbar_kws={'shrink': 0.5})
plt.title("Fig. 9. Correlation Heatmap", fontweight='bold')
plt.show()


# Separate the features and target
def perform_rfe(estimator, X, y):
    with open(os.devnull, 'w') as devnull:
        with redirect_stdout(devnull):
            selector = RFECV(estimator, cv=5)
            selector.fit(X, y)
    return X.columns[selector.support_]

estimators = {
    "Random Forest Regressor": RandomForestRegressor(),
    "Gradient Boosting Regressor": GradientBoostingRegressor(),
    "XGB Regressor": XGBRegressor(),
    "CatBoost Regressor": CatBoostRegressor(logging_level='Silent')
}

selected_features = {}

for name, estimator in estimators.items():
    selected_features[name] = perform_rfe(estimator, X, y)
    print(f"Selected features for {name}:", selected_features[name])

Selected features for Random Forest Regressor: Index(['Average Albedo', 'Average NDWI', 'Average Land Surface Temperature',
       'Average Soil Moisture', 'Precipitation', 'Precipitation Variance'],
      dtype='object')
Selected features for Gradient Boosting Regressor: Index(['Average Albedo', 'Average NDWI', 'Average Land Surface Temperature',
       'Precipitation', 'Precipitation Variance'],
      dtype='object')
Selected features for XGB Regressor: Index(['Precipitation', 'Precipitation Variance'], dtype='object')
Selected features for CatBoost Regressor: Index(['Average Albedo', 'Average EVI 2', 'Average NDVI', 'Average NDWI',
       'Average Land Surface Temperature', 'Average Soil Moisture',
       'Precipitation', 'Precipitation Variance'],
      dtype='object')


def train_and_evaluate(model, model_name, X_train, y_train, X_test, y_test):
    model.fit(X_train, y_train)
    y_pred = model.predict(X_test)
    r2 = r2_score(y_test, y_pred)
    mse = mean_squared_error(y_test, y_pred)

    print(f"{model_name} Accuracy: {r2:.2f}")
    print(f"{model_name} MSE: {mse**(1/2):.2f}")
    print()
    return model, y_pred

# Create model instances
rf_model = RandomForestRegressor(n_estimators=15, max_depth=5, random_state=42)
gb_model = GradientBoostingRegressor(n_estimators=14, learning_rate=0.21, random_state=42)
xgb_model = xgb.XGBRegressor(objective='reg:squarederror', n_estimators=18, learning_rate=0.201, seed=42)
cb_model = CatBoostRegressor(learning_rate=0.21, depth=5, n_estimators=25, verbose=0, random_state=42)

selection1 = ['Average Albedo', 'Average NDWI', 'Average Land Surface Temperature', 'Precipitation', 'Precipitation Variance']
selection2 = ['Average NDWI', 'Average Land Surface Temperature', 'Precipitation', 'Precipitation Variance']
selection3 = ['Precipitation', 'Precipitation Variance']
selection4 = ['Average Albedo', 'Average EVI 2', 'Average NDVI', 'Average NDWI', 'Average Land Surface Temperature', 'Average Soil Moisture', 'Precipitation', 'Precipitation Variance']

X1 = new_df[selection1]
X2 = new_df[selection2]
X3 = new_df[selection3]
X4 = new_df[selection4]
y = locations['Rice Yield (kg/ha)']

X_train1, X_test1, y_train, y_test = train_test_split(X1, y, test_size=0.25, random_state=123)
X_train2, X_test2, y_train, y_test = train_test_split(X2, y, test_size=0.25, random_state=123)
X_train3, X_test3, y_train, y_test = train_test_split(X3, y, test_size=0.25, random_state=123)
X_train4, X_test4, y_train, y_test = train_test_split(X4, y, test_size=0.25, random_state=123)

# Train and evaluate models
rf_model, y_pred_rf = train_and_evaluate(rf_model, "Random Forest Regression", X_train1, y_train, X_test1, y_test)
gb_model, y_pred_gb = train_and_evaluate(gb_model, "Gradient Boosting Regression", X_train2, y_train, X_test2, y_test)
xgb_model, y_pred_xgb = train_and_evaluate(xgb_model, "XGBoost Regression", X_train3, y_train, X_test3, y_test)
cb_model, y_pred_cb = train_and_evaluate(cb_model, "CatBoost Regression", X_train4, y_train, X_test4, y_test)

Random Forest Regression Accuracy: 0.68
Random Forest Regression MSE: 437.55

Gradient Boosting Regression Accuracy: 0.68
Gradient Boosting Regression MSE: 434.28

XGBoost Regression Accuracy: 0.67
XGBoost Regression MSE: 445.02

CatBoost Regression Accuracy: 0.64
CatBoost Regression MSE: 461.87


data = pd.DataFrame({'Actual': y_test,
                     'Model1': y_pred_rf,
                     'Model2': y_pred_gb,
                     'Model3': y_pred_xgb,
                     'Model4': y_pred_cb})

fig, axes = plt.subplots(2, 2, figsize=(9, 8))

sns.regplot(ax=axes[0, 0], data=data, x='Actual', y='Model1', scatter_kws={'alpha': 0.3}, color = "Black")
axes[0, 0].set_title('Fig. 10. Random Forest', fontweight='bold')

sns.regplot(ax=axes[0, 1], data=data, x='Actual', y='Model2', scatter_kws={'alpha': 0.3}, color = "Orange")
axes[0, 1].set_title('Fig. 11. Gradient Boosting', fontweight='bold')

sns.regplot(ax=axes[1, 0], data=data, x='Actual', y='Model3', scatter_kws={'alpha': 0.3}, color = "#121bc7")
axes[1, 0].set_title('Fig. 12. XGBoost', fontweight='bold')

sns.regplot(ax=axes[1, 1], data=data, x='Actual', y='Model4', scatter_kws={'alpha': 0.3}, color="Red")
axes[1, 1].set_title('Fig. 13. CatBoost', fontweight='bold')

plt.tight_layout()
plt.show()


adaboost_regressor = AdaBoostRegressor(base_estimator=gb_model, n_estimators=90, learning_rate=0.025, random_state=42)
adaboost_regressor.fit(X_train2, y_train)

y_pred_adaboost = adaboost_regressor.predict(X_test2)
r2_adaboost = r2_score(y_test, y_pred_adaboost)
mse_adaboost = mean_squared_error(y_test, y_pred_adaboost)

print(f"Boosting Regression Accuracy: {r2_adaboost:.2f}")
print(f"Boosting Regression MSE: {mse_adaboost**(1/2):.2f}")

Boosting Regression Accuracy: 0.69
Boosting Regression MSE: 430.36


data2 = pd.DataFrame({'Actual': y_test, 'Predicted': y_pred_adaboost})
plt.figure(figsize=(5, 4))
ax = sns.regplot(data=data2, x='Actual', y='Predicted', scatter_kws={'alpha': 0.3})
ax.set_title('Fig. 14. Boosting', fontweight='bold')
plt.show()


data2['Residuals'] = data2['Actual'] - data2['Predicted']
plt.figure(figsize=(5, 4))
ax = sns.histplot(data=data2, x='Residuals', kde=True, color='Black')
ax.set_title('Fig. 15. Residuals Distribution', fontweight='bold')
plt.show()

	Latitude	Longitude	Albedo
0	10.510542	105.248554	[0.09556685977230825, 0.09556685977230825, 0.0...
1	10.509150	105.265098	[0.09556685977230825, 0.09556685977230825, 0.0...
2	10.467721	105.192464	[0.09556685977230825, 0.09556685977230825]
3	10.494453	105.241281	[0.09556685977230825, 0.09556685977230825, 0.0...
4	10.535058	105.252744	[0.09556685977230825, 0.09556685977230825, 0.0...

	Latitude	Longitude	Albedo	Average Albedo
0	10.510542	105.248554	[0.09556685977230825, 0.09556685977230825, 0.0...	0.095567
1	10.509150	105.265098	[0.09556685977230825, 0.09556685977230825, 0.0...	0.095567
2	10.467721	105.192464	[0.09556685977230825, 0.09556685977230825]	0.095567
3	10.494453	105.241281	[0.09556685977230825, 0.09556685977230825, 0.0...	0.095567
4	10.535058	105.252744	[0.09556685977230825, 0.09556685977230825, 0.0...	0.095567

	Average Albedo	Average EVI 2	Average NDVI	Average SAVI	Average NDWI	Average Soil Moisture	Average Land Surface Temperature
0	0.095567	0.447983	0.635155	0.419686	-0.583988	-1.069717	66.657175
1	0.095567	0.447983	0.635155	0.419686	-0.583988	-1.105420	66.657175
2	0.095567	0.447983	0.635155	0.419686	-0.583988	-1.060978	66.657175
3	0.095567	0.447983	0.635155	0.419686	-0.583988	-1.017536	66.657175
4	0.095567	0.447983	0.635155	0.419686	-0.583988	-1.102196	66.657175

	Average Albedo	Average EVI 2	Average NDVI	Average SAVI	Average NDWI	Average Land Surface Temperature	Average Soil Moisture	Precipitation	Precipitation Variance
0	0.095567	0.447983	0.635155	0.419686	-0.583988	66.657175	-1.069717	6.304514	177.830676
1	0.095567	0.447983	0.635155	0.419686	-0.583988	66.657175	-1.105420	6.304514	177.830676
2	0.095567	0.447983	0.635155	0.419686	-0.583988	66.657175	-1.060978	6.304514	177.830676
3	0.095567	0.447983	0.635155	0.419686	-0.583988	66.657175	-1.017536	6.304514	177.830676
4	0.095567	0.447983	0.635155	0.419686	-0.583988	66.657175	-1.102196	6.322997	178.353735

	Feature	Linear Coefficients	Lasso Coefficients
0	Average Albedo	-7207.746314	-0.000000
1	Average EVI 2	-1653.822764	-728.091388
2	Average NDVI	600.007497	446.673485
3	Average SAVI	2569.392649	-0.000000
4	Average NDWI	1666.820463	245.993988
5	Average Land Surface Temperature	-1.849313	-3.391375
6	Average Soil Moisture	827.565980	219.788827
7	Precipitation	-1243.472173	-1186.544723
8	Precipitation Variance	18.794090	18.538168

ECON 323: Predicting Rice Crop Yield In Vietnam¶

Authors: Divyadarshan Punjabi; Nirvaan Rohira; and Yasin Zahir¶

Outline¶

Introduction and Current Literature ¶

Setting up the import statements and the themes ¶

Loading the Dataset ¶

Feature Engineering ¶

Lets visualize the data coming from Landsat and Sentinel 1 satellites for a sample location ¶

For Landsat¶

View RGB (real color) images from the time series¶

Applying Cloud Filtering and Masking¶

For Sentinel 1¶

Let us calculate our Features (Predictor Variables) ¶

From Landsat ¶

From Sentinel 1 ¶

Now let's start building the model ... ¶

Linear Regression ¶

Lasso Regression ¶

Feature Extraction ¶

ML Regression Models ¶

Boosting ¶

Bottlenecks and Methods of Improvement ¶

Conclusion ¶

References ¶

Contributions ¶

	District	Latitude	Longitude	Season(SA = Summer Autumn, WS = Winter Spring)	Rice Crop Intensity(D=Double, T=Triple)	Date of Harvest	Field size (ha)	Rice Yield (kg/ha)
0	Chau_Phu	10.510542	105.248554	SA	T	15-07-2022	3.40	5500
1	Chau_Phu	10.509150	105.265098	SA	T	15-07-2022	2.43	6000
2	Chau_Phu	10.467721	105.192464	SA	D	15-07-2022	1.95	6400
3	Chau_Phu	10.494453	105.241281	SA	T	15-07-2022	4.30	6000
4	Chau_Phu	10.535058	105.252744	SA	D	14-07-2022	3.30	6400