#!/usr/bin/env python
# coding: utf-8

# In[1]:


get_ipython().run_line_magic('matplotlib', 'inline')
import pandas as pd
import numpy as np
import re
from requests import get
import json 
import os 
from collections import defaultdict
from sklearn.preprocessing import MinMaxScaler

#Plotting
import matplotlib.pyplot as plt
import matplotlib.animation as animation
from IPython.display import HTML, IFrame
import seaborn as sns
import plotly.express as px

import folium #  conda install -c conda-forge folium
from folium import plugins
import geopandas as gpd

# Our helpers
from neural import prepare_future, predict_future # Wrapper to use RNNs
from helpfunc import *
from plots import *

# Foor retrieving prices of food
import nltk
from nltk.stem.wordnet import WordNetLemmatizer
from nltk.corpus import stopwords

import picos as pic # Library used to implement the convex optimization problem

import plotly.graph_objects as go
import plotly.offline as py

get_ipython().run_line_magic('load_ext', 'autoreload')
get_ipython().run_line_magic('autoreload', '2')


# In[2]:


path_dict = {'food_balance_africa': 'data/raw/FoodBalanceSheets_E_Africa_1.csv',
                'geoworld_json': 'data/raw/world-countries.json',
                'africa_supply_rnn': 'data/processed/africa_cal.pkl',
                'ages_calories_demand': 'data/raw/calories_demand.xlsx',
                'african_countries_list': "data/raw/african_countries.txt",
                'population_age_male': "data/raw/POPULATION_BY_AGE_MALE.xlsx",
                'population_age_female': "data/raw/POPULATION_BY_AGE_FEMALE.xlsx",
                'food_balance_europe': "data/raw/FoodBalanceSheets_E_Europe_1.csv",
                'europe_supply_rnn': 'data/processed/europe_cal.pkl',
                'european_countries_list': "data/raw/european_countries.txt", 
                'african_supply_map': 'visualization/africa_supply_map',
                'african_demand_anim': 'visualization/african_cal_diff_animation.html',
                'african_estimation_kcal': 'visualization/africa_est_kcal',
                'african_kcal_need': "visualization/african_kcal_need",
                 "european_supply_map":"visualization/european_supply_map",
                 "african_pop_growth":"visualization/african_pop_growth.html", 
                 'european_pop_growth': "visualization/european_pop_growth.html",
                 'european_estimation_kcal': 'visualization/europe_est_kcal',
                 'european_demand_anim': 'visualization/european_cal_diff_animation.html',
                 'european_kcal_surplus': "visualization/european_kcal_surplus",
                 'world_kcal_surplus': "visualization/world/world_kcal_surplus",
            }


# In[3]:


plt.rcParams["figure.figsize"] = (15,8) #set size of plot


# # Determination of African countries with food deficit

# ## 1) How much human food resources are available in African countries?

# ### 1.1) Preprocessing

# To answer this important question, we will need to import data from the **FAO Dataset**. More specifically, we will focus on the section **Food Balance Sheet** with respect to African countries only. 

# In[4]:


food_balance_africa = pd.read_csv(path_dict['food_balance_africa'],encoding='latin-1');


# Firstly, we will **remove** all the columns with title **"Y----F"** as they contain information about how the data was obtained (Calculated, Regression, Aggregate, FAO Estimation). In this context we will consider that FAO is a *highly renowned Agency* and hence we can assume these values are truthful without loss of generality. Furthermore we thought that it would be very handy to have numbers as columns representing years instead of **"Y----"**. We proceed on removing the letter **Y**. The helper functions `clean_Fs_and_years` does this cleaning on the dataframe.

# In[5]:


food_balance_africa = clean_Fs_and_years(food_balance_africa)


# Secondly, we replace all the **NAN** values with **0** as Item was not available.

# In[6]:


food_balance_africa = food_balance_africa.fillna(0);


# The third step to complete **the cleaning** of food_balance_africa consists of adapting names of countries in order to have consistency along our different dataframes. Since some countries changed their name over the years we will rename them. In particular, **Swaziland** to **Eswatini** and **South Africa** to **Southern Africa**. The function `replace_names_of_countries` takes the dataframe and the country names which should be replaced.

# In[7]:


food_balance_africa = replace_names_of_countries(food_balance_africa, [('Swaziland', 'Eswatini'), ('South Africa', 'Southern Africa')])


# Our Dataframe looks like this:

# In[8]:


food_balance_africa.head()


# Analysing our DataFrame *food_balance_africa* we can see that it's already well structured since it contains many key - value couples such as **Item Code - Item** and **Element Code - Element** . More specifically, we will take advantage of this structure to filter out only rows characterized by **Grand total** as an **Item** and **Food supply (kcal/capita/day)** as an **Element**. The corresponding key-values are **(Item Code, 2901) and (Element Code, 664)**.
# A reference to the documentation in the [FAO Website](http://www.fao.org/faostat/en/#data/FBS) explains the legend for Element Code and Element Item extensively. 
# 
# In order to keep our original Dataframe *food_balance_africa* as a reference we create a new Dataframe *food_supply_africa* in which we just keep **countries** and **food supplies** for every **year**.

# In[9]:


food_supply_africa = obtain_supply(food_balance_africa)


# We can now group group by **Area** and see the supplies derived from each item available in countries for that particular year.

# In[10]:


food_supply_africa = food_supply_africa.set_index("Area")
food_supply_africa.head()


# In order to check for anomalies in our data, we would like to analyze the **timeline**. We therefore transpose the dataframe and plot the timeline of how food supply in different countries evolved. Legend was suppressed as it is too large.

# In[11]:


food_supply_africa = food_supply_africa.transpose();


# In[12]:


#converting the year from string to int
food_supply_africa.index = food_supply_africa.index.astype(int)


# In[13]:


timeline_supply(food_supply_africa, "African")


# This analysis shows that there are two inconsistencies. We therefore check for countries containing values equal to zero.

# In[14]:


food_supply_africa.columns.values[(food_supply_africa == 0).any()]


# We notice that **Sudan** and **Ethiopia** appear twice as "Sudan" and "Sudan (former)" and "Ethiopia" and "Ethiopia PDR" respectively. This is due to the fact that South Sudan gained independence in 2011 (reference to https://en.wikipedia.org/wiki/South_Sudan), and the foundation of the Federal Democratic Republic of Ethiopia (reference to https://en.wikipedia.org/wiki/Ethiopia) in 1991. From then on, Ethiopia PDR was listed as Ethiopia. With food supply being consistently constant even after division, the newly introduced country "Sudan" is assumed to further on have accounted for both countries. For this reason, we will consider them to be one single country. 
# Consequently, the two countries' data is merged into one continuous set each. The function `merge_countries` takes care of this, by substituting each key in dictionary (the second argument) with its value(s).

# In[15]:


food_supply_africa = merge_countries(food_supply_africa, {'Sudan (former)': ['Sudan'], 'Ethiopia PDR': ['Ethiopia']})


# Let's plot the newly generated data:

# In[16]:


timeline_supply(food_supply_africa, "African", " - Cleaned dataset")


# Next, we want to add more columns representing future years until 2020 to prepare cells for extrapolation to make predictions about possible scenarios. `prepare_future` is used for this.

# In[17]:


# Adding columns for the new years
food_supply_africa = prepare_future(food_supply_africa, 2014, 2020)


# ### 1.2) Extrapolation
# First of all, we want to simulate data until 2020 to match the population data. Furthermore, we also want to be able to make predicitions for individual countries to assess if they might run into food shortages in the near future.  
# The prediction for the new years are done by using a "*Recurrent Neural Network (RNN)*" and a window of size 10. Basically what we will do here is using all the past history of each country (windowed in block of 10 years each) to run a neural network and try to predict the future behaviour (up to 2020). During our test we found that the neural networks are able to predict good estimations.  
# As we don't want precise data, the **_estimations_** achieved by using ML are in this case more than acceptable for our purpose.  
# *Credits*: We don't know much about RNN, so the network used here is adapted from the *Time series forecasting tutorial* on **Tensorflow**, available [here](https://www.tensorflow.org/tutorials/structured_data/time_series)  
# *Note*: we already ran the networks and saved the results on Colab, running them each time requires more than an hour on Colab. For this reason, we just use the pickle here instead of running the network. This is achieved by the function `predict_future`.

# In[18]:


food_supply_africa = predict_future(food_supply_africa, path_dict['africa_supply_rnn'])


# Plotting the results:

# In[19]:


timeline_supply(food_supply_africa, "African", "- Predicted dataset")


# When looking at the predicted period (after 2013), a linear approximation can be observed with some countries being forecasted to oscillate. This can be explained by the fact that the neural network method recognised a recurring pattern in the data and extrapolates this behaviour. Among all the tested methods, this one is expected to reproduce the most realistic data, as no odd developments can be observed.

# ### 1.3) Visualizing the data interactively

# In[20]:


#Geographic coordinates for visualizing 
geojson_world = gpd.read_file(path_dict['geoworld_json'])
type(geojson_world)


# We observe `geojson_world` is a **GeoDataFrame**. Let's see how the data are sorted:

# In[21]:


geojson_world.head()


# We have to make sure that `geojson_world` has data for every country under our analysis. In this case, we need to check for every country in Africa and more specifically every country taken into account in `food_supply_africa`. Since the **name** in `geojson_world` is not a good index to match and we don't have any comparable **id** in `food_supply_africa` we have to do this job by hand, which is feasible since the size is small enough. As a result, we filter `geojson_world` to `geojson_africa`. Let's display this new GeoDataFrame.

# In[22]:


african_country_codes = ["DZA","AGO","BEN","BWA","BFA","CMR","CAF","TCD","COD","CIV",
                         "DJI","EGY","SWZ","ETH","GAB","GMB","GHA","GNQ","GNB","KEN","LSO","LBR",
                         "MDG","MWI","MLI","MRT","MAR","MOZ","NAM","NER","NGA","RWA"
                         ,"SEN","SLE","ZAF","SDN","TGO","TUN","UGA","TZA","ZMB","ZWE"]

african_country_names = ['Algeria', 'Angola', 'Benin', 'Botswana', 'Burkina Faso',
                         'Cameroon', 'Central African Republic', 'Chad', 'Congo',
                         "Côte d'Ivoire", 'Djibouti', 'Egypt', 'Eswatini','Ethiopia', 'Gabon',
                         'Gambia', 'Ghana', 'Guinea', 'Guinea-Bissau', 'Kenya', 'Lesotho',
                         'Liberia', 'Madagascar', 'Malawi', 'Mali', 'Mauritania', 'Morocco',
                         'Mozambique', 'Namibia', 'Niger', 'Nigeria', 'Rwanda', 'Senegal',
                         'Sierra Leone', 'Southern Africa', 'Sudan', 'Togo',
                         'Tunisia', 'Uganda', 'United Republic of Tanzania', 'Zambia',
                         'Zimbabwe']

african_country_kv = pd.DataFrame({'codes': african_country_codes,
                                    'names': african_country_names
                                    })

geojson_africa = geojson_world[geojson_world.id.isin(african_country_codes)]
geojson_africa.head()


# The ordered list **african_country_codes** contains all the countries available to be plotted as geometry is available. We found out manually that **Cabo Verde, Sao Tome and Principe and Mauritius** are not in this list. For the sake of simplicity, we will remove these three countries as they don't affect our analysis. 

# In[23]:


food_supply_africa = food_supply_africa.drop(columns=["Cabo Verde","Mauritius","Sao Tome and Principe"])


# Now we can move to plot the Food supply for each country. All of this is done in the `plot_map` function. This function plots a world map centered in the zone of interest with interactive values while scrolling over the individual.  
# As we are particularly interested in the situation in *2020*, we'll plot our prediction of supply for this year in the notebook map. Could be interesting also to look at the evolving of the situation over the last 50 years, so we plot a map for each decade from 1970 to 2020.  
# **_Note_: the map visualized here is just for 2020, if maybe will not load. If this happens, [click here](https://manuleo.github.io/mADAm_files/africa_supply_map2020.html)**  
# 
# The maps for the previous decade are available here:
# - [1970](https://manuleo.github.io/mADAm_files/africa_supply_map1970.html)
# - [1980](https://manuleo.github.io/mADAm_files/africa_supply_map1980.html)
# - [1990](https://manuleo.github.io/mADAm_files/africa_supply_map1990.html)
# - [2000](https://manuleo.github.io/mADAm_files/africa_supply_map2000.html)
# - [2010](https://manuleo.github.io/mADAm_files/africa_supply_map2010.html)

# In[24]:


# Used to generate, don't run it
legend_name = "Food supply (kcal/person/day)"
for year in range(1970, 2030, 10):
    africa_supply_map = plot_map(food_supply_africa.T, path_dict['geoworld_json'], \
    african_country_kv, year, "Blues", legend_name, legend_name, path_dict['african_supply_map'] + str(year) + ".html")
africa_supply_map


# In[25]:


# Printing 2020 map
IFrame(src='https://manuleo.github.io/mADAm_files/africa_supply_map2020.html', width = 800, height=600)


# In[26]:


save_map_data(geojson_africa, african_country_kv, food_supply_africa, "docs/json/africa_supply/africa_supply_{}.geojson", "docs/json/africa_supply/africa_supply_ticks.json")


# By analyzing the trend over the decades, we discover that the situation in the African region **partially improved** over the years, especially if we compare the 2020 with the 1970.  
# There are some countries, especially in North Africa, that improved their supplies by 1000 kcal. In contrast, the situation in the central-southern region is almost the same as 50 years ago. Another important point to see is that there is a bit of fluctuation of the values.

# ## 2) What is the ideal amount of kcal each African country needs?
# In this first part, we compute **kcal demand** for males and females for every age group. Secondly, we will conduct an extensive analysis on **African demographics**. Finally, we will be able to combine kcal demand with African population data into a unique dataframe that will be the answer of our inital question: **What is the kcal demand of a regular person in order to be healthy?**

# ### 2.1) How many kilocalories does a regular person need daily?

# First of all, we load the calories demand datasets scraped from the webpage [Calories](https://health.gov/dietaryguidelines/2015/guidelines/appendix-2/). This information will be matched with the population datsets to receive the total caloric demand in each country, each year.

# In[27]:


male_calory_demand = pd.read_excel(path_dict['ages_calories_demand'], header = None, sheet_name = 0, names = ['age', 'sedentary', 'moderate', 'active'])


# In[28]:


female_calory_demand = pd.read_excel(path_dict['ages_calories_demand'], header = None, sheet_name = 1, names = ['age', 'sedentary', 'moderate', 'active'])


# In order to better work with the information we have collected, we will make some simplifications on the data. Mainly, we will:
# - take the **active lifestyle** column in the calories demands database. According to the [World Health Organization](https://www.afro.who.int/health-topics/physical-activity), regular physical activity helps to maintain a healthy body and reduces the risk of disease.
# - group the ages into ranges that match the ranges provided in the World Population Database

# In[29]:


male_calories = male_calory_demand.drop(columns=['sedentary', 'moderate'])
female_calories = female_calory_demand.drop(columns=['sedentary', 'moderate'])
male_calories.rename(columns={'active':'input kcal'}, inplace=True)
female_calories.rename(columns={'active':'input kcal'}, inplace=True)


# We have now obtained a caloric demand for simpler calculations in the future and stored in the two previous datasets.  
# Now, we need a  way to match the age groups in this dataframe to the ones in the population database we obtained. As such, let's analyse how ages are represented in our caloric demand dataframes.

# In[30]:


male_calories['age'].unique()


# We can see that there are ranges of ages with different sizes (which makes sense, because different age groups have different caloric needs). The function `explode_age` returns the dataframe with one row per individual age.

# We apply the function to our two dataframes:

# In[31]:


male_calories = explode_age(male_calories)
female_calories = explode_age(female_calories)


# In[32]:


male_calories['age'].unique()


# Ages are now unique in each dataframe ( `male_calories` and `female_calories` ) and there's a caloric input value for each of them.

# The last step to allow the match with the population database is to build the **same age groups** we have in that set. The `compress_ages` function takes care of the differences between datasets by grouping the **ages** into the same **ranges** as in the population dataset (and calculating the average needs).

# We can lastly apply the function to the dataframes:

# In[33]:


male_calories = compress_ages(male_calories)
female_calories = compress_ages(female_calories)


# We also use the age group as new index and rename the columns:

# In[34]:


male_calories.index.name = 'age_group'
male_calories = male_calories.rename(columns={0: 'input kcal'})
female_calories.index.name = 'age_group'
female_calories = female_calories.rename(columns={0: 'input kcal'})


# Let's have a look at the result we have achieved and collected in our matchable dataframe `male_calories` and `female_calories`. The unit here is **kcal/person/day**.

# In[35]:


male_calories.head()


# In[36]:


fig, axes = plt.subplots(nrows=1, ncols=2)
axes[0].barh(male_calories.index, male_calories["input kcal"])
axes[1].barh(female_calories.index, female_calories["input kcal"],color="#ffc0cb")
axes[0].set_xlabel("Amount of kcal needed for group age in males [kcal/persona/day]")
fig.suptitle("Estimation of energy needed to live an active lifestyle for males and females divided in group ages")
axes[1].set_xlabel("Amount of kcal needed for group age in females [kcal/persona/day]")
axes[1].set_xlim(0, 3300)
axes[0].set_xlim(0, 3300)
axes[0].set_ylabel("age groups")
axes[1].set_ylabel("age groups");


# As as we can see, in general, males need more calories than females. For both genders, the maximum demand occurs at the age of 15-30.

# In our data story, we present a pyramid plot to show the same results. The code used is presented below.

# In[ ]:


fig = go.Figure()

fig.add_trace(
    go.Bar(
        name="Male",
        orientation="h",
        marker_color="#AED6FF",
        y=male_calories.index,
        x=male_calories['input kcal'],
        hovertemplate="Caloric need: %{x:.2f} kcal/day<br>Age group: %{y}"))

fig.add_trace(
    go.Bar(
        name="Female",
        orientation="h",
        marker_color="#FFDCFD",
        y=female_calories.index,
        x=-female_calories['input kcal'],
        text=female_calories['input kcal'],
        hovertemplate="Caloric need: %{text:.2f} kcal/day<br>Age group: %{y}"))

fig.update_layout(
    barmode='overlay', 
    xaxis=dict(autorange=False, range=[-3500, 3500], ticks="", 
               tickvals=[-3000,-2000,-1000,0,1000,2000,3000],
               ticktext=[3000,2000,1000,0,1000,2000,3000],
               title="Caloric need [kcal/day]",
               gridcolor="#eee"),
    yaxis=dict(title="Age group"),
    paper_bgcolor='rgba(0,0,0,0)',
    plot_bgcolor='rgba(0,0,0,0)',
    title="Caloric need per age and gender"
    )

py.plot(fig, filename="docs/_includes/gender.html", include_plotlyjs=False)
fig.show()


# ### 2.2) How many people live in Africa?

# #### 2.2.1) Preprocessing

# In this second part of our analysis, we load the list of **African countries**. Secondly, we load the **World Population Database** (United Nation) and therefore we obtain two dataframes: one for males and the other one for females.

# In[37]:


with open (path_dict['african_countries_list'],'r') as af_c:
    af_countries = [line.rstrip() for line in af_c] #loading list


# We need to check if the FAO Database contains data regarding **every country** in Africa. We will check the intersection with the list **af_countries**.

# In[38]:


af_to_remove = list(set(af_countries) - set(food_supply_africa.columns.values))
print("List of countries for which no data is available: " + str(af_to_remove))


# As expected, many countries were not present in the FAO Database. The countries to remove are the following: **Eritrea, Burundi, Comoros, Democratic Republic of the Congo, Equatorial Guinea, Libya, Seychelles, Western Sahara, South Sudan, and Somalia**. Furthermore, **Mayotte** and **Réunion** are French territory islands so they will be removed as well.

# In[39]:


af_countries = [i for i in af_countries if not i in af_to_remove]
af_to_remove = list(set(af_countries)- set(food_supply_africa.columns.values))
print("List of countries for which no data is available: "+ str(af_to_remove))


# Now we can proceed to load the **population data** and clean it

# In[40]:


#loading datasets
pop_male = pd.read_excel(path_dict['population_age_male'], sheet_name="ESTIMATES")
pop_female = pd.read_excel(path_dict['population_age_female'], sheet_name="ESTIMATES")


# The function `clean_pop_df` takes care of all the required cleaning. What it does is removing all the unnecessary columns and retaining just the rows of the needed countries. Then it multiplies the value by 1000 (Since the population data is per 1000 people, all entries are multiplied by the same factor (1000) to return to the real value).

# In[41]:


pop_male_africa = clean_pop_df(pop_male, af_countries)
pop_female_africa = clean_pop_df(pop_female, af_countries)


# **World Population Database** (United Nation) is now loaded and cleaned. This preprocessing is necessary in order to sort things out for next more complex steps.  

# Let's have a look at the final version of male population data grouped by age:

# In[42]:


pop_male_africa.head()


# #### 2.2.2) Interpolating the data on African population

# In this context, the population dataframe for males **pop_male** and for females **pop_female** contains measurements of population censi, for years from 1950 to 2020 with a frequency of **5 years**. Next, data is interpolated in order to obtain values for intermediate years. 
# 
# For our interpolation method to work, we need to know if the population evolution in these intervals of 5 years is linear. In order to do so, we need to visualize the growth of the population for a group of countries (plotting all of them would be confusing). So we select **3 countries** randomly and plot a simple animation of the growth over time.  
# *Note*: the code below is used to generate the HTML animation. If it doesn't load **click on [this link](https://manuleo.github.io/mADAm_files/african_pop_growth.html)**

# In[43]:


IFrame(src='https://manuleo.github.io/mADAm_files/african_pop_growth.html', width = 1200, height=600)


# In[44]:


# # Selecting countries

# countryrand = []
# n_countries = 3
# for i in range(0, n_countries):
#     countryrand.append(random.choice(pop_male_africa.country.drop_duplicates().values))

# fig = plt.figure()
# animator = animation.FuncAnimation(fig, timeline_country_gender, frames=(range(1950, 2025, 5)),\
#                                    fargs = (pop_male_africa, pop_female_africa, "30-34", countryrand), repeat=False)
# #HTML(animator.to_jshtml())

# with open(path_dict['african_pop_growth'], "w") as f:
#     print(animator.to_html5_video(), file=f)


# The animation shows an almost linear growth for the 3 countries considered (with their respective scale), so we can continue with the interpolation.

# Now we can apply our function `interpolate_years`, a simple linear interpolation, in order to obtain a frequency of **1 year**.

# In[45]:


pop_male_africa = interpolate_years(pop_male_africa, 1950, 2020)
pop_female_africa = interpolate_years(pop_female_africa, 1950, 2020)


# Let's see how the new dataframes for males and females population look like:

# In[46]:


pop_male_africa.head()


# In[47]:


pop_female_africa.head()


# #### 2.2.3) Computing the total African population

# Lastly, we will compute the total population per year. This new dataframe **pop_tot** will be useful for the next section of our analysis. `obtain_total_pop` does just this, combining the population in *pop_male_africa* and *pop_female_africa*.

# In[48]:


pop_tot_africa = obtain_total_pop(pop_male_africa, pop_female_africa)


# For the next analysis we will need to match this data with the `food_balance_africa`. We proceed to give to our population data the same shape as the other datasets by using the function `reshape_pop_dataframe`. 

# In[49]:


pop_tot_africa = reshape_pop_dataframe(pop_tot_africa)


# In[50]:


pop_tot_africa.head()


# ### 2.3) Estimantion of ideal human food demand in Africa 

# Now we multiply each column of the population data for each matching `age_group` in the calories table (which is squeezed to enable multiplication, similar to transposing rows/columns of the dataset).  All of this is done in `get_calories_need`  
# We obtain two datasets: `male_cal_need_africa` and `female_cal_need_africa` reporting total calories needed for **each country in each year per age group per gender**. 
# The unit here is **kcal/day**.

# In[51]:


male_cal_need_africa = get_calories_need(pop_male_africa, male_calories)
male_cal_need_africa.head()


# In[52]:


#total calories female
female_cal_need_africa = get_calories_need(pop_female_africa, female_calories)
female_cal_need_africa.head()


# Once we have the calories needed for both genders, we can aggregate the total caloric need of african countries into `total_cal_need_africa`. The function `obtain_total_cal_need` does just this, returning a dataframe with the calories needed for all the population of a country in a year, in the unit **kcal/year**.

# In[53]:


total_cal_need_africa = obtain_total_cal_need(male_cal_need_africa, female_cal_need_africa)


# Let's take a look at the total calories dataframe **total_cal**:

# In[54]:


total_cal_need_africa.sort_values(by="Calories", ascending=False).head()


# For the sake of consistency, we will now reshape our dataframe `total_cal` into a new one `total_cal_final` according to the same schema seen for `food_supply_africa`.

# In[55]:


total_cal_need_africa = reshape_calories_df(total_cal_need_africa)


# Drawing a sample of the final shaped dataframe total calories `total_cal_need_africa`:

# In[56]:


total_cal_need_africa.head()


# Let's go on with a interactive visualization of the data in order to understand the trend over countries. The following map is based on data for **2020**.  
# However, it's important also to understand what are the possible changing over time, so we plot the same map over the range from 1970 to 2020 (step of 10 years).  
# **If the maps for 2020 doesn't show click [here](https://manuleo.github.io/mADAm_files/africa_est_kcal2020.html)**  
# Link to the other maps:
# - [1970](https://manuleo.github.io/mADAm_files/africa_est_kcal1970.html)
# - [1980](https://manuleo.github.io/mADAm_files/africa_est_kcal1980.html)
# - [1990](https://manuleo.github.io/mADAm_files/africa_est_kcal1990.html)
# - [2000](https://manuleo.github.io/mADAm_files/africa_est_kcal2000.html)
# - [2010](https://manuleo.github.io/mADAm_files/africa_est_kcal2010.html)

# In[57]:


# Code used just for generating the map
for year in range(1970,2030,10):
    legend_name = "Estimation of kcal/year [10^11 kcal/year]"
    africa_kcal_est_map = plot_map(total_cal_need_africa.divide(10**11), path_dict['geoworld_json'], \
    african_country_kv, year, "Greens", legend_name, legend_name, path_dict['african_estimation_kcal'] + str(year) + ".html")
africa_kcal_est_map


# In[58]:


IFrame(src='https://manuleo.github.io/mADAm_files/africa_est_kcal2020.html', width = 800, height=600)


# In[59]:


save_map_data(geojson_africa, african_country_kv, total_cal_need_africa.divide(10**11).T, "docs/json/africa_need/africa_need_{}.geojson", "docs/json/africa_need/africa_need_ticks.json")


# By looking at the **scale change** over the years, it's possible to note how the needs of the African population  have constantly grown (connected to the increase of population too). This is not reflected in the African food supply, that doesn't increase at the same rate as the population.

# ## 3) Which countries are in food deficit?
# Next, an interesting comparison is introduced between the two dataframes we have obtained in the fist two parts of our analysis. More specifically, the analysis will take into account the total population dataframe `pop_tot_africa` and the `food_supply_africa`. With regard to the FAO Dataframe of food supply, we will need to transform the unit in **kcal/year** in order to compare results appropriately.

# The function `obtain_difference` takes into account our dataframes to compute which countris have enough caloric food supply to actually meet their needs. 

# In[60]:


caloric_difference_africa = obtain_difference(pop_tot_africa, food_supply_africa, total_cal_need_africa)
caloric_difference_africa.head()


# ### 3.1) Visualizing the data

# Let's start by doing a simple barplot of the deficit per persona/year in each country. As our main point of interest is the present, we will start with a graph showing next year sitution:

# In[61]:


caloric_difference_africa_sorted = caloric_difference_africa[2020].sort_values()
p = caloric_difference_africa_sorted.plot(kind='barh', color=(caloric_difference_africa_sorted > 0).map({True: 'g', False: 'red'}),alpha=0.75, rot=0);
p.set_xlabel("deficit/surplus in the african countries [kcal/persona/day]")
p.set_ylabel("African countries")
plt.title('Estimation of net food availability in African countries in 2020' );


# This plot already suggests that only by smart redistribution, Africa could sustain its own food demand. However, having the capabilities and know-how to efficiently set up a food aid operation is harder than it seems. European countries on the other hand have a lot more experience in this field, and they are expected to have an even higher amount of excess food, making it easier to provide for this whole operation. Therefore They will be considered to provide the difference in African countries.

# For a better visualization, it is convenient to see the **evolution of the kcal demand** for all the years of interest. By using the `draw_demand_bar`, we wil plot the information for every year combined with an animation to move back and forth in time.  
# **_Note_: the code below is used to generate the HTML animation. If the animation doesn't work inside the notebook, click on [this link](https://manuleo.github.io/mADAm_files/african_cal_diff_animation.html)**

# In[62]:


# # # Code to generate the animation


# fig = plt.figure()
# animator = animation.FuncAnimation(fig, draw_demand_bar, frames=range(1961, 2021),\
#                                    fargs=(caloric_difference_africa, ),
#                                       repeat=False);
# #HTML(animator.to_jshtml())
# with open(path_dict['african_demand_anim'], "w") as f:
#     print(animator.to_html5_video(), file=f)


# In[63]:


IFrame(src='https://manuleo.github.io/mADAm_files/african_cal_diff_animation.html', width = 1100, height=600)


# As the animation shows, the African situation **sensitevely improved** over the last 60 years, starting from a share of **75%** of starving countries in 1961 to "just" **21%** in 2020.  
# During the years the situation improved differently. The improvement was basic up to the nineties, than dramatically changed from 2000 to now.

# For our data story we plotted the same animation in a slider manner, by using the below function for Africa's and Europe's surplus situation.

# In[ ]:


def get_plotly_progress(dataframe, zone, title, width=800, height=800, mini=-1100, maxi=1500, size=20):
    fig = go.Figure()

    for year in range(1961, 2021):
        x, y, colors = colors_and_order(dataframe[year])
        fig.add_trace(
                go.Bar(visible=False,
                       width=0.85,
                       orientation="h",
                       x=x,
                       y=y,
                       text=y,
                       hovertemplate="%{x} kcal",
                       textposition='outside',
                       textfont=dict(size=100),
                       marker_color=colors,
                       name=str(year)))

    fig.data[-1].visible=True

    steps = []
    for i in range(len(fig.data)):
        step = dict(
            method="restyle",
            args=["visible", [False] * len(fig.data)],
            label=str(i+1961),

        )
        step["args"][1][i] = True  # Toggle i'th trace to "visible"
        steps.append(step)

    sliders = [dict(
        active=59,
        currentvalue={"prefix": "Year: "},
        pad={"t": 50},
        steps=steps,
    )]

    fig.update_layout(
        title=dict(text=title, y=0.94, x=0.03, font=dict(size=15)),
        sliders=sliders,
        xaxis=dict(autorange=False, range=[mini, maxi], title="caloric surplus (kcal/person/day)", showline=True),
        yaxis=dict(autorange=True, showgrid=False, ticks='', showticklabels=False),
        margin=dict(t=30, l=5, b=10, r=5),
        autosize=False,
        width=width,
        height=height,
        paper_bgcolor='rgba(0,0,0,0)',
        plot_bgcolor='rgba(0,0,0,0)'
    )

    py.plot(fig, filename='docs/_includes/cal_diff_{}.html'.format(zone), include_plotlyjs=False)
    
    fig.show()


# Let's now move on to a more understandable map visualization. As usual, we are interested in **2020** for our endpoint, but also on the changing over the years.
# **_Note_: if it doesn't show, the 2020 map can be found [here](https://manuleo.github.io/mADAm_files/african_kcal_need2020.html)**  
# Links to the other years:
# - [1970](https://manuleo.github.io/mADAm_files/african_kcal_need1970.html)
# - [1980](https://manuleo.github.io/mADAm_files/african_kcal_need1980.html)
# - [1990](https://manuleo.github.io/mADAm_files/african_kcal_need1990.html)
# - [2000](https://manuleo.github.io/mADAm_files/african_kcal_need2000.html)
# - [2010](https://manuleo.github.io/mADAm_files/african_kcal_need2010.html)

# In[64]:


legend_name = "Estimation of kcal/persona/day deficit"
for year in range(1970,2030,10):
    bins = [min(caloric_difference_africa[year]), 0, 200, 450, max(caloric_difference_africa[year])]
    african_kcal_need_map = plot_map(caloric_difference_africa, path_dict['geoworld_json'], \
    african_country_kv, year, "RdYlGn", legend_name, legend_name, path_dict['african_kcal_need'] + str(year) + ".html", bins)
african_kcal_need_map


# In[65]:


IFrame(src='https://manuleo.github.io/mADAm_files/african_kcal_need2020.html', width = 800, height=600)


# <img src="https://upload.wikimedia.org/wikipedia/commons/thumb/9/90/African_countries_by_GDP_%28PPP%29_per_capita_in_2017.png/1024px-African_countries_by_GDP_%28PPP%29_per_capita_in_2017.png" height="400" width="400" style="float: left; margin-right: 10px;">  
# 
# We found an analysis <a href="https://en.wikipedia.org/wiki/List_of_African_countries_by_GDP_(PPP)_per_capita">here</a> about the GDP (Gross Domestic Product) in African countries, which we think can be a good way to evaluate the correctness of our analysis. We show to the left the heat map of Africa in terms of each countries GDP per capita.
# 
# Observing the evolution over the last 50 years allow us to better understand and comment the animation above. As we can see from the 1970 map, almost all Africa lived in starving condition with the lowest peak in Guinea-Bissau (at huge deficit of 700 kcal/persona/day needed).
# 
# Individual case studies prove the data's accuracy. For example, the Ethiopian famine in the 80's is impressively reproduced and manifests itself as Ethiopia drops to the lowest rank. In general, a very positive development can be observed, as more and more countries manage to reach a surplus regime every single decade, with the highest increase occurring just recently in the last 10 years. 
# 
# Ever since 2001, more than half of the examined countries show a net surplus. As of 2019, all of the countries in red were determined to be either war-riddled or politically fragile. Exceptions to the rule are Namibia and Eswatini, which both boast a relatively high GDP per capita (ranked 10th and 11th for the African continent, respectively). Thus, the only explanation would be inequality amongst the population or insufficient distribution of available resources. 
# 
# 

# # Determination of European countries with food surplus

# ## 1) How much human food resources are available for European countries?

# ### 1.1) Preprocessing

# To answer this important question, we will need to import data from the **FAO Dataset**. More specifically, we will focus on the section **Food Balance Sheet** with respect to European countries only. 

# In[66]:


food_balance_europe = pd.read_csv(path_dict['food_balance_europe'],encoding='latin-1', low_memory=False);


# European countries will be analysed following the same strategy we used for African countries in order to be consistent also in the way by which we assess whether countries are in deficit or have a surplus. To start off, we will:
# - **remove** all the columns with title **"Y----F"**.
# - **replace** all the **NAN** values with **0** as Item was not available.

# In[67]:


food_balance_europe = clean_Fs_and_years(food_balance_europe)
food_balance_europe = food_balance_europe.fillna(0);


# The third step to complete **the cleaning** of food_balance_europe consists on adapting names of countries in order to have consistency along our different dataframes.
# 
# The easiest of these changes that we observe in our dataframe is **The former Yugoslav Republic of Macedonia** should become **North Macedonia**.

# In[68]:


food_balance_europe = replace_names_of_countries(food_balance_europe, [("The former Yugoslav Republic of Macedonia", "North Macedonia")])


# Our Dataframe looks like this:

# In[69]:


food_balance_europe.head()


# Given the European countries analysis, and since the structure of this dataset is equivalent to that one, we can again obtain the pairs **(Item Code, 2901) and (Element Code, 664)** for our Europe analysis.

# In[70]:


food_supply_europe = obtain_supply(food_balance_europe)


# We can now group by **Area** and see the supplies derived from each item available in countries for each particular year.

# In[71]:


food_supply_europe = food_supply_europe.set_index("Area")
food_supply_europe.head()


# In order to check for anomalies in our data, we would like to analyze the **timeline**. We therefore transpose the dataframe and plot the timeline of how food supply in different countries evolved. Legend was suppressed as it is too large.

# In[72]:


food_supply_europe = food_supply_europe.transpose();


# In[73]:


#converting the year from string to int
food_supply_europe.index = food_supply_europe.index.astype(int)


# In[74]:


timeline_supply(food_supply_europe, "European")


# We can observe a similar situation in this graph as we did in the African countries analysis. As such, we can assume that some countries may have changed names or maybe some other situations occured, which caused countries data to stop being recorded/collected. 
# 
# Let's then see which countries have zeros in the values for food supplies.

# In[75]:


food_supply_europe.columns.values[(food_supply_europe == 0).any()]


# Respecting the cronology, we observe that there were some countries on which data stopped being collected, as well as others from which data started being collected after the initial collection year of 1950. After some research on these countries, we found out that:
# - **USSR** (Union of Soviet Socialist Republics) was a union of a lot of countries, some of which were not even located in Europe. The Union was dissolved in 26 December 1991, and so the countries **Belarus**, **Ukraine**, **Estonia**, **Republic of Moldova**, **Russian Federation**, **Latvia**, **Lithuania** are the European countries that became independent in Europe.
# - **Yugoslav SFR** was made up of the countries that became **Bosnia and Herzegovina**, **Croatia**, **Serbia and Montenegro**, **North Macedonia**, **Slovenia**, all of which obtained their independence between 25 June 1991 - 27 April 1992
# - **Czechoslovakia** was dissoluted in 1 January 1993, after a period called The Velvet Revolution (because of the peaceful ways), and split into the two countries **Czechia** and **Slovakia**
# - **Serbia and Montenegro** separated themselves in 2006, breaking up the last union still recognized as a successor of Yougoslavia
# - **Belgium-Luxembourg** (2003) does not have a main reason to exist, as there is no major event happening in 2003 that would justify the mixing of the two countries. Our understanding lead us to believe this was probably just a grouping created in FAO when collecting the data from the two countries.
# 
# Let's then take care of all these cases..

# In[76]:


countries_to_merge = {'USSR': ['Belarus', 'Ukraine', 'Estonia', 'Republic of Moldova', 'Russian Federation', 'Latvia', 'Lithuania'], 
                         'Yugoslav SFR': ['Bosnia and Herzegovina', 'Croatia', 'Serbia and Montenegro', 'North Macedonia', 'Slovenia'],
                         'Czechoslovakia': ['Czechia', 'Slovakia'],
                         'Serbia and Montenegro': ['Serbia', 'Montenegro'],
                         'Belgium-Luxembourg': ['Luxembourg', 'Belgium']}
food_supply_europe = merge_countries(food_supply_europe, countries_to_merge)


# In[77]:


timeline_supply(food_supply_europe, "European", " - Cleaned dataset")


# We observe a drop in the *food supply* from most countries that came from the former USSR, which we find to be interesting. While it might not be the case for all of these countries, some of them are vey poor and as such, when separated from the world power that was the USSR, their ability to provide and produce for themselves may have dropped significantly, which is coherent with what we observe. As such, we continue our analysis with these values.

# ### 1.2) Interpolation

# Next, we want to add more columns representing future years until 2020 to prepare cells for extrapolation to make predictions about possible scenarios.

# In[78]:


food_supply_europe = prepare_future(food_supply_europe, 2014, 2020)


# In[79]:


food_supply_europe = predict_future(food_supply_europe, path_dict['europe_supply_rnn'])


# In[80]:


timeline_supply(food_supply_europe, "European", " - Predicted dataset")


# Similar to the analysis of African countries, the same conclusions can be drawn about using the neural network method. 

# Visualizing our results with a interactive heatmap:

# In[81]:


european_country_codes = ["ALB","AUT","BLR","BEL","BIH","BGR","HRV","CZE","DNK","EST","FIN","FRA","DEU","GRC","HUN","ISL","IRL",
                          "ITA","LVA","LTU","LUX","MALTAMISSING","MNE","NLD","NOR","POL","PRT","MDA","ROU","RUS","SRB","SWK",
                          "SVN","ESP","SWE","CHE","MKD","UKR","GBR"]
european_country_names = food_supply_europe.T.index.values


european_country_kv = pd.DataFrame({'codes': european_country_codes,
                                    'names': european_country_names
                                    })

geojson_europe = geojson_world[geojson_world.id.isin(european_country_codes)]
geojson_europe.head()


# We notice, after our manual analysis on the countries that **Malta** is not displayed in the json. Hence, we won't consider it for our analysis

# In[82]:


food_supply_europe = food_supply_europe.drop(columns=["Malta"])


# Moving into the plot of the European countries to see their actual and past supply. 
# **_Note_: the map for 2020 (in case it doesn't load) is available [here](https://manuleo.github.io/mADAm_files/european_supply_map2020.html)**  
# Links to other years:
# - [1970](https://manuleo.github.io/mADAm_files/european_supply_map1970.html)
# - [1980](https://manuleo.github.io/mADAm_files/european_supply_map1980.html)
# - [1990](https://manuleo.github.io/mADAm_files/european_supply_map1990.html)
# - [2000](https://manuleo.github.io/mADAm_files/european_supply_map2000.html)
# - [2010](https://manuleo.github.io/mADAm_files/european_supply_map2010.html)

# In[83]:


# Plotting
legend_name = "Food supply in Europe (kcal/person/day)"
for year in range(1970, 2030, 10):
    europe_supply_map = plot_map(food_supply_europe.T, path_dict['geoworld_json'], \
    european_country_kv, year, "Blues", legend_name, legend_name, path_dict['european_supply_map'] + str(year) + ".html", bins=9)
europe_supply_map


# In[84]:


IFrame(src='https://manuleo.github.io/mADAm_files/european_supply_map2020.html', width = 1000, height=600)


# In[85]:


save_map_data(geojson_europe, european_country_kv, food_supply_europe, "docs/json/europe_supply/europe_supply_{}.geojson", "docs/json/europe_supply/europe_supply_ticks.json", bins=6)


# Analyzing the data shows a trend **scaled up but mostly similar** African case. The available food supply in Europe over the last 50 years remained mostly the same, with an available supply between 2500 to 3800 kcal/persona/day.  
# Interesting to note the drop the ex USSR had in 2000 after the division, drop already emerged during the previous cleaning of the dataset.

# ## 2) What is the ideal amount of kcal each European country needs?
# Once again, we do a very similar analysis on European countries as we did for the African ones.

# ### 2.1) How many people live in Europe?

# #### 2.1.1) Preprocessing

# In this second part of our analysis, we load the list of **European countries**. Secondly, we load the **World Population Database** (United Nation) and therefore we obtain two dataframes: one for males and the other one for females.

# In[86]:


with open (path_dict["european_countries_list"],'r') as eu_c:
    eu_countries = [line.rstrip() for line in eu_c] #loading list


# We need to check if the FAO Database contains data regarding **every country** in Europe. We will check the intersection with the list `eu_countries`.

# In[87]:


eu_to_remove = list(set(eu_countries) - set(food_supply_europe.columns.values))
print("List of countries for which no data is available: " + str(eu_to_remove))


# Because there are less countries in Europe, and also because most European countries are part of ONU, we expected most countries to be present in both the FAO database and the population database. These **Channel Islands** are a small set of islands in the English Channel, and because they are so small, we can safely remove them from our scope of interest.

# In[88]:


eu_countries = [i for i in eu_countries if not i in eu_to_remove]
eu_to_remove = list(set(eu_countries) - set(food_supply_europe.columns.values))
print("List of countries for which no data is available: " + str(eu_to_remove))


# We now obtain the population for each gender in all european countries.

# In[89]:


pop_male_europe = clean_pop_df(pop_male, eu_countries)
pop_female_europe = clean_pop_df(pop_female, eu_countries)


# Let's have a look at the final version of male population data grouped by age:

# In[90]:


pop_male_europe.head()


# #### 2.1.2) Interpolation

# Similarly to the analysis performed on Africa, we once again interpolate the years with a frequency of **5 years** in *pop_male_europe*, to have a frequency of 1 year. Before doing so, we check again if the evolution in these 5 years intervals occurs in a linear manner.  
# **_Note_: as usual, if the animation is not visible in the notebook, click [here](https://manuleo.github.io/mADAm_files/european_pop_growth.html)**

# In[91]:


IFrame(src='https://manuleo.github.io/mADAm_files/european_pop_growth.html', width = 1200, height=600)


# In[92]:


# # Selecting countries
# countryrand = []
# n_countries = 2
# for i in range(0, n_countries):
#     countryrand.append(random.choice(pop_male_europe.country.drop_duplicates().values))

# fig = plt.figure()
# animator = animation.FuncAnimation(fig, timeline_country_gender, frames=(range(1950, 2025, 5)),\
#                                    fargs = (pop_male_europe, pop_female_europe, "30-34", countryrand), repeat=False)
# #HTML(animator.to_jshtml())

# with open(path_dict['european_pop_growth'], "w") as f:
#     print(animator.to_html5_video(), file=f)


# Even if the growth is not so linear as a whole as before, the animation clearly shows that it is linear inside each group of 5 year. As our interpolation works by interpolating on each of these groups, we can proceed with it

# In[93]:


pop_male_europe = interpolate_years(pop_male_europe, 1950, 2020)
pop_female_europe = interpolate_years(pop_female_europe, 1950, 2020)


# Let's see how the new dataframes for males and females population look like:

# In[94]:


pop_male_europe.head()


# In[95]:


pop_female_europe.head()


# #### 2.1.3) Computing total European population

# Lastly, we will compute the total population per year. This new dataframe **pop_tot** will be useful for the next section of our analysis.

# In[96]:


pop_tot_europe = obtain_total_pop(pop_male_europe, pop_female_europe)


# For the next analysis we will need to match this data with the `food_balance_europe`. We proceed to give to our population data the same shape as the other dataset

# In[97]:


pop_tot_europe = reshape_pop_dataframe(pop_tot_europe)


# In[98]:


pop_tot_europe.head()


# ## 3) Estimantion of ideal human food demand in Europe 

# Now we multiply each column of the population data for each matching `age_group` in the calories table (that here we squeeze to allow the multiplication, similar to a transpose rows/columns of the dataset).  
# We obtain two datasets: `male_cal_need_europe` and `female_cal_need_europe` reporting total calories needed for **each country in each year per age group per gender**. 
# The unit here is **kcal/day**.

# In[99]:


#total calories male
male_cal_need_europe = get_calories_need(pop_male_europe, male_calories)
male_cal_need_europe.head()


# In[100]:


#total calories female
female_cal_need_europe = get_calories_need(pop_female_europe, female_calories)
female_cal_need_europe.head()


# Once we have the calories needed for both gender, we can add them together easily to achieve total calories needed for **each country in each year**, and we collect them in the dataframe `total_cal_need_europe`. The unit is **kcal/year**. All of this is done by the `obtain_total_cal_need` function

# In[101]:


total_cal_need_europe = obtain_total_cal_need(male_cal_need_europe, female_cal_need_europe)


# Let's take a look at total calories dataframe **total_cal**:

# In[102]:


total_cal_need_europe.sort_values(by="Calories", ascending=False).head()


# For the sake of consistency, we will now reshape our dataframe `total_cal_need_europe` according to the same schema seen for `food_supply_europe`.

# In[103]:


total_cal_need_europe = reshape_calories_df(total_cal_need_europe)


# Drawing a sample of the final shaped dataframe total calories `total_cal`:

# In[104]:


total_cal_need_europe.head()


# We can proceed to plot the results inside a map visualization. The most interesting one is as usual the **2020**, but also analyzing past years is crucial to understanding how the trend changed. 
# **_Note_: the link to 2020 in case it won't show up is available [here](https://manuleo.github.io/mADAm_files/europe_est_kcal2020.html)**  
# Links for other years:
# - [1970](https://manuleo.github.io/mADAm_files/europe_est_kcal1970.html)
# - [1980](https://manuleo.github.io/mADAm_files/europe_est_kcal1980.html)
# - [1990](https://manuleo.github.io/mADAm_files/europe_est_kcal1990.html)
# - [2000](https://manuleo.github.io/mADAm_files/europe_est_kcal2000.html)
# - [2010](https://manuleo.github.io/mADAm_files/europe_est_kcal2010.html)

# In[105]:


for year in range(1970,2030,10):
    legend_name = "Estimation of kcal/year [10^11 kcal/year]"
    europe_kcal_est_map = plot_map(total_cal_need_europe.divide(10**11), path_dict['geoworld_json'], \
    european_country_kv, year, "YlGn", legend_name, legend_name, path_dict['european_estimation_kcal'] + str(year) + ".html", bins=9)
europe_kcal_est_map


# In[106]:


IFrame(src='https://manuleo.github.io/mADAm_files/europe_est_kcal2020.html', width = 1000, height=600)


# In[107]:


save_map_data(geojson_europe, european_country_kv, total_cal_need_europe.divide(10**11).T, "docs/json/europe_need/europe_need_{}.geojson", "docs/json/europe_need/europe_need_ticks.json", bins=9)


# As we can see, the increasing in population is reflected in increasing of needed kcal in Europe.

# ## 4) Which countries have more than they need? 
# In this Europe evaluation, we flip the scope of our analysis completely. While in the case of Africa, we want to know which countries need most help (as in, are not producing enough to sustainably survive), for Europe we want to find out which countries are producing more food internally than what they need. The point of this analysis is to find out who could help the African countries in need, by giving away some of their production, while still keeping at least a minimum to be healthy.
# 
# As with the African analysis, this analysis will take into account the total population dataframe `pop_tot` and the `food_supply_europe`. With regards to the FAO Dataframe of food supply, we will need to transform the unit into **kcal/year** in order to compare results appropriately.

# In[108]:


caloric_difference_europe = obtain_difference(pop_tot_europe, food_supply_europe, total_cal_need_europe)


# In[109]:


caloric_difference_europe.head()


# In[110]:


caloric_difference_europe[caloric_difference_europe[2020].values < 0].index


# When running the exact same analysis on European countries as we did in African ones, it's interesting to observe that in Europe, no country at all is producing less than what they actually need. As such, every single European country's suplly is actually greater than its demand, and should in theory be able to solve the hunger issue in Africa.  

# ### 4.1) Visualizing the data

# We can now proceed to the same visualization we did before, this time with the scope of **visualizing the 2020 situation and evaluating if there has been a deficit of kcal over the past years**

# In[111]:


caloric_difference_europe_sorted = caloric_difference_europe[2020].sort_values()
p = caloric_difference_europe_sorted.plot(kind='barh', color=(caloric_difference_europe_sorted > 0).map({True: 'g', False: 'red'}),alpha=0.75, rot=0);
p.set_xlabel("deficit/surplus in the european countries [kcal/persona/day]")
p.set_ylabel("European countries")
plt.title('Estimation of net food availability in European countries in 2020' );


# As already noticed before, European countries boast a very high surplus of food supplied. We can now use the animation we used before to assess the situation over the past years and see if Europe had deficits:

# In[112]:


# fig = plt.figure()
# animator = animation.FuncAnimation(fig, draw_demand_bar, frames=range(1961, 2021),\
#                                    fargs=(caloric_difference_europe, ),
#                                       repeat=False);
# #HTML(animator.to_jshtml())
# with open(path_dict['european_demand_anim'], "w") as f:
#     print(animator.to_html5_video(), file=f)


# In[113]:


IFrame(src='https://manuleo.github.io/mADAm_files/european_cal_diff_animation.html', width = 1200, height=600)


# As we can see, **Europe has experienced high affluence periods** since 1961. Over the past 50 years, only in three years some countries presented a really small deficit in calories:
# - Albania in 1961 and 1963
# - Moldova, Macedonia and Croatia directly after their independence (this situation already emerged during the food supply analysis)

# Finally, we draw a map of the European situation. Year **2020** is as usual the most important one, especially for Europe, as we are interested in their actual surplus for the rest of our analysis and because there has been basically no significant change over the last 50 years.  
# **_Note_: map for 2020 is available [here](https://manuleo.github.io/mADAm_files/european_kcal_surplus2020.html)**  
# Links for other years:
# - [1970](https://manuleo.github.io/mADAm_files/european_kcal_surplus1970.html)
# - [1980](https://manuleo.github.io/mADAm_files/european_kcal_surplus1980.html)
# - [1990](https://manuleo.github.io/mADAm_files/european_kcal_surplus1990.html)
# - [2000](https://manuleo.github.io/mADAm_files/european_kcal_surplus2000.html)
# - [2010](https://manuleo.github.io/mADAm_files/european_kcal_surplus2010.html)

# In[114]:


legend_name = "Estimation of kcal/persona/day surplus"
for year in range(1970,2030,10):
    europe_kcal_surplus_map = plot_map(caloric_difference_europe, path_dict['geoworld_json'], \
    european_country_kv, year, "Greens", legend_name, legend_name, path_dict['european_kcal_surplus'] + str(year) + ".html", bins=8)
europe_kcal_surplus_map


# In[115]:


IFrame(src='https://manuleo.github.io/mADAm_files/european_kcal_surplus2020.html', width = 1000, height=600)


# As a consequence of the welfare in which Europe is leaving since the post wars period, the available surplus hasn't changed a lot over the years. However, a little increase is notable since 1970 to now.  
# With the information we gathered so far, we can proceed to a **comparison between Africa and Europe**, to then move on to an analysis on which European countries could potentially help African countries in need.

# # Which European countries can help Africa?

# We have easily noticed that in Europe, every country has more food than what they need to healthily survive. We will now plot a map that shows this difference more noticeably.

# In[116]:


afr_eu_country_kv = european_country_kv.copy()
afr_eu_country_kv = afr_eu_country_kv.append(african_country_kv)

afr_eu_country_kv = afr_eu_country_kv.sort_values(by='names')

geojson_afr_eu = geojson_world[geojson_world.id.isin(afr_eu_country_kv.codes)]
geojson_afr_eu.head()


# In[117]:


caloric_difference_world = pd.concat([caloric_difference_africa, caloric_difference_europe])
caloric_difference_world = caloric_difference_world.sort_index()


# Below we plot the surplus/deficit of calories in Africa and Europe in the year of **2020**, for the comparison mentioned before.

# In[118]:


legend_name = "Estimation of kcal/persona/day surplus"
bins = [min(caloric_difference_world[2020]), 0, 300, 600, 900, max(caloric_difference_world[2020])]
afr_eu_kcal_surplus_map = plot_map(caloric_difference_world, path_dict['geoworld_json'], \
afr_eu_country_kv, year, "RdYlGn", legend_name, legend_name, path_dict['world_kcal_surplus'] + str(2020) + ".html", bins)
afr_eu_kcal_surplus_map


# In[119]:


IFrame(src='https://manuleo.github.io/mADAm_files/world/world_kcal_surplus2020.html', width = 1100, height=800)


# In[120]:


save_map_data(geojson_afr_eu, afr_eu_country_kv, caloric_difference_world.T, "docs/json/cal_world/cal_world_{}.geojson", "docs/json/cal_world/cal_world_ticks.json",\
              bins=[0, 300, 600, 900], year_start = 2020)


# This map illustrates the previously stated assumption that Europe is in a much better food situation than African countries in general. This is obviously observed by checking that there are no countries in Europe with a color "lower" than yellow-ish green, while in Africa we observe multiple countries colored in orange and even red! These african countries which don't have enough food to feed some of their citizens are the ones we consider to be in most dire need of help, and thus we will try to come up with opportunities of how to help them in the future.   
# Another interesting thing to observe is that there seems to be a line dividing the hunger in the world in North and South. By looking at the plots generated by the above cell (located in *visualization/world*), we notice that this line has been going more and more south over the decades, which means that the hunger situation is slowly being solved as years go by. The main reason we think that such a line exists and it's going south is that the countries in the north of Africa are developing more rapidly than their subsaharan counterparts, and are subsequently able to provide better quality of life for their citizens.

# Furthermore, comparing the countries in Europe which have the highest surplus to the African countries with the highest deficit would be of interest as well. We consider the countries in need of help as those that do not have a surplus of at least **300** kcal/person/day (orange and red in the map). This selection makes sense because the values we calculate are averaged over the whole population, and if a country has a low surplus, it may mean that a big slice of the population are still starving.

# In[121]:


# The countries with deficit in calories
african_countries_to_help = caloric_difference_africa[caloric_difference_africa[2020].values < 300][2020].sort_values(ascending = False)
number_of_need_of_help = len(african_countries_to_help.index)


# In[122]:


# The countries in Europe with highest surplus in calories
caloric_difference_highest_europe = caloric_difference_europe[2020].sort_values(ascending=False).head(number_of_need_of_help).sort_values(ascending=True)


# In[123]:


grid = plt.GridSpec(1,2)

# negative african countries
plt.subplot(grid[0, 0]);
p = african_countries_to_help.plot(kind='barh', color=(african_countries_to_help > 0).map({True: 'orange', False: 'red'}),alpha=0.75, rot=0);
p.set_xlabel("deficit in the african countries [kcal/persona/day]")
p.set_ylabel("African countries")
plt.title('Lowest surplus in African Countries in 2020' )

# highest surplus europe countries
plt.subplot(grid[0, 1]);
p1 = caloric_difference_highest_europe.plot(kind='barh', color=(caloric_difference_highest_europe > 900).map({True: 'g', False: 'lightgreen'}),alpha=0.75, rot=0);
p1.set_xlabel("surplus in the european countries [kcal/persona/day]")
p1.set_ylabel("European countries")
plt.title('Highest surplus in European countries in 2020' )

plt.tight_layout()
plt.show()


# By looking at the difference between the country most in need **Zambia** and the country with the highest surplus **Belgium**, we see that Belgium alone has way more extra food than the lacking countries need individually. As such, we think that an interesting analysis to be made is which European countries could solve African hunger on their own. Below we analyse how much Africa needs in total, and how much Belgium alone "can" provide.

# In[124]:


need_in_africa = abs(african_countries_to_help[african_countries_to_help.values < 0].sum())
print("African countries with a deficit need {0:.2f} extra kcal/person/day in total to solve hunger.".format(need_in_africa))


# In[125]:


print("Belgium has {0:.2f} kcal/person/day over their basic needs.".format(caloric_difference_highest_europe['Belgium']))


# These numbers need to be multiplied by the countries' population to allow for direct comparison! We can say however that each person in Belgium could help one person/day in each of the most starving countries in Africa, while still maintaining some extra kcals (around 300 kcal). We proceed to do the analysis on the total caloric need for the starving countries we picked out (less than 0 kcal/person/day).

# In[126]:


# the african countries with negative surplus
deficit_africa = pd.DataFrame(african_countries_to_help[african_countries_to_help.values < 0])

# european countries with highest surplus (per person)
caloric_surplus_europe_high = pd.DataFrame(caloric_difference_europe[2020].sort_values(ascending=False).head(len(deficit_africa.index)).sort_values(ascending=True))


# In[127]:


total_need_africa = deficit_africa.merge(pop_tot_africa[2020], left_index=True, right_index=True)
total_need_africa['Total deficit (kcal/year)'] = total_need_africa['2020_x'] * total_need_africa['2020_y']*365
total_need_africa = total_need_africa.drop(columns=['2020_x', '2020_y']).sort_values(by='Total deficit (kcal/year)', ascending=False)
total_need_africa


# In[128]:


total_extra_europe = caloric_surplus_europe_high.merge(pop_tot_europe[2020], left_index=True, right_index=True)
total_extra_europe['Total surplus (kcal/year)'] = total_extra_europe['2020_x'] * total_extra_europe['2020_y']*365
total_extra_europe = total_extra_europe.drop(columns=['2020_x', '2020_y']).sort_values(by='Total surplus (kcal/year)')
total_extra_europe


# In[129]:


grid = plt.GridSpec(1,2);

ax0 = plt.subplot(grid[0, 0]);
total_need_africa.plot(kind='barh', color='r',alpha=0.75, rot=0, ax = ax0 );
plt.xlabel("Deficit in the african countries [kcal/year]")
plt.ylabel("African countries")
plt.title('Highest total deficits in African countries in 2020' )
ax1 = plt.subplot(grid[0, 1]);

total_extra_europe.plot(kind='barh', color='g',alpha=0.75, rot=0, ax = ax1);
plt.xlabel("Surplus in the european countries [kcal/year]")
plt.ylabel("European countries")
plt.title('Highest total surplus in European countries in 2020' )

plt.tight_layout()
plt.show()


# We now see that **France** is actually the one with highest total surplus (of the ones with highest surplus per person). This is due to the fact that France has a much higher population than Belgium does. Following now is the repetition of the previous analysis if France alone could solve the sum of African deficits.

# In[130]:


need_in_africa = abs(total_need_africa['Total deficit (kcal/year)'].sum())
print("African countries with a deficit need {0:.2E} extra kcal/year in total to solve hunger.".format(need_in_africa))


# In[131]:


surplus_france = total_extra_europe['Total surplus (kcal/year)']['France']
print("France has {0:.2E} kcal/year over their basic needs.".format(surplus_france))


# From this analysis, we see that the total surplus in France is **over double** of the sum of deficits in African countries. This result shows that the European countries with a high surplus per person are indeed very capable of feeding a large percentage of starving people in Africa.

# The final interesting analysis we want to look into is the actual total surplus in each European country. From that, we choose the countries that have the highest of total surplus (multiply the surplus per person by the country population) and then we choose the ones with the highest values.

# In[132]:


total_europe = pd.DataFrame(caloric_difference_europe[2020]).merge(pop_tot_europe[2020], left_index=True, right_index=True)
total_europe['Total surplus (kcal/year)'] = total_europe['2020_x'] * total_europe['2020_y']*365
total_europe = total_europe.drop(columns=['2020_x', '2020_y']).sort_values(by='Total surplus (kcal/year)', ascending=False).head(len(deficit_africa.index)).sort_values(by='Total surplus (kcal/year)')
total_europe


# In[133]:


p1 = total_europe.plot(kind='barh', color='g',alpha=0.75, rot=0);
p1.set_xlabel("Surplus in the european countries [kcal/year]")
p1.set_ylabel("European countries")
plt.title('Highest total surplus in European countries in 2020' )
plt.show()


# Because the **Russian Federation** has such a high population, we see that, even though its surplus/person is not very high compared to other European countries, the total surplus in the full country will actually overrrule the other countries' surplus. For the sake of consistency, we print below the actual total surplus of **Russian Federation** and compare it to the total deficit in Africa.

# In[134]:


print("African countries with a deficit need {0:.2E} extra kcal/year in total to solve hunger.".format(need_in_africa))


# In[135]:


total_surplus_russia = total_europe['Total surplus (kcal/year)']['Russian Federation']
print("The Russian Federation has {0:.2E} kcal/year over their basic needs.".format(total_surplus_russia))


# From this shift in first seeing the total country's surplus, we arrive to the conclusion that the **Russian Federation** alone has a whopping **5 times** bigger surplus than what is needed to solve hunger in the African countries with a deficit.  
# With this analysis, we're able to see that Europe is very likely able to solve the hunger issue in Africa by giving up a **small fraction** of the extra calories that are not currently being put to use.

# ### A summing up to now  
# 
# The results we came up with are satisfactory as they **accurately reproduce what was expected**. They also allowed for interesting analysis. We decided to conduct an analysis on Europe too as it's interesting to see how much spare food would be available for redistribution, as European countries are more likely to be able to provide help. Furthermore, it was decided to restrict further analysis to 2020 as the interest is in the contemporary situation. **Nine African countries** were found to show an overall deficit in supply, meaning unlike other African countries, they are not capable of solving their issues themselves.

# ## Next steps until Milestone 3
# 
# 1. **Split the nutrition supply into macronutrients**, in order to do so we will use the FAO food supply dataset, this time by choosing the protein and fat supply (the charbohydrates will be derived by using _total_carbo (computed basing on kcal) - (fat + supply)_. Then we will compute the needed amounts of these nutrients with the use of a simple approximation on the already computed kcal (55% charbo/ 25% proteins/ 20% fat) and we will compute the difference between the two. In this sense, the analysis will be **fairly similar** to what we already did.  
# 2. After achieving this information, we can compute what are the **best final products that can represent African needs** over all the three macronutrients (and the total kcals needed). In order to do so, we will produce a ranking of products that best match each of the three drivers. Once we have this, we can sample a bunch of products and decide to use them to produce a _"simple diet"_ for African countries.
# 3. We then will decide **how Europe should send products to Africa**. As a first approximation, we will aggregate African countries and the same will be done for European ones. In this sense, the Europe as a whole won't be any problem to send products to Africa. We will need to consider the amount of food supply in the whole europe using another time the FAO dataset.
# 4. Having decided which products we should send from Europe and attested that we can send them, we'll find **which countries should receive which food items and in what quantities** (considering their surplus and deficit).
#     - At this point, we will decide if proceeding with the simple approximation up to the end or not. Our plan now start to move from seeing the European countries as a whole (first approximation) to a second, more in-depth step were we will decide **how to redistribute the aid sending** among all the considered European countries. This step will be done by considering the total supply available in each country in absolute value and then scale it taking into account the population. This will allow to define **a share** for the products.
# 5. **Final visualization steps and data story**
#     - Improve the **visualization methods** already used and come up with new ones. As example, we would introduce maps changing over time for the fist part of the analysis instead of sticking them to a single year. Also, the animations will be developed in Javascript, allowing the user to enjoy the visualization experience with **real-time manipulations of the data**.
#     - In parallel to the visualization improvement, we will develop a data story that **presents the issue we want to solve**, and propose **our way to solve it**. The data story we'll propose will probably follow the main points of the analysis we have done so far about Africa vs Europe supply, and from this baseline we'll move to the "simple diet" introduction.

# # Optimizing the distribution of products from Europe to Africa

# The idea of optimizing the distribution of products from Europe to Africa presupposes the knowledge of the countries that will help and the countries that will be helped. In this context, we have chosen that helper European countries will be **Italy, France, Germany, United Kingdom, Spain**. This decision is justified by different reasons:
# - The total surplus of Europe is 2 orders of magnitude higher than the deficit in Africa. We will take this results and adapt it to the scope of our analysis. More specifically, this simplification will allow to stress the fact that **the problem of African starvation can come to an end by the hand of just few countries and not necessarily the whole world**.
# - During the analysis of food prices we were not able to scrape costs of every food item in every european country. The decision is then to narrow it down.
# - The choice will be tested with a GWP analysis on the richest European countries that should be involved in the scope of helping Africa.

# As regards African countries, our plan is to primarily help countries showing a deficit predicted by our initial model. In addition, we will extend the scope to other African countries that shows a weak surplus. As such, we set the threshold of **300kcal/persona/day** for the food availability. If a country is below this upper bound it will be helped to reach it. For the sake of completeness, we want to let the reader know that our model is accurate in many ways but still doesn't take into account of different phenomena that can altarate our values such as **wealth distribution**, **civil wars**, **climate disasters**,etc. 

# ## 1) How much African countries really need to solve the hunger problem?

# Let's have a look at the countries we need to help:

# In[136]:


af_real_deficit = np.abs(300 - african_countries_to_help)
af_real_deficit = af_real_deficit.to_frame()
af_real_deficit.tail()


# Plotting the countries vs the amount of kilocalories they need to recevive in order to reach an desired availability of **300kcal/persona/day**:

# In[137]:


sns.barplot(y="index", x=2020, data=af_real_deficit.reset_index(), palette="Reds")
plt.xlabel("Amount of kilocalories needed [kcal/persona/day]")
plt.ylabel("African countries")
plt.title("Calculated amount of kcal to reach a threshold of 300kcal/persona/day in every african countries [kcal/persona/day]");


# In the plot above we see how the darker red coloured countries like **Zambia**, **Madagascar** and **Chad** are most in need. The situation here is flipped so the reader must pay attention to the context of this plot to fully understand the meaning. <br><br>
# Now we will scale the problem to its real dimensions,*i.e.* we will multiply these values for the population of each country respecetively and for 365 day/year. The results keep the structure but the unit of values will be **kcal/year**.

# In[138]:


af_real_deficit_tot = af_real_deficit.merge(pop_tot_africa[2020], left_index=True, right_index=True)
af_real_deficit_tot['Total deficit (kcal/year)'] = af_real_deficit_tot['2020_x'] * af_real_deficit_tot['2020_y']*365
af_real_deficit_tot = af_real_deficit_tot.drop(columns=['2020_x', '2020_y']).sort_values(by='Total deficit (kcal/year)', ascending=False)
af_real_deficit_tot.head()


# ## 2) GDP Analysis on European countries

# Since we have set the number of countries that will be "the helper", now we want to understand how much they should contribute individually to the cause. In particular, the scope is to define an index by which we will weight the contribution of each European country. According to this [website](https://www.thebalance.com/gdp-per-capita-formula-u-s-compared-to-highest-and-lowest-3305848), "the **GDP per capita is a measure of a country's economic output** that accounts for its number of people. It is a good measurement of a country's standard of living". This looks like exactly what we are looking for. Starting from the GDP per capita, it will be relatively easy to compute an **effective weight for how prosperous a country is**. The goal here is to **determine a measure for how much countries should give up of their surplus** and the idea is to model the optimization problem in order to **penalize countries with highest GDP that should indeed contribute the most**.
# We will show initally all the countries but then narrow it down as mentioned before.

# We load the dataset from the FAO on **Macro Statistics and Key Indicators**:

# In[139]:


gdp_europe = pd.read_csv("data/raw/Macro-Statistics_Key_Indicators_E_Europe.csv")
gdp_europe.head(2)


# We are interested in the **Gross Domestic Product in US Dollars**. We now filter down for Element Code and Item Code. In particular **Item Code** equal to **22008** corresponds to "Gross Domestic Product" and **Element Code** equal **22008** corresponds to "Value US$"

# In[140]:


#filtering element code and item code
gdp_europe = gdp_europe[(gdp_europe["Element Code"]==6110) & (gdp_europe["Item Code"]==22008)]


# A short note on the dataset: the most recent observation is for 2017 and in our analysis many times we predicted the data up to 2020 if not available. In this case, we decide not to do it since we are not interested in the exact value of GDP, but rather to an index. Therefore, we accept the error as it will be neglegible.

# We clean now a little bit our dataframe in order to delete information we don't need:

# In[141]:


#Delete all the columns but Area and Y2017, set the index as Area (name of the country)
gdp_europe = gdp_europe[["Area","Y2017"]].set_index("Area")
#Rename column
gdp_europe = gdp_europe.rename(columns={"Y2017": "GDP [millions US$]"})
#Drop Malta as its not in our scope
gdp_europe = gdp_europe.drop("Malta")
#Filter down to the coutries in our main list european_country_kv
gdp_europe = gdp_europe[gdp_europe.index.isin(european_country_kv.names)]
gdp_europe.head()


# Let's visualize quickly this result:

# In[142]:


p = sns.barplot(y="Area", x="GDP [millions US$]",\
            data=gdp_europe.sort_values(by="GDP [millions US$]",ascending=True).reset_index()\
            ,palette="Greens")
p.set(xscale="log") #we use log axis because we are are more interested in the order of magnitude than the exact value
plt.xlabel("Log of GWP [millions US$]")
plt.ylabel("European countries")
plt.title("GWP for European countries [millions US$]");


# As we can see **Germany, United Kingdom, France, Italy and Spain are the countries with highest GDP in Europe**. This justifies perfectly our design choice of narrowing down the helper countries to these five. Since we are not taking into account of population, Russian Federation shows up in the list of richest countries. Let's analyse now what happens if we divide by population. We expect that Switzerland and Luxembourg will be at the top of the list while Russian Federation will lose positions.

# In[143]:


#create dataframe gdp_capita
gdp_capita = pd.DataFrame(index=gdp_europe.index)
#retrieve population data
pop_tot_2020 = pop_tot_europe[[2020]]
#dividing gwp for population
gdp_capita["GDP kUS$/capita"] = gdp_europe.values/pop_tot_2020.values*1000
gdp_capita.head()


# Plotting the results:

# In[144]:


p = sns.barplot(y="Area", x="GDP kUS$/capita",\
            data=gdp_capita.sort_values(by="GDP kUS$/capita",ascending=True).reset_index()\
            ,palette="Greens")
plt.xlabel("GWP per capita [kUS$/capita]")
plt.ylabel("European countries")
plt.title("GWP per capita for European countries [kUS$/capita]");


# Our expectations are met. Switzerland and Luxembourg are the richest countries if considered the well-being of every person. Frow now on we will consider **the GWP per capita as the weight indicating at what extent countries should contribute to the African cause**. 

# ## 3) Modelling the optmization problem

# In this chapter we will introduce **a method to redistribute food among African countries** that are in need. The idea behind the implementation is to **minimize the amount of food that has to be delivered** by our five best european contries. While minimizing, we will have to respect the demand constraint as well as the constraint on food availability of each European country. It is very important to model our problem in a way that richest countries with an higher food availability will have to give up more and viceversa. It is now way more clear where the analysis on GWP will come in handy.

# The objective function to be minimized is a **quadratic non-negative weighted sum of food [kcal/year]**. More spefically, the weights we designed take into account both of GWP of a country and the food availability [kcal/year]. The modelling choice is justified by the fact that a rich country with a large surplus should contribute more than a relatively poor country with less possibilities.

# The problem we want to model is a **Quadratic Program with Linear Constraints** a.k.a **QP**. The choice of a quadratic objective function is due the fact that we will need to evenely distribute the resources and this means that weights will have to increase quadratically with the amount of food given away. It makes sense to say that the more a country gives away of its surplus, the less the same country should contribute further if other countries didn't countribute at all yet.

# In particular:
# - $Y$ is a matrix in $R^{mxn}$ in which $m$ is the number of European countries and $n$ is the number of African countries. Each entry $y_{ij}$ of the matrix $Y\in{R^{mxn}}$ is the amount [kcal/year] of kilocalories that the European country $i\in{1,...,m}$ will have to send to the African country $j\in{1,...,n}$
# - The weights will be in the interval $[-1,1]$. Initally we normalize the GWP per capita of every european country. Afterwards, we will calculate the gwp_inverse = (1-gwp_normalized). The final weight is computed by multiplying the food surplus for gwp_inverse. The number obtained will be the measure by which the logic "The richest gives more" is respected. We will name the weights $w\in{R^{mx1}}$.
# - The objective function is $\sum_{i=1}^{m}\sum_{j=1}^{n}{w_jy_{ij}^2 + w_jy_{ij}}$
# - The constraints can be considered a restriction on the value that our decision variable will assume. By restricting the feasibile set we will impose the following limits: 
#   - non-negativity of food [kcal/year] sent
#   - supply and demand must be met according to European country surplus and African countries deficit.
#   - even distribution of resources
# 

# The first step is then to normalize the GWP in order to scale them in the interval $[-1,1]$:

# In[145]:


scaler = MinMaxScaler()
scaler.fit(gdp_capita.values.reshape(-1,1)) #Min Max of sklearn used
gdp_europe_capita_scaled = pd.DataFrame(index=gdp_capita.index) #create new dataframe
gdp_europe_capita_scaled["GDP/capita scaled"] = np.round_(scaler.transform(gdp_capita.values.reshape(-1,1)),6) #round values in order to have clean numbers
gdp_europe_capita_scaled.head()


# In[146]:


fig = px.bar(gdp_europe_capita_scaled.reset_index(), x='Area', y='GDP/capita scaled')
fig.show()


# We narrow down to the countries shown below in `best_countries`:

# In[147]:


best_countries = ["Italy","France","Spain","United Kingdom","Germany"]


# Let's see the scaled GDPs of our best countries:

# In[148]:


gdp_europe_capita_scaled.loc[best_countries]


# In the code below, we present how the bubble plot used for helper country selection was generated.

# In[ ]:


maximum = max(total_cal_need_europe[2020])
sizes = [x/maximum*100 for x in total_cal_need_europe[2020]]

fig = go.Figure()

fig.add_trace(go.Scatter(
    x=gdp_capita["GDP 100kUS$/capita"],
    y=caloric_difference_europe[2020],
    mode="markers",
    marker=dict(size=sizes,
                color=total_cal_need_europe[2020],
                colorbar=dict(title="Total caloric surplus [kcal/year]")
               ),
    text=total_cal_need_europe.index,
    name="",
    hovertemplate="<b>%{text}</b><br><br>GDP per capita: %{x:.2f}e+3 US$<br>Surplus per capita: %{y:.2f} kcal/person/day<br>Total surplus: %{marker.color:.2e} kcal/year"))

fig.update_layout(
    title = dict(text="Viability of European donor nations", font=dict(size=20)), 
    plot_bgcolor='rgb(256,256,256)',
    xaxis=dict(type="log", gridcolor="#eee", title="GDP per capita [thousands US$]"),
    yaxis=dict(gridcolor="#eee", title="Caloric surplus [kcal/person/day]"),
    )

py.plot(fig, filename='docs/_includes/gdp_surplus.html', include_plotlyjs=False)


# In[287]:


IFrame(src='https://manuleo.github.io/mADAm_files/gdp_surplus.html', width = 1000, height=600)


# **Note that the design choice of selecting 5 best european countries is on a large extent justified**. We can observe that the countries represented in the top right hand side of the plot are the ones with the higher Surplus and GDP. Moreover, we see that Italy, UK, Spain, Germany and France are actually the countries that have the most in terms of total surplus (considering the population). Russia wil not be considered because the position on the plot suggest that even tough Total surplus is high, the GDP is relatively not compared to other countries.

# Create auxiliary dataframes to store the coefficients:
# - `europe_plus_val` and `europe_plus_index` will store respectively the european surplus and the corresponding names of the countries
# - `africa_deficit_val` and `africa_deficit_indx` will store respectively the african defict and the corresponding names of the countries
# - `gdp_europe_val` and `gdp_europe_index` will store respectively the gwp normalized of european countries and the corresponding names of the countries

# In[150]:


europe_plus_val = total_europe.loc[best_countries].astype(float).values
europe_plus_index = total_europe.loc[best_countries].index
africa_deficit_val= np.abs(af_real_deficit_tot.values).astype(float)
africa_deficit_index = af_real_deficit_tot.index
gdp_europe_val = gdp_europe_capita_scaled.loc[best_countries].astype(float).values
gdp_europe_index = gdp_europe_capita_scaled.loc[best_countries].index


# According to the [documentation](https://picos-api.gitlab.io/picos/tutorial.html) of the library **PICOS** we adopted to solve the problem, **QUADRATIC PROGRAMS** can't be implemented yet. As a consequence, we will model the problem with a matlab function **QPSolverEuropeAfrica** that is located on the repository. It is possibile to run matlab functions on jupyter notebooks by importing **matlab.engine**. The function is then called and executed by a temporary matlab kernel created in the jupyter session.
# In Matlab, it is possibile to use YALMIP which is a very useful translator of the problem to different solvers. We have an academic free license of GUROBI which is a quite famous solver and we will go for it.

# In[151]:


import matlab.engine
euc = europe_plus_val; #european countries surplus matrix size 1xm
afc = africa_deficit_val; #african countries deficit matrix size 1xn
gwp = 1-gdp_europe_val; # gwp
#Casting to matlab variables
euc = matlab.double(euc.tolist())
afc = matlab.double(afc.tolist())
gwp = matlab.double(gwp.tolist())
#Start matlab engine
eng = matlab.engine.start_matlab()
#Initializing QPSolverEuropeAfrica that will store the outputs of the matlab function
QPSolverEuropeAfrica = eng.QPSolverEuropeAfrica(euc,afc,gwp,nargout=1)
#Quitting the matlab engine
eng.quit()


# In[152]:


#PICKLE
#food_opt_distribution_df.to_pickle('data/processed/food_opt_distribution_df.pkl') #create pickle
food_opt_distribution_df = pd.read_pickle('data/processed/food_opt_distribution_df.pkl')


# The optimization was successful. We will store them in the dataframe `food_opt_distribution_df`. Let's have a look at them:

# In[153]:


food_opt_distribution =  np.round(np.asarray(QPSolverEuropeAfrica),4)
food_opt_distribution_df= pd.DataFrame(data=food_opt_distribution,index = europe_plus_index, columns=africa_deficit_index)
food_opt_distribution_df.head()


# Let's compute now the total amount that every European country has to give up to save Africa:

# In[154]:


#summing over rows of optimal_distribution_df 
eu_countries_giveup = food_opt_distribution.sum(axis=1)
#storing results in a new dataframe eu_countries_giveup
eu_countries_giveup = pd.DataFrame({'Total giveup [kcal/year]': eu_countries_giveup}, index=best_countries)
eu_countries_giveup.head()


# # Finding the right ingredients to provide

# ## 1) A first look inside the possible food

# The following dataframe we obtained has nutritional information about the most varied food products there are. The information we are most interested in for our analysis. per product, are:
# * **Food Group** - A generalized group in which the product is inserted
# * **Food Name** - The name of the product itself
# * **Protein (g)** - The amount of grams of proteins in a 100g serving
# * **Carbohydrates (g)** - The amount of grams of carbohydrates in a 100g serving
# * **Fat (g)** - The amount of grams of fat in a 100g serving

# In[155]:


#load dataset USDA-Food
usda_foods = pd.read_excel("data/raw/USDA-Food.xlsx", sheet_name=0)
food_properties = pd.DataFrame(usda_foods[['Food Group', 'Food Name', 'Protein (g)', 'Carbohydrates (g)', 'Fat (g)']])
food_properties.head()


# In[156]:


#checking size of the database
food_properties.index.size


# There are a lot of possible product inside this database, but many of them could be unsuitable for our purpose. As for the first task, we check the available _'**Food Group'**_ and remove the ones we won't need.  
# We plot them for a clear understanding.

# In[157]:


food_properties["Food Group"].value_counts().plot(kind="barh")
plt.title("Distribution of Food Group inside the USDA database")
plt.xlabel("Counts")
plt.ylabel("Food Group");


# As expected, **not all most viable products are necessarily healthy. Also, other are not suitable for our purpose** (like, as example *"Baby Food"*). Therefore, we will no longer consider the following items:
# - Sweets
# - Beverages
# - Snacks 
# - Baked Products
# - Spices and Herbs  
# - Nut and Seed Products
# - Dairy and Egg Products
# - Breakfast Cereals
# - Baby Foods
# - Soups, Sauces, and Gravies
# - Spices and Herbs
# - Nut and Seed Products
# - Fats and Oils

# In[158]:


not_suit = ["Sweets","Snacks","Beverages", "Baked Products","Spices and Herbs", "Nut and Seed Products", \
            "Dairy and Egg Products", "Breakfast Cereals", "Baby Foods", "Soups, Sauces, and Gravies", \
           "Spices and Herbs", "Nut and Seed Products", "Fats and Oils"]
food_properties = food_properties[~food_properties["Food Group"].isin(not_suit)].reset_index(drop=True)
food_properties.head()


# New check on dimension:

# In[159]:


food_properties.index.size


# By a first look to the head of the dataset we also understand that the `USDA-Food` contains many processed products. As we are interested in *raw* products for our analysis, we filter only this one:

# In[160]:


food_properties = food_properties[food_properties["Food Name"].str.contains("raw")]
food_properties.index.size


# We have a set of **1283** products from which we have to choose the possible diet.  
# We keep this dataset as it is for now and move to the next part.

# ## 2) Consider the producer prices
# An important feature in our next choice of products will be **the cost of producing the products** inside the European countries. For this reason, we introduce now **two new datasets** and we use them to filter more the `USDA-Food` ones.  
# The new datasets are:
# - [EU_prices_for_representative](https://ec.europa.eu/info/food-farming-fisheries/farming/facts-and-figures/markets/prices/price-monitoring-sector/eu-prices-selected-representative-products_en) by the European Commission
# - [FAO Producer prices](http://www.fao.org/faostat/en/#data/PP) (in particular we use the data for cereal and vegetables prices) 
# 
# We use the two of them because just the one from FAO is not complete enough for our purpose.  
# *Note*: FAO data are from 2014 to 2018, while EU_prices will go from 2014 to 2019. We assume these prices to be ok even if they are from a different period because there are no available data online. We simply decide to pick the most recent price for each available product.

# In[161]:


# loading the new datasets
eu_prices = pd.read_csv("data/raw/europe_food_prices.csv", usecols=["Category", "Product desc", "Unit", "Country", "Period", "MP Market Price"])\
            .rename(columns={"Product desc": "Product", "MP Market Price":"Price"})
eu_prices.head(1)


# In order to have reasonable prices, we consider only the last five years, from 2014 to october 2019:

# In[162]:


eu_prices = eu_prices[eu_prices.Period.between(201400, 201910)]


# Check the availability inside this dataset:

# In[163]:


eu_prices.Product.unique()


# There is a good number of representatives from all the categories we have in the `USDA-Food`, but we miss an important source of carbohydrates: rice. Also, there is no meat present in the database.  
# For this reason, we go to scrape these information from FAO:

# In[164]:


fao_prices = pd.read_csv("data/raw/fao_cereal_meat_prices_201418.csv", usecols=["Area", "Item", "Value", "Year"])
fao_prices


# *Note*: FAO data are in USD/ton 
# 
# At this point we need to restrict our product analysis. In the **_Which European countries can help Africa?_** section we defined the countries with the highest surplus in Europe, that are able to help Africa only by giving away a small fraction of their surplus.  
# From the previous classification we remove now the *Russian Federation*, because it has a huge surplus and would be able to solve the African problem (and we want a fair share between the richest European countries), together with *Poland, Ukraine* and *Romania*, because they are smaller nations.  
# At this point we have build the set of **top 5 countries that will help Africa: France, Italy, United Kingdom, Germany** and **Spain**.

# In[165]:


# defining countries
best_countries = ["France", "Italy", "United Kingdom", "Germany", "Spain"]
best_countries_code = ["FR", "IT", "UK", "DE", "ES"] # codes for the eu_prices

# filtering our datasets
eu_prices = eu_prices[eu_prices.Country.isin(best_countries_code)]
fao_prices = fao_prices[fao_prices.Area.isin(best_countries)]


# Considering only the most recent prices in the FAO cereal dataset

# In[166]:


fao_prices = fao_prices.sort_values("Year", ascending=False).groupby(["Area", "Item"])\
                      .first().reset_index().drop(columns="Year")


# Let's take a look to the products in the FAO **cereal** dataset now

# In[167]:


fao_prices.Item.value_counts().plot(kind="barh")
plt.title("Distribution of products inside the FAO cereal and meat dataset")
plt.xlabel("Number of countries covered")
plt.ylabel("Products");


# Two fundamental problems emerge here: we don't have the price for rice and meat (**cattle**) in all the countries of interest, **we need to impute them**  
# We start with the rice.  
# In order to do so, what we do now is try to find a correlation between the product we have in **all the countries** (*Wheat* and *Oats*) and the rice.

# In[168]:


rice_df = fao_prices[fao_prices.Item=="Rice, paddy"]
rice_df = rice_df.merge(fao_prices, on="Area")
rice_df = rice_df[rice_df.Item_y.str.contains("Wheat|Oats")]
rice_df["Coeff"] = rice_df.Value_x/rice_df.Value_y
rice_df.sort_values("Item_y")


# In[169]:


mean_coeff_oats = rice_df[rice_df["Item_y"] == "Oats"].Coeff.mean()
mean_coeff_wheat = rice_df[rice_df["Item_y"] == "Wheat"].Coeff.mean()
print("The mean coefficient for oats is", mean_coeff_oats)
print("The mean coefficient for wheat is", mean_coeff_wheat)


# As the wheat seems to be the product with less fluctuation, we take this one for our imputation.

# In[170]:


wheat_germany = fao_prices[(fao_prices.Area == "Germany") & (fao_prices.Item=="Wheat")].Value.values[0]
row_germ = pd.DataFrame(np.array([["Germany", "Rice, paddy", '%.1f'%(mean_coeff_wheat*wheat_germany)]]), columns=["Area", "Item", "Value"])
wheat_uk = fao_prices[(fao_prices.Area == "United Kingdom") & (fao_prices.Item=="Wheat")].Value.values[0]
row_uk = pd.DataFrame(np.array([["United Kingdom", "Rice, paddy", '%.1f'%(mean_coeff_wheat*wheat_uk)]]), columns=["Area", "Item", "Value"])
fao_prices = fao_prices.append(row_germ)
fao_prices = fao_prices.append(row_uk).reset_index(drop=True)


# In[171]:


fao_prices[fao_prices.Item=="Barley"]


# For the meat case, we first look to the countries for which we have data (considering the most general case: cattle, that we consider to be beef)

# In[172]:


fao_prices[fao_prices.Item=="Meat, cattle"]


# We need to impute the price for Italy:
# - For Italy we can look [here](https://www.bordbia.ie/farmers-growers/farmers/prices-markets/eu-world-cattle-prices/?country=Italy). Taking a price average we have a price of 3.42€/kg
# 
# We need to multiply this price by **1.10** (actual exchange EUR/USD) and 1000 (to consider tons)

# In[173]:


meat_italy = pd.DataFrame(np.array([["Italy", "Meat, cattle", '%.1f'%(3.42*1.10*1000)]]), columns=["Area", "Item", "Value"])
fao_prices = fao_prices.append(meat_italy).reset_index(drop=True)


# For pork, we already have all the data so no need to impute.

# Reprint the plot to see the final product in the FAO dataset:

# In[174]:


fao_prices.Item.value_counts().plot(kind="barh")
plt.title("Distribution of products inside the FAO cereal and meat dataset")
plt.xlabel("Number of countries covered")
plt.ylabel("Products");


# We need to report all in a single unit in `eu_prices`

# In[175]:


eu_prices.Unit.unique()


# As we are working with large amount of data and FAO as data in USD/tons, we take everything as EUR/ton (multiply 10) and multiply by **1.10** (EUR/USD change)

# In[176]:


eu_prices.loc[eu_prices['Unit'].str.contains('100'), 'Price'] = eu_prices.loc[eu_prices['Unit'].str.contains('100'), 'Price']\
                                                                .apply(lambda x: x*1.10*10)
eu_prices = eu_prices.drop(columns="Unit")


# We now assess, as we did for the `fao_prices`, if the prices we have are correctly represented among all our countries.  
# First step we do to this end is to keep only the most recent data for each product inside each country.

# In[177]:


eu_prices = eu_prices.sort_values("Period", ascending=False).groupby(["Country", "Category", "Product"])\
                      .first().reset_index().drop(columns="Period")


# In[178]:


plt.figure(figsize=(20, 15))
eu_prices.Product.value_counts().plot(kind="barh")
plt.title("Distribution of products inside the eu_countries dataset")
plt.xlabel("Number of countries covered")
plt.ylabel("Products");


# In[179]:


eu_prices.Product = eu_prices.Product.apply(lambda x: "Apples" if "Apples" in x else x)
eu_prices.Product = eu_prices.Product.apply(lambda x: "Tomatoes" if "Tomatoes" in x else x)
eu_prices.Product = eu_prices.Product.apply(lambda x: "Bread" if "Bread" in x else x)


# In[180]:


# dropping and readd them
apples_eu = eu_prices[eu_prices.Product=="Apples"].groupby(["Country", "Category", "Product"]).agg("mean").reset_index()
eu_prices = eu_prices[~(eu_prices["Product"]=="Apples")].reset_index(drop=True)
tomatoes_eu = eu_prices[eu_prices.Product=="Tomatoes"].groupby(["Country", "Category", "Product"]).agg("mean").reset_index()
eu_prices = eu_prices[~(eu_prices["Product"]=="Tomatoes")].reset_index(drop=True)
bread_eu = eu_prices[eu_prices.Product=="Bread"].groupby(["Country", "Category", "Product"]).agg("mean").reset_index()
eu_prices = eu_prices[~(eu_prices["Product"]=="Bread")].reset_index(drop=True)
eu_prices = eu_prices.append(apples_eu)
eu_prices = eu_prices.append(tomatoes_eu)
eu_prices = eu_prices.append(bread_eu)


# In[181]:


plt.figure(figsize=(20, 15))
eu_prices.Product.value_counts().plot(kind="barh")
plt.title("Distribution of products inside the eu_countries dataset")
plt.xlabel("Number of countries covered")
plt.ylabel("Products");


# **Merging didn't solve the situation.**  
# As a consequence, we decide to **drop all the products with not at least 4 counts** and put our *imputing efforts* just on them.

# In[182]:


counts_eu = eu_prices.Product.value_counts()
possible_products = counts_eu[counts_eu >= len(best_countries) - 1].index
eu_prices = eu_prices[eu_prices.Product.isin(possible_products)]


# In[183]:


# replot to zoom on our interested products
eu_prices.Product.value_counts().plot(kind="barh")
plt.title("Distribution of products inside the eu_countries dataset")
plt.xlabel("Number of countries covered")
plt.ylabel("Products");


# Printing the counts to understand which countries to impute. A country with all products should count **13 products**

# In[184]:


eu_prices.Country.value_counts()


# Only Spain has all the needed products. We need to impute all the others. We therefore proceed to print all the missing points in the other countries:

# In[185]:


# set difference between the products
need_it = set(possible_products) - set(eu_prices[eu_prices.Country=="IT"].Product.values)
need_de = set(possible_products) - set(eu_prices[eu_prices.Country=="DE"].Product.values)
need_fr = set(possible_products) - set(eu_prices[eu_prices.Country=="FR"].Product.values)
need_uk = set(possible_products) - set(eu_prices[eu_prices.Country=="UK"].Product.values)

# printing
print("Missing product in Italy:", need_it)
print("Missing product in Germany:", need_de)
print("Missing product in France:", need_fr)
print("Missing product in United Kingdom:", need_uk)


# #### Italy
# We need to impute the **Malting Barley**. The actual italian price can be found [here](https://www.clal.it/en/index.php?section=conf_cereali#orzo), where the price (on the day of writing) is 189 EUR/ton. We multiply by 1.10 to have the price in USD.

# In[186]:


barley_italy = pd.DataFrame(np.array([["IT", "Vegetal Products", "Malting Barley", '%.2f'%(189*1.10)]]), columns=["Country", "Category", "Product", "Price"])
eu_prices = eu_prices.append(barley_italy).reset_index(drop=True)


# #### Germany
# We need to impute the **Onions**. We can easily expect that onions won't be part of our diet, so we can drop them for all of the countries.

# In[187]:


eu_prices = eu_prices[~(eu_prices.Product=="Onions")]


# #### France
# We need to impute **Lettuce** and **Butter**. For the second one, we can drop the aliment because it won't be part of the diet for sure. For the Lettuces, we can look at the FAO data directly from the [website](http://www.fao.org/faostat/en/#data/PP): 1296.3 in 2018

# In[188]:


eu_prices = eu_prices[~(eu_prices.Product=="Butter")]
lettuce_france = pd.DataFrame(np.array([["FR", "Vegetable Products", "Lettuces", '%.2f'%(1269.3)]]), columns=["Country", "Category", "Product", "Price"])
eu_prices = eu_prices.append(lettuce_france).reset_index(drop=True)


# In[189]:


eu_prices[(eu_prices.Product=="Feed Maize")]


# #### United Kingdom
# We need to impute **Apples**, **Cherries** and **Feed Maize**. For Apples and cherries we can use FAO data: 1343.4 and 4574.9 respectively. The price of Maize can be found [here](https://ahdb.org.uk/cereals-oilseeds-markets), 195\$ per ton.

# In[190]:


apples_uk = pd.DataFrame(np.array([["UK", "Fruit Products", "Apples", '%.2f'%(1343.4)]]), columns=["Country", "Category", "Product", "Price"])
cherries_uk = pd.DataFrame(np.array([["UK", "Fruit Products", "Cherries", '%.2f'%(4574.9)]]), columns=["Country", "Category", "Product", "Price"])
maize_uk = pd.DataFrame(np.array([["UK", "Vegetal Products", "Feed Maize", '%.2f'%(195)]]), columns=["Country", "Category", "Product", "Price"])

eu_prices = eu_prices.append(apples_uk)
eu_prices = eu_prices.append(cherries_uk)
eu_prices = eu_prices.append(maize_uk).reset_index(drop=True)


# Check if everything went right:

# In[191]:


# replot to zoom on our interested products
eu_prices.Product.value_counts().plot(kind="barh")
plt.title("Distribution of products inside the eu_countries dataset")
plt.xlabel("Number of countries covered")
plt.ylabel("Products");


# As we can see, the **Vegetables** category is not well represented by our actual dataset. For this reason, we consider **FAO data for vegetables** in order to fill the gaps:

# In[192]:


fao_vegetables_prc = pd.read_csv("data/raw/fao_vegetables_prices_201418.csv", usecols=["Area", "Item", "Value", "Year"])
fao_vegetables_prc = fao_vegetables_prc[fao_vegetables_prc.Area.isin(best_countries)]

fao_vegetables_prc = fao_vegetables_prc.sort_values("Year", ascending=False).groupby(["Area", "Item"])\
                      .first().reset_index().drop(columns="Year")

# plotting
fao_vegetables_prc.Item.value_counts().plot(kind="barh")
plt.title("Distribution of products inside the FAO vegetables dataset")
plt.xlabel("Number of countries covered")
plt.ylabel("Products");


# As we did before, we take only the products present in at least **4 countries** for the next analysis:

# In[193]:


counts_fao_veg = fao_vegetables_prc.Item.value_counts()
possible_products_veg = counts_fao_veg[counts_fao_veg >= len(best_countries) - 1].index
fao_vegetables_prc = fao_vegetables_prc[fao_vegetables_prc.Item.isin(possible_products_veg)]


# In[194]:


# replot to zoom on our interested products
fao_vegetables_prc.Item.value_counts().plot(kind="barh")
plt.title("Distribution of products inside the FAO vegetables dataset")
plt.xlabel("Number of countries covered")
plt.ylabel("Products");


# We can also drop **Carrots** and **Lettuce** before already present in the precedent dataset:

# In[195]:


fao_vegetables_prc = fao_vegetables_prc[~(fao_vegetables_prc.Item.isin(["Carrots and turnips", "Lettuce and chicory"]))]
possible_products_veg = possible_products_veg[~(possible_products_veg.isin(["Carrots and turnips", "Lettuce and chicory"]))]


# In[196]:


# replot to zoom on our interested products
fao_vegetables_prc.Item.value_counts().plot(kind="barh")
plt.title("Distribution of products inside the FAO vegetables dataset")
plt.xlabel("Number of countries covered")
plt.ylabel("Products");


# Let's see how these products are represented:

# In[197]:


fao_vegetables_prc.Area.value_counts()


# The only country in which products are missing is **Italy**. We need to impute manually the prices of these products:

# In[198]:


# set difference between the products
need_it = set(possible_products_veg) - set(fao_vegetables_prc[fao_vegetables_prc.Area=="Italy"].Item.values)

# printing
print("Missing product in Italy:", need_it)


# ### Imputing Italy

# The price for the **cucumbers** and the **cauliflowers** can be found in the `eu_prices` dataset

# In[199]:


eu_prices2 = pd.read_csv("data/raw/europe_food_prices.csv", usecols=["Category", "Product desc", "Unit", "Country", "Period", "MP Market Price"])\
            .rename(columns={"Product desc": "Product", "MP Market Price":"Price"})
eu_prices2 = eu_prices2[eu_prices2.Country.isin(best_countries_code)]
eu_prices2 = eu_prices2[eu_prices2.Period.between(201400, 201910)]
eu_prices2 = eu_prices2.sort_values("Period", ascending=False).groupby(["Country", "Category", "Product"])\
                      .first().reset_index().drop(columns="Period")


# In[200]:


eu_prices2[eu_prices2.Product=="Cucumbers"]


# In[201]:


eu_prices2[eu_prices2.Product=="Cauliflowers"]


# In[202]:


#prices are in Eur/100 kg, we need usd/tonnes
cucumbers = pd.DataFrame(np.array([["Italy", "Cucumbers and gherkins", 68.00*1.10*10]]), columns=["Area", "Item", "Value"])
cauliflowers = pd.DataFrame(np.array([["Italy", "Cauliflowers and broccoli", 62.71*1.10*10]]), columns=["Area", "Item", "Value"])


fao_vegetables_prc = fao_vegetables_prc.append(cucumbers)
fao_vegetables_prc = fao_vegetables_prc.append(cauliflowers).reset_index(drop=True)


# For the **peas** and the **beans** of 2017 in Italy can be found [here](http://www.codima.info/trunk/nor_file_107_decreto-ministeriale-n.-0031908-del-29-dicembre-2016-prezzi_parte_1-.pdf) (page in Italian).  
# The price is **120.33€/100kg** for the peas and **175.56€/100kg** for the beans.

# In[203]:


#prices are in Eur/100 kg, we need usd/tonnes
peas = pd.DataFrame(np.array([["Italy", "Peas, green", 120.33*1.10*10]]), columns=["Area", "Item", "Value"])
beans = pd.DataFrame(np.array([["Italy", "Beans, green", 175.56*1.10*10]]), columns=["Area", "Item", "Value"])


fao_vegetables_prc = fao_vegetables_prc.append(peas)
fao_vegetables_prc = fao_vegetables_prc.append(beans).reset_index(drop=True)


# Unfortunately, we didn't find any reliable price for **Cabbages and mushrooms**, so we drop them:

# In[204]:


fao_vegetables_prc = fao_vegetables_prc[~(fao_vegetables_prc.Item.isin(["Cabbages and other brassicas", "Mushrooms and truffles"]))]


# Final check on these products

# In[205]:


fao_vegetables_prc.Item.value_counts().plot(kind="barh")
plt.title("Distribution of products inside the FAO vegetable dataset")
plt.xlabel("Number of countries covered")
plt.ylabel("Products");


# We also now rename the codes in  `eu_prices` to proper names of nations before merging.

# In[206]:


dict_country = dict(zip(best_countries_code, best_countries))
eu_prices = eu_prices.replace({"Country":dict_country})


# Final step is combining the three dataframe together. We modify the needed products from the FAO datasets and adapt it to `eu_prices`

# In[207]:


counts_fao = fao_prices.Item.value_counts()
fao_products = counts_fao[counts_fao == len(best_countries)].index
fao_prices = fao_prices[fao_prices.Item.isin(fao_products)]


# In[208]:


# dropping the no longer needed category column from eu
eu_prices.drop(columns="Category", inplace=True)


# In[209]:


fao_prices = fao_prices.replace({"Item":{"Rice, paddy": "Rice", "Meat, cattle": "Meat"}})
fao_prices = fao_prices[["Area", "Item", "Value"]]
fao_prices = fao_prices.rename(columns={"Item": "Product", "Area":"Country", "Value":"Price"})
fao_vegetables_prc = fao_vegetables_prc.replace({"Item":{"Cauliflowers and broccoli": "Cauliflowers", "Peas, green": "Peas", "Beans, green": "Beans", "Cucumbers and gherkins": "Cucumbers"}})
fao_vegetables_prc = fao_vegetables_prc[["Area", "Item", "Value"]]
fao_vegetables_prc = fao_vegetables_prc.rename(columns={"Item": "Product", "Area":"Country", "Value":"Price"})
prices = eu_prices.append(fao_prices).append(fao_vegetables_prc).sort_values(by="Product").reset_index(drop=True)
prices.head()


# In[210]:


print("The final products present in the dataset are: {}".format(prices.Product.unique()))


# We have finally build our **final prices dataset!**

# ## 3) Filtering the food dataset  
# Now that we know the set of products between we can choose, it's time to filter the `USDA_food` accordingly

# In[211]:


# defining the set of foods between we can choose
foods = prices.Product.unique()


# In[212]:


food_properties[food_properties["Food Name"].str.contains('|'.join(foods), case=False)]


# 269 rows, but there are multiple matches. Let's look to the elements that don't have a match:

# In[213]:


# find the set of total foods and set to lower
total_foods = food_properties["Food Name"].unique()
total_foods = [t.lower() for t in total_foods]

# consider also foods to lower fo check the match
foods = [f.lower()for f in foods]


# In[214]:


# find set of not matching
not_matching = [f for f in foods if not any(f in t for t in total_foods)]
len(not_matching)


# In[215]:


not_matching


# We can try to divide the multiple words, also remove duplicates after that

# In[216]:


foods = [f.replace(",","").replace("(", "").replace(")", "").split() for f in foods]
foods = [l for sublist in foods for l in sublist]
foods = list(set(foods))


# Second step is removing all generated stopwords and add singulars

# In[217]:


stop_words = stopwords.words('english')
stop_words += ["feed", "raw"]
foods = [f for f in foods if f not in stop_words]
lem = WordNetLemmatizer()
singular = [lem.lemmatize(f) for f in foods]
foods += singular
foods = list(set(foods))
foods = [f for f in foods if f not in stop_words]


# In[218]:


# printing the set of food
foods


# It's possible to note that _"meat"_ is too generic for our purpose, as we considered cow or beef in analyzing the prices. For this reason, we remove meat and add "cow" and "beef"

# In[219]:


foods = [f for f in foods if f!="meat"]
foods += ["beef", "cow"]


# New matching check:

# In[220]:


food_properties[food_properties["Food Name"].str.contains('|'.join(foods), case=False)]


# We have a good set of products now, so we can stop here for the next considerations

# In[221]:


poss_diet = food_properties[food_properties["Food Name"].str.contains('|'.join(foods), case=False)]


# Check of our residue groups:

# In[222]:


poss_diet["Food Group"].unique()


# Fish products seems a bit strange for what we analyzed so far, check on it:

# In[223]:


poss_diet[poss_diet["Food Group"] == "Finfish and Shellfish Products"]


# Finfish and Shellfish Products is not really represented by our prices database, so we drop the category

# In[224]:


poss_diet = poss_diet[~poss_diet["Food Group"].isin(["Finfish and Shellfish Products"])].reset_index(drop=True)


# Let's now adjust the units. The grams of proteins in 100g serve **[grams of proteins/100g]** times the kcalalories in 1 gram of proteins **[kcal/g of proteins]** will result in the kilocalories obtained from proteins for 100g serve **[kcal/100g]**. Same argument che be extended to carbohydrates and fat.  <br /> According to the [National Agriculture Library](https://www.nal.usda.gov/fnic/how-many-calories-are-one-gram-fat-carbohydrate-or-protein):
#  - 1 gram of protein corresponds to **4 kcal**
#  - 1 gram of carbohydrates corresponds to **4 kcal**
#  - 1 gram of fat corresponds to **9 kcal**
# 
# According to what has been said above, we will multiply the values and obtain a new dataframe `diet_kcal` in which values have units of **[kcal/100g]**

# In[225]:


#multiply values with vector of kcal/g, merging to keep Food 
kcal_g = np.array([4,4,9])
diet_kcal = poss_diet[["Food Group", "Food Name"]].merge(poss_diet[poss_diet.columns[2:]].multiply(kcal_g),left_index=True, right_index=True)
diet_kcal.rename(columns={'Protein (g)':'Protein (kcal/100g)',
                          'Carbohydrates (g)':'Carbohydrates (kcal/100g)',
                          'Fat (g)':'Fat (kcal/100g)'}, 
                 inplace=True)
diet_kcal.head()


# The next point we are going to do now is trying to give a rank to each product taking into account the necessity that each person, in their diet, should have: 
# * 55% from proteins
# * 25% from carbohydrates 
# * 20% from fats.
# (Data adapted from [here](https://www.ncbi.nlm.nih.gov/pmc/articles/PMC1479724/))  
# In order to do so, we apply a greedy ranking based on how well a particular food is is representing these shares.

# In[226]:


def rank_food(food):
    prot = food['Protein (kcal/100g)']
    carb = food['Carbohydrates (kcal/100g)']
    fat = food['Fat (kcal/100g)']
    
    if (prot == 0 and carb == 0 and fat == 0):
        return -1
    
    tot = prot + carb + fat
    
    err_prot = abs(tot*0.55/4 - prot) / 100
    err_carb = abs(tot*0.25/4 - carb) / 100
    err_fat = abs(tot*0.20/9 - fat) / 100
    
    avg_err = (err_prot + err_carb + err_fat)/3
    
    return avg_err


# In[227]:


diet_kcal['rank'] = diet_kcal.apply(rank_food, axis=1)


# In the next step we take the best **3** out of each group (because there may be inconsistencies in the results due to the filtering process).  
# From these, we then choose our **representative product for each group**. We will use this to compute the final diet in the next and final step.

# In[228]:


diet_kcal.groupby(["Food Group"]).apply(lambda x: x.sort_values(['rank'])\
                                 .reset_index(drop=True).groupby("Food Group").head(3)).reset_index(drop=True)


# Let's further analyze this results.  
# We can start from the category that is less represented (2 results instead of 3): **Lamb, Veal, and Game Products**

# In[229]:


diet_kcal[diet_kcal["Food Group"] == "Lamb, Veal, and Game Products"]


# As we can see, this category doesn't really have a representation. The residue results are here just as consequence of the filtering process (using `str.contains` caused this).
# **For this reason, we drop it**, than we increase the range of possible products from 3 to **5** and reprint.

# In[230]:


diet_kcal = diet_kcal[~diet_kcal["Food Group"].isin(["Lamb, Veal, and Game Products"])].reset_index(drop=True)


# In[231]:


diet_kcal.groupby(["Food Group"]).apply(lambda x: x.sort_values(['rank'])\
                                 .reset_index(drop=True).groupby("Food Group").head(5)).reset_index(drop=True)


# Now the shape of our final diet is **way more clear**:
# - From the _**Beef Products**_ category we have a great variety of meats. We can take **the median over the first 10 rows** to have our representative beef nutrients.
# - From the _**Cereal Grains and Pasta**_ group, a good product would be one of **rice, bearley** or **oat**. We keep all of them for now. Further analysis will be in the next section.
# - From the _**Fruits and Fruit Juices**_ category, we take the **Strawberries** and **Apples** (not rose-apples as shown here but apples, as there are some in the dataframe). The other products are probably shown due to some inconsistence with the food items presented before.
# - From **_Poultry Products_**, we take **Chicken, roasting, light meat, meat only, raw** as it's the most simple one.
# - For the _**Legumes and Legume Products**_ and _**Vegetables and Vegetable Products**_, we need further analysis (because these two categories can be very useful for our analysis):
#     - From _**Legumes and Legume Products**_ we go for sure for **Beans**, but the yogurt there presented is an inconsistency. Therefore, we need further explorations.
#     - From _**Vegetables and Vegetable Products**_ we could take **Lettuce, green leaf, raw** and **cucumber**. We should analyze if there's the possibility to take other products.
# 
# We can now build the **final dataframe** with nutrient for each product.

# In[232]:


prod_diet_final = pd.DataFrame(columns=["Product", "Protein (kcal/100g)", "Carbohydrates (kcal/100g)","Fat (kcal/100g)"])


# #### Beef Products

# In[233]:


# taking an average over the first 5 beef product
best_beef = diet_kcal[diet_kcal["Food Group"]=="Beef Products"].sort_values(by="rank").head(10)
beef_repr = [best_beef["Protein (kcal/100g)"].median(),\
             best_beef["Carbohydrates (kcal/100g)"].median(),\
             best_beef["Fat (kcal/100g)"].median()]
beef_list = ["Beef Meat"] + beef_repr
beef = pd.DataFrame(np.array([beef_list]), columns=["Product", "Protein (kcal/100g)", "Carbohydrates (kcal/100g)","Fat (kcal/100g)"])
prod_diet_final = prod_diet_final.append(beef)


# #### Cereal Grains and Pasta

# In[234]:


# taking the top 3 in cereals (rice, bearley, oat)
best_cereal = diet_kcal[diet_kcal["Food Group"]=="Cereal Grains and Pasta"].sort_values(by="rank").head(3)
best_cereal = best_cereal.drop(columns=["Food Group", "rank"]).rename(columns={"Food Name":"Product"})
best_cereal = best_cereal.replace({"Product":{"Wild rice, raw": "Rice", "Barley, pearled, raw": "Barley", "Oat bran, raw": "Oats"}})
prod_diet_final = prod_diet_final.append(best_cereal)


# #### Fruits and Fruit Juices
# By further analysis, we discovered that it's possible to add also **cherries** to the possible diet. So we keep those too.

# In[235]:


# taking apples, cherries and strawberries 
fruits = diet_kcal[(diet_kcal["Food Name"]=="Apples, raw, with skin") | (diet_kcal["Food Name"]=="Strawberries, raw") | (diet_kcal["Food Name"]=="Cherries, sour, red, raw")]
fruits = fruits.drop(columns=["Food Group", "rank"]).rename(columns={"Food Name":"Product"})
fruits = fruits.replace({"Product":{"Apples, raw, with skin": "Apples", "Cherries, sour, red, raw": "Cherries", "Strawberries, raw": "Strawberries"}})
prod_diet_final = prod_diet_final.append(fruits)


# #### Poultry Products

# In[236]:


# just take Chicken, roasting, light meat, meat only, raw
poultry = diet_kcal[diet_kcal["Food Name"]=="Chicken, roasting, light meat, meat only, raw"]
poultry = poultry.drop(columns=["Food Group", "rank"]).rename(columns={"Food Name":"Product"})
poultry = poultry.replace({"Product":{"Chicken, roasting, light meat, meat only, raw": "Chicken"}})
prod_diet_final = prod_diet_final.append(poultry)


# #### Legumes and Legume Products

# In[237]:


check_leg = diet_kcal[diet_kcal["Food Group"]=="Legumes and Legume Products"]
check_leg[check_leg["Food Name"].str.contains("bean", case=False)].head(5)


# In[238]:


check_leg[check_leg["Food Name"].str.contains("pea", case=False)].head(5)


# As we can see, we have both the product we mentioned before. As **only 2 types of legumes** were chosen, because their **nutriments provision is usually high**, we decide to keep both of them for our next analysis.

# In[239]:


# Taking one for beans and one for peas
beans = diet_kcal[diet_kcal["Food Name"]=="Beans, adzuki, mature seeds, raw"]
beans = beans.drop(columns=["Food Group", "rank"]).rename(columns={"Food Name":"Product"})
beans = beans.replace({"Product":{"Beans, adzuki, mature seeds, raw": "Beans"}})

peas = diet_kcal[diet_kcal["Food Name"]=="Peas, green, split, mature seeds, raw"]
peas = peas.drop(columns=["Food Group", "rank"]).rename(columns={"Food Name":"Product"})
peas = peas.replace({"Product":{"Peas, green, split, mature seeds, raw": "Peas"}})

prod_diet_final = prod_diet_final.append(beans)
prod_diet_final = prod_diet_final.append(peas)


# #### Vegetables and Vegetable Products

# In[240]:


diet_kcal[diet_kcal["Food Group"]=="Vegetables and Vegetable Products"].sort_values(by="rank").head(10)


# As we can see, the list of most viable **10 vegetables** includes **Lettuce, cucumber and tomatoes** (different kinds for the latter).  
# We decide to take the three (we search for the red one in the case of tomatoes, as it is the most common).

# In[241]:


# take Lettuce, green leaf, raw and Cucumber
lettuce = diet_kcal[diet_kcal["Food Name"]=="Lettuce, green leaf, raw"]
lettuce = lettuce.drop(columns=["Food Group", "rank"]).rename(columns={"Food Name":"Product"})
lettuce = lettuce.replace({"Product":{"Lettuce, green leaf, raw": "Lettuces"}})

cucumber = diet_kcal[diet_kcal["Food Name"]=="Cucumber, peeled, raw"]
cucumber = cucumber.drop(columns=["Food Group", "rank"]).rename(columns={"Food Name":"Product"})
cucumber = cucumber.replace({"Product":{"Cucumber, peeled, raw": "Cucumbers"}})

prod_diet_final = prod_diet_final.append(lettuce)
prod_diet_final = prod_diet_final.append(cucumber)


# In[242]:


#average over tomatoes
tomatoes = diet_kcal[diet_kcal["Food Name"].str.contains("tomatoe", case=False)]
tomatoes


# Found the name for the red tomatoe. Add it to the dataset:

# In[243]:


tomatoes = diet_kcal[diet_kcal["Food Name"]=="Tomatoes, red, ripe, raw, year round average"]
tomatoes = tomatoes.drop(columns=["Food Group", "rank"]).rename(columns={"Food Name":"Product"})
tomatoes = tomatoes.replace({"Product":{"Tomatoes, red, ripe, raw, year round average": "Tomatoes"}})

prod_diet_final = prod_diet_final.append(tomatoes).reset_index(drop=True).set_index("Product")


# We can finally look at **our final diet dataframe** `prod_diet_final`. This dataframe will be used in the next section to build an optimal diet **based on the production prices**

# In[244]:


prod_diet_final


# In[245]:


# making sure all the values are float
prod_diet_final["Protein (kcal/100g)"] = prod_diet_final["Protein (kcal/100g)"].astype(float)
prod_diet_final["Carbohydrates (kcal/100g)"] = prod_diet_final["Carbohydrates (kcal/100g)"].astype(float)
prod_diet_final["Fat (kcal/100g)"] = prod_diet_final["Fat (kcal/100g)"].astype(float)


# Saving the result into a pickle:

# In[246]:


prod_diet_final.to_pickle("data/processed/prod_diet_final.pkl")


# In[247]:


prod_diet_final = pd.read_pickle("data/processed/prod_diet_final.pkl")


# Plotting the results:

# In[248]:


#Creating an exploded version of prod_diet_final 
plot_diet_exploxed = prod_diet_final.copy()
#Resetting index
plot_diet_exploxed.reset_index()
#Unstack result, converting to dataframe and reset index
plot_diet_exploxed = plot_diet_exploxed.unstack().to_frame().reset_index(level=['Product',0])
#Casting values column to float
#plot_diet_exploxed[0]= plot_diet_exploxed[0].astype(float)
#renaming columns
plot_diet_exploxed = plot_diet_exploxed.rename(columns={"level_0":"Macronutrient",0:"Value"})
#showing results
plot_diet_exploxed.head()


# In[249]:


#Initialize bar plot 
bp = sns.barplot(x='Product', y="Value", hue='Macronutrient', data=plot_diet_exploxed)
#Set log axis in order to show more values since with have both big numbers and very small numbers
bp.set(yscale="log") 
plt.ylabel('Log of caloric density [kcal/100g]')
plt.xlabel('Food items')
plt.title('Caloric density of different food items with respect to their share of macronutrients');


# In[250]:


prod_diet_final.plot(kind='bar',stacked=True)
plt.ylabel('Caloric density [kcal/100g]')
plt.xlabel('Food items')
plt.title('Stacked barplot of caloric density of different food items with respect to their share of macronutrients');


# The code which generates the 3D plot used for the macronutrients represation is as follows.

# In[ ]:


t = np.linspace(0, 20, 13)

fig = go.Figure(data=[go.Scatter3d(
    name="",
    text=prod_diet_final.index.values,
    x=prod_diet_final["Protein (kcal/100g)"],
    y=prod_diet_final["Carbohydrates (kcal/100g)"],
    z=prod_diet_final["Fat (kcal/100g)"],
    hovertemplate="Item: %{text}<br />Protein: %{x:.2f} kcal/100g<br />Carbohydrates: %{y:.2f} kcal/100g<br />Fat: %{z:.2f} kcal/100g",
    mode="markers",
    marker=dict(
        size=8,
        color=t,                # set color to an array/list of desired values
        colorscale='Viridis',   # choose a colorscale
        opacity=0.8
    )
)])


camera = dict(
    eye=dict(x=-1.4, y=-1.4, z=0.05),
    center=dict(x=0,y=0,z=-0.2)
)

name = 'Shares of macronutrients in each food item"'


# tight layout
fig.update_layout(
    margin=dict(l=0, r=0, b=0, t=0),
    title=name,
    scene = dict(
        camera=camera,
        xaxis = dict(title='Protein [kcal/100g]'), 
        yaxis = dict(title='Carbohydrates [kcal/100g]'),
        zaxis = dict(title='Fat [kcal/100g]'),
        dragmode = 'turntable'),
    paper_bgcolor = "rgba(0,0,0,0)"
    )

py.plot(fig, filename='docs/_includes/3d_macros.html', include_plotlyjs=False)


# In[291]:


IFrame(src="https://manuleo.github.io/mADAm_files/3d_macros.html", width=800, height=600)


# As we can see from the 3D spatial representation of the shares of macronutrients, **chicken** and **meat beef** occupy the region in which **proteins** and **fat** density is high. On the other hand, we also observe that **oats** has very high carbs caloric density, while it doesn't really have a stong protein and fat footprint. In addition, we also observe that the most of **vegetables** and **fruits** have quite low caloric density. The products listed will be the **basis of the perfect diet** that we will compute.

# ## 4) Final filter on the prices to retain only the needed ones
# We need now to prepare the prices matrix for the next section. In order to do so, we filter the product we have in the `prices` dataframe accordingly with what we have delined in the last part.  
# Theoretically, we should have **65 elements** in the final dataframe (5 countries, 13 products)

# In[252]:


# a conversion in float for all the prices just to be sure
prices.Price = prices.Price.apply(lambda x: float(x))


# In[253]:


prices_final = prices[prices.Product.isin(prod_diet_final.index.values)]
prices_final.index.size


# We miss **2** products. We must investigate over those:

# In[254]:


print("Set of missing product in prices:", set(prod_diet_final.index.values) - set(prices_final.Product.unique()))


# Probably these products have another name in the previous dataframe (as example, _"Beef Meat"_ was set as _"Meat"_ by us).  
# We look into these problems and manually solve them

# In[255]:


# check for Meat
prices[prices.Product.str.contains("Meat")]


# As expected, we have _"Meat"_ instead of _"Beef Meat"_. Fixing it:

# In[256]:


prices = prices.replace({"Product":{"Meat": "Beef Meat"}})


# In[257]:


# check for Barley
prices[prices.Product.str.contains("Barley")]


# For this case, we have _"Feed Barley"_ and _"Malting Barley"_ from which choose. As we don't know from which kind of barley is derived the one we are considering in the `USDA_food` we decide to take an **average** between the two prices

# In[258]:


barleys = prices[prices.Product.str.contains("Barley")]
barleys = barleys.groupby(["Country"]).mean().reset_index()
barleys["Product"] = "Barley"
barleys = barleys[["Country", "Product", "Price"]]
prices = prices.append(barleys)


# After the final fix, we can now print the **`prices_final` dataframe** for the prices for our interesting product for all the countries

# In[259]:


prices_final = prices[prices.Product.isin(prod_diet_final.index.values)]
prices_final.head()


# We need to elaborate on this dataframe. For our subsequent section, we need a **matrix with countries as index and product as columns**. We therefore work on the last dataframe a bit and produce a final one named `final_prices`

# In[260]:


# reshaping with pivot
final_prices = pd.pivot_table(prices_final, index="Country", columns="Product", aggfunc="first")

#cleaning
final_prices.index = final_prices.index.to_flat_index()
final_prices.columns = final_prices.columns.get_level_values(1)
final_prices.columns.name = None

# final print
final_prices


# *Final footnote*: the values here reported are **USD/tonnes**

# Loading the result into a pickle

# In[261]:


final_prices.to_pickle('data/processed/final_prices.pkl')


# In[262]:


final_prices = pd.read_pickle('data/processed/final_prices.pkl')


# *Final footnote*: the values here reported are **USD/tonnes**

# # Composing the optimal DIET for every helper country

# The composition of the diet is one of the **key steps** of our analysis. We decide to compute the amount of product by minimizing the acutal costs of shipments that every European country has to meet. In order to model the problem we need to define an objective function which in this case will be the **non-negative weighted sum of products' cost**.

# The problem we want to model is a **Linear Program**. The library we will use to solve is **PICOS** and the solver will be **GUROBI**.

# In particular:
# - $Y$ is a matrix in $R^{mxn}$ in which $m$ is the number of products in our diet and $n$ is the number of African countries. Each entry $y_{ij}$ of the matrix $Y\in{R^{mxn}}$ is the amount [tonne] of food product $i\in{1,...,m}$ sent by the country $j\in{1,...,n}$
# - The weights are the unit cost of each product [USD/tonne]. We will name the costs $c\in{R^{mx1}}$.
# - The objective function is $\sum_{i=1}^{m}\sum_{j=1}^{n}{c_jy_{ij}}$
# - The constraints can be considered a restriction on the value that our decision variable will assume. By restricting the feasibile set we will impose the following limits: 
#   - non-negativity
#   - supply and demand must be met according to the correct shares of proteins,carbs and fats 
#   - even distribution of resources

# The strategy is to take advantage of the results we obtained from the previous optimization problem (`food_opt_distribution_df`) and use them a starting point for the computation of the **ideal diet** for every country.
# - The term **ideal** is used because the output of the problem will be the cheapest solution to meet African food demand considering the limitations of our model and the approximation used so far.
# - The term **diet** is used to indicate the set of products that will be sent from every European country to the African ones based on the optimal distribution computed before. Since retrieving the prices for every product in different country, the list of food items available to choose is restricted (`final_prices`). 
# 

# A last note is that the problem will be solved 5 times, one for each of `best_countries`. We will have 5 outputs to analyse. Let's do it.

# In[263]:


italy_giveup_val = food_opt_distribution_df.loc["Italy"].values
italy_giveup_index = food_opt_distribution_df.loc["Italy"].index


# In[264]:


italy_prices_val = final_prices.loc["Italy"].values.reshape(-1,1)
italy_prices_index = final_prices.loc["Italy"].index
final_prices


# In[265]:


prod_diet_final = prod_diet_final.sort_values(by="Product")
prod_diet_index = prod_diet_final.index
prod_diet_val = prod_diet_final.to_numpy()/10**-6
prod_diet_final


# First off, we initialize the problem `prob`. Secondly, we add the decision variable $Y\in{R^{mxn}}$:

# In[266]:


prob = pic.Problem() #initalize convex problem
Y = prob.add_variable('Y', (italy_prices_val.size,italy_giveup_val.size)) #definition of decision matrix of variables nxm


# The go on setting the objective function `obj` and the nature of the problem `minimization`:

# In[267]:


obj = pic.sum(italy_prices_val.T * Y) #define obj function 
prob.set_objective("min", obj) #set objective function 


# With regards to the constrains:
# - Non-negativity is required because **phisically-speaking it doesn't make sense** to ship negative quantity of food. In addition, by a mathematical point of view we will have to set a lower bound for the objective function, otherwise the optimization will push the optimal value to $-\infty$.
# - The second constraint is modeled to make sure that **every African country receives the right shares of its demand in terms of macronutrients**.
# - The third constraint will ensure that the problem's resolution yields attendible results. To guarantee this, one important assumption has to be taken: **For every country, the amount of each product should cover a share between 0.01% to 35.5% of kilocalories demand**. In this way, we will have an even distribution of products for every single country.

# In[268]:


#Initialize constraints
constraints = []

#Define non-negativity constraint
constraints.append(prob.add_constraint(Y>=0))

#Define constraints proteins,carbs and fats
#Define shares of proteins,carbs and fat as an absolute variable (not subject to optimization)
shares = np.array([0.55,0.25,0.2])
for i in range(0,italy_giveup_val.size):
     for j in range(0,shares.size):
        constraints.append(prob.add_constraint(Y[:,i].T*prod_diet_val[:,j].reshape(-1,1)==shares[j]*italy_giveup_val[i])) 
#Define constraints to provide an upper bound (every product has to be sent at most to cover the 35,5% of the final demand) 
for i in range(0,italy_giveup_val.size):
    for j in range(0,italy_prices_val.size):
        constraints.append(prob.add_constraint(pic.sum(Y[j,i]*prod_diet_val[j,:].reshape(1,-1))<=0.355*italy_giveup_val[i])) 
        constraints.append(prob.add_constraint(pic.sum(Y[j,i]*prod_diet_val[j,:].reshape(1,-1))>=0.0001*italy_giveup_val[i]))


# Let's have a look at the problem:

# In[269]:


#print(prob)


# In[270]:


#Solving problem with gurobi solver (License available for free academic use)
solution = prob.solve(verbose=0, solver = 'gurobi')
Y_opt_ITdiet = Y.value
print('The solution of the problem is:')
print(Y_opt_ITdiet);


# Converting the solution into a more agile dataframe:

# In[271]:


result_ITdiet = np.array(Y_opt_ITdiet)
result_ITdiet_df= pd.DataFrame(data=result_ITdiet, index = prod_diet_index)
result_ITdiet_df.columns = italy_giveup_index
result_ITdiet_df


# Let's now have a look at the results yielded by the optmization considering that Italy is the donor country. The products that are being most used are:
# - **Beef Meat** with a price of **3762\$/ton** and a caloric protein density of **70.24 kcal/100g** and caloric fat density **28.98 kcal/100g**
# - **Chicken** with a price of **2244.77\$/ton** and a caloric protein density of **88.80 kcal/100g** and **14.67 kcal/100g** caloric fat density
# - **Cucumbers** with a price of **748.0\$/ton** and a caloric carbs density of **8.64 kcal/100g** caloric proteins density **2.36 kcal/100g**
# - **Oats** with a price of **272.3\$/ton** and a caloric carbs density of **264.88 kcal/100g**
# - **Tomatoes** with a price of **1372.4\$/ton** and a caloric carbs density of **15.56 kcal/100g** 
# <br>

# We derive from this values that the optimization is actually choosing the products that **costs the least** while **contributing the most to the kilocalories demand**. Moreover, the constraint of shares of macronutrients is respected so our diets are close to what in reality one would need biologically. 

# In conclusion, **the results we obtained are quite surprising**. We have obtained a tangible number of tons of food items that every European country should send in order to meet the demand in Africa. <br>
# If the reader want to have an overview on:
# - **the food item**, he can look at the rows of `result_ITdiet_df` and obtain **the distribution of for every food items** over the 25 African countries in respect to the donor country (In this example Italy).
# - **the diet**, the reader can take the columns to see a **full diet composed by different food itmes with the respective quantitites**. This full diet can be provided by each European countries for which we solved the problem and in this context it will shipped from Italy with respect to what the prices and availability in Italy are.

# 

# The problem shown above was an example on how we will solve the problem for just one country. For the sake of simplicity, a function `LPSolverDietEurope` will be created and run on the 5 countries. The function will output 5 dataframes with the respective solutions. For more details, have a look at *LPSolverDietEurope.py* in the repository.

# In[272]:


import LPSolverDietEurope as lp
diet_dict_eu_countries = {name: lp.LPSolverDietEurope(name) for name in best_countries}
for countries in best_countries:
    diet_dict_eu_countries[countries].to_pickle("data/processed/" + countries + "_opt_diet.pkl")


# In[273]:


#Loading pickle of solutions
import pandas as pd
Final_Diet_It = pd.read_pickle("data/processed/Italy_opt_diet.pkl")
Final_Diet_Ge = pd.read_pickle("data/processed/Germany_opt_diet.pkl")
Final_Diet_Fr = pd.read_pickle("data/processed/France_opt_diet.pkl")
Final_Diet_Sp = pd.read_pickle("data/processed/Spain_opt_diet.pkl")
Final_Diet_UK = pd.read_pickle("data/processed/United Kingdom_opt_diet.pkl")


# We have our five final dataframes containing the tons of products for every country that can be considered as **OUR FINAL DIETS**.

# In[295]:


Final_Diet_It.head()


# In[275]:


Final_Diet_Ge.head()


# In[276]:


Final_Diet_Fr.head()


# In[277]:


Final_Diet_Sp.head()


# In[278]:


Final_Diet_UK.head()


# The code for the chord plot is present in [this notebook](https://nbviewer.jupyter.org/github/manuleo/mADAm-2019/blob/master/Chord.ipynb). (Chord.ipynb from our repository). We didn't put it here for scalability reasons.

# The below cells contain the code for the sunburst plot generation.

# In[3]:


def build_hierarchical_dataframe(df, levels, value_column, color_columns=None, prod=None):
    """
    Build a hierarchy of levels for Sunburst or Treemap charts.

    Levels are given starting from the bottom to the top of the hierarchy, 
    ie the last level corresponds to the root.
    """
    df_all_trees = pd.DataFrame(columns=['id', 'label', 'parent', 'value', 'log_val','color'])
    for i, level in enumerate(levels):
        df_tree = pd.DataFrame(columns=['id', 'parent', 'value','log_val', 'color'])
        dfg = df.groupby(levels[i:]).sum(numerical_only=True)
        dfg = dfg.reset_index()
        df_tree['id'] = dfg[level].copy()
        if(i==0):
            df_tree['label'] = dfg['Product'].copy()
        else:
            df_tree['label'] = dfg[level].copy()
        if i < len(levels) - 1:
            df_tree['parent'] = dfg[levels[i+1]].copy()
        else:
            df_tree['parent'] = 'Total'
        df_tree['log_val'] = np.log(dfg[value_column]+2)
        df_tree['value'] = dfg[value_column]
        df_tree['color'] = dfg[color_columns[0]]
        df_all_trees = df_all_trees.append(df_tree, ignore_index=True)
    total = pd.Series(dict(id='Total', parent='', 
                              value=df_tree['value'].sum(),
                              color=df_tree['color'].sum(), 
                              label="Total",
#                               log_val=np.log1p(df[value_column].sum())))
                              log_val=sum(df_all_trees[df_all_trees.id.isin(prod)]['log_val'])))
    df_all_trees = df_all_trees.append(total, ignore_index=True)
    
    return df_all_trees


# In[4]:


def make_sunburst(dataset, path, EU=False):
    df_val = dataset.copy()
    prod = dataset.index.values
    df_val = df_val.apply(pd.to_numeric)
    df_val = df_val.stack().to_frame().reset_index()
    df_val = df_val.rename(columns = {"level_1": "af_countries", 0:"val"})
    levels = ['id', 'af_countries'] # levels used for the hierarchical chart
    color_columns = ['val']
    value_column = 'val'
    id_n = df_val.index.values
    df_val['id'] = df_val.Product + id_n.astype(str)
    df_val = df_val[['Product', 'id', 'af_countries', 'val']]
    df_all_trees = build_hierarchical_dataframe(df_val, levels, value_column, color_columns, prod)
    df_all_trees = df_all_trees.replace({'United Republic of Tanzania':'Tanzania', 'Central African Republic':'Centr. Afr. Rep.'})
    
    if (EU):
        color_scale = "YlGnBu"
    else:
        color_scale = "RdBu"
    fig =go.Figure(go.Sunburst(
        ids=df_all_trees.id,
        labels=df_all_trees.label,
        parents=df_all_trees.parent,
        maxdepth=2,
        values=df_all_trees.log_val,
        customdata = df_all_trees.value,
        hovertemplate='<b>%{label}</b><br><br>Need: %{customdata:.3f} tons',
        name='',
        marker = dict(colorscale=color_scale, line=dict(color="#777", width=0.5)),
#         branchvalues='total'
    ))
    fig.update_layout(margin = dict(t=0, l=0, r=0, b=20), autosize = False, height=500, width=500, paper_bgcolor = "rgba(0,0,0,0)", plot_bgcolor="rgba(0,0,0,0)")
    py.plot(fig, filename=path, include_plotlyjs=False)


# The corresponding Chord Plot and SunBurst Plot are available at the following links:
# - **Italy**: 
#     - [Chord](https://manuleo.github.io/mADAm_files/chord_it.html)
#     - [SunBurst](https://manuleo.github.io/mADAm_files/sunburst_it.html)
# - **Germany**: 
#     - [Chord](https://manuleo.github.io/mADAm_files/chord_ge.html)
#     - [SunBurst](https://manuleo.github.io/mADAm_files/sunburst_ge.html)
# - **France**: 
#     - [Chord](https://manuleo.github.io/mADAm_files/chord_fr.html)
#     - [SunBurst](https://manuleo.github.io/mADAm_files/sunburst_fr.html)
# - **Spain**: 
#     - [Chord](https://manuleo.github.io/mADAm_files/chord_es.html)
#     - [SunBurst](https://manuleo.github.io/mADAm_files/sunburst_sp.html)
# - **United Kingdom**: 
#     - [Chord](https://manuleo.github.io/mADAm_files/chord_uk.html)
#     - [SunBurst](https://manuleo.github.io/mADAm_files/sunburst_uk.html)

# The final important step of our analysis is to **compute a more general diet** that takes into account of all the countries contributing to the cause. The first optmization problem gave us the kilocalories that every European contry has to give up to save every African country. Considering this, the second optimization problem could have been carried out considering just the total calories given jointly by all Eruope but **increasing the granularity enabled us to determine appropriate  individual diets**. In this final part we will have a look at the **final global diet** that comprises the effort of all European countries. The result is simply obtained by summing up the 5 dataframes generated. 

# In[280]:


FINAL_Diet_df = Final_Diet_It + Final_Diet_UK + Final_Diet_Ge + Final_Diet_Sp + Final_Diet_Fr
FINAL_Diet_df.head()


# Let's have a look at them and then visualize the data with an interactive **Chord plot** and a **SunBurst plot**.

# In[297]:


IFrame(src="https://manuleo.github.io/mADAm_files/chord.html", width=800, height=600)


# This chord plot provides a general overview about how food could be redistributed with minimal expenditures.Germany and France were found to contribute the most, with each providing slightly more than 100,000 tons. This is mainly due to their higher GDP. On the other end,  Ethiopia claim the highest share of food aid of all examined countries.

# In[300]:


IFrame(src="https://manuleo.github.io/mADAm_files/sunburst_EU.html", width=800, height=600)


# The SunBurst plot shows basically the final result of our whole analysis. It is used to wrap up the global final diet that the analysis propose to every African country. It is important to understand that when we talk about a diet we mean a list of food items in which we indicate the amount of food [tons] that has to be shipped. 
# <br>More specifically, we can analyse Ethiopia case in which the diet is composed by:
# - **meat beef** (\~10000 tons)
# - **chicken** (\~8000 tons)
# - **oats**(\~2700 tons) 
# - **tomatoes** (\~1400 tons)
# 
# <br>The other products are being considered while in a neglegible extent as they have either high prices low caloric density. The toal extent to which **barley, cucumber, lattuces, strawberries, cherries, apples,beans,peas and rice** contribute jointly for \~150 tons.

# Notice that the diet will be proportional with respect to the food deficit of each African country. This is important because we can apply the previous analysis to all the other African countries and take it as a guide to understand the results in a different context.