#!/usr/bin/env python # coding: utf-8 # Table of Contents: # # * [1. Introduction](#Introduction) # # * [1.1 Background](#Background) # * [1.2 Problem Statement](#PS) # * [1.3 Objective](#Objective) # * [1.4 Data Dictionary](#Dict) # # * [2. Python Libraries](#PL) # # * [2.1 Import Libraries & Ignore Warnings](#IL) # # * [3. Data Preprocessing(1)](#DP) # # * [3.1 Data Reading](#DR) # * [3.2 Data Inspection and Analysis](#DIA) # * [3.3 Feature Transformation ](#FT) # # # * [4. Exploratory Data Analysis (EDA)](#EDA) # # * [4.1 Univariate Analysis](#UA) # * [4.2 Multivariate Analysis](#BA) # # * [5. Strategy](#S) # # * [5.1 Feature Selection](#FS) # * [5.2 Outlier Treatment](#OT) # # * [6. Data Preprocessing (2)](#DP) # # * [6.1 Feature Scaling](#FS) # * [6.2 Hopkins Test](#HT) # # * [7. K-Means Clustering Model](#KC) # # * [7.1 The Elbow Method](#EM) # * [7.2 K-Means Algorithm](#KA) # * [7.3 Visualization of Clustering Result](#VCR) # # * [8. Findings & Conclusion](#F&C) # # * [8.1 Results](#RS) # * [8.2 Recommendations](#RM) # # # # # # # ### 1. Introduction # ### 1.1 Background # # HELP International is an international humanitarian NGO that is committed to fighting poverty and providing the people of backward countries with basic amenities and relief during the time of disasters and natural # calamities. It runs a lot of operational projects from time to time along with advocacy drives to raise awareness as well as for funding purposes. # ### 1.2 Problem Statement # # After the recent funding programmes, they have been able to raise around $ 10 million. Now the CEO of the NGO needs to decide how to use this money strategically and effectively. The significant issues that come while making this decision are mostly related to choosing the countries that are in the direst need of aid. # # ### 1.3 Objective # # To categorize/segment countries using socio-economic and health factors to identify which countries need financial assistance the most. # # # ### 1.4 Data Dictionary # # * country : Name of the country # * child_mort : Death of children under 5 years of age per 1000 live births # * exports : Exports of goods and services per capita. Given as %age of the GDP per capita # * health : Total health spending per capita. Given as %age of GDP per capita # * imports : Imports of goods and services per capita. Given as %age of the GDP per capita # * income : Net income per person # * inflation : The measurement of the annual growth rate of the Total GDP # * life_expec : The average number of years a new born child would live if the current mortality patterns are to remain the same # * total_fer : The number of children that would be born to each woman if the current age-fertility rates remain the same # * gdpp : The GDP per capita. Calculated as the Total GDP divided by the total population # # # # # # # [Back to TOC](#TOC) # ### 2. Python Libraries # ### 2.1 Import Libraries and Ignore Warnings # In[1]: pip install kneed # In[2]: # DATA ANALYSIS AND VISUALIZATION LIBRARIES import pandas as pd import numpy as np import pandas as pd from random import sample from numpy.random import uniform from math import isnan import seaborn as sns import matplotlib.pyplot as plt # MACHINE lEARNING LIBRARIES import sklearn from sklearn.preprocessing import StandardScaler from sklearn.cluster import KMeans from sklearn.metrics import silhouette_score from sklearn.neighbors import NearestNeighbors from kneed import KneeLocator import warnings warnings.filterwarnings('ignore') # [Back to TOC](#TOC) # ### 3. Data Preprocessing # # ### 3.1 Data Reading # In[3]: # IMPORTING EXCEL FILES country_df = pd.read_excel("/Users/griotinsights/Desktop/DATASETS/HELP INTERNATIONAL/Country-data.xls") # In[4]: country_df.head(10) # ### 3.2 Data Inspection # In[5]: country_df.shape # In[6]: country_df.isnull().sum() # In[7]: country_df.duplicated().sum() # In[8]: country_df.info() #
Observation: The data has no missing values nor duplicates.
# # ### 3.2 Feature Transformation # # According to the data dictionary, imports, exports and health are represented as percentages of GDP per capita. Using these figures for further analysis can skew results. It can give the impression that certain countries spend similar amounts on health such as Australia and Afghanistan(8.73% and 7.58%). However, this is inaccurate especially when their respective GDP per capita are far apart, hence the need to convert them into their actual values. # In[9]: country_df['exports'] = (country_df['exports']/100) * country_df['gdpp'] country_df['imports'] = (country_df['imports']/100) * country_df['gdpp'] country_df['health'] = (country_df['health']/100) * country_df['gdpp'] country_df.head() # [Back to TOC](#TOC) # ### 4. Exploratory Data Ananlysis # # It is the process of performing initial investigation and analyses to undersatnd the data by discovering trends, spotting anomalies and checking assumptions by using statistical summaries and data visualizations. # ### 4.1 Univariate Analysis # # In Univariate Analysis, only one variable is analyzed at a time. This analysis is used to describe the data and find patterns that exist within it. # In[10]: country_df.describe() #
Observation: From the table above, we can already tell that "child_mort","incmome"and"gdpp" are greatly skewed and have asymmetrical distribution. This is because their mean and median are very far apart.
# In[11]: country_df.columns # In[12]: features =['child_mort', 'exports', 'health', 'imports', 'income', 'inflation', 'life_expec', 'total_fer', 'gdpp'] # In[13]: plt.figure(figsize=(12,12)) for i in enumerate(features): ax = plt.subplot(3, 3, i[0]+1) sns.distplot(country_df[i[1]]) plt.xticks(rotation=0) plt.tight_layout() # In[14]: plt.figure(figsize=(12,12)) for i in enumerate(features): ax = plt.subplot(3, 3, i[0]+1) sns.boxplot(x=country_df[i[1]]) plt.xticks(rotation=0) plt.tight_layout() #
OBSERVATIONS FROM UNIVARIATE ANALYSIS:
#
# * Child Mortality, Exports, Imports, Income, Inflation and GDP per capita are all highly skewed to the right (positively-skewed) and have several outliers
#
# * Health and Total fertility are also positively skewed but with only one outlier each.
#
# * Life expectancy is negatively skewed (left-skewed) with a few outliers
# # ### 4.2 Multivavariate Analysis # # In Multivariate Analysis, more than two different variables are analyzed. This analysis deals with causes and relationships and the analysis is done to find out the relationship between the variables. # In[15]: plt.figure(figsize=(14,9)) sns.pairplot(country_df, corner=True) plt.show() # In[16]: # Heatmap to determine the correlation between the features. plt.figure(figsize=(14,9)) sns.heatmap(country_df.corr(), annot = True, cmap="YlGnBu") # In[17]: ##To find the degree of the relationship amongst the variables, a correlation function is used country_df.corr() #
OBSERVATIONS FROM BIVARIATE ANALYSIS:
#
# * Child Mortality, Exports, Imports, Income, Inflation and GDP per capita are all highly skewed to the right (positively-skewed) and have several outliers
#
# * Health and Total fertility are also positively skewed but with only one outlier each.
#
# * Life expectancy is negatively skewed (left-skewed) with a few outliers
# # [Back to TOC](#TOC) # ### 5. Strategy # # The above analysis has provided some insight into the relationships of the variables. Hence, the next steps involve selecting the right features to perform clustering analysis then treating the outliers for these features # ### 5.1 Feature Selection # # Based on the objective, we are to select countries in need based on socio-economic and health factors. Therefore, we need to know what features/variables fall under these factors. # # | Health Factors| Socio-Economic Factors | # | :-------------| ---------------------: | # | Child Mortality | Exports | # | Health. | Imports | # | Total Fertility | Income | # | Life Expectancy | Inflation | # | | GDP Per Capita | #
Approach
#
# 1. Health Factor of choice is Child Mortality:
#
# The weak correlation between healthcare expenditure and the other health factors (Child Mortality, Total Fertility, and Life Expectancy) suggests that the healthcare funds may not be reaching the areas that have the greatest impact on child mortality and the other health factors.
#
# Child mortality, influenced by various determinants of health and environmental factors such as sanitation, poverty and nutrition is a better indicator for financial assistance. By targeting child mortality, HELP International can address both healthcare needs and broader social determinants of health, aligning with the goal of fighting poverty. Moreover, child mortality correlates with other health factors such as total fertility and life expectancy, making it a comprehensive measure to improve population health.
#
# 2. Socio-Economic Factor of Choice is Income:
#
# Inflation has a weak correlation with the other socio-economic factors because its influence is indirect and mediated by various economic and social factors such as monetary policy decisions, fiscal measures, and supply-demand dynamics within an economy. Income, on the other hand, provides a more comprehensive understanding of individuals' economic well-being by considering non-trade-related factors and domestic economic impact. It captures income disparities, poverty rates, and ability to access essential services, making it a more holistic indicator for financial assistance targeting socio-economic conditions.
# # ### 5.2 Outlier Treatment # # From the box plots in the univariate analysis, both child mortatlity and income have outliers and are skewed to the right. In treating these outliers, we would refrain from the deletion method since this would exclude countries that are in the direst need of aid. Hence, there would be no outlier tratment for Child Mortality. # # For Income, we are more focused on countries with low income per person so we would adopt the winorization method or percentile capping to treat the high values. Hence, we would cap at the 99th percentile which means that values that are greater than the value at 99th percentile are replaced by the value of the 96th percentile. # # In[18]: ### Capping Income based on the 95th persentile max_income = country_df['income'].quantile(0.96) max_income print('Total number of rows getting capped for income : ', len(country_df[country_df['income']>max_income])) # In[19]: country_df['income'][country_df['income']>max_income]=max_income country_df.income.max() # In[20]: ## Checking for outliers after capping sns.boxplot(data=country_df, x="income") plt.show() # [Back to TOC](#TOC) # ### 6. Data Preprocessing (2) # # Now that the features have been selected and outliers treated, we can go ahead and prepare and test the data for clustering. # ### 6.1 Feature Scaling # # # The scales for the selected features are different hence the need to adjust the values and put them on a common scale. # In[24]: df = country_df[['country','child_mort','income']] # In[28]: cluster_df = country_df[['country','child_mort','income']].set_index('country') features = cluster_df.columns cluster_df # In[29]: ## Scale the features in the new dataframe scale=StandardScaler() # INITIALIZE cluster_df_scaled= pd.DataFrame(scale.fit_transform(cluster_df)) ## fit the data to be studied by the algorithmb cluster_df_scaled.columns=features cluster_df_scaled.head() # ### 6.2 Hopkins Test # # The hopkins test is a way to measure the clusterability of a dataset to ascertain if there are any meaningful clusters. #
# NB: A uniformly distributed dataset can still be clustered but with no meaningful clusters hence the neccesity of hopkins test # # In[30]: def hopkins(X): d = X.shape[1] ## Length of colums n = X.shape[0] ## length of rows m = int(0.1 * n) # size of the randomly sampled dataset nbrs = NearestNeighbors(n_neighbors=1).fit(X.values) rand_X = sample(range(0, n, 1), m) ujd = [] wjd = [] for j in range(0, m): #draw uniformly from the space that is strechted from the least point to the maximum point & Calculate their distance to the nearest neighbor u_dist, _ = nbrs.kneighbors(uniform(np.amin(X,axis=0),np.amax(X,axis=0),d).reshape(1, -1), 2, return_distance=True) ujd.append(u_dist[0][1]) # generate another random sample from the sample itself & Calculate the distance to the nearest neigbor w_dist, _ = nbrs.kneighbors(X.iloc[rand_X[j]].values.reshape(1, -1), 2, return_distance=True) wjd.append(w_dist[0][1]) H = sum(ujd) / (sum(ujd) + sum(wjd)) if isnan(H): print(ujd, wjd) H = 0 return H # In[32]: hopkins(cluster_df_scaled) #
RESULTS:
#
# The hopkins statistic value of 0.91 indicates that the data set has a high clustering tendency
# # [Back to TOC](#TOC) # ### 7. The K-Means Clustering Model # # K-Means Clustering is an Unsupervised Machine Learning algorithm, which groups the unlabeled dataset into different clusters. # ### 7.1 The Elbow Method # # Used to decide on the optimal number of clusters to use # In[33]: inertia_scores=[] ## create an empty list to put all the inertia scores in once calculated for i in range(1,11): ## For each cluster in a range of 1 to 10, kmeans=KMeans(n_clusters=i, random_state=42) ### Initialize the algorithm kmeans.fit(cluster_df_scaled[['child_mort','income']]) ## Fit the data to be studied by the algorithm inertia_scores.append(kmeans.inertia_) ### append the calculated WCSS to inertia scores that was created plt.plot(range(1,11), inertia_scores, marker='o') plt.title('The Elbow Method') plt.ylabel("WCSS") plt.xlabel('Number of Clusters') plt.show() # In[34]: k=KneeLocator(range(1,11), inertia_scores, curve='convex', direction='decreasing') k.elbow # ### 7.2 The K-Means Algorithm # In[35]: ## With our chosen cluster, we can input it into the KMeans algorithm kmeans=KMeans(n_clusters=3, random_state=50) ## initialize algorithm with optimal clusters kmeans.fit(cluster_df_scaled) kmeans.labels_ # In[36]: # Assign clustering result to each country in the data frame cluster_df['Cluster Label']=kmeans.labels_ cluster_df # In[37]: cluster_df['Cluster Label'].value_counts(ascending=True) # ### 7.3 Visualization of Clustering Results # In[38]: ### Visualize to better understand clustering result plt.figure(figsize=(10,8)) sns.scatterplot(data=cluster_df, x='child_mort',y='income',hue='Cluster Label', palette="Dark2") plt.show() # # On a development scale, the clusters can be classified into: # # # | Color | Cluster Label. | Development scale | # | :-------------| --------------------- |-------------------------: | # | Green. | 0 | Developed Countries | # | Orange | 1 | Under Developed Countries | # | Purple. | 2 | Developing Countries | # # # [Back to TOC](#TOC) # ### 8. Findings & Conclusions # # ### 8.1 Results # # According to the clusters above, the countries in the most need of financial assistance are those in Cluster 1 (Under-developed Countries). # In[50]: Underdeveloped = cluster_df[cluster_df['Cluster Label'] == 1] Underdeveloped.head(5) # ### 8.1 Recommendations # # Since Cluster 1 has 41 countries, we will narrow the selection down to the top 10 countries that need financial assistance the most. That is, countries with the highest child mortality rates and those with the lowest income levels. # In[84]: # Rank countries in cluster 1 based on child mortality and income levels combined Underdeveloped['rank'] = (Underdeveloped['income'].rank(ascending=True) + Underdeveloped['child_mort'].rank(ascending=False)).rank() Underdeveloped["rank"].nsmallest(10) # In[76]: plt.figure(figsize=(18,6)) sns.barplot(x = "country", y='child_mort', palette="rocket", data=Underdeveloped.reset_index().nlargest(20, 'child_mort')) plt.title("TOP 20 COUNTRIES WITH THE HIGHEST CHILD MORTALITY RATES") plt.xticks(rotation = 45, horizontalalignment = "right") plt.show() # In[77]: plt.figure(figsize=(18,6)) sns.barplot(x='country', y='income', palette="rocket", data=Underdeveloped.reset_index().nsmallest(20, 'income')) plt.title("TOP 10 COUNTRIES WITH THE LOWEST INCOME LEVELS ") plt.xticks(rotation = 45, horizontalalignment = "right") plt.show() #
THE TOP 10 COUNTRIES RANKED ACCORDING TO THOSE IN NEED OF THE MOST FINANCIAL ASSISTANCE ARE :
#
# # 1. CENTRAL AFRICAN REPUBLIC # 2. CONGO DEM. REPUBLIC # 3. NIGER # 4. SIERRA LEONE # 5. HAITI # 6. BURUNDI # 7. GUINEA # 8. MOZAMBIQUE # 9. GUINEA-BISSAU # 10. BURKINA FASO # # [Back to TOC](#TOC)