#!/usr/bin/env python # coding: utf-8 # # Project: 2019 Happiness Report with Animated Bubble Plot # # --by Lu Tang # # ## Table of Contents #

Introduction
Data Cleaning
Data Visualization

# # ## Introduction # # **About the project**: # # This project focuses on visualizing the changes on happiness over the last ten years on country-level. We will find where are the happy countries and how their happiness changed over time. At the end of the notebook, we will see an animated bubble plot to display the changes. # # **Dataset**: # # This data set contains happiness data about 150 countries collected from [World Happiness Report](https://worldhappiness.report/) and from [Gapminder](https://www.gapminder.org/) # # I did research on finding the data, combining different dataset and then cleaning it before the final visualization. # #### Brief Introduction about the World Happiness Report and the indicator they chose: # # The World Happiness Report is a landmark survey of the state of global happiness. The first report was published in 2012, the second in 2013, the third in 2015, and the fourth in the 2016 Update. The World Happiness 2017, which ranks 155 countries by their happiness levels, was released at the United Nations at an event celebrating International Day of Happiness on March 20th. The report continues to gain global recognition as governments, organizations and civil society increasingly use happiness indicators to inform their policy-making decisions. Leading experts across fields – economics, psychology, survey analysis, national statistics, health, public policy and more – describe how measurements of well-being can be used effectively to assess the progress of nations. The reports review the state of happiness in the world today and show how the new science of happiness explains personal and national variations in happiness. # # The happiness scores and rankings use data from the Gallup World Poll. The scores are based on answers to the main life evaluation question asked in the poll. This question, known as the Cantril ladder, asks respondents to think of a ladder with the best possible life for them being a 10 and the worst possible life being a 0 and to rate their own current lives on that scale. The scores are from nationally representative samples for the years 2013-2016 and use the Gallup weights to make the estimates representative. The columns following the happiness score estimate the extent to which each of six factors – economic production, social support, life expectancy, freedom, absence of corruption, and generosity – contribute to making life evaluations higher in each country than they are in Dystopia, a hypothetical country that has values equal to the world’s lowest national averages for each of the six factors. They have no impact on the total score reported for each country, but they do explain why some countries rank higher than others. # # Inspiration # # What countries or regions rank the highest in overall happiness and each of the six factors contributing to happiness? How did country ranks or scores change between the 2015 and 2016 as well as the 2016 and 2017 reports? Did any country experience a significant increase or decrease in happiness? # What is Dystopia? # # Dystopia is an imaginary country that has the world’s least-happy people. The purpose in establishing Dystopia is to have a benchmark against which all countries can be favorably compared (no country performs more poorly than Dystopia) in terms of each of the six key variables, thus allowing each sub-bar to be of positive width. The lowest scores observed for the six key variables, therefore, characterize Dystopia. Since life would be very unpleasant in a country with the world’s lowest incomes, lowest life expectancy, lowest generosity, most corruption, least freedom and least social support, it is referred to as “Dystopia,” in contrast to Utopia. # What are the residuals? # # The residuals, or unexplained components, differ for each country, reflecting the extent to which the six variables either over- or under-explain average 2014-2016 life evaluations. These residuals have an average value of approximately zero over the whole set of countries. Figure 2.2 shows the average residual for each country when the equation in Table 2.1 is applied to average 2014- 2016 data for the six variables in that country. We combine these residuals with the estimate for life evaluations in Dystopia so that the combined bar will always have positive values. As can be seen in Figure 2.2, although some life evaluation residuals are quite large, occasionally exceeding one point on the scale from 0 to 10, they are always much smaller than the calculated value in Dystopia, where the average life is rated at 1.85 on the 0 to 10 scale. # # What do the columns succeeding the Happiness Score(like Family, Generosity, etc.) describe? # # The following columns: GDP per Capita, Family, Life Expectancy, Freedom, Generosity, Trust Government Corruption describe the extent to which these factors contribute in evaluating the happiness in each country. The Dystopia Residual metric actually is the Dystopia Happiness Score(1.85) + the Residual value or the unexplained value for each country as stated in the previous answer. # # If you add all these factors up, you get the happiness score so it might be un-reliable to model them to predict Happiness Scores. # # ## Data Cleaning # # To make the final bubble plot, there is a long process on data cleaning. This Jupyter Notebook shows all the steps and methods on the cleaning process, but for people who are more interested to see the final result, you can skip this part and directly go to the last part Data Visualization. # In[1]: # import library import pandas as pd import numpy as np import matplotlib.pyplot as plt get_ipython().run_line_magic('matplotlib', 'inline') import seaborn as sns # to make better chart sns.set_style('whitegrid') plt.figure(figsize = (10,6)) sns.despine(left=True, bottom=True) # to avoid warnings import warnings warnings.filterwarnings('ignore') # to avoid truncated output pd.options.display.max_columns = 150 # In[2]: happy_2019 = pd.read_excel('Chapter2OnlineData.xls') happy_2019.head() # In[3]: happy_2019['Country name'].nunique() # In[4]: happy_2019.rename({'Country name':'Country'}, axis=1, inplace=True) # In[5]: happy_2019.columns # In[6]: drop_columns=['Positive affect', 'Negative affect', 'Confidence in national government', 'Democratic Quality', 'Delivery Quality', 'Standard deviation of ladder by country-year', 'Standard deviation/Mean of ladder by country-year', 'GINI index (World Bank estimate)', 'GINI index (World Bank estimate), average 2000-16', 'gini of household income reported in Gallup, by wp5-year', 'Most people can be trusted, Gallup', 'Most people can be trusted, WVS round 1981-1984', 'Most people can be trusted, WVS round 1989-1993', 'Most people can be trusted, WVS round 1994-1998', 'Most people can be trusted, WVS round 1999-2004', 'Most people can be trusted, WVS round 2005-2009', 'Most people can be trusted, WVS round 2010-2014'] # In[7]: happy_2019_clean = happy_2019.drop(columns = drop_columns, axis=1) happy_2019_clean.head() # In[8]: happy_2019_clean.groupby('Year')['Country'].count() # In[9]: happy_2019_clean['Country'].nunique() # In[10]: happy_2019_clean.head() # ### Add Region # In[11]: table_2015 = pd.read_excel('2015-2017.xlsx', sheet_name=1) table_2015.head(1) # In[12]: Country_Region = table_2015[['Country','Region']] print(Country_Region['Country'].nunique()) Country_Region.head() # In[13]: Country_Region['Region'].value_counts() # In[14]: Country_Region.loc[Country_Region['Region'].str.contains('Asia'), 'Region']='Asia' # In[15]: Country_Region['Region'].value_counts() # In[16]: Country_Region['Region'].replace({'Australia and New Zealand':'North America'}, inplace=True) # In[17]: Country_Region['Region'].value_counts() # In[18]: Country_Region.head() # In[19]: happy_2019_clean.shape # In[20]: happy_2019_clean_region = happy_2019_clean.merge(Country_Region, on='Country',how='outer') print(happy_2019_clean_region.shape) happy_2019_clean_region.head() # In[21]: happy_2019_clean_region.isna().sum() # In[22]: happy_2019_clean_region.to_csv('happy_2019_clean_region.csv',index=False) # In[23]: happy_report=pd.read_excel('happy_2019_clean_region.xlsx') happy_report.isna().sum() # In[24]: happy_report['Country'].nunique() # In[25]: happy_report.groupby('Year')['Country'].count() # #### Add Population # In[26]: population = pd.read_csv('population_total.csv') population.head(1) # In[27]: population_country = population.loc[:,['country']] population_year = population.loc[:,'2005':'2018'] population_new = population_country.merge(population_year, on = population_year.index) population_new.head(1) # In[28]: population_new = population_new.drop('key_0',axis = 1) # In[29]: population_new.head() # In[30]: population_new_converted = pd.melt(population_new, id_vars=['country'], value_vars =['2005','2006','2007','2008','2009','2010','2011','2012','2013', '2014','2015','2016','2017','2018']) population_new_converted.head() # In[31]: population_new_converted.rename({'country':'Country','variable':'Year', 'value':'Population'}, axis=1, inplace=True) population_new_converted.head() # In[32]: population_new = population_new_converted.sort_values(by = ['Country','Year']) population_new.head(20) # In[33]: population_new.to_csv('population_new.csv') # In[34]: population = pd.read_csv('population_new.csv') population.head() # In[35]: population.drop('Unnamed: 0',axis=1, inplace=True) # In[36]: population.head() # In[37]: happy_report.head() # In[38]: happy_report_population=happy_report.merge(population, on=['Country','Year']) happy_report_population.head() # In[39]: happy_report_population.to_csv('happy_report.csv',index=False) # In[40]: happy_report_population.head() # #### Make bubble df # In[41]: happy_report_population.groupby('Year')['Country'].count() # In[42]: df = happy_report_population # In[43]: happy2009 = df[df['Year']==2009] happy2010 = df[df['Year']==2010] happy2011 = df[df['Year']==2011] happy2012 = df[df['Year']==2012] happy2013 = df[df['Year']==2013] happy2014 = df[df['Year']==2014] happy2015 = df[df['Year']==2015] happy2016 = df[df['Year']==2016] happy2017 = df[df['Year']==2017] happy2018 = df[df['Year']==2018] # In[44]: df = happy2010.merge(happy2009,on='Country') df.shape[0] # In[45]: df = df.merge(happy2010,on='Country') df.shape[0] # In[46]: df = df.merge(happy2011,on='Country') df.shape[0] # In[47]: df = df.merge(happy2012,on='Country') df.shape[0] # In[48]: df = df.merge(happy2013,on='Country') df.shape[0] # In[49]: df = df.merge(happy2014,on='Country') df.shape[0] # In[50]: df = df.merge(happy2015,on='Country') df.shape[0] # In[51]: df = df.merge(happy2016,on='Country') df.shape[0] # In[52]: df = df.merge(happy2017,on='Country') df.shape[0] # In[53]: df = df.merge(happy2018,on='Country') df.shape[0] # In[54]: df.head() # In[55]: country_81=df[['Country']] country_81.head() # In[56]: happy_report_population.head() # In[57]: happy_bubble=happy_report_population.copy() # In[58]: happy_bubble = happy_bubble.merge(country_81, on='Country') happy_bubble.head() # In[59]: happy_bubble['Year'].value_counts() # In[60]: happy_bubble = happy_bubble.drop(index=happy_bubble[happy_bubble['Year']==2005].index) # In[61]: happy_bubble = happy_bubble.drop(index=happy_bubble[happy_bubble['Year']==2006].index) # In[62]: happy_bubble = happy_bubble.drop(index=happy_bubble[happy_bubble['Year']==2007].index) # In[63]: happy_bubble = happy_bubble.drop(index=happy_bubble[happy_bubble['Year']==2008].index) # In[64]: happy_bubble['Year'].value_counts() # In[65]: happy_bubble.head() # # ## Data Visualization # As shown in the cleaning process, I added region to classify each country, so we can have clear comparison on different region (Note that Australia and Newsland are included in'North America' because they are smaller countries regarding to population and more similar in culture and economics). I also added population to each country, and in this way we can find where the majority of people's happiness levels are (the large two blue bubbles are China and India) # # - After cleaning, there are 81 countries left with all clean and complete data that are included in the plot. # - Year range is from 2009 to 2018 # - Each bubble represent a country, and the size of the bubble is the population size in that country. # - Y-axis is the total happiness score for the each country # - X-axis is the log form of GDP per capita. # - Colors represent different Region. # # The chart is fullyy interactive. # In[66]: from __future__ import division from plotly.offline import init_notebook_mode, iplot init_notebook_mode() from bubbly.bubbly import bubbleplot figure = bubbleplot(dataset = happy_bubble, x_column='Log GDP per capita', y_column='Life Ladder', bubble_column='Country', color_column='Region', time_column='Year', size_column='Population', x_title="GDP per Capita", y_title="Happiness Score for each country", title='Happy Report', scale_bubble=3, height=650) iplot(figure, config={'scrollzoom': True}) # ### Conclusion: # # - In general, the higher the GDP per capita is, the higher the happiness score is, but for each country it's not always the case, as we can see for with the same GDP per Capita, different countries have different happiness score. # - North America and Western Europe (the purple and brown on the right top) have higher GDP per Capita and higher happiness score, this is not so surprising, but if we compare the changes from 2009 to 2018, we can see the happiness score in North America is decreasing despite very high GDP per capita. # - Many countries in Latin America and Caribbean (the orange bubble) have very high happiness score, with Costa Rice as the happiest and its happiness and economic score does not change too much, very stable. Other countries in Latin America and Caribbean have slight lower happiness score compared ten years ago, although they are still generally happier than many people in Aisan. # - Some countries like Israel and United Arab Emirates (light purple) are in the similar position as North America and Western Europe and their hapiness score are also slightly decreased. # - Sub-Saharan African countries (the red bubble) are the worst in general with lower GDP per capita and lower happiness score, with the exception of South Africa. However, we are happy to see the improvement over time for these countries. # - Many Asian countries (the blue bubble) also moved to the upper right side, but some countries like India, dispite the increase in GDP per capita, the happiness score dropped, while countries like China are becoming richer and happier. # - Lastly, if we see closely, in 2009, counrties in each region are more concentrated and for the world as a whole, we can see a clearer positive relation between GDP per capita and Happiness Score. However, over time, although GDP per capita is generally higher, the spread of each bubble are becoming larger. This means in the same region, the differences in happiness score is becoming larger. There are other factors that explain the overall happiness score, not only GDP per capita. # - In our data, we have 6 factors to measure the happiness score, but this analysis can show us how much GDP per capita can contribute to people's happiness in different countries. e.g. For China and many Sub-Saharan African countries, as GDP goes up, happiness score goes up too, but this is not the same pattern for India, so maybe there are other problems with India, and this gives us indication that we can find more data for India to see why people in Indian are becoming less happy.