#!/usr/bin/env python # coding: utf-8 # # Analyzing CIA Factbook Data Using SQL # # In this project, we will be analyzing the [CIA Factbook Data](https://www.cia.gov/library/publications/the-world-factbook/) that contains information about each world's state/territory and in particular: # * Population as of 2015. # * Population growth - the annual growth, in percentage. # * Area - the total land and water area. # # We will figure out: # * Which country has the least population? # * Which country has the largest population growth? # * Average population, area and density of a country. # * Which country has the maximum density? # * Which country would have added the most people in 2016? The least? # * Which countries have a higher death rate than birth rate? # * What countries have the highest population/area ratio? # # ## Get to Know the Data # # Before we jump to analysis, let's study the data we have. # In[1]: get_ipython().run_cell_magic('capture', '', '%load_ext sql\n%sql sqlite:///factbook.db\n') # In[2]: get_ipython().run_cell_magic('sql', '', "SELECT *\n FROM sqlite_master\n WHERE type='table';\n") # We see that this database contains two tables: `sqlite_sequence` and `facts`. For our project, we are only interested in the second table that contains facts about countries. Let's have look at this table. # In[3]: get_ipython().run_cell_magic('sql', '', 'SELECT *\n FROM facts\n LIMIT 5;\n') # The first columns are pretty self-explainable (or already explained, like `population_growth`). The columns `birth_rate`, `death_rate` and `migration_rate` are to be considered as per 1000 persons, and are calculated as an average rate during the year. # # The areas (land and water) are measured in square kilometers. # # ## Summary Statistics # Let's calculate some summary statics such as: # * Minimum population. # * Maximum population. # * Minimum population growth. # * Maximum population growth. # # ### Minimum and Maximum Population # In[4]: get_ipython().run_cell_magic('sql', '', 'SELECT MIN(population), MAX(population),\n MIN(population_growth), MAX(population_growth)\n FROM facts;\n') # As far as we can see, the minimum population is 0? What is the territory where nobody lives? It's probably the Antarctic or the Arctic. The maximum population is the whole's world population. We also have a country with 0 population growth (it probably has no population at all?). # # Let's investigate the "countries" with 0 and 7,2 billion populations. # In[5]: get_ipython().run_cell_magic('sql', '', 'SELECT name, MAX(population) max_population\n FROM facts;\n') # In[6]: get_ipython().run_cell_magic('sql', '', 'SELECT name, MIN(population) min_population\n FROM facts;\n') # It's clear now that the database has two entries: *World* which has the largest population (the world's population) and *Antarctica*, a continent with no permanent inhabitants. In the Antractica there only permanent and summer-only [research bases](https://en.wikipedia.org/wiki/Research_stations_in_Antarctica). # # Let's exclude the whole world and the Antarctica from summary statistics. # In[7]: get_ipython().run_cell_magic('sql', '', 'SELECT MIN(population), MAX(population),\n MIN(population_growth), MAX(population_growth)\n FROM facts\n WHERE name NOT IN ("World", "Antarctica");\n') # It's obvious that the country with the maximum population is [China](https://en.wikipedia.org/wiki/China), but what is the country with only 48 persons living in there? Is it Vatican? # In[8]: get_ipython().run_cell_magic('sql', '', 'SELECT name, MIN(population) AS min_population\n FROM facts\n WHERE name NOT IN ("World", "Antarctica");\n') # It's not Vatican! [Pitcairn Islands](https://en.wikipedia.org/wiki/Pitcairn_Islands) are a group of four volcanic islands the least populous jurisdiction territory in the world. They are a part of British Overseas Territory in Pacific Ocean. When you are tired of the life you may want to move there for a couple of months (not an advertisement). # # # ### Maximum Population Growth # # Let's now see what is the county with the maximum population growth (since the minimum population growth is 0, it fair to think there are many countries with such value, in other words they have no population growth). # In[9]: get_ipython().run_cell_magic('sql', '', 'SELECT name, MAX(population_growth) population_growth\n FROM facts;\n') # [South Sudan](https://en.wikipedia.org/wiki/South_Sudan) is the most recent sovereign state and one of the poorest states in the world. It's no surprise it has the highest population growth rate in the world. # # ### Average Area and Population # # Let's calculate the average number for `area` and `population` excluding the Antarctica and the world. It's worth noting that the world, Antarctica and the oceans (i.e. Pacific, Indian) have no values for the area, but there is the area of the European Union that we have to exclude from our calculation (all its countries are already present in the table). # In[10]: get_ipython().run_cell_magic('sql', '', 'SELECT AVG(area) avg_country_area\n FROM facts\n WHERE name != "European Union";\n') # It's 539 893 square kilometers, an area comparable to [Metropolitan France](https://en.wikipedia.org/wiki/Metropolitan_France)! Let's figure out the number of states that have an area bigger than their average. # In[11]: get_ipython().run_cell_magic('sql', '', 'SELECT COUNT(*)\n FROM facts\n WHERE area > (SELECT AVG(area) FROM facts\n WHERE name NOT IN ("European Union")\n );\n') # There are 48 countries out of 195 countries that are bigger than the average country area. Let's write a query to figure out which countries have areas above average. # In[12]: get_ipython().run_cell_magic('sql', '', 'SELECT name, area\n FROM facts\n WHERE area > (SELECT AVG(area) FROM facts\n WHERE name NOT IN ("European Union")\n )\n ORDER BY area DESC;\n') # Let's do the same statistics for `population`. This time though we have to exclude the world and Antarctica since they have values in the `population` column. We additionally have to exclude the European Union. # In[13]: get_ipython().run_cell_magic('sql', '', 'SELECT AVG(population) avg_world_population\n FROM facts\n WHERE name NOT IN ("World", "Antarctica", "European Union");\n') # There are more than 30 million people on average living in each country. It's around the population in Mozambique. Let's then count the number of countries that have more people than the average population. # In[14]: get_ipython().run_cell_magic('sql', '', 'SELECT COUNT(*)\n FROM facts\n WHERE population > (SELECT AVG(population) FROM facts \n WHERE name NOT IN ("World", "European Union", "Antarctica"));\n') # There are 45 (43 without the world and European Union) countries that have a population bigger than the average world's population! Let's figure out what this countries are and order them by population (from the highest to the lowest). # In[15]: get_ipython().run_cell_magic('sql', '', 'WITH more_avg_pop AS \n (\n SELECT name, population\n FROM facts\n WHERE population > (SELECT AVG(population) FROM facts\n WHERE name NOT IN ("World", "European Union", "Antractica")\n )\n )\n\nSELECT name, population\n FROM more_avg_pop\n WHERE name NOT IN ("World", "European Union")\n ORDER BY population DESC;\n') # It would be also interesting to figure out which proportion of the world's population the first three countries (China, India, USA) hold. # In[16]: get_ipython().run_cell_magic('sql', '', 'WITH top_three_pop_proportion AS\n (\n SELECT CAST(Population AS FLOAT) / (SELECT MAX(Population) FROM facts) proportion\n FROM facts\n WHERE name NOT IN ("World", "European Union")\n ORDER BY 1 DESC\n LIMIT 3\n )\n \nSELECT ROUND(SUM(proportion) * 100, 2) "proportion (%)"\n FROM top_three_pop_proportion\n') # Just three countries account for more than 40 percent of the world's population: China, India and USA with China and India having more than one billion people each. # ### Population Densities # # Now let's calculate population densities. To do so we will have to divide the country's population by its land area (hopefully most people don't live in rivers or lakes). Let's also order them in descending order to find out the countries with the highest population density. # In[17]: get_ipython().run_cell_magic('sql', '', 'SELECT name, population, area, \n ROUND(CAST(population AS FLOAT) / area_land, 2) population_density\n FROM facts\n ORDER BY population_density DESC\n LIMIT 10;\n') # [Macau](https://en.wikipedia.org/wiki/Macau) is a special administrative region in China, and is considered the most densely populated region in the world. # # If we are following this way why not calculating the average population density and figure out how many (and what) countries have densitites more than its average? The oceans, the EU, the world, have no values for `area_land` and/or for `population` so we do not need to explicitly exclude them from calculation. On the other hand, the Antarctica has both values, so it's necessary to exclude it. # In[18]: get_ipython().run_cell_magic('sql', '', 'SELECT ROUND(AVG(CAST(population AS FLOAT) / area_land), 2) avg_pop_density\n FROM facts\n WHERE name NOT IN ("Antarctica")\n') # It's around 432 pop./km$^2$, the [density of the Netherlands](https://en.wikipedia.org/wiki/List_of_countries_and_dependencies_by_population_density), a state in Europe, a continent with one of the most highest densitites in the world. # # Let's now figure out how many countries (and what) have densities more than the average value. # In[19]: get_ipython().run_cell_magic('sql', '', 'SELECT COUNT(*) more_avg_dens\n FROM facts\n WHERE CAST(population AS FLOAT) / area_land > \n (\n SELECT AVG(CAST(population AS FLOAT) / area_land) \n FROM facts\n WHERE name NOT IN ("Antarctica") \n )\n') # In[20]: get_ipython().run_cell_magic('sql', '', 'SELECT name, ROUND(CAST(population AS FLOAT) / area_land, 2) population_density\n FROM facts\n WHERE CAST(population AS FLOAT) / area_land > \n (\n SELECT AVG(CAST(population AS FLOAT) / area_land) \n FROM facts\n WHERE name NOT IN ("Antarctica") \n )\n ORDER BY population_density DESC;\n') # There are 26 countries which density is higher that the average density. It's interesting to note that the calculated value of 500 pop./km$^2$ for the Netherlands is much higher that the Wikipedia value. It's also interesting to note that the Netherlands in the only European continental state (not counting small states like Monaco) with a density higher that the average. # ### Most and Least Added People # It's also interesting to know which countries will add the most people to their population next year. We can calculate this by multiplication of the population by the population growth rate / 100 + 1 (because it's in percentage) - current population. If we order this table by the new column in ascending order we can figure out what countries would have added the least people. # In[21]: get_ipython().run_cell_magic('sql', '', 'CREATE VIEW add_people AS\n SELECT name, population, population_growth, \n ROUND(population * (population_growth / 100 + 1) - population, 0) AS added_people\n FROM facts;\n') # In[22]: get_ipython().run_cell_magic('sql', '', 'SELECT *\n FROM add_people\n ORDER BY added_people DESC\n LIMIT 11;\n') # In[23]: get_ipython().run_cell_magic('sql', '', 'SELECT *\n FROM add_people\n WHERE added_people IS NOT NULL\n ORDER BY added_people\n LIMIT 10;\n') # It's interesting to notice the USA in this least as the only developed country (it's also arguably China) in this list that would have added people in the year 2016. It's also curious to figure out that China with its policy 'One family - one child' has more than two times smaller number of new people than India that has not such policy. # # As for the countries with the least added people we notice [Svalbard](https://en.wikipedia.org/wiki/Svalbard), a Norwegian archipelago in the Arctic Ocean and [Christmas Island](https://en.wikipedia.org/wiki/Christmas_Island), an Australian external territory in the Indian Ocean, both adding just 1 and 17 persons, respectively. # ### Birth and Death Rates # Let's figure out how many and which countries have a higher death rate than the birth rate. This value is called "Rate of Natural population increase" and is calculated as "birth rate - death rate". # In[24]: get_ipython().run_cell_magic('sql', '', 'SELECT name, ROUND(birth_rate - death_rate, 2) AS natural_pop_increase\n FROM facts\n WHERE natural_pop_increase IS NOT NULL\n ORDER BY natural_pop_increase\n LIMIT 10;\n') # Bulgaria, Serbia and Latvia are three countries with the most negative natural population increase. # ### Highest Water-to-Land Ratio # Let's now compare the water-to-land ratios to figure out what countries are the most waterful (hoorah, we've just invented a new word)! # In[25]: get_ipython().run_cell_magic('sql', '', 'SELECT name, area_land, area_water, ROUND(CAST(area_water AS FLOAT) / area_land, 2) AS water_to_land\n FROM facts\n ORDER BY Water_to_land DESC\n LIMIT 10;\n') # The [British Indian Ocean Territory](https://en.wikipedia.org/wiki/British_Indian_Ocean_Territory) stands out in the list having the water-to-land ratio more than 200 times bigger than the second place (Virgin Islands). It's a group of attols with more than 1000 individual islands with a total area of 54 400 square kilometers so it's no surprise it has such a big ratio. # # ## Conclusions # # The main objective of this project though was to understand how the basic SQL code works and what information could be extracted using SQL. We analyzed the CIA Factbook Data and answered some summary questions about the database and discovered some interesting facts about the world's countries: # * [Pitcairn Islands](https://en.wikipedia.org/wiki/Pitcairn_Islands) are a part of the British Overseas Territories and it's the least populous national jurisdiction in the world! # * [Macau](https://en.wikipedia.org/wiki/Macau), located in China, is a territory with the highest population density in the world. # * [British Indian Ocean Territory](https://en.wikipedia.org/wiki/British_Indian_Ocean_Territory) is a territory in the Indian Ocean with the highest water-to-land ration (more than 900 times).