In this project, we'll work with data from the CIA World Factbook, a compendium of statistics about all of the countries on Earth. The Factbook contains demographic information like the following:
We will use SQLite in Jupyter Notebook to analyze data from this database. The questions we're looking to answer are:
Before we can do that, we need to connect to our database.
#Connect to the DB #%%capture -- gives error in DQ site. Submitted bug report 3/14/22. #Project appears to work without it %reload_ext sql %sql sqlite:///factbook.db #'Connected: [email protected]' indicates success
%%sql SELECT * FROM sqlite_master WHERE type='table';
|table||sqlite_sequence||sqlite_sequence||3||CREATE TABLE sqlite_sequence(name,seq)|
|table||facts||facts||47||CREATE TABLE "facts" ("id" INTEGER PRIMARY KEY AUTOINCREMENT NOT NULL, "code" varchar(255) NOT NULL, "name" varchar(255) NOT NULL, "area" integer, "area_land" integer, "area_water" integer, "population" integer, "population_growth" float, "birth_rate" float, "death_rate" float, "migration_rate" float)|
We have a fairly simple database. The SQLite internal table sqlite_sequence is used to store information about SQLite: autoincrement columns. This means we only have one data table, called "facts". Let's take a look at the first five rows of the facts table.
%%sql SELECT * FROM facts LIMIT 5
There are a total of 11 columns. Some of the ones we're most interested are:
With this knowledge, we can now begin our analysis.
Our first analysis step will be to find and calculate some summary statiscics, namely The minimum and maximum populations and minimum and maximum population growth.
%%sql SELECT MIN(population) AS minimum_pop, MAX(population) AS max_pop, MIN(population_growth) AS min_pop_growth, MAX(population_growth) AS max_pop_growth FROM facts
We have a country with zero population and one with a populaton of 7.25 Billion (the approximate population of the world). Let's take a look at the countries with the highest and lowest populations
%%sql SELECT name, population FROM facts ORDER BY population DESC LIMIT 5
%%sql SELECT name, population FROM facts ORDER BY population LIMIT 21 --Need 21 rows to get the first row with any population
|Ashmore and Cartier Islands||None|
|Coral Sea Islands||None|
|Heard Island and McDonald Islands||None|
|French Southern and Antarctic Lands||None|
|British Indian Ocean Territory||None|
|South Georgia and South Sandwich Islands||None|
|United States Pacific Island Wildlife Refuges||None|
The data include several rows with "None" for population. These include oceans, which make sense and several island chains. According to the CIA Factbook pages for several of these islands, they are all uninhabited or only house temporary research personnel.
There is also one row with zero population, which is Antartica. the CIA Factbook page for Antartica shows a similar status to the islands:
53 countries have signed the 1959 Antarctic Treaty; 30 of those operate through their National Antarctic Program a number of seasonal-only (summer) and year-round research stations on the continent and its nearby islands south of 60 degrees south latitude (the region covered by the Antarctic Treaty); the population engaging in and supporting science or managing and protecting the Antarctic region varies from approximately 4,400 in summer to 1,100 in winter; in addition, approximately 1,000 personnel, including ship's crew and scientists doing onboard research, are present in the waters of the treaty region.
We also have a row for the entire world, which explains the population over 7.25 billion. We can exclude all of these rows to find the actual countries with the highest and lowest populations.
%%sql SELECT MIN(population) AS minimum_pop, MAX(population) AS max_pop, MIN(population_growth) AS min_pop_growth, MAX(population_growth) AS max_pop_growth FROM facts WHERE population != (SELECT MAX(population) FROM facts) AND population != (SELECT MIN(population) --Null values excluded by default FROM facts)
These numbers look much better, but we still have at least one country with zero growth. Well take a look at which countries those are.
%%sql SELECT * FROM facts WHERE population_growth = (SELECT MIN(population_growth) FROM facts)
|190||vt||Holy See (Vatican City)||0||0||0||842||0.0||None||None||None|
|200||ck||Cocos (Keeling) Islands||14||14||0||596||0.0||None||None||None|
We have 2 small islands with very small populations. It makes sense that the populations would not fluctuate much. The third "country" is Vatican city. As it's controlled by the Catholic Church, they can maintain the population so it would not necessarily be prone to normal population influences. The final country is Greenland. According to the Factbook, the birth rate is approximately 13.79 per 1000 people. This is balanced by the death rate of 8.96 per 1000 and immigration rate of -5.07 (14.03 total) for a net of -.24 people per 1000. It seems unlikely that the population is exactly the same from year to year, but accounting for polling methods, statistical errors and rounding, the number is about right.
We now turn our analysis to the average population and area of the countries of the world. We don't want to include Antartica or the whole world in this either, so we'll exclude them again.
%%sql SELECT ROUND(AVG(population),2) AS avg_pop, ROUND(AVG(area_land),2) AS avg_area_land FROM facts WHERE population != (SELECT --exclude row with world population MAX(population) FROM facts) AND population != (SELECT --Exclude Zero population (Antartica) and Null by default MIN(population) FROM facts)
The average population is approximately 32,377,000 and the average land area is 551,000
Our final analysis step will be to find the most densely populated countries in the world. To do this, we'll look for countries with populations higher than the average and land areas lower than the average.
%%sql SELECT name, population, area_land, population/area_land AS population_density FROM facts WHERE population > (SELECT AVG(population) FROM facts) AND area_land < (SELECT AVG(area_land) FROM facts) ORDER BY population_density DESC
The drawback to this method is that it might exclude countries where the population is lower than average but has an extremely small land area or the land area is above average and the population is still very high. Excluding the filters, we get:
%%sql SELECT name, population, area_land, population/area_land AS population_density FROM facts ORDER BY population_density DESC LIMIT 20
Now Bangladesh, which was our highest density country is actually 11th!
In this analysis, we reviewed some basic statistics about the countries of the world based on the CIA World Factbook. Due to some quirks of the data such as "countries" with zero population and an entry for the entire world, we had to mainipulate our queries a bit, but in the end we were able to answer all of the questions about the countries having the highest and lowest populations and population densities.