In this guided project, we will be working with data from the CIA World Factbook, a compendium of statistics about every countries on Earth. The site provides demographic information for each country.
This dataset is a SQLite Database named factbook.db that can be dowloaded here.
Note: the data used is not current.
The goal of this project is to learn basic statistics using SQL (more precisely, SQLite).
Connecting to database
First of all we need to connect to the database:
import warnings warnings.filterwarnings("ignore")
%reload_ext sql %sql sqlite:///factbook.db
Overview of the Data
Now let's get the information about the database structure:
%%sql select * from sqlite_master where type='table';
|table||sqlite_sequence||sqlite_sequence||3||CREATE TABLE sqlite_sequence(name,seq)|
|table||facts||facts||47||CREATE TABLE "facts" ("id" INTEGER PRIMARY KEY AUTOINCREMENT NOT NULL, "code" varchar(255) NOT NULL, "name" varchar(255) NOT NULL, "area" integer, "area_land" integer, "area_water" integer, "population" integer, "population_growth" float, "birth_rate" float, "death_rate" float, "migration_rate" float)|
The database actually contain only one table named 'facts'. Let's take a look at its structure and content (5 lines are enough):
%%sql select * from facts limit 5;
Here are the descriptions for the columns:
%%sql select count(*) from facts ;
The table contains 261 rows including NULL and duplicates.
Section 1. Summary Statistics
The goal of the current calculation is to get following:
%%sql select min(population) as 'Minimum population', max(population) as 'Maximum population', min(population_growth) as 'Minimum population growth, %', max(population_growth) as 'Maximum population growth, %' from facts;
|Minimum population||Maximum population||Minimum population growth, %||Maximum population growth, %|
Section 2. Exploring Outliers
The result above looks incorrect, the analysis showed that in some countries the population is zero, and in some it is close to the world population.
Let's find which country has zero population:
%%sql select name as Country, population as Population from facts where population == 0;
That's right, there is no permanent population in Antarctica.
Now let's find which country has the maximun population:
%%sql select name as Country, population as Population from facts where population == (select max(population) from facts);
Thus, the database contains the value of the total population of the world . This means that we need to exclude this row and recalculate our statistics:
%%sql select min(population), max(population), min(population_growth), max(population_growth) from facts where name <> 'World' --exclude the "World" row and name <> 'Antarctica' --exclude the 'Antarctica' row ;
As we can see, the population range is from 48 people to over 1.3 billion, and the population growth is from zero to 4 people per 1,000 people. Are we curious enough to know what these countries are? Well, I am:
%%sql select name as Country, population as Total_population, population*100/(select population from facts where name = 'World') as '%_from_World_population' from facts where population = (select min(population) from facts where name != 'Antarctica') or population = (select max(population) from facts where name != 'World') ;
%%sql select name as Country, population_growth as 'Growth_population_%' from facts where population_growth = (select min(population_growth) from facts) or population_growth = (select max(population_growth) from facts) ;
|Holy See (Vatican City)||0.0|
|Cocos (Keeling) Islands||0.0|
Above we see the following:
Zero as the growth value does not mean that population is constant. It may be too small to be shown in the data. Let's test this hipothesis:
%%sql select name as Country, population as Population_Total, population_growth as 'Growth_population_%', birth_rate as Births_per_1000, death_rate as Deaths_per_1000 from facts where population_growth = (select min(population_growth) from facts) ;
|Holy See (Vatican City)||842||0.0||None||None|
|Cocos (Keeling) Islands||596||0.0||None||None|
We see that in database three countries with the minimum population growth do not have records about births and deaths. What about Greenland?
%%sql select name as Country, (birth_rate - death_rate)*100/population as Growth_Rate_Full from facts where name = 'Greenland' ;
Thus, the Greenland's population is growthing, but too low to shown the value of the growth in the current data.
By the way, we can see that some countries do not have data in the "population_growth" column. Let's look at these rows:
%%sql select name as Country, population_growth as 'Growth_%', birth_rate as Births_per_1000, death_rate as Deaths_per_1000, area as Total_Area_km2, area_land as Land_km2 from facts where population_growth is null order by name ;
|Ashmore and Cartier Islands||None||None||None||5||5|
|British Indian Ocean Territory||None||None||None||54400||60|
|Coral Sea Islands||None||None||None||3||3|
|French Southern and Antarctic Lands||None||None||None||None||None|
|Heard Island and McDonald Islands||None||None||None||412||412|
|South Georgia and South Sandwich Islands||None||None||None||3903||3903|
|United States Pacific Island Wildlife Refuges||None||None||None||None||None|
%%sql select count(name) from facts where population_growth is null ;
We can see that in 25 lines there is no data about growth, births or deaths. Thus, we cannot fill missing data by calculation. Between these lines there are rows with data about Antarctica and all five oceans.
%%sql select count(name) from facts where population_growth is null and area < 300 ;
The 11 countries for which data are missing are smaller than 300 km2 and are small islands or a group of small islands. There are also a couple of group islands with a fairly large total area. Separately lies Kosovo, about which we do not have the necessary data.
Section 3. Exploring Average population and Area
Now let's calculate the average value for population and area:
%%sql select round(avg(population), 2) as 'Average_Population', round(avg(area), 2) as 'Average_Area' from facts where name <> 'World';
And one more calculation: which countries meet both of the following criteria:
%%sql select name as Country, population as Total_population, area as Total_Area from facts where population > (select avg(population) from facts where name <> 'World') and area < (select avg(area) from facts where name <> 'World');
%%sql select count(name) as "Amount_of_countries" from facts where name in (select name from facts where population > (select avg(population) from facts where name <> 'World') and area < (select avg(area) from facts where name <> 'World') );
As we see, 14 countries meet both criteria.
Section 4. Finding Densely Populated Countries
In this section, we will analyze which countries have an above average population at a time when the land (without water area) is below the average:
%%sql select name as Country, population as Total_population, population/area_land as People_per_km2_of_land from facts where population > (select avg(population) from facts where name<>'World') and area < (select avg(area_land) from facts where name<>'World') order by People_per_km2_of_land desc limit 10;
Above we can see that the most populous country is Bangladesh- over 1200 people per square kilometer of the land. Which country is the least populuos?
%%sql select name as Country, population as Total_population, population/area_land as People_per_km2_of_land from facts where population > (select avg(population) from facts where name<>'World') and area < (select avg(area_land) from facts where name<>'World') order by People_per_km2_of_land limit 5;
It's Morocco - only 74 people per square kilometer of the land.
Section 5. Some More Analysis
Q1. Which country has the most people? Which country has the highest growth rate?
%%sql select name as 'Country', population as 'Total_Population', population_growth as 'Growth_Rate_%' from facts where population== (select max(population) from facts where name !='World') or population_growth == (select max(population_growth) from facts where name !='World') ;
So, China is the most populous, but its growth rate is low. South Sudan's growth rate is the highest, but the total population is below average.
*Q2. Which country has the highest ratio of water to land? Which country has less water than land?
%%sql select name as 'Country', area_water/area_land as 'Water/Land_Ratio', area_water-area_land as 'Water/Land_Difference' from facts where area_water/area_land == (select max(area_water/area_land) from facts where name !='World' ) or area_water-area_land == (select min(area_water-area_land) from facts where name !='World') ;
|British Indian Ocean Territory||905||54280|
We have get that British Indian Ocean Territory has 905 times more water than land, and Russian lands are much larger than water area.
Q3. Which countries will add the most people to their populations next year?
%%sql select name as Country, cast((birth_rate - death_rate)*population/1000 as int) as Population_Growth_Next_Year from facts where name <> 'World' order by Population_Growth_Next_Year desc limit 10 ;
|Congo, Democratic Republic of the||1969297|
Above we see that the world can expect that next year India will add over 15 millions people to its population.
Q4. Which countries have a higher death rate than birth rate?
%%sql select name as 'country', death_rate, birth_rate, round(death_rate - birth_rate, 2) as diff from facts where name <> 'World' order by diff desc limit 10 ;
The above calculation shows which populations are declining. If you notice, all these countries are in Europe. What about the opposite situation?
%%sql select name as 'country', birth_rate, death_rate, round(birth_rate - death_rate, 2) as diff from facts where name <> 'World' order by diff desc limit 10 ;
Here we get the countries whose population is growing most intensively. And they are all in Africa.
Q5. Which countries have the highest population/area ratio?
%%sql select name as Country, cast(population/area as int) as People_per_km2_Area, population as Population, area as Total_Area_km2 from facts where name <> 'World' order by People_per_km2_Area desc limit 10 ;
The calculation shows that there are countries with extremely high population densities that did not get in result of our analysis above because their total population is below average.
In this analysis, we reviewed some of the key statistics about the countries of the world based on the CIA World Fact Book. As we have seen, a small database with a small amount of data and fairly simple queries can provide detailed analysis and deliver interesting information that allows us to better and deeper understand our world.