In this guided project, we will be working with data from the CIA World Factbook, a compendium of statistics about every countries on Earth. The site provides demographic information for each country.
This dataset is a SQLite Database named factbook.db that can be dowloaded here.
Note: the data used is not current.
The goal of this project is to learn basic statistics using SQL (more precisely, SQLite).
Connecting to database
First of all we need to connect to the database:
%reload_ext sql
%sql sqlite:///factbook.db
/dataquest/system/env/python3/lib/python3.4/site-packages/IPython/config.py:13: ShimWarning: The `IPython.config` package has been deprecated. You should import from traitlets.config instead. "You should import from traitlets.config instead.", ShimWarning) /dataquest/system/env/python3/lib/python3.4/site-packages/IPython/utils/traitlets.py:5: UserWarning: IPython.utils.traitlets has moved to a top-level traitlets package. warn("IPython.utils.traitlets has moved to a top-level traitlets package.")
'Connected: None@factbook.db'
Overview of the Data
Now let's get the information about the database structure:
%%sql
select *
from sqlite_master
where type='table';
Done.
type | name | tbl_name | rootpage | sql |
---|---|---|---|---|
table | sqlite_sequence | sqlite_sequence | 3 | CREATE TABLE sqlite_sequence(name,seq) |
table | facts | facts | 47 | CREATE TABLE "facts" ("id" INTEGER PRIMARY KEY AUTOINCREMENT NOT NULL, "code" varchar(255) NOT NULL, "name" varchar(255) NOT NULL, "area" integer, "area_land" integer, "area_water" integer, "population" integer, "population_growth" float, "birth_rate" float, "death_rate" float, "migration_rate" float) |
The database actually contain only one table named 'facts'. Let's take a look at its structure and content (5 lines are enough):
%%sql
select *
from facts
limit 5;
Done.
id | code | name | area | area_land | area_water | population | population_growth | birth_rate | death_rate | migration_rate |
---|---|---|---|---|---|---|---|---|---|---|
1 | af | Afghanistan | 652230 | 652230 | 0 | 32564342 | 2.32 | 38.57 | 13.89 | 1.51 |
2 | al | Albania | 28748 | 27398 | 1350 | 3029278 | 0.3 | 12.92 | 6.58 | 3.3 |
3 | ag | Algeria | 2381741 | 2381741 | 0 | 39542166 | 1.84 | 23.67 | 4.31 | 0.92 |
4 | an | Andorra | 468 | 468 | 0 | 85580 | 0.12 | 8.13 | 6.96 | 0.0 |
5 | ao | Angola | 1246700 | 1246700 | 0 | 19625353 | 2.78 | 38.78 | 11.49 | 0.46 |
Here are the descriptions for the columns:
%%sql
select count(*)
from facts
;
Done.
count(*) |
---|
261 |
The table contains 261 rows including NULL and duplicates.
Section 1. Summary Statistics
The goal of the current calculation is to get following:
%%sql
select
min(population) as 'Minimum population',
max(population) as 'Maximum population',
min(population_growth) as 'Minimum population growth, %',
max(population_growth) as 'Maximum population growth, %'
from facts;
Done.
Minimum population | Maximum population | Minimum population growth, % | Maximum population growth, % |
---|---|---|---|
0 | 7256490011 | 0.0 | 4.02 |
Section 2. Exploring Outliers
The result above looks incorrect, the analysis showed that in some countries the population is zero, and in some it is close to the world population.
Let's find which country has zero population:
%%sql
select name as Country,
population as Population
from facts
where population == 0;
Done.
Country | Population |
---|---|
Antarctica | 0 |
That's right, there is no permanent population in Antarctica.
Now let's find which country has the maximun population:
%%sql
select name as Country,
population as Population
from facts
where population == (select max(population) from facts);
Done.
Country | Population |
---|---|
World | 7256490011 |
Thus, the database contains the value of the total population of the world . This means that we need to exclude this row and recalculate our statistics:
%%sql
select min(population), max(population),
min(population_growth), max(population_growth)
from facts
where name <> 'World' --exclude the "World" row
and name <> 'Antarctica' --exclude the 'Antarctica' row
;
Done.
min(population) | max(population) | min(population_growth) | max(population_growth) |
---|---|---|---|
48 | 1367485388 | 0.0 | 4.02 |
As we can see, the population range is from 48 people to over 1.3 billion, and the population growth is from zero to 4 people per 1,000 people. Are we curious enough to know what these countries are? Well, I am:
%%sql
select name as Country,
population as Total_population,
population*100/(select population from facts
where name = 'World') as '%_from_World_population'
from facts
where
population = (select min(population)
from facts
where name != 'Antarctica')
or population = (select max(population)
from facts
where name != 'World')
;
Done.
Country | Total_population | %_from_World_population |
---|---|---|
China | 1367485388 | 18 |
Pitcairn Islands | 48 | 0 |
%%sql
select name as Country,
population_growth as 'Growth_population_%'
from facts
where
population_growth = (select min(population_growth)
from facts)
or population_growth = (select max(population_growth)
from facts)
;
Done.
Country | Growth_population_% |
---|---|
South Sudan | 4.02 |
Holy See (Vatican City) | 0.0 |
Cocos (Keeling) Islands | 0.0 |
Greenland | 0.0 |
Pitcairn Islands | 0.0 |
Above we see the following:
Zero as the growth value does not mean that population is constant. It may be too small to be shown in the data. Let's test this hipothesis:
%%sql
select name as Country,
population as Population_Total,
population_growth as 'Growth_population_%',
birth_rate as Births_per_1000,
death_rate as Deaths_per_1000
from facts
where
population_growth = (select min(population_growth)
from facts)
;
Done.
Country | Population_Total | Growth_population_% | Births_per_1000 | Deaths_per_1000 |
---|---|---|---|---|
Holy See (Vatican City) | 842 | 0.0 | None | None |
Cocos (Keeling) Islands | 596 | 0.0 | None | None |
Greenland | 57733 | 0.0 | 14.48 | 8.49 |
Pitcairn Islands | 48 | 0.0 | None | None |
We see that in database three countries with the minimum population growth do not have records about births and deaths. What about Greenland?
%%sql
select name as Country,
(birth_rate - death_rate)*100/population as Growth_Rate_Full
from facts
where name = 'Greenland'
;
Done.
Country | Growth_Rate_Full |
---|---|
Greenland | 0.010375348587462975 |
Thus, the Greenland's population is growthing, but too low to shown the value of the growth in the current data.
By the way, we can see that some countries do not have data in the "population_growth" column. Let's look at these rows:
%%sql
select name as Country,
population_growth as 'Growth_%',
birth_rate as Births_per_1000,
death_rate as Deaths_per_1000,
area as Total_Area_km2,
area_land as Land_km2
from facts
where
population_growth is null
order by name
;
Done.
Country | Growth_% | Births_per_1000 | Deaths_per_1000 | Total_Area_km2 | Land_km2 |
---|---|---|---|---|---|
Akrotiri | None | None | None | 123 | None |
Antarctica | None | None | None | None | 280000 |
Arctic Ocean | None | None | None | None | None |
Ashmore and Cartier Islands | None | None | None | 5 | 5 |
Atlantic Ocean | None | None | None | None | None |
Bouvet Island | None | None | None | 49 | 49 |
British Indian Ocean Territory | None | None | None | 54400 | 60 |
Clipperton Island | None | None | None | 6 | 6 |
Coral Sea Islands | None | None | None | 3 | 3 |
Dhekelia | None | None | None | 130 | None |
French Southern and Antarctic Lands | None | None | None | None | None |
Heard Island and McDonald Islands | None | None | None | 412 | 412 |
Indian Ocean | None | None | None | None | None |
Jan Mayen | None | None | None | 377 | 377 |
Kosovo | None | None | None | 10887 | 10887 |
Navassa Island | None | None | None | 5 | 5 |
Pacific Ocean | None | None | None | None | None |
Paracel Islands | None | None | None | 7 | 7 |
Saint Barthelemy | None | None | None | None | None |
Saint Martin | None | None | None | 54 | 54 |
South Georgia and South Sandwich Islands | None | None | None | 3903 | 3903 |
Southern Ocean | None | None | None | None | None |
Spratly Islands | None | None | None | 5 | 5 |
United States Pacific Island Wildlife Refuges | None | None | None | None | None |
Wake Island | None | None | None | 6 | 6 |
%%sql
select count(name)
from facts
where population_growth is null
;
Done.
count(name) |
---|
25 |
We can see that in 25 lines there is no data about growth, births or deaths. Thus, we cannot fill missing data by calculation. Between these lines there are rows with data about Antarctica and all five oceans.
%%sql
select count(name)
from facts
where population_growth is null
and area < 300
;
Done.
count(name) |
---|
11 |
The 11 countries for which data are missing are smaller than 300 km2 and are small islands or a group of small islands. There are also a couple of group islands with a fairly large total area. Separately lies Kosovo, about which we do not have the necessary data.
Section 3. Exploring Average population and Area
Now let's calculate the average value for population and area:
%%sql
select
round(avg(population), 2) as 'Average_Population',
round(avg(area), 2) as 'Average_Area'
from facts
where name <> 'World';
Done.
Average_Population | Average_Area |
---|---|
32242666.57 | 555093.55 |
And one more calculation: which countries meet both of the following criteria:
%%sql
select name as Country,
population as Total_population,
area as Total_Area
from facts
where population > (select avg(population)
from facts
where name <> 'World')
and area < (select avg(area)
from facts
where name <> 'World');
Done.
Country | Total_population | Total_Area |
---|---|---|
Bangladesh | 168957745 | 148460 |
Germany | 80854408 | 357022 |
Iraq | 37056169 | 438317 |
Italy | 61855120 | 301340 |
Japan | 126919659 | 377915 |
Korea, South | 49115196 | 99720 |
Morocco | 33322699 | 446550 |
Philippines | 100998376 | 300000 |
Poland | 38562189 | 312685 |
Spain | 48146134 | 505370 |
Thailand | 67976405 | 513120 |
Uganda | 37101745 | 241038 |
United Kingdom | 64088222 | 243610 |
Vietnam | 94348835 | 331210 |
%%sql
select count(name) as "Amount_of_countries"
from facts
where name in
(select name
from facts
where population > (select avg(population)
from facts
where name <> 'World')
and area < (select avg(area)
from facts
where name <> 'World')
);
Done.
Amount_of_countries |
---|
14 |
As we see, 14 countries meet both criteria.
Section 4. Finding Densely Populated Countries
In this section, we will analyze which countries have an above average population at a time when the land (without water area) is below the average:
%%sql
select name as Country,
population as Total_population,
population/area_land as People_per_km2_of_land
from facts
where population > (select avg(population) from facts where name<>'World')
and area < (select avg(area_land) from facts where name<>'World')
order by People_per_km2_of_land desc
limit 10;
Done.
Country | Total_population | People_per_km2_of_land |
---|---|---|
Bangladesh | 168957745 | 1297 |
Korea, South | 49115196 | 506 |
Japan | 126919659 | 348 |
Philippines | 100998376 | 338 |
Vietnam | 94348835 | 304 |
United Kingdom | 64088222 | 264 |
Germany | 80854408 | 231 |
Italy | 61855120 | 210 |
Uganda | 37101745 | 188 |
Thailand | 67976405 | 133 |
Above we can see that the most populous country is Bangladesh- over 1200 people per square kilometer of the land. Which country is the least populuos?
%%sql
select name as Country,
population as Total_population,
population/area_land as People_per_km2_of_land
from facts
where population > (select avg(population) from facts where name<>'World')
and area < (select avg(area_land) from facts where name<>'World')
order by People_per_km2_of_land
limit 5;
Done.
Country | Total_population | People_per_km2_of_land |
---|---|---|
Morocco | 33322699 | 74 |
Iraq | 37056169 | 84 |
Spain | 48146134 | 96 |
Poland | 38562189 | 126 |
Thailand | 67976405 | 133 |
It's Morocco - only 74 people per square kilometer of the land.
Section 5. Some More Analysis
Q1. Which country has the most people? Which country has the highest growth rate?
%%sql
select name as 'Country',
population as 'Total_Population',
population_growth as 'Growth_Rate_%'
from facts
where population== (select max(population)
from facts
where name !='World')
or population_growth == (select max(population_growth)
from facts
where name !='World')
;
Done.
Country | Total_Population | Growth_Rate_% |
---|---|---|
China | 1367485388 | 0.45 |
South Sudan | 12042910 | 4.02 |
So, China is the most populous, but its growth rate is low. South Sudan's growth rate is the highest, but the total population is below average.
*Q2. Which country has the highest ratio of water to land? Which country has less water than land?
%%sql
select name as 'Country',
area_water/area_land as 'Water/Land_Ratio',
area_water-area_land as 'Water/Land_Difference'
from facts
where area_water/area_land == (select max(area_water/area_land)
from facts
where name !='World' )
or area_water-area_land == (select min(area_water-area_land)
from facts
where name !='World')
;
Done.
Country | Water/Land_Ratio | Water/Land_Difference |
---|---|---|
Russia | 0 | -15657242 |
British Indian Ocean Territory | 905 | 54280 |
We have get that British Indian Ocean Territory has 905 times more water than land, and Russian lands are much larger than water area.
Q3. Which countries will add the most people to their populations next year?
%%sql
select name as Country,
cast((birth_rate - death_rate)*population/1000 as int) as Population_Growth_Next_Year
from facts
where name <> 'World'
order by Population_Growth_Next_Year desc
limit 10
;
Done.
Country | Population_Growth_Next_Year |
---|---|
India | 15308236 |
China | 6782727 |
Nigeria | 4491845 |
Pakistan | 3203291 |
Ethiopia | 2892466 |
Indonesia | 2649534 |
Bangladesh | 2623913 |
Congo, Democratic Republic of the | 1969297 |
Philippines | 1834130 |
Mexico | 1645881 |
Above we see that the world can expect that next year India will add over 15 millions people to its population.
Q4. Which countries have a higher death rate than birth rate?
%%sql
select name as 'country',
death_rate, birth_rate,
round(death_rate - birth_rate, 2) as diff
from facts
where name <> 'World'
order by diff desc
limit 10
;
Done.
country | death_rate | birth_rate | diff |
---|---|---|---|
Bulgaria | 14.44 | 8.92 | 5.52 |
Serbia | 13.66 | 9.08 | 4.58 |
Latvia | 14.31 | 10.0 | 4.31 |
Lithuania | 14.27 | 10.1 | 4.17 |
Ukraine | 14.46 | 10.72 | 3.74 |
Hungary | 12.73 | 9.16 | 3.57 |
Germany | 11.42 | 8.47 | 2.95 |
Slovenia | 11.37 | 8.42 | 2.95 |
Romania | 11.9 | 9.14 | 2.76 |
Croatia | 12.18 | 9.45 | 2.73 |
The above calculation shows which populations are declining. If you notice, all these countries are in Europe. What about the opposite situation?
%%sql
select name as 'country',
birth_rate, death_rate,
round(birth_rate - death_rate, 2) as diff
from facts
where name <> 'World'
order by diff desc
limit 10
;
Done.
country | birth_rate | death_rate | diff |
---|---|---|---|
Malawi | 41.56 | 8.41 | 33.15 |
Uganda | 43.79 | 10.69 | 33.1 |
Niger | 45.45 | 12.42 | 33.03 |
Burundi | 42.01 | 9.27 | 32.74 |
Mali | 44.99 | 12.89 | 32.1 |
Burkina Faso | 42.03 | 11.72 | 30.31 |
Zambia | 42.13 | 12.67 | 29.46 |
Ethiopia | 37.27 | 8.19 | 29.08 |
South Sudan | 36.91 | 8.18 | 28.73 |
Tanzania | 36.39 | 8.0 | 28.39 |
Here we get the countries whose population is growing most intensively. And they are all in Africa.
Q5. Which countries have the highest population/area ratio?
%%sql
select name as Country,
cast(population/area as int) as People_per_km2_Area,
population as Population, area as Total_Area_km2
from facts
where name <> 'World'
order by People_per_km2_Area desc
limit 10
;
Done.
Country | People_per_km2_Area | Population | Total_Area_km2 |
---|---|---|---|
Macau | 21168 | 592731 | 28 |
Monaco | 15267 | 30535 | 2 |
Singapore | 8141 | 5674472 | 697 |
Hong Kong | 6445 | 7141106 | 1108 |
Gaza Strip | 5191 | 1869055 | 360 |
Gibraltar | 4876 | 29258 | 6 |
Bahrain | 1771 | 1346613 | 760 |
Maldives | 1319 | 393253 | 298 |
Malta | 1310 | 413965 | 316 |
Bermuda | 1299 | 70196 | 54 |
The calculation shows that there are countries with extremely high population densities that did not get in result of our analysis above because their total population is below average.
In this analysis, we reviewed some of the key statistics about the countries of the world based on the CIA World Fact Book. As we have seen, a small database with a small amount of data and fairly simple queries can provide detailed analysis and deliver interesting information that allows us to better and deeper understand our world.