In this project, we will be analyzing the CIA Factbook Data that contains information about each world's state/territory and in particular:
We will figure out:
Before we jump to analysis, let's study the data we have.
%%capture
%load_ext sql
%sql sqlite:///factbook.db
%%sql
SELECT *
FROM sqlite_master
WHERE type='table';
* sqlite:///factbook.db Done.
type | name | tbl_name | rootpage | sql |
---|---|---|---|---|
table | sqlite_sequence | sqlite_sequence | 3 | CREATE TABLE sqlite_sequence(name,seq) |
table | facts | facts | 47 | CREATE TABLE "facts" ("id" INTEGER PRIMARY KEY AUTOINCREMENT NOT NULL, "code" varchar(255) NOT NULL, "name" varchar(255) NOT NULL, "area" integer, "area_land" integer, "area_water" integer, "population" integer, "population_growth" float, "birth_rate" float, "death_rate" float, "migration_rate" float) |
We see that this database contains two tables: sqlite_sequence
and facts
. For our project, we are only interested in the second table that contains facts about countries. Let's have look at this table.
%%sql
SELECT *
FROM facts
LIMIT 5;
* sqlite:///factbook.db Done.
id | code | name | area | area_land | area_water | population | population_growth | birth_rate | death_rate | migration_rate |
---|---|---|---|---|---|---|---|---|---|---|
1 | af | Afghanistan | 652230 | 652230 | 0 | 32564342 | 2.32 | 38.57 | 13.89 | 1.51 |
2 | al | Albania | 28748 | 27398 | 1350 | 3029278 | 0.3 | 12.92 | 6.58 | 3.3 |
3 | ag | Algeria | 2381741 | 2381741 | 0 | 39542166 | 1.84 | 23.67 | 4.31 | 0.92 |
4 | an | Andorra | 468 | 468 | 0 | 85580 | 0.12 | 8.13 | 6.96 | 0.0 |
5 | ao | Angola | 1246700 | 1246700 | 0 | 19625353 | 2.78 | 38.78 | 11.49 | 0.46 |
The first columns are pretty self-explainable (or already explained, like population_growth
). The columns birth_rate
, death_rate
and migration_rate
are to be considered as per 1000 persons, and are calculated as an average rate during the year.
The areas (land and water) are measured in square kilometers.
Let's calculate some summary statics such as:
%%sql
SELECT MIN(population), MAX(population),
MIN(population_growth), MAX(population_growth)
FROM facts;
* sqlite:///factbook.db Done.
MIN(population) | MAX(population) | MIN(population_growth) | MAX(population_growth) |
---|---|---|---|
0 | 7256490011 | 0.0 | 4.02 |
As far as we can see, the minimum population is 0? What is the territory where nobody lives? It's probably the Antarctic or the Arctic. The maximum population is the whole's world population. We also have a country with 0 population growth (it probably has no population at all?).
Let's investigate the "countries" with 0 and 7,2 billion populations.
%%sql
SELECT name, MAX(population) max_population
FROM facts;
* sqlite:///factbook.db Done.
name | max_population |
---|---|
World | 7256490011 |
%%sql
SELECT name, MIN(population) min_population
FROM facts;
* sqlite:///factbook.db Done.
name | min_population |
---|---|
Antarctica | 0 |
It's clear now that the database has two entries: World which has the largest population (the world's population) and Antarctica, a continent with no permanent inhabitants. In the Antractica there only permanent and summer-only research bases.
Let's exclude the whole world and the Antarctica from summary statistics.
%%sql
SELECT MIN(population), MAX(population),
MIN(population_growth), MAX(population_growth)
FROM facts
WHERE name NOT IN ("World", "Antarctica");
* sqlite:///factbook.db Done.
MIN(population) | MAX(population) | MIN(population_growth) | MAX(population_growth) |
---|---|---|---|
48 | 1367485388 | 0.0 | 4.02 |
It's obvious that the country with the maximum population is China, but what is the country with only 48 persons living in there? Is it Vatican?
%%sql
SELECT name, MIN(population) AS min_population
FROM facts
WHERE name NOT IN ("World", "Antarctica");
* sqlite:///factbook.db Done.
name | min_population |
---|---|
Pitcairn Islands | 48 |
It's not Vatican! Pitcairn Islands are a group of four volcanic islands the least populous jurisdiction territory in the world. They are a part of British Overseas Territory in Pacific Ocean. When you are tired of the life you may want to move there for a couple of months (not an advertisement).
Let's now see what is the county with the maximum population growth (since the minimum population growth is 0, it fair to think there are many countries with such value, in other words they have no population growth).
%%sql
SELECT name, MAX(population_growth) population_growth
FROM facts;
* sqlite:///factbook.db Done.
name | population_growth |
---|---|
South Sudan | 4.02 |
South Sudan is the most recent sovereign state and one of the poorest states in the world. It's no surprise it has the highest population growth rate in the world.
Let's calculate the average number for area
and population
excluding the Antarctica and the world. It's worth noting that the world, Antarctica and the oceans (i.e. Pacific, Indian) have no values for the area, but there is the area of the European Union that we have to exclude from our calculation (all its countries are already present in the table).
%%sql
SELECT AVG(area) avg_country_area
FROM facts
WHERE name != "European Union";
* sqlite:///factbook.db Done.
avg_country_area |
---|
539893.1895161291 |
It's 539 893 square kilometers, an area comparable to Metropolitan France! Let's figure out the number of states that have an area bigger than their average.
%%sql
SELECT COUNT(*)
FROM facts
WHERE area > (SELECT AVG(area) FROM facts
WHERE name NOT IN ("European Union")
);
* sqlite:///factbook.db Done.
COUNT(*) |
---|
48 |
There are 48 countries out of 195 countries that are bigger than the average country area. Let's write a query to figure out which countries have areas above average.
%%sql
SELECT name, area
FROM facts
WHERE area > (SELECT AVG(area) FROM facts
WHERE name NOT IN ("European Union")
)
ORDER BY area DESC;
* sqlite:///factbook.db Done.
name | area |
---|---|
Russia | 17098242 |
Canada | 9984670 |
United States | 9826675 |
China | 9596960 |
Brazil | 8515770 |
Australia | 7741220 |
European Union | 4324782 |
India | 3287263 |
Argentina | 2780400 |
Kazakhstan | 2724900 |
Algeria | 2381741 |
Congo, Democratic Republic of the | 2344858 |
Greenland | 2166086 |
Saudi Arabia | 2149690 |
Mexico | 1964375 |
Indonesia | 1904569 |
Sudan | 1861484 |
Libya | 1759540 |
Iran | 1648195 |
Mongolia | 1564116 |
Peru | 1285216 |
Angola | 1246700 |
Mali | 1240192 |
South Africa | 1219090 |
Colombia | 1138910 |
Ethiopia | 1104300 |
Bolivia | 1098581 |
Mauritania | 1030700 |
Egypt | 1001450 |
Tanzania | 947300 |
Nigeria | 923768 |
Venezuela | 912050 |
Namibia | 824292 |
Mozambique | 799380 |
Pakistan | 796095 |
Turkey | 783562 |
Chile | 756102 |
Zambia | 752618 |
Burma | 676578 |
Afghanistan | 652230 |
South Sudan | 644329 |
France | 643801 |
Somalia | 637657 |
Central African Republic | 622984 |
Ukraine | 603550 |
Madagascar | 587041 |
Botswana | 581730 |
Kenya | 580367 |
Let's do the same statistics for population
. This time though we have to exclude the world and Antarctica since they have values in the population
column. We additionally have to exclude the European Union.
%%sql
SELECT AVG(population) avg_world_population
FROM facts
WHERE name NOT IN ("World", "Antarctica", "European Union");
* sqlite:///factbook.db Done.
avg_world_population |
---|
30362063.589958157 |
There are more than 30 million people on average living in each country. It's around the population in Mozambique. Let's then count the number of countries that have more people than the average population.
%%sql
SELECT COUNT(*)
FROM facts
WHERE population > (SELECT AVG(population) FROM facts
WHERE name NOT IN ("World", "European Union", "Antarctica"));
* sqlite:///factbook.db Done.
COUNT(*) |
---|
45 |
There are 45 (43 without the world and European Union) countries that have a population bigger than the average world's population! Let's figure out what this countries are and order them by population (from the highest to the lowest).
%%sql
WITH more_avg_pop AS
(
SELECT name, population
FROM facts
WHERE population > (SELECT AVG(population) FROM facts
WHERE name NOT IN ("World", "European Union", "Antractica")
)
)
SELECT name, population
FROM more_avg_pop
WHERE name NOT IN ("World", "European Union")
ORDER BY population DESC;
* sqlite:///factbook.db Done.
name | population |
---|---|
China | 1367485388 |
India | 1251695584 |
United States | 321368864 |
Indonesia | 255993674 |
Brazil | 204259812 |
Pakistan | 199085847 |
Nigeria | 181562056 |
Bangladesh | 168957745 |
Russia | 142423773 |
Japan | 126919659 |
Mexico | 121736809 |
Philippines | 100998376 |
Ethiopia | 99465819 |
Vietnam | 94348835 |
Egypt | 88487396 |
Iran | 81824270 |
Germany | 80854408 |
Turkey | 79414269 |
Congo, Democratic Republic of the | 79375136 |
Thailand | 67976405 |
France | 66553766 |
United Kingdom | 64088222 |
Italy | 61855120 |
Burma | 56320206 |
South Africa | 53675563 |
Tanzania | 51045882 |
Korea, South | 49115196 |
Spain | 48146134 |
Colombia | 46736728 |
Kenya | 45925301 |
Ukraine | 44429471 |
Argentina | 43431886 |
Algeria | 39542166 |
Poland | 38562189 |
Uganda | 37101745 |
Iraq | 37056169 |
Sudan | 36108853 |
Canada | 35099836 |
Morocco | 33322699 |
Afghanistan | 32564342 |
Nepal | 31551305 |
Malaysia | 30513848 |
Peru | 30444999 |
It would be also interesting to figure out which proportion of the world's population the first three countries (China, India, USA) hold.
%%sql
WITH top_three_pop_proportion AS
(
SELECT CAST(Population AS FLOAT) / (SELECT MAX(Population) FROM facts) proportion
FROM facts
WHERE name NOT IN ("World", "European Union")
ORDER BY 1 DESC
LIMIT 3
)
SELECT ROUND(SUM(proportion) * 100, 2) "proportion (%)"
FROM top_three_pop_proportion
* sqlite:///factbook.db Done.
proportion (%) |
---|
40.52 |
Just three countries account for more than 40 percent of the world's population: China, India and USA with China and India having more than one billion people each.
Now let's calculate population densities. To do so we will have to divide the country's population by its land area (hopefully most people don't live in rivers or lakes). Let's also order them in descending order to find out the countries with the highest population density.
%%sql
SELECT name, population, area,
ROUND(CAST(population AS FLOAT) / area_land, 2) population_density
FROM facts
ORDER BY population_density DESC
LIMIT 10;
* sqlite:///factbook.db Done.
name | population | area | population_density |
---|---|---|---|
Macau | 592731 | 28 | 21168.96 |
Monaco | 30535 | 2 | 15267.5 |
Singapore | 5674472 | 697 | 8259.78 |
Hong Kong | 7141106 | 1108 | 6655.27 |
Gaza Strip | 1869055 | 360 | 5191.82 |
Gibraltar | 29258 | 6 | 4876.33 |
Bahrain | 1346613 | 760 | 1771.86 |
Maldives | 393253 | 298 | 1319.64 |
Malta | 413965 | 316 | 1310.02 |
Bermuda | 70196 | 54 | 1299.93 |
Macau is a special administrative region in China, and is considered the most densely populated region in the world.
If we are following this way why not calculating the average population density and figure out how many (and what) countries have densitites more than its average? The oceans, the EU, the world, have no values for area_land
and/or for population
so we do not need to explicitly exclude them from calculation. On the other hand, the Antarctica has both values, so it's necessary to exclude it.
%%sql
SELECT ROUND(AVG(CAST(population AS FLOAT) / area_land), 2) avg_pop_density
FROM facts
WHERE name NOT IN ("Antarctica")
* sqlite:///factbook.db Done.
avg_pop_density |
---|
431.91 |
It's around 432 pop./km$^2$, the density of the Netherlands, a state in Europe, a continent with one of the most highest densitites in the world.
Let's now figure out how many countries (and what) have densities more than the average value.
%%sql
SELECT COUNT(*) more_avg_dens
FROM facts
WHERE CAST(population AS FLOAT) / area_land >
(
SELECT AVG(CAST(population AS FLOAT) / area_land)
FROM facts
WHERE name NOT IN ("Antarctica")
)
* sqlite:///factbook.db Done.
more_avg_dens |
---|
26 |
%%sql
SELECT name, ROUND(CAST(population AS FLOAT) / area_land, 2) population_density
FROM facts
WHERE CAST(population AS FLOAT) / area_land >
(
SELECT AVG(CAST(population AS FLOAT) / area_land)
FROM facts
WHERE name NOT IN ("Antarctica")
)
ORDER BY population_density DESC;
* sqlite:///factbook.db Done.
name | population_density |
---|---|
Macau | 21168.96 |
Monaco | 15267.5 |
Singapore | 8259.78 |
Hong Kong | 6655.27 |
Gaza Strip | 5191.82 |
Gibraltar | 4876.33 |
Bahrain | 1771.86 |
Maldives | 1319.64 |
Malta | 1310.02 |
Bermuda | 1299.93 |
Bangladesh | 1297.98 |
Sint Maarten | 1167.32 |
Guernsey | 847.18 |
Jersey | 838.74 |
Taiwan | 725.83 |
Barbados | 675.82 |
Mauritius | 660.01 |
Aruba | 623.12 |
Lebanon | 604.57 |
Saint Martin | 588.04 |
San Marino | 541.31 |
Rwanda | 513.29 |
Korea, South | 506.76 |
Netherlands | 500.04 |
West Bank | 493.86 |
Nauru | 454.29 |
There are 26 countries which density is higher that the average density. It's interesting to note that the calculated value of 500 pop./km$^2$ for the Netherlands is much higher that the Wikipedia value. It's also interesting to note that the Netherlands in the only European continental state (not counting small states like Monaco) with a density higher that the average.
It's also interesting to know which countries will add the most people to their population next year. We can calculate this by multiplication of the population by the population growth rate / 100 + 1 (because it's in percentage) - current population. If we order this table by the new column in ascending order we can figure out what countries would have added the least people.
%%sql
CREATE VIEW add_people AS
SELECT name, population, population_growth,
ROUND(population * (population_growth / 100 + 1) - population, 0) AS added_people
FROM facts;
* sqlite:///factbook.db Done.
[]
%%sql
SELECT *
FROM add_people
ORDER BY added_people DESC
LIMIT 11;
* sqlite:///factbook.db Done.
name | population | population_growth | added_people |
---|---|---|---|
World | 7256490011 | 1.08 | 78370092.0 |
India | 1251695584 | 1.22 | 15270686.0 |
China | 1367485388 | 0.45 | 6153684.0 |
Nigeria | 181562056 | 2.45 | 4448270.0 |
Pakistan | 199085847 | 1.46 | 2906653.0 |
Ethiopia | 99465819 | 2.89 | 2874562.0 |
Bangladesh | 168957745 | 1.6 | 2703324.0 |
United States | 321368864 | 0.78 | 2506677.0 |
Indonesia | 255993674 | 0.92 | 2355142.0 |
Congo, Democratic Republic of the | 79375136 | 2.45 | 1944691.0 |
Philippines | 100998376 | 1.61 | 1626074.0 |
%%sql
SELECT *
FROM add_people
WHERE added_people IS NOT NULL
ORDER BY added_people
LIMIT 10;
* sqlite:///factbook.db Done.
name | population | population_growth | added_people |
---|---|---|---|
Holy See (Vatican City) | 842 | 0.0 | 0.0 |
Cocos (Keeling) Islands | 596 | 0.0 | 0.0 |
Norfolk Island | 2210 | 0.01 | 0.0 |
Greenland | 57733 | 0.0 | 0.0 |
Niue | 1190 | 0.03 | 0.0 |
Tokelau | 1337 | 0.01 | 0.0 |
Falkland Islands (Islas Malvinas) | 3361 | 0.01 | 0.0 |
Pitcairn Islands | 48 | 0.0 | 0.0 |
Svalbard | 1872 | 0.03 | 1.0 |
Christmas Island | 1530 | 1.11 | 17.0 |
It's interesting to notice the USA in this least as the only developed country (it's also arguably China) in this list that would have added people in the year 2016. It's also curious to figure out that China with its policy 'One family - one child' has more than two times smaller number of new people than India that has not such policy.
As for the countries with the least added people we notice Svalbard, a Norwegian archipelago in the Arctic Ocean and Christmas Island, an Australian external territory in the Indian Ocean, both adding just 1 and 17 persons, respectively.
Let's figure out how many and which countries have a higher death rate than the birth rate. This value is called "Rate of Natural population increase" and is calculated as "birth rate - death rate".
%%sql
SELECT name, ROUND(birth_rate - death_rate, 2) AS natural_pop_increase
FROM facts
WHERE natural_pop_increase IS NOT NULL
ORDER BY natural_pop_increase
LIMIT 10;
* sqlite:///factbook.db Done.
name | natural_pop_increase |
---|---|
Bulgaria | -5.52 |
Serbia | -4.58 |
Latvia | -4.31 |
Lithuania | -4.17 |
Ukraine | -3.74 |
Hungary | -3.57 |
Germany | -2.95 |
Slovenia | -2.95 |
Romania | -2.76 |
Croatia | -2.73 |
Bulgaria, Serbia and Latvia are three countries with the most negative natural population increase.
Let's now compare the water-to-land ratios to figure out what countries are the most waterful (hoorah, we've just invented a new word)!
%%sql
SELECT name, area_land, area_water, ROUND(CAST(area_water AS FLOAT) / area_land, 2) AS water_to_land
FROM facts
ORDER BY Water_to_land DESC
LIMIT 10;
* sqlite:///factbook.db Done.
name | area_land | area_water | water_to_land |
---|---|---|---|
British Indian Ocean Territory | 60 | 54340 | 905.67 |
Virgin Islands | 346 | 1564 | 4.52 |
Puerto Rico | 8870 | 4921 | 0.55 |
Bahamas, The | 10010 | 3870 | 0.39 |
Guinea-Bissau | 28120 | 8005 | 0.28 |
Malawi | 94080 | 24404 | 0.26 |
Netherlands | 33893 | 7650 | 0.23 |
Uganda | 197100 | 43938 | 0.22 |
Eritrea | 101000 | 16600 | 0.16 |
Liberia | 96320 | 15049 | 0.16 |
The British Indian Ocean Territory stands out in the list having the water-to-land ratio more than 200 times bigger than the second place (Virgin Islands). It's a group of attols with more than 1000 individual islands with a total area of 54 400 square kilometers so it's no surprise it has such a big ratio.
The main objective of this project though was to understand how the basic SQL code works and what information could be extracted using SQL. We analyzed the CIA Factbook Data and answered some summary questions about the database and discovered some interesting facts about the world's countries: