Analyzing CIA Factbook Data Using SQL

In this project, we will be analyzing the CIA Factbook Data that contains information about each world's state/territory and in particular:

  • Population as of 2015.
  • Population growth - the annual growth, in percentage.
  • Area - the total land and water area.

We will figure out:

  • Which country has the least population?
  • Which country has the largest population growth?
  • Average population, area and density of a country.
  • Which country has the maximum density?
  • Which country would have added the most people in 2016? The least?
  • Which countries have a higher death rate than birth rate?
  • What countries have the highest population/area ratio?

Get to Know the Data

Before we jump to analysis, let's study the data we have.

In [1]:
%%capture
%load_ext sql
%sql sqlite:///factbook.db
In [2]:
%%sql
SELECT *
  FROM sqlite_master
 WHERE type='table';
 * sqlite:///factbook.db
Done.
Out[2]:
type name tbl_name rootpage sql
table sqlite_sequence sqlite_sequence 3 CREATE TABLE sqlite_sequence(name,seq)
table facts facts 47 CREATE TABLE "facts" ("id" INTEGER PRIMARY KEY AUTOINCREMENT NOT NULL, "code" varchar(255) NOT NULL, "name" varchar(255) NOT NULL, "area" integer, "area_land" integer, "area_water" integer, "population" integer, "population_growth" float, "birth_rate" float, "death_rate" float, "migration_rate" float)

We see that this database contains two tables: sqlite_sequence and facts. For our project, we are only interested in the second table that contains facts about countries. Let's have look at this table.

In [3]:
%%sql
SELECT *
  FROM facts
 LIMIT 5;
 * sqlite:///factbook.db
Done.
Out[3]:
id code name area area_land area_water population population_growth birth_rate death_rate migration_rate
1 af Afghanistan 652230 652230 0 32564342 2.32 38.57 13.89 1.51
2 al Albania 28748 27398 1350 3029278 0.3 12.92 6.58 3.3
3 ag Algeria 2381741 2381741 0 39542166 1.84 23.67 4.31 0.92
4 an Andorra 468 468 0 85580 0.12 8.13 6.96 0.0
5 ao Angola 1246700 1246700 0 19625353 2.78 38.78 11.49 0.46

The first columns are pretty self-explainable (or already explained, like population_growth). The columns birth_rate, death_rate and migration_rate are to be considered as per 1000 persons, and are calculated as an average rate during the year.

The areas (land and water) are measured in square kilometers.

Summary Statistics

Let's calculate some summary statics such as:

  • Minimum population.
  • Maximum population.
  • Minimum population growth.
  • Maximum population growth.

Minimum and Maximum Population

In [4]:
%%sql
SELECT MIN(population), MAX(population),
       MIN(population_growth), MAX(population_growth)
  FROM facts;
 * sqlite:///factbook.db
Done.
Out[4]:
MIN(population) MAX(population) MIN(population_growth) MAX(population_growth)
0 7256490011 0.0 4.02

As far as we can see, the minimum population is 0? What is the territory where nobody lives? It's probably the Antarctic or the Arctic. The maximum population is the whole's world population. We also have a country with 0 population growth (it probably has no population at all?).

Let's investigate the "countries" with 0 and 7,2 billion populations.

In [5]:
%%sql
SELECT name, MAX(population) max_population
  FROM facts;
 * sqlite:///factbook.db
Done.
Out[5]:
name max_population
World 7256490011
In [6]:
%%sql
SELECT name, MIN(population) min_population
  FROM facts;
 * sqlite:///factbook.db
Done.
Out[6]:
name min_population
Antarctica 0

It's clear now that the database has two entries: World which has the largest population (the world's population) and Antarctica, a continent with no permanent inhabitants. In the Antractica there only permanent and summer-only research bases.

Let's exclude the whole world and the Antarctica from summary statistics.

In [7]:
%%sql
SELECT MIN(population), MAX(population),
       MIN(population_growth), MAX(population_growth)
  FROM facts
 WHERE name NOT IN ("World", "Antarctica");
 * sqlite:///factbook.db
Done.
Out[7]:
MIN(population) MAX(population) MIN(population_growth) MAX(population_growth)
48 1367485388 0.0 4.02

It's obvious that the country with the maximum population is China, but what is the country with only 48 persons living in there? Is it Vatican?

In [8]:
%%sql
SELECT name, MIN(population) AS min_population
  FROM facts
 WHERE name NOT IN ("World", "Antarctica");
 * sqlite:///factbook.db
Done.
Out[8]:
name min_population
Pitcairn Islands 48

It's not Vatican! Pitcairn Islands are a group of four volcanic islands the least populous jurisdiction territory in the world. They are a part of British Overseas Territory in Pacific Ocean. When you are tired of the life you may want to move there for a couple of months (not an advertisement).

Maximum Population Growth

Let's now see what is the county with the maximum population growth (since the minimum population growth is 0, it fair to think there are many countries with such value, in other words they have no population growth).

In [9]:
%%sql 
SELECT name, MAX(population_growth) population_growth
  FROM facts;
 * sqlite:///factbook.db
Done.
Out[9]:
name population_growth
South Sudan 4.02

South Sudan is the most recent sovereign state and one of the poorest states in the world. It's no surprise it has the highest population growth rate in the world.

Average Area and Population

Let's calculate the average number for area and population excluding the Antarctica and the world. It's worth noting that the world, Antarctica and the oceans (i.e. Pacific, Indian) have no values for the area, but there is the area of the European Union that we have to exclude from our calculation (all its countries are already present in the table).

In [10]:
%%sql
SELECT AVG(area) avg_country_area
  FROM facts
 WHERE name != "European Union";
 * sqlite:///factbook.db
Done.
Out[10]:
avg_country_area
539893.1895161291

It's 539 893 square kilometers, an area comparable to Metropolitan France! Let's figure out the number of states that have an area bigger than their average.

In [11]:
%%sql
SELECT COUNT(*)
  FROM facts
 WHERE area > (SELECT AVG(area) FROM facts
                WHERE name NOT IN ("European Union")
              );
 * sqlite:///factbook.db
Done.
Out[11]:
COUNT(*)
48

There are 48 countries out of 195 countries that are bigger than the average country area. Let's write a query to figure out which countries have areas above average.

In [12]:
%%sql
SELECT name, area
  FROM facts
 WHERE area > (SELECT AVG(area) FROM facts
                WHERE name NOT IN ("European Union")
              )
 ORDER BY area DESC;
 * sqlite:///factbook.db
Done.
Out[12]:
name area
Russia 17098242
Canada 9984670
United States 9826675
China 9596960
Brazil 8515770
Australia 7741220
European Union 4324782
India 3287263
Argentina 2780400
Kazakhstan 2724900
Algeria 2381741
Congo, Democratic Republic of the 2344858
Greenland 2166086
Saudi Arabia 2149690
Mexico 1964375
Indonesia 1904569
Sudan 1861484
Libya 1759540
Iran 1648195
Mongolia 1564116
Peru 1285216
Angola 1246700
Mali 1240192
South Africa 1219090
Colombia 1138910
Ethiopia 1104300
Bolivia 1098581
Mauritania 1030700
Egypt 1001450
Tanzania 947300
Nigeria 923768
Venezuela 912050
Namibia 824292
Mozambique 799380
Pakistan 796095
Turkey 783562
Chile 756102
Zambia 752618
Burma 676578
Afghanistan 652230
South Sudan 644329
France 643801
Somalia 637657
Central African Republic 622984
Ukraine 603550
Madagascar 587041
Botswana 581730
Kenya 580367

Let's do the same statistics for population. This time though we have to exclude the world and Antarctica since they have values in the population column. We additionally have to exclude the European Union.

In [13]:
%%sql 
SELECT AVG(population) avg_world_population
  FROM facts
 WHERE name NOT IN ("World", "Antarctica", "European Union");
 * sqlite:///factbook.db
Done.
Out[13]:
avg_world_population
30362063.589958157

There are more than 30 million people on average living in each country. It's around the population in Mozambique. Let's then count the number of countries that have more people than the average population.

In [14]:
%%sql
SELECT COUNT(*)
  FROM facts
 WHERE population > (SELECT AVG(population) FROM facts 
                     WHERE name NOT IN ("World", "European Union", "Antarctica"));
 * sqlite:///factbook.db
Done.
Out[14]:
COUNT(*)
45

There are 45 (43 without the world and European Union) countries that have a population bigger than the average world's population! Let's figure out what this countries are and order them by population (from the highest to the lowest).

In [15]:
%%sql
WITH more_avg_pop AS 
    (
    SELECT name, population
      FROM facts
      WHERE population > (SELECT AVG(population) FROM facts
                           WHERE name NOT IN ("World", "European Union", "Antractica")
                         )
    )

SELECT name, population
  FROM more_avg_pop
 WHERE name NOT IN ("World", "European Union")
 ORDER BY population DESC;
 * sqlite:///factbook.db
Done.
Out[15]:
name population
China 1367485388
India 1251695584
United States 321368864
Indonesia 255993674
Brazil 204259812
Pakistan 199085847
Nigeria 181562056
Bangladesh 168957745
Russia 142423773
Japan 126919659
Mexico 121736809
Philippines 100998376
Ethiopia 99465819
Vietnam 94348835
Egypt 88487396
Iran 81824270
Germany 80854408
Turkey 79414269
Congo, Democratic Republic of the 79375136
Thailand 67976405
France 66553766
United Kingdom 64088222
Italy 61855120
Burma 56320206
South Africa 53675563
Tanzania 51045882
Korea, South 49115196
Spain 48146134
Colombia 46736728
Kenya 45925301
Ukraine 44429471
Argentina 43431886
Algeria 39542166
Poland 38562189
Uganda 37101745
Iraq 37056169
Sudan 36108853
Canada 35099836
Morocco 33322699
Afghanistan 32564342
Nepal 31551305
Malaysia 30513848
Peru 30444999

It would be also interesting to figure out which proportion of the world's population the first three countries (China, India, USA) hold.

In [16]:
%%sql
WITH top_three_pop_proportion AS
    (
    SELECT CAST(Population AS FLOAT) / (SELECT MAX(Population) FROM facts) proportion
      FROM facts
     WHERE name NOT IN ("World", "European Union")
     ORDER BY 1 DESC
     LIMIT 3
    )
    
SELECT ROUND(SUM(proportion) * 100, 2) "proportion (%)"
  FROM top_three_pop_proportion
 * sqlite:///factbook.db
Done.
Out[16]:
proportion (%)
40.52

Just three countries account for more than 40 percent of the world's population: China, India and USA with China and India having more than one billion people each.

Population Densities

Now let's calculate population densities. To do so we will have to divide the country's population by its land area (hopefully most people don't live in rivers or lakes). Let's also order them in descending order to find out the countries with the highest population density.

In [17]:
%%sql
SELECT name, population, area, 
       ROUND(CAST(population AS FLOAT) / area_land, 2) population_density
  FROM facts
 ORDER BY population_density DESC
 LIMIT 10;
 * sqlite:///factbook.db
Done.
Out[17]:
name population area population_density
Macau 592731 28 21168.96
Monaco 30535 2 15267.5
Singapore 5674472 697 8259.78
Hong Kong 7141106 1108 6655.27
Gaza Strip 1869055 360 5191.82
Gibraltar 29258 6 4876.33
Bahrain 1346613 760 1771.86
Maldives 393253 298 1319.64
Malta 413965 316 1310.02
Bermuda 70196 54 1299.93

Macau is a special administrative region in China, and is considered the most densely populated region in the world.

If we are following this way why not calculating the average population density and figure out how many (and what) countries have densitites more than its average? The oceans, the EU, the world, have no values for area_land and/or for population so we do not need to explicitly exclude them from calculation. On the other hand, the Antarctica has both values, so it's necessary to exclude it.

In [18]:
%%sql
SELECT ROUND(AVG(CAST(population AS FLOAT) / area_land), 2) avg_pop_density
  FROM facts
 WHERE name NOT IN ("Antarctica")
 * sqlite:///factbook.db
Done.
Out[18]:
avg_pop_density
431.91

It's around 432 pop./km$^2$, the density of the Netherlands, a state in Europe, a continent with one of the most highest densitites in the world.

Let's now figure out how many countries (and what) have densities more than the average value.

In [19]:
%%sql
SELECT COUNT(*) more_avg_dens
  FROM facts
 WHERE CAST(population AS FLOAT) / area_land > 
                                        (
                                        SELECT AVG(CAST(population AS FLOAT) / area_land) 
                                          FROM facts
                                         WHERE name NOT IN ("Antarctica")                         
                                        )
 * sqlite:///factbook.db
Done.
Out[19]:
more_avg_dens
26
In [20]:
%%sql
SELECT name, ROUND(CAST(population AS FLOAT) / area_land, 2) population_density
  FROM facts
 WHERE CAST(population AS FLOAT) / area_land > 
                                        (
                                        SELECT AVG(CAST(population AS FLOAT) / area_land) 
                                          FROM facts
                                         WHERE name NOT IN ("Antarctica")                         
                                        )
 ORDER BY population_density DESC;
 * sqlite:///factbook.db
Done.
Out[20]:
name population_density
Macau 21168.96
Monaco 15267.5
Singapore 8259.78
Hong Kong 6655.27
Gaza Strip 5191.82
Gibraltar 4876.33
Bahrain 1771.86
Maldives 1319.64
Malta 1310.02
Bermuda 1299.93
Bangladesh 1297.98
Sint Maarten 1167.32
Guernsey 847.18
Jersey 838.74
Taiwan 725.83
Barbados 675.82
Mauritius 660.01
Aruba 623.12
Lebanon 604.57
Saint Martin 588.04
San Marino 541.31
Rwanda 513.29
Korea, South 506.76
Netherlands 500.04
West Bank 493.86
Nauru 454.29

There are 26 countries which density is higher that the average density. It's interesting to note that the calculated value of 500 pop./km$^2$ for the Netherlands is much higher that the Wikipedia value. It's also interesting to note that the Netherlands in the only European continental state (not counting small states like Monaco) with a density higher that the average.

Most and Least Added People

It's also interesting to know which countries will add the most people to their population next year. We can calculate this by multiplication of the population by the population growth rate / 100 + 1 (because it's in percentage) - current population. If we order this table by the new column in ascending order we can figure out what countries would have added the least people.

In [21]:
%%sql
CREATE VIEW add_people AS
    SELECT name, population, population_growth, 
           ROUND(population * (population_growth / 100 + 1) - population, 0) AS added_people
     FROM facts;
 * sqlite:///factbook.db
Done.
Out[21]:
[]
In [22]:
%%sql
SELECT *
  FROM add_people
 ORDER BY added_people DESC
 LIMIT 11;
 * sqlite:///factbook.db
Done.
Out[22]:
name population population_growth added_people
World 7256490011 1.08 78370092.0
India 1251695584 1.22 15270686.0
China 1367485388 0.45 6153684.0
Nigeria 181562056 2.45 4448270.0
Pakistan 199085847 1.46 2906653.0
Ethiopia 99465819 2.89 2874562.0
Bangladesh 168957745 1.6 2703324.0
United States 321368864 0.78 2506677.0
Indonesia 255993674 0.92 2355142.0
Congo, Democratic Republic of the 79375136 2.45 1944691.0
Philippines 100998376 1.61 1626074.0
In [23]:
%%sql
SELECT *
  FROM add_people
 WHERE added_people IS NOT NULL
 ORDER BY added_people
 LIMIT 10;
 * sqlite:///factbook.db
Done.
Out[23]:
name population population_growth added_people
Holy See (Vatican City) 842 0.0 0.0
Cocos (Keeling) Islands 596 0.0 0.0
Norfolk Island 2210 0.01 0.0
Greenland 57733 0.0 0.0
Niue 1190 0.03 0.0
Tokelau 1337 0.01 0.0
Falkland Islands (Islas Malvinas) 3361 0.01 0.0
Pitcairn Islands 48 0.0 0.0
Svalbard 1872 0.03 1.0
Christmas Island 1530 1.11 17.0

It's interesting to notice the USA in this least as the only developed country (it's also arguably China) in this list that would have added people in the year 2016. It's also curious to figure out that China with its policy 'One family - one child' has more than two times smaller number of new people than India that has not such policy.

As for the countries with the least added people we notice Svalbard, a Norwegian archipelago in the Arctic Ocean and Christmas Island, an Australian external territory in the Indian Ocean, both adding just 1 and 17 persons, respectively.

Birth and Death Rates

Let's figure out how many and which countries have a higher death rate than the birth rate. This value is called "Rate of Natural population increase" and is calculated as "birth rate - death rate".

In [24]:
%%sql 
SELECT name, ROUND(birth_rate - death_rate, 2) AS natural_pop_increase
  FROM facts
 WHERE natural_pop_increase IS NOT NULL
 ORDER BY natural_pop_increase
 LIMIT 10;
 * sqlite:///factbook.db
Done.
Out[24]:
name natural_pop_increase
Bulgaria -5.52
Serbia -4.58
Latvia -4.31
Lithuania -4.17
Ukraine -3.74
Hungary -3.57
Germany -2.95
Slovenia -2.95
Romania -2.76
Croatia -2.73

Bulgaria, Serbia and Latvia are three countries with the most negative natural population increase.

Highest Water-to-Land Ratio

Let's now compare the water-to-land ratios to figure out what countries are the most waterful (hoorah, we've just invented a new word)!

In [25]:
%%sql
SELECT name, area_land, area_water, ROUND(CAST(area_water AS FLOAT) / area_land, 2) AS water_to_land
  FROM facts
 ORDER BY Water_to_land DESC
 LIMIT 10;
 * sqlite:///factbook.db
Done.
Out[25]:
name area_land area_water water_to_land
British Indian Ocean Territory 60 54340 905.67
Virgin Islands 346 1564 4.52
Puerto Rico 8870 4921 0.55
Bahamas, The 10010 3870 0.39
Guinea-Bissau 28120 8005 0.28
Malawi 94080 24404 0.26
Netherlands 33893 7650 0.23
Uganda 197100 43938 0.22
Eritrea 101000 16600 0.16
Liberia 96320 15049 0.16

The British Indian Ocean Territory stands out in the list having the water-to-land ratio more than 200 times bigger than the second place (Virgin Islands). It's a group of attols with more than 1000 individual islands with a total area of 54 400 square kilometers so it's no surprise it has such a big ratio.

Conclusions

The main objective of this project though was to understand how the basic SQL code works and what information could be extracted using SQL. We analyzed the CIA Factbook Data and answered some summary questions about the database and discovered some interesting facts about the world's countries:

  • Pitcairn Islands are a part of the British Overseas Territories and it's the least populous national jurisdiction in the world!
  • Macau, located in China, is a territory with the highest population density in the world.
  • British Indian Ocean Territory is a territory in the Indian Ocean with the highest water-to-land ration (more than 900 times).