Analyzing CIA Factbook Data Using SQL

The World Factbook provides basic intelligence on the history, people, government, economy, energy, geography, communications, transportation, military, terrorism, and transnational issues for 267 world entities.

Tasks we will complete in this excercise:

  1. Connect jupyter notebook to the CIA factbook database
  2. Explore database
  3. Explore indivdual tables in the database
  4. Generate statistics from the "facts" table

Connecting juyter notebook to the CIA factbook database file i.e. factbook.db

Make sure that factbook.db is in the same folder as jupyter notebook

In [1]:
%%capture
%load_ext sql
%sql sqlite:///factbook.db

Exploring the factbook.db database file

In [2]:
%%sql

SELECT *
  FROM sqlite_master
 WHERE type='table';
 * sqlite:///factbook.db
Done.
Out[2]:
type name tbl_name rootpage sql
table sqlite_sequence sqlite_sequence 3 CREATE TABLE sqlite_sequence(name,seq)
table facts facts 47 CREATE TABLE "facts" ("id" INTEGER PRIMARY KEY AUTOINCREMENT NOT NULL, "code" varchar(255) NOT NULL, "name" varchar(255) NOT NULL, "area" integer, "area_land" integer, "area_water" integer, "population" integer, "population_growth" float, "birth_rate" float, "death_rate" float, "migration_rate" float)

There are two different tables in the factbook.db:

  1. sqlite_sequence
  2. facts

Countries related data is stored in "facts" table, so we will explore that table further.

Exploring facts table:

In [3]:
%%sql

SELECT * 
    FROM facts
    LIMIT 5;
 * sqlite:///factbook.db
Done.
Out[3]:
id code name area area_land area_water population population_growth birth_rate death_rate migration_rate
1 af Afghanistan 652230 652230 0 32564342 2.32 38.57 13.89 1.51
2 al Albania 28748 27398 1350 3029278 0.3 12.92 6.58 3.3
3 ag Algeria 2381741 2381741 0 39542166 1.84 23.67 4.31 0.92
4 an Andorra 468 468 0 85580 0.12 8.13 6.96 0.0
5 ao Angola 1246700 1246700 0 19625353 2.78 38.78 11.49 0.46

Here are the descriptions for some of the columns:

  • name — the name of the country.
  • area— the country's total area (both land and water).
  • area_land — the country's land area in square kilometers.
  • area_water — the country's waterarea in square kilometers.
  • population — the country's population.
  • population_growth — the country's population growth as a percentage.
  • birth_rate — the country's birth rate, or the number of births per year per 1,000 people.
  • death_rate — the country's death rate, or the number of death per year per 1,000 people.

Generate Statistics

Min,Max of "population" and "population growth"

In [4]:
%%sql

SELECT MIN(population), MAX(population), MIN(population_growth), MAX(population_growth)
    FROM facts;
 * sqlite:///factbook.db
Done.
Out[4]:
MIN(population) MAX(population) MIN(population_growth) MAX(population_growth)
0 7256490011 0.0 4.02

From this search we can see that minimum population of a country is 0 and maximum is 7.2 billion. Both of these populations seem unrealistic. Let's find out to which country they refer to.

Minimum Population:

In [5]:
%%sql

SELECT *
    FROM facts
   WHERE population == (SELECT MIN(population) FROM facts)
 * sqlite:///factbook.db
Done.
Out[5]:
id code name area area_land area_water population population_growth birth_rate death_rate migration_rate
250 ay Antarctica None 280000 None 0 None None None None

Maximum Population:

In [6]:
%%sql

SELECT *
    FROM facts
   WHERE population == (SELECT MAX(population) FROM facts)
 * sqlite:///factbook.db
Done.
Out[6]:
id code name area area_land area_water population population_growth birth_rate death_rate migration_rate
261 xx World None None None 7256490011 1.08 18.6 7.8 None

So, minimum population of 0 belongs to Antarctica and 7.2 billion is the world population. Going forward, we will exclude world population from our calculation for accurate result.

Recalculating Min/Max statistics after dropping world population figures:

In [7]:
%%sql

SELECT MIN(population), MAX(population), MIN(population_growth), MAX(population_growth)
    FROM facts
   WHERE population <> (SELECT MAX(population) FROM facts)
 * sqlite:///factbook.db
Done.
Out[7]:
MIN(population) MAX(population) MIN(population_growth) MAX(population_growth)
0 1367485388 0.0 4.02

Let's find all countries meeting both of the following criteria:

  1. The population is above average.
  2. The area is below average.
In [8]:
%%sql

SELECT *
    FROM facts
   WHERE population > (SELECT AVG(population) FROM facts WHERE population <> (SELECT MAX(population) FROM facts))
   AND area < (SELECT AVG(area) FROM facts WHERE population <> (SELECT MAX(population) FROM facts))
 * sqlite:///factbook.db
Done.
Out[8]:
id code name area area_land area_water population population_growth birth_rate death_rate migration_rate
14 bg Bangladesh 148460 130170 18290 168957745 1.6 21.14 5.61 0.46
65 gm Germany 357022 348672 8350 80854408 0.17 8.47 11.42 1.24
80 iz Iraq 438317 437367 950 37056169 2.93 31.45 3.77 1.62
83 it Italy 301340 294140 7200 61855120 0.27 8.74 10.19 4.1
85 ja Japan 377915 364485 13430 126919659 0.16 7.93 9.51 0.0
88 ke Kenya 580367 569140 11227 45925301 1.93 26.4 6.89 0.22
91 ks Korea, South 99720 96920 2800 49115196 0.14 8.19 6.75 0.0
120 mo Morocco 446550 446300 250 33322699 1.0 18.2 4.81 3.36
138 rp Philippines 300000 298170 1830 100998376 1.61 24.27 6.11 2.09
139 pl Poland 312685 304255 8430 38562189 0.09 9.74 10.19 0.46
163 sp Spain 505370 498980 6390 48146134 0.89 9.64 9.04 8.31
173 th Thailand 513120 510890 2230 67976405 0.34 11.19 7.8 0.0
182 ug Uganda 241038 197100 43938 37101745 3.24 43.79 10.69 0.74
185 uk United Kingdom 243610 241930 1680 64088222 0.54 12.17 9.35 2.54
192 vm Vietnam 331210 310070 21140 94348835 0.97 15.96 5.93 0.3

Which country has the most people? Which country has the highest growth rate?

In [9]:
%%sql

SELECT name, MAX(population)
    FROM facts
   WHERE population <> (SELECT MAX(population) FROM facts)
 * sqlite:///factbook.db
Done.
Out[9]:
name MAX(population)
China 1367485388
In [10]:
%%sql

SELECT name, MAX(population_growth)
    FROM facts
   WHERE population <> (SELECT MAX(population) FROM facts)
 * sqlite:///factbook.db
Done.
Out[10]:
name MAX(population_growth)
South Sudan 4.02

Which countries have the highest ratios of water to land? Which countries have more water than land?

In [11]:
%%sql

SELECT name, MAX(area_water/area_land)
    FROM facts
   WHERE population <> (SELECT MAX(population) FROM facts)
 * sqlite:///factbook.db
Done.
Out[11]:
name MAX(area_water/area_land)
Virgin Islands 4
In [12]:
%%sql

SELECT name, (area_water-area_land) AS more_water
    FROM facts
   WHERE more_water > 0
 * sqlite:///factbook.db
Done.
Out[12]:
name more_water
British Indian Ocean Territory 54280
Virgin Islands 1218

Which countries will add the most people to their populations next year?

In [18]:
%%sql

SELECT name, (population * population_growth) AS new_people
    FROM facts
   WHERE population <> (SELECT MAX(population) FROM facts)
   ORDER BY new_people DESC
   LIMIT 5;
 * sqlite:///factbook.db
Done.
Out[18]:
name new_people
India 1527068612.48
China 615368424.6
Nigeria 444827037.20000005
Pakistan 290665336.62
Ethiopia 287456216.91

Which countries have a higher death rate than birth rate?

In [14]:
%%sql

SELECT name, ROUND(birth_rate-death_rate,1) AS more_birth
    FROM facts
   WHERE more_birth > 0
   ORDER BY more_birth DESC
   LIMIT 5;
 * sqlite:///factbook.db
Done.
Out[14]:
name more_birth
Malawi 33.2
Uganda 33.1
Niger 33.0
Burundi 32.7
Mali 32.1

Which countries have the highest population/area ratio, and how does it compare to list we found in the previous screen?

In [17]:
%%sql

SELECT name, (CAST(population AS Float)/CAST(area AS Float)) AS pop_area_ratio
    FROM facts
   ORDER BY pop_area_ratio
   LIMIT 5;
 * sqlite:///factbook.db
Done.
Out[17]:
name pop_area_ratio
Chad None
Niger None
Holy See (Vatican City) None
Ashmore and Cartier Islands None
Coral Sea Islands None

This wasn't the result we were expecting here. So let's explore the dataset for these countries and see what information is present in the population and area columns for these countries

In [19]:
%%sql

SELECT name, population, area
    FROM facts
   WHERE name IN ('Chad', 'Niger', 'Holy See (Vatican City)', 'Ashmore and Cartier Islands', 'Coral Sea Islands')
 * sqlite:///factbook.db
Done.
Out[19]:
name population area
Chad 11631456 None
Niger 18045729 None
Holy See (Vatican City) 842 0
Ashmore and Cartier Islands None 5
Coral Sea Islands None 3

We can see that population and area columns have "None" or "0" values for these countries. So, when calculating the population/area ratio we will have to ignore these countries.

In [23]:
%%sql

SELECT name, ROUND(CAST(population AS Float)/CAST(area AS Float),2) AS pop_area_ratio
    FROM facts
   WHERE population NOT IN ('None', '0') AND area NOT IN ('None', '0')
   ORDER BY pop_area_ratio DESC
   LIMIT 5;
 * sqlite:///factbook.db
Done.
Out[23]:
name pop_area_ratio
Macau 21168.96
Monaco 15267.5
Singapore 8141.28
Hong Kong 6445.04
Gaza Strip 5191.82

Let's calculate the pop_area_ratio for the countries, which have population above average and area below average.

In [25]:
%%sql

SELECT name, ROUND(CAST(population AS Float)/CAST(area AS Float),2) AS pop_area_ratio
    FROM facts
   WHERE population NOT IN ('None', '0') AND area NOT IN ('None', '0') AND population > (SELECT AVG(population) FROM facts WHERE population <> (SELECT MAX(population) FROM facts))
   AND area < (SELECT AVG(area) FROM facts WHERE population <> (SELECT MAX(population) FROM facts))
   ORDER BY pop_area_ratio DESC;
 * sqlite:///factbook.db
Done.
Out[25]:
name pop_area_ratio
Bangladesh 1138.07
Korea, South 492.53
Philippines 336.66
Japan 335.84
Vietnam 284.86
United Kingdom 263.08
Germany 226.47
Italy 205.27
Uganda 153.92
Thailand 132.48
Poland 123.33
Spain 95.27
Iraq 84.54
Kenya 79.13
Morocco 74.62
In [ ]: