Analyzing CIA Factbook Data Using SQL

In this project we will use SQL to analyze some of the data in the CIA World Factbook. The purpose is to show how data stored in a database can be analyzed with SQL.

The CIA World Factbook provides various data points about 267 world entities. Below are some examples of these data points:

  • Demographics
  • Enonomy
  • Transportation
  • Energy
  • Communications
  • Government
  • Military
  • Geography
  • History

The database file in this project contains a selection of these data points. We will use SQL to answer some basic questions about the data. Most of the time the answers will be observed in cell output immediately following code cells and will not be spelled out into separate sentences to avoid wordiness and duplication.

Connect to Database Stored Locally

In [1]:
%%capture
%load_ext sql
%sql sqlite:///factbook.db

List Tables Currently In The Database

In [2]:
%%sql

SELECT *
  FROM sqlite_master
 WHERE type = 'table';
 * sqlite:///factbook.db
Done.
Out[2]:
type name tbl_name rootpage sql
table sqlite_sequence sqlite_sequence 3 CREATE TABLE sqlite_sequence(name,seq)
table facts facts 47 CREATE TABLE "facts" ("id" INTEGER PRIMARY KEY AUTOINCREMENT NOT NULL, "code" varchar(255) NOT NULL, "name" varchar(255) NOT NULL, "area" integer, "area_land" integer, "area_water" integer, "population" integer, "population_growth" float, "birth_rate" float, "death_rate" float, "migration_rate" float)

Inspect the 'facts' Table

In [3]:
%%sql

SELECT *
  FROM facts
 LIMIT 5;
 * sqlite:///factbook.db
Done.
Out[3]:
id code name area area_land area_water population population_growth birth_rate death_rate migration_rate
1 af Afghanistan 652230 652230 0 32564342 2.32 38.57 13.89 1.51
2 al Albania 28748 27398 1350 3029278 0.3 12.92 6.58 3.3
3 ag Algeria 2381741 2381741 0 39542166 1.84 23.67 4.31 0.92
4 an Andorra 468 468 0 85580 0.12 8.13 6.96 0.0
5 ao Angola 1246700 1246700 0 19625353 2.78 38.78 11.49 0.46

This preview looks clean. The column headers are styled properly per database conventions.

The columns present in the database are:

  • id - unique identifier for each entry in the database
  • code - country code
  • name - country name
  • area - total land and water area in square kilometers
  • area_land - total land area in square kilometers
  • area_water - total water area in square kilometers
  • population - total population
  • population_growth- annual population growth as a percentage
  • birth_rate - number of people born per 1,000 people per year
  • death_rate - number of deaths per 1,000 people per year
  • migration_rate - migration rate. The DataQuest project instructions do not have a better description for this column. See more below.

We need to attempt to understand the migration_rate field more. Searching the CIA World Factbook site does not yield useful results because the database used for this project seems to have been selected by someone else who did not include the explanation for this field, or it was omitted by the author of this dataquest project. CIA has references to "net migration", which can be a negative or a potitive value. In this case, migration_rate does not appear to be the same as "net migration" because there are no negative values here (see cell below).

In [4]:
%%sql

SELECT name, migration_rate
  FROM facts
 WHERE migration_rate < 0;
 * sqlite:///factbook.db
Done.
Out[4]:
name migration_rate

Calculate Summary Statistics For Select Fields

In [5]:
%%sql

SELECT MIN(population),
       MAX(population),
       MIN(population_growth),
       MAX(population_growth)
  FROM facts;
 * sqlite:///factbook.db
Done.
Out[5]:
MIN(population) MAX(population) MIN(population_growth) MAX(population_growth)
0 7256490011 0.0 4.02

Minimum population is 0, which is either an error or is an entity that is not an inhabited area. Maximum population is over 7 billion, which appears to be the total population of the world. Minimum and maximum population growth seem plausible.

View Country Names With Minimum and Maximum Population

In [6]:
%%sql

SELECT name, population
  FROM facts
 WHERE population == (SELECT MIN(population)
                        FROM facts);
 * sqlite:///factbook.db
Done.
Out[6]:
name population
Antarctica 0
In [7]:
%%sql

SELECT name, population
  FROM facts
 WHERE population == (SELECT MAX(population)
                        FROM facts);
 * sqlite:///factbook.db
Done.
Out[7]:
name population
World 7256490011

This query confirms our suspicion that not all entities in the database are countries. Some are continents or the entire world.

Calculate Summary Statistics For Select Fields With Filtering

In [8]:
%%sql

SELECT MIN(population),
       MAX(population),
       MIN(population_growth),
       MAX(population_growth)
  FROM facts
 WHERE population <> (SELECT MAX(population) /* filter out maximum population */
                        FROM facts); 
 * sqlite:///factbook.db
Done.
Out[8]:
MIN(population) MAX(population) MIN(population_growth) MAX(population_growth)
0 1367485388 0.0 4.02
In [9]:
%%sql

SELECT ROUND(AVG(population), 0), ROUND(AVG(area), 0)
  FROM facts
 WHERE population <> (SELECT MAX(population) /* filter out maximum population */
                        FROM facts);
 * sqlite:///factbook.db
Done.
Out[9]:
ROUND(AVG(population), 0) ROUND(AVG(area), 0)
32242667.0 582950.0

Answer Other Questions About Data

See list of countries with above average population and below average area
In [10]:
%%sql

SELECT name
  FROM facts
 WHERE population > (SELECT AVG(population)
                       FROM facts)
   AND area       < (SELECT AVG(area)
                       FROM facts);
           
 * sqlite:///factbook.db
Done.
Out[10]:
name
Bangladesh
Germany
Japan
Philippines
Thailand
United Kingdom
Vietnam
List 5 top countries by population (descending)
In [11]:
%%sql

SELECT name, population, population_growth
  FROM facts
 WHERE name NOT IN ('World', 'European Union')
 ORDER BY population DESC
 LIMIT 5;
 * sqlite:///factbook.db
Done.
Out[11]:
name population population_growth
China 1367485388 0.45
India 1251695584 1.22
United States 321368864 0.78
Indonesia 255993674 0.92
Brazil 204259812 0.77
List 5 top countries by population growth rate (descending)
In [12]:
%%sql

SELECT name, population_growth
  FROM facts
 WHERE name NOT IN ('World', 'European Union')
 ORDER BY population_growth DESC
 LIMIT 5;
 * sqlite:///factbook.db
Done.
Out[12]:
name population_growth
South Sudan 4.02
Malawi 3.32
Burundi 3.28
Niger 3.25
Uganda 3.24
List top 5 countries with the highest ratio of water to land
In [13]:
%%sql

SELECT name, 
       ROUND(CAST(area_water AS Float) / CAST(area_land AS Float), 2) AS water_to_land_ratio
  FROM facts
 WHERE name <> 'British Indian Ocean Territory'
 ORDER BY water_to_land_ratio DESC
 LIMIT 5;       
 * sqlite:///factbook.db
Done.
Out[13]:
name water_to_land_ratio
Virgin Islands 4.52
Puerto Rico 0.55
Bahamas, The 0.39
Guinea-Bissau 0.28
Malawi 0.26
List all countries with more water than land
In [14]:
%%sql

SELECT name, area_land, area_water
  FROM facts
 WHERE area_land < area_water
   AND name <> 'British Indian Ocean Territory';
 * sqlite:///factbook.db
Done.
Out[14]:
name area_land area_water
Virgin Islands 346 1564
List 5 countries that will add most people to their population next year
In [15]:
%%sql

SELECT name, CAST(ROUND(population * population_growth / 100, 0) AS Int) AS population_added
  FROM facts
 WHERE name <> 'World'
 ORDER BY population_added DESC
 LIMIT 5;
 * sqlite:///factbook.db
Done.
Out[15]:
name population_added
India 15270686
China 6153684
Nigeria 4448270
Pakistan 2906653
Ethiopia 2874562

It's worth noting that India will add more than 15 million people as compared to China's 6 million.

Quantify countries with higher death rate than birth rate, then list them
In [16]:
%%sql

SELECT COUNT(*) AS number_of_countries_with_death_rate_higher_than_birth_rate
  FROM facts
 WHERE birth_rate - death_rate < 0;
 * sqlite:///factbook.db
Done.
Out[16]:
number_of_countries_with_death_rate_higher_than_birth_rate
24
In [17]:
%%sql

SELECT name, birth_rate, death_rate, ROUND(birth_rate - death_rate, 2) AS difference
  FROM facts
 WHERE difference < 0
 ORDER BY difference;
 * sqlite:///factbook.db
Done.
Out[17]:
name birth_rate death_rate difference
Bulgaria 8.92 14.44 -5.52
Serbia 9.08 13.66 -4.58
Latvia 10.0 14.31 -4.31
Lithuania 10.1 14.27 -4.17
Ukraine 10.72 14.46 -3.74
Hungary 9.16 12.73 -3.57
Germany 8.47 11.42 -2.95
Slovenia 8.42 11.37 -2.95
Romania 9.14 11.9 -2.76
Croatia 9.45 12.18 -2.73
Belarus 10.7 13.36 -2.66
Monaco 6.65 9.24 -2.59
Greece 8.66 11.09 -2.43
Saint Pierre and Miquelon 7.42 9.72 -2.3
Russia 11.6 13.69 -2.09
Estonia 10.51 12.4 -1.89
Portugal 9.27 11.02 -1.75
Japan 7.93 9.51 -1.58
Italy 8.74 10.19 -1.45
Bosnia and Herzegovina 8.87 9.75 -0.88
Czech Republic 9.63 10.34 -0.71
Moldova 12.0 12.59 -0.59
Poland 9.74 10.19 -0.45
Austria 9.41 9.42 -0.01
List 5 counties with the highest population density
In [18]:
%%sql

SELECT name, population, area_land, population / area_land AS population_density
  FROM facts
 ORDER BY population_density DESC
 LIMIT 5;
 * sqlite:///factbook.db
Done.
Out[18]:
name population area_land population_density
Macau 592731 28 21168
Monaco 30535 2 15267
Singapore 5674472 687 8259
Hong Kong 7141106 1073 6655
Gaza Strip 1869055 360 5191

These results are different from above where we listed countries with above average population and below average areas. This is because the highest density countries tend to be very small in land area and never above average in total population.

Conclusion

This project demonstrated simple SQL usage to answer basic questions about a dataset containing mostly numeric data.