In this project we will use SQL to analyze some of the data in the CIA World Factbook. The purpose is to show how data stored in a database can be analyzed with SQL.
The CIA World Factbook provides various data points about 267 world entities. Below are some examples of these data points:
The database file in this project contains a selection of these data points. We will use SQL to answer some basic questions about the data. Most of the time the answers will be observed in cell output immediately following code cells and will not be spelled out into separate sentences to avoid wordiness and duplication.
%%capture
%load_ext sql
%sql sqlite:///factbook.db
%%sql
SELECT *
FROM sqlite_master
WHERE type = 'table';
* sqlite:///factbook.db Done.
type | name | tbl_name | rootpage | sql |
---|---|---|---|---|
table | sqlite_sequence | sqlite_sequence | 3 | CREATE TABLE sqlite_sequence(name,seq) |
table | facts | facts | 47 | CREATE TABLE "facts" ("id" INTEGER PRIMARY KEY AUTOINCREMENT NOT NULL, "code" varchar(255) NOT NULL, "name" varchar(255) NOT NULL, "area" integer, "area_land" integer, "area_water" integer, "population" integer, "population_growth" float, "birth_rate" float, "death_rate" float, "migration_rate" float) |
%%sql
SELECT *
FROM facts
LIMIT 5;
* sqlite:///factbook.db Done.
id | code | name | area | area_land | area_water | population | population_growth | birth_rate | death_rate | migration_rate |
---|---|---|---|---|---|---|---|---|---|---|
1 | af | Afghanistan | 652230 | 652230 | 0 | 32564342 | 2.32 | 38.57 | 13.89 | 1.51 |
2 | al | Albania | 28748 | 27398 | 1350 | 3029278 | 0.3 | 12.92 | 6.58 | 3.3 |
3 | ag | Algeria | 2381741 | 2381741 | 0 | 39542166 | 1.84 | 23.67 | 4.31 | 0.92 |
4 | an | Andorra | 468 | 468 | 0 | 85580 | 0.12 | 8.13 | 6.96 | 0.0 |
5 | ao | Angola | 1246700 | 1246700 | 0 | 19625353 | 2.78 | 38.78 | 11.49 | 0.46 |
This preview looks clean. The column headers are styled properly per database conventions.
The columns present in the database are:
id
- unique identifier for each entry in the databasecode
- country codename
- country namearea
- total land and water area in square kilometersarea_land
- total land area in square kilometersarea_water
- total water area in square kilometerspopulation
- total populationpopulation_growth
- annual population growth as a percentagebirth_rate
- number of people born per 1,000 people per yeardeath_rate
- number of deaths per 1,000 people per yearmigration_rate
- migration rate. The DataQuest project instructions do not have a better description for this column. See more below.We need to attempt to understand the migration_rate
field more. Searching the CIA World Factbook site does not yield useful results because the database used for this project seems to have been selected by someone else who did not include the explanation for this field, or it was omitted by the author of this dataquest project. CIA has references to "net migration", which can be a negative or a potitive value. In this case, migration_rate
does not appear to be the same as "net migration" because there are no negative values here (see cell below).
%%sql
SELECT name, migration_rate
FROM facts
WHERE migration_rate < 0;
* sqlite:///factbook.db Done.
name | migration_rate |
---|
%%sql
SELECT MIN(population),
MAX(population),
MIN(population_growth),
MAX(population_growth)
FROM facts;
* sqlite:///factbook.db Done.
MIN(population) | MAX(population) | MIN(population_growth) | MAX(population_growth) |
---|---|---|---|
0 | 7256490011 | 0.0 | 4.02 |
Minimum population is 0, which is either an error or is an entity that is not an inhabited area. Maximum population is over 7 billion, which appears to be the total population of the world. Minimum and maximum population growth seem plausible.
%%sql
SELECT name, population
FROM facts
WHERE population == (SELECT MIN(population)
FROM facts);
* sqlite:///factbook.db Done.
name | population |
---|---|
Antarctica | 0 |
%%sql
SELECT name, population
FROM facts
WHERE population == (SELECT MAX(population)
FROM facts);
* sqlite:///factbook.db Done.
name | population |
---|---|
World | 7256490011 |
This query confirms our suspicion that not all entities in the database are countries. Some are continents or the entire world.
%%sql
SELECT MIN(population),
MAX(population),
MIN(population_growth),
MAX(population_growth)
FROM facts
WHERE population <> (SELECT MAX(population) /* filter out maximum population */
FROM facts);
* sqlite:///factbook.db Done.
MIN(population) | MAX(population) | MIN(population_growth) | MAX(population_growth) |
---|---|---|---|
0 | 1367485388 | 0.0 | 4.02 |
%%sql
SELECT ROUND(AVG(population), 0), ROUND(AVG(area), 0)
FROM facts
WHERE population <> (SELECT MAX(population) /* filter out maximum population */
FROM facts);
* sqlite:///factbook.db Done.
ROUND(AVG(population), 0) | ROUND(AVG(area), 0) |
---|---|
32242667.0 | 582950.0 |
%%sql
SELECT name
FROM facts
WHERE population > (SELECT AVG(population)
FROM facts)
AND area < (SELECT AVG(area)
FROM facts);
* sqlite:///factbook.db Done.
name |
---|
Bangladesh |
Germany |
Japan |
Philippines |
Thailand |
United Kingdom |
Vietnam |
%%sql
SELECT name, population, population_growth
FROM facts
WHERE name NOT IN ('World', 'European Union')
ORDER BY population DESC
LIMIT 5;
* sqlite:///factbook.db Done.
name | population | population_growth |
---|---|---|
China | 1367485388 | 0.45 |
India | 1251695584 | 1.22 |
United States | 321368864 | 0.78 |
Indonesia | 255993674 | 0.92 |
Brazil | 204259812 | 0.77 |
%%sql
SELECT name, population_growth
FROM facts
WHERE name NOT IN ('World', 'European Union')
ORDER BY population_growth DESC
LIMIT 5;
* sqlite:///factbook.db Done.
name | population_growth |
---|---|
South Sudan | 4.02 |
Malawi | 3.32 |
Burundi | 3.28 |
Niger | 3.25 |
Uganda | 3.24 |
%%sql
SELECT name,
ROUND(CAST(area_water AS Float) / CAST(area_land AS Float), 2) AS water_to_land_ratio
FROM facts
WHERE name <> 'British Indian Ocean Territory'
ORDER BY water_to_land_ratio DESC
LIMIT 5;
* sqlite:///factbook.db Done.
name | water_to_land_ratio |
---|---|
Virgin Islands | 4.52 |
Puerto Rico | 0.55 |
Bahamas, The | 0.39 |
Guinea-Bissau | 0.28 |
Malawi | 0.26 |
%%sql
SELECT name, area_land, area_water
FROM facts
WHERE area_land < area_water
AND name <> 'British Indian Ocean Territory';
* sqlite:///factbook.db Done.
name | area_land | area_water |
---|---|---|
Virgin Islands | 346 | 1564 |
%%sql
SELECT name, CAST(ROUND(population * population_growth / 100, 0) AS Int) AS population_added
FROM facts
WHERE name <> 'World'
ORDER BY population_added DESC
LIMIT 5;
* sqlite:///factbook.db Done.
name | population_added |
---|---|
India | 15270686 |
China | 6153684 |
Nigeria | 4448270 |
Pakistan | 2906653 |
Ethiopia | 2874562 |
It's worth noting that India will add more than 15 million people as compared to China's 6 million.
%%sql
SELECT COUNT(*) AS number_of_countries_with_death_rate_higher_than_birth_rate
FROM facts
WHERE birth_rate - death_rate < 0;
* sqlite:///factbook.db Done.
number_of_countries_with_death_rate_higher_than_birth_rate |
---|
24 |
%%sql
SELECT name, birth_rate, death_rate, ROUND(birth_rate - death_rate, 2) AS difference
FROM facts
WHERE difference < 0
ORDER BY difference;
* sqlite:///factbook.db Done.
name | birth_rate | death_rate | difference |
---|---|---|---|
Bulgaria | 8.92 | 14.44 | -5.52 |
Serbia | 9.08 | 13.66 | -4.58 |
Latvia | 10.0 | 14.31 | -4.31 |
Lithuania | 10.1 | 14.27 | -4.17 |
Ukraine | 10.72 | 14.46 | -3.74 |
Hungary | 9.16 | 12.73 | -3.57 |
Germany | 8.47 | 11.42 | -2.95 |
Slovenia | 8.42 | 11.37 | -2.95 |
Romania | 9.14 | 11.9 | -2.76 |
Croatia | 9.45 | 12.18 | -2.73 |
Belarus | 10.7 | 13.36 | -2.66 |
Monaco | 6.65 | 9.24 | -2.59 |
Greece | 8.66 | 11.09 | -2.43 |
Saint Pierre and Miquelon | 7.42 | 9.72 | -2.3 |
Russia | 11.6 | 13.69 | -2.09 |
Estonia | 10.51 | 12.4 | -1.89 |
Portugal | 9.27 | 11.02 | -1.75 |
Japan | 7.93 | 9.51 | -1.58 |
Italy | 8.74 | 10.19 | -1.45 |
Bosnia and Herzegovina | 8.87 | 9.75 | -0.88 |
Czech Republic | 9.63 | 10.34 | -0.71 |
Moldova | 12.0 | 12.59 | -0.59 |
Poland | 9.74 | 10.19 | -0.45 |
Austria | 9.41 | 9.42 | -0.01 |
%%sql
SELECT name, population, area_land, population / area_land AS population_density
FROM facts
ORDER BY population_density DESC
LIMIT 5;
* sqlite:///factbook.db Done.
name | population | area_land | population_density |
---|---|---|---|
Macau | 592731 | 28 | 21168 |
Monaco | 30535 | 2 | 15267 |
Singapore | 5674472 | 687 | 8259 |
Hong Kong | 7141106 | 1073 | 6655 |
Gaza Strip | 1869055 | 360 | 5191 |
These results are different from above where we listed countries with above average population and below average areas. This is because the highest density countries tend to be very small in land area and never above average in total population.
This project demonstrated simple SQL usage to answer basic questions about a dataset containing mostly numeric data.