In this project, we'll work with data from the CIA World Factbook, a compendium of statistics about all of the countries on Earth. The Factbook contains demographic information like the following:
We will use SQLite in Jupyter Notebook to analyze data from this database. The questions we're looking to answer are:
Before we can do that, we need to connect to our database.
#Connect to the DB
#%%capture -- gives error in DQ site. Submitted bug report 3/14/22.
#Project appears to work without it
%reload_ext sql
%sql sqlite:///factbook.db
#'Connected: None@factbook.db' indicates success
'Connected: None@factbook.db'
Our first step is to get an overview of the data
%%sql
SELECT *
FROM sqlite_master
WHERE type='table';
Done.
type | name | tbl_name | rootpage | sql |
---|---|---|---|---|
table | sqlite_sequence | sqlite_sequence | 3 | CREATE TABLE sqlite_sequence(name,seq) |
table | facts | facts | 47 | CREATE TABLE "facts" ("id" INTEGER PRIMARY KEY AUTOINCREMENT NOT NULL, "code" varchar(255) NOT NULL, "name" varchar(255) NOT NULL, "area" integer, "area_land" integer, "area_water" integer, "population" integer, "population_growth" float, "birth_rate" float, "death_rate" float, "migration_rate" float) |
We have a fairly simple database. The SQLite internal table sqlite_sequence is used to store information about SQLite: autoincrement columns. This means we only have one data table, called "facts". Let's take a look at the first five rows of the facts table.
%%sql
SELECT *
FROM facts
LIMIT 5
Done.
id | code | name | area | area_land | area_water | population | population_growth | birth_rate | death_rate | migration_rate |
---|---|---|---|---|---|---|---|---|---|---|
1 | af | Afghanistan | 652230 | 652230 | 0 | 32564342 | 2.32 | 38.57 | 13.89 | 1.51 |
2 | al | Albania | 28748 | 27398 | 1350 | 3029278 | 0.3 | 12.92 | 6.58 | 3.3 |
3 | ag | Algeria | 2381741 | 2381741 | 0 | 39542166 | 1.84 | 23.67 | 4.31 | 0.92 |
4 | an | Andorra | 468 | 468 | 0 | 85580 | 0.12 | 8.13 | 6.96 | 0.0 |
5 | ao | Angola | 1246700 | 1246700 | 0 | 19625353 | 2.78 | 38.78 | 11.49 | 0.46 |
There are a total of 11 columns. Some of the ones we're most interested are:
With this knowledge, we can now begin our analysis.
Our first analysis step will be to find and calculate some summary statiscics, namely The minimum and maximum populations and minimum and maximum population growth.
%%sql
SELECT MIN(population) AS minimum_pop,
MAX(population) AS max_pop,
MIN(population_growth) AS min_pop_growth,
MAX(population_growth) AS max_pop_growth
FROM facts
Done.
minimum_pop | max_pop | min_pop_growth | max_pop_growth |
---|---|---|---|
0 | 7256490011 | 0.0 | 4.02 |
We have a country with zero population and one with a populaton of 7.25 Billion (the approximate population of the world). Let's take a look at the countries with the highest and lowest populations
%%sql
SELECT name, population
FROM facts
ORDER BY population
DESC
LIMIT 5
Done.
name | population |
---|---|
World | 7256490011 |
China | 1367485388 |
India | 1251695584 |
European Union | 513949445 |
United States | 321368864 |
%%sql
SELECT name, population
FROM facts
ORDER BY population
LIMIT 21 --Need 21 rows to get the first row with any population
Done.
name | population |
---|---|
Ashmore and Cartier Islands | None |
Coral Sea Islands | None |
Heard Island and McDonald Islands | None |
Clipperton Island | None |
French Southern and Antarctic Lands | None |
Bouvet Island | None |
Jan Mayen | None |
British Indian Ocean Territory | None |
South Georgia and South Sandwich Islands | None |
Navassa Island | None |
Wake Island | None |
United States Pacific Island Wildlife Refuges | None |
Paracel Islands | None |
Spratly Islands | None |
Arctic Ocean | None |
Atlantic Ocean | None |
Indian Ocean | None |
Pacific Ocean | None |
Southern Ocean | None |
Antarctica | 0 |
Pitcairn Islands | 48 |
The data include several rows with "None" for population. These include oceans, which make sense and several island chains. According to the CIA Factbook pages for several of these islands, they are all uninhabited or only house temporary research personnel.
There is also one row with zero population, which is Antartica. the CIA Factbook page for Antartica shows a similar status to the islands:
53 countries have signed the 1959 Antarctic Treaty; 30 of those operate through their National Antarctic Program a number of seasonal-only (summer) and year-round research stations on the continent and its nearby islands south of 60 degrees south latitude (the region covered by the Antarctic Treaty); the population engaging in and supporting science or managing and protecting the Antarctic region varies from approximately 4,400 in summer to 1,100 in winter; in addition, approximately 1,000 personnel, including ship's crew and scientists doing onboard research, are present in the waters of the treaty region.
We also have a row for the entire world, which explains the population over 7.25 billion. We can exclude all of these rows to find the actual countries with the highest and lowest populations.
%%sql
SELECT MIN(population) AS minimum_pop,
MAX(population) AS max_pop,
MIN(population_growth) AS min_pop_growth,
MAX(population_growth) AS max_pop_growth
FROM facts
WHERE population != (SELECT
MAX(population)
FROM facts)
AND
population != (SELECT
MIN(population) --Null values excluded by default
FROM facts)
Done.
minimum_pop | max_pop | min_pop_growth | max_pop_growth |
---|---|---|---|
48 | 1367485388 | 0.0 | 4.02 |
These numbers look much better, but we still have at least one country with zero growth. Well take a look at which countries those are.
%%sql
SELECT *
FROM facts
WHERE population_growth = (SELECT
MIN(population_growth)
FROM facts)
Done.
id | code | name | area | area_land | area_water | population | population_growth | birth_rate | death_rate | migration_rate |
---|---|---|---|---|---|---|---|---|---|---|
190 | vt | Holy See (Vatican City) | 0 | 0 | 0 | 842 | 0.0 | None | None | None |
200 | ck | Cocos (Keeling) Islands | 14 | 14 | 0 | 596 | 0.0 | None | None | None |
207 | gl | Greenland | 2166086 | 2166086 | None | 57733 | 0.0 | 14.48 | 8.49 | 5.98 |
238 | pc | Pitcairn Islands | 47 | 47 | 0 | 48 | 0.0 | None | None | None |
We have 2 small islands with very small populations. It makes sense that the populations would not fluctuate much. The third "country" is Vatican city. As it's controlled by the Catholic Church, they can maintain the population so it would not necessarily be prone to normal population influences. The final country is Greenland. According to the Factbook, the birth rate is approximately 13.79 per 1000 people. This is balanced by the death rate of 8.96 per 1000 and immigration rate of -5.07 (14.03 total) for a net of -.24 people per 1000. It seems unlikely that the population is exactly the same from year to year, but accounting for polling methods, statistical errors and rounding, the number is about right.
We now turn our analysis to the average population and area of the countries of the world. We don't want to include Antartica or the whole world in this either, so we'll exclude them again.
%%sql
SELECT ROUND(AVG(population),2) AS avg_pop,
ROUND(AVG(area_land),2) AS avg_area_land
FROM facts
WHERE population != (SELECT --exclude row with world population
MAX(population)
FROM facts)
AND
population != (SELECT --Exclude Zero population (Antartica) and Null by default
MIN(population)
FROM facts)
Done.
avg_pop | avg_area_land |
---|---|
32377011.01 | 550643.76 |
The average population is approximately 32,377,000 and the average land area is 551,000
Our final analysis step will be to find the most densely populated countries in the world. To do this, we'll look for countries with populations higher than the average and land areas lower than the average.
%%sql
SELECT name, population, area_land, population/area_land AS population_density
FROM facts
WHERE population > (SELECT AVG(population)
FROM facts)
AND
area_land < (SELECT AVG(area_land)
FROM facts)
ORDER BY population_density
DESC
Done.
name | population | area_land | population_density |
---|---|---|---|
Bangladesh | 168957745 | 130170 | 1297 |
Japan | 126919659 | 364485 | 348 |
Philippines | 100998376 | 298170 | 338 |
Vietnam | 94348835 | 310070 | 304 |
United Kingdom | 64088222 | 241930 | 264 |
Germany | 80854408 | 348672 | 231 |
Thailand | 67976405 | 510890 | 133 |
The drawback to this method is that it might exclude countries where the population is lower than average but has an extremely small land area or the land area is above average and the population is still very high. Excluding the filters, we get:
%%sql
SELECT name, population, area_land, population/area_land AS population_density
FROM facts
ORDER BY population_density
DESC
LIMIT 20
Done.
name | population | area_land | population_density |
---|---|---|---|
Macau | 592731 | 28 | 21168 |
Monaco | 30535 | 2 | 15267 |
Singapore | 5674472 | 687 | 8259 |
Hong Kong | 7141106 | 1073 | 6655 |
Gaza Strip | 1869055 | 360 | 5191 |
Gibraltar | 29258 | 6 | 4876 |
Bahrain | 1346613 | 760 | 1771 |
Maldives | 393253 | 298 | 1319 |
Malta | 413965 | 316 | 1310 |
Bermuda | 70196 | 54 | 1299 |
Bangladesh | 168957745 | 130170 | 1297 |
Sint Maarten | 39689 | 34 | 1167 |
Guernsey | 66080 | 78 | 847 |
Jersey | 97294 | 116 | 838 |
Taiwan | 23415126 | 32260 | 725 |
Barbados | 290604 | 430 | 675 |
Mauritius | 1339827 | 2030 | 660 |
Aruba | 112162 | 180 | 623 |
Lebanon | 6184701 | 10230 | 604 |
Saint Martin | 31754 | 54 | 588 |
Now Bangladesh, which was our highest density country is actually 11th!
In this analysis, we reviewed some basic statistics about the countries of the world based on the CIA World Factbook. Due to some quirks of the data such as "countries" with zero population and an entry for the entire world, we had to mainipulate our queries a bit, but in the end we were able to answer all of the questions about the countries having the highest and lowest populations and population densities.