Analyzing CIA Factbook Data Using SQL

Introduction

In this project, we'll work with data from the CIA World Factbook, a compendium of statistics about all of the countries on Earth. The Factbook contains demographic information like the following:

  • population — the global population.
  • population_growth — the annual population growth rate, as a percentage.
  • area_land — the total land area in kilometers squared.

We will use SQLite in Jupyter Notebook to analyze data from this database. The questions we're looking to answer are:

  1. What are the summarry statistics (Minimum and maximum populations and minimum and maximum population growth.
  2. What is the average population? Average Area?
  3. What countries have above average populations and which have below average areas? Which countries have the highest population density

Before we can do that, we need to connect to our database.

In [4]:
#Connect to the DB
#%%capture -- gives error in DQ site. Submitted bug report 3/14/22. 
               #Project appears to work without it
%reload_ext sql
%sql sqlite:///factbook.db
        #'Connected: [email protected]' indicates success
Out[4]:
'Connected: [email protected]'

Overview of the Data

Our first step is to get an overview of the data

In [5]:
%%sql
SELECT *
FROM sqlite_master
WHERE type='table';
Done.
Out[5]:
type name tbl_name rootpage sql
table sqlite_sequence sqlite_sequence 3 CREATE TABLE sqlite_sequence(name,seq)
table facts facts 47 CREATE TABLE "facts" ("id" INTEGER PRIMARY KEY AUTOINCREMENT NOT NULL, "code" varchar(255) NOT NULL, "name" varchar(255) NOT NULL, "area" integer, "area_land" integer, "area_water" integer, "population" integer, "population_growth" float, "birth_rate" float, "death_rate" float, "migration_rate" float)

We have a fairly simple database. The SQLite internal table sqlite_sequence is used to store information about SQLite: autoincrement columns. This means we only have one data table, called "facts". Let's take a look at the first five rows of the facts table.

In [6]:
%%sql
SELECT * 
FROM facts
LIMIT 5
Done.
Out[6]:
id code name area area_land area_water population population_growth birth_rate death_rate migration_rate
1 af Afghanistan 652230 652230 0 32564342 2.32 38.57 13.89 1.51
2 al Albania 28748 27398 1350 3029278 0.3 12.92 6.58 3.3
3 ag Algeria 2381741 2381741 0 39542166 1.84 23.67 4.31 0.92
4 an Andorra 468 468 0 85580 0.12 8.13 6.96 0.0
5 ao Angola 1246700 1246700 0 19625353 2.78 38.78 11.49 0.46

There are a total of 11 columns. Some of the ones we're most interested are:

  • name — the name of the country.
  • area— the country's total area (both land and water).
  • area_land — the country's land area in square kilometers.
  • area_water — the country's waterarea in square kilometers.
  • population — the country's population.
  • population_growth— the country's population growth as a percentage.
  • birth_rate — the country's birth rate, or the number of births per year per 1,000 people.
  • death_rate — the country's death rate, or the number of death per year per 1,000 people.

With this knowledge, we can now begin our analysis.

Summary Statistics

Our first analysis step will be to find and calculate some summary statiscics, namely The minimum and maximum populations and minimum and maximum population growth.

In [7]:
%%sql
SELECT MIN(population) AS minimum_pop, 
    MAX(population) AS max_pop, 
    MIN(population_growth) AS min_pop_growth,
    MAX(population_growth) AS max_pop_growth
FROM facts
Done.
Out[7]:
minimum_pop max_pop min_pop_growth max_pop_growth
0 7256490011 0.0 4.02

We have a country with zero population and one with a populaton of 7.25 Billion (the approximate population of the world). Let's take a look at the countries with the highest and lowest populations

In [8]:
%%sql
SELECT name, population
FROM facts
ORDER BY population
DESC
LIMIT 5
Done.
Out[8]:
name population
World 7256490011
China 1367485388
India 1251695584
European Union 513949445
United States 321368864
In [9]:
%%sql
SELECT name, population
FROM facts
ORDER BY population
LIMIT 21  --Need 21 rows to get the first row with any population
Done.
Out[9]:
name population
Ashmore and Cartier Islands None
Coral Sea Islands None
Heard Island and McDonald Islands None
Clipperton Island None
French Southern and Antarctic Lands None
Bouvet Island None
Jan Mayen None
British Indian Ocean Territory None
South Georgia and South Sandwich Islands None
Navassa Island None
Wake Island None
United States Pacific Island Wildlife Refuges None
Paracel Islands None
Spratly Islands None
Arctic Ocean None
Atlantic Ocean None
Indian Ocean None
Pacific Ocean None
Southern Ocean None
Antarctica 0
Pitcairn Islands 48

The data include several rows with "None" for population. These include oceans, which make sense and several island chains. According to the CIA Factbook pages for several of these islands, they are all uninhabited or only house temporary research personnel.

There is also one row with zero population, which is Antartica. the CIA Factbook page for Antartica shows a similar status to the islands:

53 countries have signed the 1959 Antarctic Treaty; 30 of those operate through their National Antarctic Program a number of seasonal-only (summer) and year-round research stations on the continent and its nearby islands south of 60 degrees south latitude (the region covered by the Antarctic Treaty); the population engaging in and supporting science or managing and protecting the Antarctic region varies from approximately 4,400 in summer to 1,100 in winter; in addition, approximately 1,000 personnel, including ship's crew and scientists doing onboard research, are present in the waters of the treaty region.

We also have a row for the entire world, which explains the population over 7.25 billion. We can exclude all of these rows to find the actual countries with the highest and lowest populations.

In [15]:
%%sql
SELECT MIN(population) AS minimum_pop, 
    MAX(population) AS max_pop, 
    MIN(population_growth) AS min_pop_growth,
    MAX(population_growth) AS max_pop_growth
FROM facts
    WHERE population != (SELECT 
                    MAX(population) 
                    FROM facts)
                AND 
    population != (SELECT 
                    MIN(population) --Null values excluded by default
                    FROM facts)
Done.
Out[15]:
minimum_pop max_pop min_pop_growth max_pop_growth
48 1367485388 0.0 4.02

These numbers look much better, but we still have at least one country with zero growth. Well take a look at which countries those are.

In [11]:
%%sql
SELECT * 
FROM facts 
WHERE population_growth = (SELECT 
                    MIN(population_growth) 
                    FROM facts)
Done.
Out[11]:
id code name area area_land area_water population population_growth birth_rate death_rate migration_rate
190 vt Holy See (Vatican City) 0 0 0 842 0.0 None None None
200 ck Cocos (Keeling) Islands 14 14 0 596 0.0 None None None
207 gl Greenland 2166086 2166086 None 57733 0.0 14.48 8.49 5.98
238 pc Pitcairn Islands 47 47 0 48 0.0 None None None

We have 2 small islands with very small populations. It makes sense that the populations would not fluctuate much. The third "country" is Vatican city. As it's controlled by the Catholic Church, they can maintain the population so it would not necessarily be prone to normal population influences. The final country is Greenland. According to the Factbook, the birth rate is approximately 13.79 per 1000 people. This is balanced by the death rate of 8.96 per 1000 and immigration rate of -5.07 (14.03 total) for a net of -.24 people per 1000. It seems unlikely that the population is exactly the same from year to year, but accounting for polling methods, statistical errors and rounding, the number is about right.

Exploring Average Population and Area

We now turn our analysis to the average population and area of the countries of the world. We don't want to include Antartica or the whole world in this either, so we'll exclude them again.

In [12]:
%%sql
SELECT ROUND(AVG(population),2) AS avg_pop, 
        ROUND(AVG(area_land),2) AS avg_area_land
FROM facts
    WHERE population != (SELECT --exclude row with world population
                    MAX(population) 
                    FROM facts)
                AND 
    population != (SELECT      --Exclude Zero population (Antartica) and Null by default
                    MIN(population) 
                    FROM facts)
Done.
Out[12]:
avg_pop avg_area_land
32377011.01 550643.76

The average population is approximately 32,377,000 and the average land area is 551,000

Finding Densely Populated Countries

Our final analysis step will be to find the most densely populated countries in the world. To do this, we'll look for countries with populations higher than the average and land areas lower than the average.

In [13]:
%%sql
SELECT name, population, area_land, population/area_land AS population_density
FROM facts
WHERE population > (SELECT AVG(population)
                   FROM facts)
    AND
    area_land < (SELECT AVG(area_land)
                FROM facts)
ORDER BY population_density
DESC
Done.
Out[13]:
name population area_land population_density
Bangladesh 168957745 130170 1297
Japan 126919659 364485 348
Philippines 100998376 298170 338
Vietnam 94348835 310070 304
United Kingdom 64088222 241930 264
Germany 80854408 348672 231
Thailand 67976405 510890 133

The drawback to this method is that it might exclude countries where the population is lower than average but has an extremely small land area or the land area is above average and the population is still very high. Excluding the filters, we get:

In [14]:
%%sql
SELECT name, population, area_land, population/area_land AS population_density
FROM facts
ORDER BY population_density
DESC
LIMIT 20
Done.
Out[14]:
name population area_land population_density
Macau 592731 28 21168
Monaco 30535 2 15267
Singapore 5674472 687 8259
Hong Kong 7141106 1073 6655
Gaza Strip 1869055 360 5191
Gibraltar 29258 6 4876
Bahrain 1346613 760 1771
Maldives 393253 298 1319
Malta 413965 316 1310
Bermuda 70196 54 1299
Bangladesh 168957745 130170 1297
Sint Maarten 39689 34 1167
Guernsey 66080 78 847
Jersey 97294 116 838
Taiwan 23415126 32260 725
Barbados 290604 430 675
Mauritius 1339827 2030 660
Aruba 112162 180 623
Lebanon 6184701 10230 604
Saint Martin 31754 54 588

Now Bangladesh, which was our highest density country is actually 11th!

Conclusion

In this analysis, we reviewed some basic statistics about the countries of the world based on the CIA World Factbook. Due to some quirks of the data such as "countries" with zero population and an entry for the entire world, we had to mainipulate our queries a bit, but in the end we were able to answer all of the questions about the countries having the highest and lowest populations and population densities.

In [ ]: