DB: Analysis on The CIA World Factbook

This project works with data from the CIA World Factbook, a compendium of statistics about all of the countries on Earth. The Factbook contains demographic information below. To name a few are:

  • name - The name of the country.
  • area- The country's total area (both land and water).
  • area_land - The country's land area in square kilometers.
  • area_water - The country's waterarea in square kilometers.
  • population - The population as of 2015.
  • population_growth - The annual population growth rate, as a percentage.
  • birth_rate - The country's birth rate, or the number of births a year per 1,000 people.
  • death_rate - The country's death rate, or the number of deaths a year per 1,000 people.

Aim of the Project

The aim of this project is to use SQL in Jupyter Notebook to explore and analyze data from this database. To work on this project locally on your computer, you can download the SQLite factbook.db database by clicking the Download button on the right.

Connecting Jupter Notebook to the database file.

In [1]:
%%capture
%load_ext sql
%sql sqlite:///factbook.db
Out[1]:
'Connected: [email protected]'

The code above connects the external database file to the jupyter notebook console.

Obtaining information about the table.

In order to extract useful information such as table Name, the code below helps with that.

In [2]:
%%sql
SELECT *
    FROM sqlite_master
    WHERE type='table';
Done.
Out[2]:
type name tbl_name rootpage sql
table sqlite_sequence sqlite_sequence 3 CREATE TABLE sqlite_sequence(name,seq)
table facts facts 47 CREATE TABLE "facts" ("id" INTEGER PRIMARY KEY AUTOINCREMENT NOT NULL, "code" varchar(255) NOT NULL, "name" varchar(255) NOT NULL, "area" integer, "area_land" integer, "area_water" integer, "population" integer, "population_growth" float, "birth_rate" float, "death_rate" float, "migration_rate" float)

Based on the information gotten above, the database contains a table called facts.

Exploring the table.

To explore the contents of the table, such as the various columns and rows contained in the facts table, the code below can be used to achieve this, displaying only the first 5 rows of the table.

In [3]:
%%sql
SELECT * 
    FROM facts
    LIMIT 5;
Done.
Out[3]:
id code name area area_land area_water population population_growth birth_rate death_rate migration_rate
1 af Afghanistan 652230 652230 0 32564342 2.32 38.57 13.89 1.51
2 al Albania 28748 27398 1350 3029278 0.3 12.92 6.58 3.3
3 ag Algeria 2381741 2381741 0 39542166 1.84 23.67 4.31 0.92
4 an Andorra 468 468 0 85580 0.12 8.13 6.96 0.0
5 ao Angola 1246700 1246700 0 19625353 2.78 38.78 11.49 0.46

Performing Summary Statistics

To begin analysis, first off I start by checking for outliers in the columns population and population_growth.

This is achieved by carrying out some statistical operations (such as calculating the MAX and MIN values) on the respective columns.

In [4]:
%%sql
SELECT MAX(population), MIN(population),
        MAX(population_growth), MIN(population_growth)
    FROM facts;
Done.
Out[4]:
MAX(population) MIN(population) MAX(population_growth) MIN(population_growth)
7256490011 0 4.02 0.0

Looking at the data gotten from the summary statistics performed, it points out that there's a country with a population of more than 7billion people which is the highest population amongst all the countries and also not very reasonable. Also, another with a record 0, which has the least population which quite frankly doesn't make any sense at all.

Moving further, I'll try exploring the data contained in the table to see if they're trends or meaning can be obtained from the data.

To do so I'll display the top 5 countries with the Maximum and Minimun number of people (population).

Determining the Top 5 and Least 5 countries with the Highest and Least Population

In [5]:
%%sql
SELECT id, name, population
    FROM facts
    ORDER BY population DESC
    LIMIT 5;
Done.
Out[5]:
id name population
261 World 7256490011
37 China 1367485388
77 India 1251695584
197 European Union 513949445
186 United States 321368864
In [6]:
%%sql
SELECT id, name, population
    FROM facts
    WHERE population >= 0
    ORDER BY population ASC
    LIMIT 5;
Done.
Out[6]:
id name population
250 Antarctica 0
238 Pitcairn Islands 48
200 Cocos (Keeling) Islands 596
190 Holy See (Vatican City) 842
220 Niue 1190

Looking at the first table which displays the top 5 countries with the highest population, there is an obvious outlier which is given for the country name World and since it's known to the best my knowledge that World isn't a country, it is possible to assume that this value could've been gotten as a result of the summation of all the population values of the other countries. I'll exclude it in the next summary statistics taken, to see what the table looks like.

Also, the country name European Union is not a valid country name, therefore I'll also get exclude it.

Moving further to the second table displayed, its observed that a country named Antarctica (which is a continent and not a country) has a 0 count as its population. This can be used to validate the fact that there is a country in the table with a value 0.

Also, the so much minimal values for countries in the remaining rows, can be further investigated. For example, the Pitcairn Islands is known to be an abandoned country comprising of a group of volcanic islands in the South Pacific Ocean, east of French Polynesia. This validates the value gotten for the population of that country.

NB: Other countries can be looked into to determine the validity of their values.

Top 5 most populated countries excluding the countries with incorrect entries.

Excluding the 3 countries identified previously, here's the top 5 countries with the highest population.

In [7]:
%%sql
SELECT id, name, population
    FROM facts
    WHERE name NOT IN ('World', 'European Union', 'Antarctica')
    ORDER BY population DESC
    LIMIT 5;
Done.
Out[7]:
id name population
37 China 1367485388
77 India 1251695584
186 United States 321368864
78 Indonesia 255993674
24 Brazil 204259812

The most populated countries in the world based on the collected data are China with more than 1.3 billion, India following closely with more than 1.2 billion and U.S following reluctantly with about 321 billion people.

In [8]:
%%sql
SELECT MIN(population), MAX(population),
        MIN(population_growth), MAX(population_growth)
    FROM facts
    WHERE name NOT IN ('World', 'European Union', 'Antarctica');
Done.
Out[8]:
MIN(population) MAX(population) MIN(population_growth) MAX(population_growth)
48 1367485388 0.0 4.02

Above shows what the new summary statistics look like.

Calculating the average population and area

Exluding the the 3 identified incorrectly named countries World, European Union and Antarctica, I'll then find the average of the remaining countries.

In [9]:
%%sql
SELECT AVG(population) AS 'avg_population', AVG(area) AS 'avg_area'
    FROM facts
    WHERE name NOT IN ('World', 'European Union', 'Antarctica');
Done.
Out[9]:
avg_population avg_area
30362063.589958157 539893.1895161291

The average population and area of all the countries excluding World, European Union and Antarctica is about 30 million and 530,000 square kilometers respectively.

Moving foward, it'll a good idea to explore how many countries have an area greater than the average area, which would be shown below.

NB: Excluding the the country names World, European Union and Antarctica.

In [10]:
%%sql
SELECT COUNT(*)
    FROM facts
    WHERE area > (SELECT AVG(area)
                     FROM facts)
    AND name NOT IN ('World', 'European Union', 'Antarctica');
Done.
Out[10]:
COUNT(*)
47

Now it's visible that they're 47 of 195 countries which have an average greater than the estimated average of all the countries.

I'll display a few of the corresponding countries below.

In [22]:
%%sql
SELECT id, name, area
    FROM facts
    WHERE area > (SELECT AVG(area)
                     FROM facts)
    ORDER BY area DESC
    LIMIT 5;
Done.
Out[22]:
id name area
143 Russia 17098242
32 Canada 9984670
186 United States 9826675
37 China 9596960
24 Brazil 8515770

Determining overpopulated countries

Given the values gotten, it's possible to identify the countries which are overpopulated by:

  • Identifying countries above the average value for population.
  • Identifying countries below the average value for area.
In [12]:
%%sql
SELECT id, name, population, area
    FROM facts
    WHERE population > (SELECT AVG(population)
                           FROM facts)
    AND area < (SELECT AVG(area)
                   FROM facts)
    ORDER BY id;
Done.
Out[12]:
id name population area
14 Bangladesh 168957745 148460
65 Germany 80854408 357022
85 Japan 126919659 377915
138 Philippines 100998376 300000
173 Thailand 67976405 513120
185 United Kingdom 64088222 243610
192 Vietnam 94348835 331210

The table above shows some of the countries suffering overpopulation, with the total number of residents overseeding the average population and area of its respective countries.

Determining the population Density of each country.

Now let's calculate population densities. To do so we will have to divide the country's population by its land area (hopefully eliminating the possiblity that there are people who live of water).

In [26]:
%%sql
SELECT id, name, population, area,
        ROUND(CAST(population AS FLOAT)/area_land, 2) AS population_density
    FROM facts
    ORDER BY population_density DESC
    LIMIT 5;
Done.
Out[26]:
id name population area population_density
205 Macau 592731 28 21168.96
117 Monaco 30535 2 15267.5
156 Singapore 5674472 697 8259.78
204 Hong Kong 7141106 1108 6655.27
251 Gaza Strip 1869055 360 5191.82

Macau is a special administrative region in China, and is considered the most densely populated region in the world.

If we are following this way why not calculate the average population density and figure out how many (and what) countries have densitites more than its average? The oceans, the EU, the world, have no values for area_land and/or for population so we do not need to explicitly exclude them from calculation. On the other hand, the Antarctica has both values, so it's necessary to exclude it.

In [28]:
%%sql
SELECT ROUND(AVG(CAST(population AS FLOAT) / area_land), 2) AS avg_pop_density
    FROM facts
    WHERE name NOT IN ('Antarctica');
Done.
Out[28]:
avg_pop_density
431.91

It's around 432 pop./km2, the density of the Netherlands, a state in Europe, a continent with one of the most highest densitites in the world.

Let's now figure out how many countries (and what) have densities more than the average value.

In [ ]:
%%sql
SELECT COUNT(*) AS greater_avg_dens
    FROM facts
    WHERE 

Further Analysis

The following questions can further help in extracting meaningful information from the data contained in the table.

  • What country has the most people? What country has the highest growth rate?
  • Which countries have the highest ratios of water to land? Which countries have more water than land?
  • Which countries will add the most people to their population next year?
  • Which countries have a higher death rate than birth rate?
  • What countries have the highest population/area ratio and how does it compare to list we found in the previous screen?

Country with the highest population and growth rate

In [13]:
%%sql
SELECT id, name, MAX(population) AS 'maximun_population'
    FROM facts
    WHERE name != 'World';
Done.
Out[13]:
id name maximun_population
37 China 1367485388
In [14]:
%%sql
SELECT id, name, MAX(population_growth) AS 'highest_population_growth'
    FROM facts
    WHERE name != 'World';
Done.
Out[14]:
id name highest_population_growth
162 South Sudan 4.02

Stated above, are the countries with the highest population and population growth namely China and South Sudan.

Countries with the highest water to land ratio

In [15]:
%%sql
SELECT id, name, CAST(area_water AS FLOAT)/area_land AS ratio_water_to_land
    FROM facts
    ORDER BY ratio_water_to_land DESC
    LIMIT 5;
Done.
Out[15]:
id name ratio_water_to_land
228 British Indian Ocean Territory 905.6666666666666
247 Virgin Islands 4.520231213872832
246 Puerto Rico 0.5547914317925592
12 Bahamas, The 0.3866133866133866
71 Guinea-Bissau 0.2846728307254623

The diagram above shows British Indian Ocean Territory with the highest water to land ratio.

NB: All the countries listed above have islands which means they've bodies of water around them. This goes to validate the ratio_water_to_land column displayed.

Countries with higher water mass than land

In [16]:
%%sql
SELECT id, name, area_water, area_land
    FROM facts
    WHERE area_water > area_land;
Done.
Out[16]:
id name area_water area_land
228 British Indian Ocean Territory 54340 60
247 Virgin Islands 1564 346

The output shows only two countries with more water mass than land mass namely British Indian Ocean Territory and Virgin Islands.

Countries with prospects of adding more population

To determine the countries with the prospects of adding more population, it is necessary to determine which countries are lessly populated and what their growth rate is, otherwise we might end up displaying result of countries that may turn out to be overpopulated.

This can be done by:

  • Identifying countries below the average value for population.
  • Identifying countries below the average value for area.
  • Identifying the growth rate for each of those countries.
In [17]:
%%sql
SELECT id, name, population, population_growth, area
    FROM facts
    WHERE population < (SELECT AVG(population)
                           FROM facts)
    AND area < (SELECT AVG(population)
                   FROM facts)
    ORDER BY population_growth DESC
    LIMIT 5;
Done.
Out[17]:
id name population population_growth area
162 South Sudan 12042910 4.02 644329
106 Malawi 17964697 3.32 118484
29 Burundi 10742276 3.28 27830
182 Uganda 37101745 3.24 241038
141 Qatar 2194817 3.07 11586

Above shows the top 5 countries which tend to experience a high number of population increase.

Countries with a higher death rate than birth rate

In [18]:
%%sql
SELECT id, name, birth_rate, death_rate, 
        ROUND(death_rate - birth_rate, 3) AS death_birth_difference
    FROM facts
    WHERE death_rate > birth_rate
    ORDER BY death_birth_difference
    LIMIT 10;
Done.
Out[18]:
id name birth_rate death_rate death_birth_difference
10 Austria 9.41 9.42 0.01
139 Poland 9.74 10.19 0.45
116 Moldova 12.0 12.59 0.59
47 Czech Republic 9.63 10.34 0.71
22 Bosnia and Herzegovina 8.87 9.75 0.88
83 Italy 8.74 10.19 1.45
85 Japan 7.93 9.51 1.58
140 Portugal 9.27 11.02 1.75
57 Estonia 10.51 12.4 1.89
143 Russia 11.6 13.69 2.09
In [19]:
%%sql
SELECT id, name, population, area, 
        (SELECT *
            FROM facts
            WHERE population =)
        population/area AS ratio_population_to_area
    FROM facts
    WHERE population = 'None' OR area = 'None'
    ORDER BY ratio_population_to_area;
(sqlite3.OperationalError) near ")": syntax error
[SQL: SELECT id, name, population, area, 
        (SELECT *
            FROM facts
            WHERE population =)
        population/area AS ratio_population_to_area
    FROM facts
    WHERE population = 'None' OR area = 'None'
    ORDER BY ratio_population_to_area;]
(Background on this error at: http://sqlalche.me/e/e3q8)
In [20]:
%%sql
SELECT id, name, population, area 
    FROM facts
    WHERE (population != 'None' OR area != 'None')
    AND 
(sqlite3.OperationalError) near "AND": syntax error
[SQL: SELECT id, name, population, area 
    FROM facts
    WHERE (population != 'None' OR area != 'None')
    AND]
(Background on this error at: http://sqlalche.me/e/e3q8)
In [ ]: