Analyzing CIA Factbook Data Using SQL

Introduction

In this guided project, we will be working with data from the CIA World Factbook, a compendium of statistics about every countries on Earth. The site provides demographic information for each country.

This dataset is a SQLite Database named factbook.db that can be dowloaded here.

Note: the data used is not current.

The goal of this project is to learn basic statistics using SQL (more precisely, SQLite).

Analyzing the Data

Connecting to database

First of all we need to connect to the database:

In [1]:
import warnings
warnings.filterwarnings("ignore")
In [2]:
%reload_ext sql
%sql sqlite:///factbook.db    
Out[2]:
'Connected: [email protected]'

Overview of the Data

Now let's get the information about the database structure:

In [3]:
%%sql
select  *
 from sqlite_master
where type='table';
Done.
Out[3]:
type name tbl_name rootpage sql
table sqlite_sequence sqlite_sequence 3 CREATE TABLE sqlite_sequence(name,seq)
table facts facts 47 CREATE TABLE "facts" ("id" INTEGER PRIMARY KEY AUTOINCREMENT NOT NULL, "code" varchar(255) NOT NULL, "name" varchar(255) NOT NULL, "area" integer, "area_land" integer, "area_water" integer, "population" integer, "population_growth" float, "birth_rate" float, "death_rate" float, "migration_rate" float)

The database actually contain only one table named 'facts'. Let's take a look at its structure and content (5 lines are enough):

In [4]:
%%sql
select *
from facts
limit 5;
Done.
Out[4]:
id code name area area_land area_water population population_growth birth_rate death_rate migration_rate
1 af Afghanistan 652230 652230 0 32564342 2.32 38.57 13.89 1.51
2 al Albania 28748 27398 1350 3029278 0.3 12.92 6.58 3.3
3 ag Algeria 2381741 2381741 0 39542166 1.84 23.67 4.31 0.92
4 an Andorra 468 468 0 85580 0.12 8.13 6.96 0.0
5 ao Angola 1246700 1246700 0 19625353 2.78 38.78 11.49 0.46

Here are the descriptions for the columns:

  • name — the name of the country.
  • area— the country's total area (both land and water).
  • area_land — the country's land area in square kilometers.
  • area_water — the country's waterarea in square kilometers.
  • population — the country's population.
  • population_growth— the country's population growth as a percentage.
  • birth_rate — the country's birth rate, or the number of births per year per 1,000 people.
  • death_rate — the country's death rate, or the number of death per year per 1,000 people
  • migration_rate - the country's migration rate.
In [5]:
%%sql
select count(*)
from facts
;
Done.
Out[5]:
count(*)
261

The table contains 261 rows including NULL and duplicates.

Section 1. Summary Statistics

The goal of the current calculation is to get following:

  • Minimum population.
  • Maximum population.
  • Minimum population growth.
  • Maximum population growth.
In [6]:
%%sql
select 
min(population) as 'Minimum population',
max(population) as 'Maximum population',
min(population_growth) as 'Minimum population growth, %',
max(population_growth) as 'Maximum population growth, %'
from facts;
Done.
Out[6]:
Minimum population Maximum population Minimum population growth, % Maximum population growth, %
0 7256490011 0.0 4.02

Section 2. Exploring Outliers

The result above looks incorrect, the analysis showed that in some countries the population is zero, and in some it is close to the world population.

Let's find which country has zero population:

In [7]:
%%sql
select name as Country,
            population as Population
from facts
where population == 0;
Done.
Out[7]:
Country Population
Antarctica 0

That's right, there is no permanent population in Antarctica.

Now let's find which country has the maximun population:

In [8]:
%%sql
select name as Country, 
           population as Population
from facts
where population == (select max(population) from facts);
Done.
Out[8]:
Country Population
World 7256490011

Thus, the database contains the value of the total population of the world . This means that we need to exclude this row and recalculate our statistics:

In [9]:
%%sql
select min(population), max(population),
min(population_growth), max(population_growth)
from facts
where name <> 'World' --exclude the "World" row
and name <> 'Antarctica' --exclude the 'Antarctica' row
; 
Done.
Out[9]:
min(population) max(population) min(population_growth) max(population_growth)
48 1367485388 0.0 4.02

As we can see, the population range is from 48 people to over 1.3 billion, and the population growth is from zero to 4 people per 1,000 people. Are we curious enough to know what these countries are? Well, I am:

In [10]:
%%sql
select name as Country,
            population as Total_population,
            population*100/(select population from facts
                           where name = 'World') as '%_from_World_population'
from facts
where 
          population = (select min(population)
                       from facts
                       where name != 'Antarctica')
       or population = (select max(population)
                       from facts
                       where name != 'World')
        ;
Done.
Out[10]:
Country Total_population %_from_World_population
China 1367485388 18
Pitcairn Islands 48 0
In [11]:
%%sql
select name as Country,
            population_growth as 'Growth_population_%'
from facts
where 
          population_growth = (select min(population_growth)
                              from facts)
       or population_growth = (select max(population_growth)
                                         from facts)
        ;
Done.
Out[11]:
Country Growth_population_%
South Sudan 4.02
Holy See (Vatican City) 0.0
Cocos (Keeling) Islands 0.0
Greenland 0.0
Pitcairn Islands 0.0

Above we see the following:

  • China is the most populous country, but it is home to only 18% of the world's population;
  • Pitcairn Island is the least populated country and population growth is shown as zero;
  • three more countries show zero population growth;
  • the population of South Sudan is growing faster than in any other country.

Zero as the growth value does not mean that population is constant. It may be too small to be shown in the data. Let's test this hipothesis:

In [12]:
%%sql
select name as Country,
            population as Population_Total,
            population_growth as 'Growth_population_%',
            birth_rate as Births_per_1000,
            death_rate as Deaths_per_1000
from facts
where 
          population_growth = (select min(population_growth)
                              from facts)
;
Done.
Out[12]:
Country Population_Total Growth_population_% Births_per_1000 Deaths_per_1000
Holy See (Vatican City) 842 0.0 None None
Cocos (Keeling) Islands 596 0.0 None None
Greenland 57733 0.0 14.48 8.49
Pitcairn Islands 48 0.0 None None

We see that in database three countries with the minimum population growth do not have records about births and deaths. What about Greenland?

In [13]:
%%sql
select name as Country,
            (birth_rate - death_rate)*100/population as Growth_Rate_Full
    from facts
    where name = 'Greenland'
    ;
Done.
Out[13]:
Country Growth_Rate_Full
Greenland 0.010375348587462975

Thus, the Greenland's population is growthing, but too low to shown the value of the growth in the current data.

By the way, we can see that some countries do not have data in the "population_growth" column. Let's look at these rows:

In [14]:
%%sql
select name as Country,
            population_growth as 'Growth_%',
            birth_rate as Births_per_1000,
            death_rate as Deaths_per_1000,
            area as Total_Area_km2,
            area_land as Land_km2
from facts
where 
         population_growth is null
        order by name
        ;
Done.
Out[14]:
Country Growth_% Births_per_1000 Deaths_per_1000 Total_Area_km2 Land_km2
Akrotiri None None None 123 None
Antarctica None None None None 280000
Arctic Ocean None None None None None
Ashmore and Cartier Islands None None None 5 5
Atlantic Ocean None None None None None
Bouvet Island None None None 49 49
British Indian Ocean Territory None None None 54400 60
Clipperton Island None None None 6 6
Coral Sea Islands None None None 3 3
Dhekelia None None None 130 None
French Southern and Antarctic Lands None None None None None
Heard Island and McDonald Islands None None None 412 412
Indian Ocean None None None None None
Jan Mayen None None None 377 377
Kosovo None None None 10887 10887
Navassa Island None None None 5 5
Pacific Ocean None None None None None
Paracel Islands None None None 7 7
Saint Barthelemy None None None None None
Saint Martin None None None 54 54
South Georgia and South Sandwich Islands None None None 3903 3903
Southern Ocean None None None None None
Spratly Islands None None None 5 5
United States Pacific Island Wildlife Refuges None None None None None
Wake Island None None None 6 6
In [15]:
%%sql
select count(name)
    from facts
    where population_growth is null
    ;
Done.
Out[15]:
count(name)
25

We can see that in 25 lines there is no data about growth, births or deaths. Thus, we cannot fill missing data by calculation. Between these lines there are rows with data about Antarctica and all five oceans.

In [16]:
%%sql
select count(name)
    from facts
    where population_growth is null
    and area < 300
    ;
Done.
Out[16]:
count(name)
11

The 11 countries for which data are missing are smaller than 300 km2 and are small islands or a group of small islands. There are also a couple of group islands with a fairly large total area. Separately lies Kosovo, about which we do not have the necessary data.

Section 3. Exploring Average population and Area

Now let's calculate the average value for population and area:

In [17]:
%%sql
select 
    round(avg(population), 2) as 'Average_Population',
    round(avg(area), 2)  as 'Average_Area'
from facts
where name <> 'World';
Done.
Out[17]:
Average_Population Average_Area
32242666.57 555093.55

And one more calculation: which countries meet both of the following criteria:

  • The population is above average.
  • The area is below average.
In [18]:
%%sql
select name as Country,
            population as Total_population,
            area as Total_Area
from facts
where population > (select  avg(population) 
                                    from facts
                                    where name <> 'World') 
            and  area < (select avg(area)  
                                 from facts
                                 where name <> 'World');
Done.
Out[18]:
Country Total_population Total_Area
Bangladesh 168957745 148460
Germany 80854408 357022
Iraq 37056169 438317
Italy 61855120 301340
Japan 126919659 377915
Korea, South 49115196 99720
Morocco 33322699 446550
Philippines 100998376 300000
Poland 38562189 312685
Spain 48146134 505370
Thailand 67976405 513120
Uganda 37101745 241038
United Kingdom 64088222 243610
Vietnam 94348835 331210
In [19]:
%%sql
select count(name) as "Amount_of_countries"
from facts
where name in
(select name       
from facts
where population > (select  avg(population) 
                                    from facts
                                    where name <> 'World') 
            and  area < (select avg(area)  
                                 from facts
                                 where name <> 'World')
);
Done.
Out[19]:
Amount_of_countries
14

As we see, 14 countries meet both criteria.

Section 4. Finding Densely Populated Countries

In this section, we will analyze which countries have an above average population at a time when the land (without water area) is below the average:

In [20]:
%%sql
select name as Country,
population as Total_population,
population/area_land as People_per_km2_of_land
from facts
where population > (select avg(population) from facts where name<>'World')
and area < (select avg(area_land) from facts where name<>'World')
order by  People_per_km2_of_land desc
limit 10;
Done.
Out[20]:
Country Total_population People_per_km2_of_land
Bangladesh 168957745 1297
Korea, South 49115196 506
Japan 126919659 348
Philippines 100998376 338
Vietnam 94348835 304
United Kingdom 64088222 264
Germany 80854408 231
Italy 61855120 210
Uganda 37101745 188
Thailand 67976405 133

Above we can see that the most populous country is Bangladesh- over 1200 people per square kilometer of the land. Which country is the least populuos?

In [21]:
%%sql
select name as Country,
population as Total_population,
population/area_land as People_per_km2_of_land
from facts
where population > (select avg(population) from facts where name<>'World')
and area < (select avg(area_land) from facts where name<>'World')
order by  People_per_km2_of_land
limit 5;
Done.
Out[21]:
Country Total_population People_per_km2_of_land
Morocco 33322699 74
Iraq 37056169 84
Spain 48146134 96
Poland 38562189 126
Thailand 67976405 133

It's Morocco - only 74 people per square kilometer of the land.

Section 5. Some More Analysis

Q1. Which country has the most people? Which country has the highest growth rate?

In [22]:
%%sql
select name as 'Country',
           population as 'Total_Population',
           population_growth as 'Growth_Rate_%'
        from facts
        where population== (select max(population)
                            from facts
                            where name !='World')
        or population_growth == (select max(population_growth)
                                 from facts
                                 where name !='World')
               ;
Done.
Out[22]:
Country Total_Population Growth_Rate_%
China 1367485388 0.45
South Sudan 12042910 4.02

So, China is the most populous, but its growth rate is low. South Sudan's growth rate is the highest, but the total population is below average.

*Q2. Which country has the highest ratio of water to land? Which country has less water than land?

In [23]:
%%sql
select name as 'Country',
           area_water/area_land as 'Water/Land_Ratio',
           area_water-area_land as 'Water/Land_Difference'
        from facts
    where area_water/area_land == (select max(area_water/area_land)
                                   from facts
                                   where name !='World' )
    or  area_water-area_land == (select min(area_water-area_land)
                                   from facts
                                   where name !='World')
         ;
Done.
Out[23]:
Country Water/Land_Ratio Water/Land_Difference
Russia 0 -15657242
British Indian Ocean Territory 905 54280

We have get that British Indian Ocean Territory has 905 times more water than land, and Russian lands are much larger than water area.

Q3. Which countries will add the most people to their populations next year?

In [24]:
%%sql
select name as Country,
                       cast((birth_rate - death_rate)*population/1000 as int) as Population_Growth_Next_Year
        from facts
        where name <> 'World'
        order by  Population_Growth_Next_Year desc
        limit 10
         ;
Done.
Out[24]:
Country Population_Growth_Next_Year
India 15308236
China 6782727
Nigeria 4491845
Pakistan 3203291
Ethiopia 2892466
Indonesia 2649534
Bangladesh 2623913
Congo, Democratic Republic of the 1969297
Philippines 1834130
Mexico 1645881

Above we see that the world can expect that next year India will add over 15 millions people to its population.

Q4. Which countries have a higher death rate than birth rate?

In [25]:
%%sql
select name as 'country',
        death_rate, birth_rate,
        round(death_rate - birth_rate, 2) as diff
        from facts
where name <> 'World'
order by diff desc
limit 10
         ;
Done.
Out[25]:
country death_rate birth_rate diff
Bulgaria 14.44 8.92 5.52
Serbia 13.66 9.08 4.58
Latvia 14.31 10.0 4.31
Lithuania 14.27 10.1 4.17
Ukraine 14.46 10.72 3.74
Hungary 12.73 9.16 3.57
Germany 11.42 8.47 2.95
Slovenia 11.37 8.42 2.95
Romania 11.9 9.14 2.76
Croatia 12.18 9.45 2.73

The above calculation shows which populations are declining. If you notice, all these countries are in Europe. What about the opposite situation?

In [26]:
%%sql
select name as 'country',
        birth_rate,  death_rate,
        round(birth_rate - death_rate, 2) as diff
        from facts
where name <> 'World'
order by diff desc
limit 10
         ;
Done.
Out[26]:
country birth_rate death_rate diff
Malawi 41.56 8.41 33.15
Uganda 43.79 10.69 33.1
Niger 45.45 12.42 33.03
Burundi 42.01 9.27 32.74
Mali 44.99 12.89 32.1
Burkina Faso 42.03 11.72 30.31
Zambia 42.13 12.67 29.46
Ethiopia 37.27 8.19 29.08
South Sudan 36.91 8.18 28.73
Tanzania 36.39 8.0 28.39

Here we get the countries whose population is growing most intensively. And they are all in Africa.

Q5. Which countries have the highest population/area ratio?

In [27]:
%%sql
select name as Country,
        cast(population/area as int) as People_per_km2_Area,
        population as Population, area as Total_Area_km2
        from facts
where name <> 'World'
order by  People_per_km2_Area desc
limit 10
         ;
Done.
Out[27]:
Country People_per_km2_Area Population Total_Area_km2
Macau 21168 592731 28
Monaco 15267 30535 2
Singapore 8141 5674472 697
Hong Kong 6445 7141106 1108
Gaza Strip 5191 1869055 360
Gibraltar 4876 29258 6
Bahrain 1771 1346613 760
Maldives 1319 393253 298
Malta 1310 413965 316
Bermuda 1299 70196 54

The calculation shows that there are countries with extremely high population densities that did not get in result of our analysis above because their total population is below average.

Conclusion

In this analysis, we reviewed some of the key statistics about the countries of the world based on the CIA World Fact Book. As we have seen, a small database with a small amount of data and fairly simple queries can provide detailed analysis and deliver interesting information that allows us to better and deeper understand our world.