Analyzing CIA Factbook Data Using SQL

This project entails working with CIA World Factbook data statistics which can be gotten from this link https://www.cia.gov/library/publications/the-world-factbook/ and it contains the statistics of all the countries on Earth

The Factbook contains demographic information like:

population - The population as of 2015.

population_growth - The annual population growth rate, as a percentage.

area - The total land and water area.

In this project, i will use the basics of SQL to explore and analyze the data. link to get the database https://dsserver-prod-resources-1.s3.amazonaws.com/257/factbook.db

In [22]:
%%capture
%load_ext sql
%sql sqlite:///factbook.db  
    
#connecting the jupyter notebook to the database file
Out[22]:
'Connected: [email protected]'
In [32]:
%%sql

SELECT * FROM sqlite_master WHERE type='table';
Done.
Out[32]:
type name tbl_name rootpage sql
table sqlite_sequence sqlite_sequence 3 CREATE TABLE sqlite_sequence(name,seq)
table facts facts 47 CREATE TABLE "facts" ("id" INTEGER PRIMARY KEY AUTOINCREMENT NOT NULL, "code" varchar(255) NOT NULL, "name" varchar(255) NOT NULL, "area" integer, "area_land" integer, "area_water" integer, "population" integer, "population_growth" float, "birth_rate" float, "death_rate" float, "migration_rate" float)

We have seen the contents of our file and the data types above.

In [30]:
%%sql
SELECT * FROM facts limit 5;
Done.
Out[30]:
id code name area area_land area_water population population_growth birth_rate death_rate migration_rate
1 af Afghanistan 652230 652230 0 32564342 2.32 38.57 13.89 1.51
2 al Albania 28748 27398 1350 3029278 0.3 12.92 6.58 3.3
3 ag Algeria 2381741 2381741 0 39542166 1.84 23.67 4.31 0.92
4 an Andorra 468 468 0 85580 0.12 8.13 6.96 0.0
5 ao Angola 1246700 1246700 0 19625353 2.78 38.78 11.49 0.46

what i did above was to easily have acess the content of my database.

We are getting close to how to approach our data

lets start calculating some summary statistics of our file

Summary Statistics of our data

In [38]:
%%sql

select 
    min(population) min_population, 
    max(population) max_population, 
    min(population_growth) min_population_growth, 
    max(population_growth) max_population_growth
    
from facts
Done.
Out[38]:
min_population max_population min_population_growth max_population_growth
0 7256490011 0.0 4.02

our maximum population has 7256490011 value which is approximately 7.2 billion people.

while our minimum population recorded 0 value, could it be the antarctica region or zone with no one living there??

we have minimum population growth as 0, while maximum population growth was 4.02

Now we are interested in gettin the countries with the minimum and maximum population. let us see further

Getting the countries with minimum and maximum value

In [43]:
%%sql

select *
from facts
where population == (select min(population) from facts)
Done.
Out[43]:
id code name area area_land area_water population population_growth birth_rate death_rate migration_rate
250 ay Antarctica None 280000 None 0 None None None None

Yea, our guesses where right and correct. The region with the least population is Antarctica. Now let us move on to the countries with the highest population. (i used nested subqueries in this context in nesting the calculation of the minimum population inside the major query

In [45]:
%%sql

select *
from facts
where population == (select max(population) from facts)
Done.
Out[45]:
id code name area area_land area_water population population_growth birth_rate death_rate migration_rate
261 xx World None None None 7256490011 1.08 18.6 7.8 None

Now to our surprise, we are seeing the name of the country as world right?? yes, the only way to explain this is that the dataset had world as the row measuring the total population of the world which is about 7.2 billion people

Performing Arithematic operations in sql

In [46]:
%%sql

select avg(population), avg(area)
from facts
Done.
Out[46]:
avg(population) avg(area)
62094928.32231405 555093.546184739

We can now see that the average population of the world is 62094928.32231405 which is about 62 million people, and the average area is 555093.546184739 which is about 555 thousand km^2 sq/area

Finding the densely populated countries

In [47]:
%%sql

select * 
from facts

where 
        population > (select avg(population) from facts) and 
        area < (select avg(area) from facts)
Done.
Out[47]:
id code name area area_land area_water population population_growth birth_rate death_rate migration_rate
14 bg Bangladesh 148460 130170 18290 168957745 1.6 21.14 5.61 0.46
65 gm Germany 357022 348672 8350 80854408 0.17 8.47 11.42 1.24
85 ja Japan 377915 364485 13430 126919659 0.16 7.93 9.51 0.0
138 rp Philippines 300000 298170 1830 100998376 1.61 24.27 6.11 2.09
173 th Thailand 513120 510890 2230 67976405 0.34 11.19 7.8 0.0
185 uk United Kingdom 243610 241930 1680 64088222 0.54 12.17 9.35 2.54
192 vm Vietnam 331210 310070 21140 94348835 0.97 15.96 5.93 0.3

Good, seems little, it is a way of going through the basics of sql, which covers some syntax(not all) mostly used in its operation such as select, min, max, from, where etc. we are done with the basics.

In [ ]: