In this Project we're working with data from the CIA World Factbook, a compendium of statistics about all of the countries on Earth. You can download the SQLite database, factbook.db, from this GitHub repo.
The Factbook contains demographic information like:
name - The name of the country.
area - The total land and water area.
area_land - The country's land area in square kilometers.
area_water - The country's waterarea in square kilometers.
population - The population as of 2015.
population_growth - The annual population growth rate, as a percentage.
birth_rate - The country's birth rate, or the number of births a year per 1,000 people.
death_rate - The country's death rate, or the number of death a year per 1,000 people.
Here, we'll explore the Python SQLite workflow to explore, analyze, and visualize data from this database
import sqlite3
import pandas as pd
import warnings
warnings.filterwarnings('ignore')
conn = sqlite3.connect('factbook.db')
query = "SELECT * FROM sqlite_master WHERE type='table';"
pd.read_sql_query(query, conn)
type | name | tbl_name | rootpage | sql | |
---|---|---|---|---|---|
0 | table | facts | facts | 2 | CREATE TABLE "facts" ("id" INTEGER PRIMARY KEY... |
1 | table | sqlite_sequence | sqlite_sequence | 3 | CREATE TABLE sqlite_sequence(name,seq) |
query1 = "SELECT * FROM facts LIMIT 5"
pd.read_sql_query(query1, conn)
id | code | name | area | area_land | area_water | population | population_growth | birth_rate | death_rate | migration_rate | created_at | updated_at | |
---|---|---|---|---|---|---|---|---|---|---|---|---|---|
0 | 1 | af | Afghanistan | 652230 | 652230 | 0 | 32564342 | 2.32 | 38.57 | 13.89 | 1.51 | 2015-11-01 13:19:49.461734 | 2015-11-01 13:19:49.461734 |
1 | 2 | al | Albania | 28748 | 27398 | 1350 | 3029278 | 0.30 | 12.92 | 6.58 | 3.30 | 2015-11-01 13:19:54.431082 | 2015-11-01 13:19:54.431082 |
2 | 3 | ag | Algeria | 2381741 | 2381741 | 0 | 39542166 | 1.84 | 23.67 | 4.31 | 0.92 | 2015-11-01 13:19:59.961286 | 2015-11-01 13:19:59.961286 |
3 | 4 | an | Andorra | 468 | 468 | 0 | 85580 | 0.12 | 8.13 | 6.96 | 0.00 | 2015-11-01 13:20:03.659945 | 2015-11-01 13:20:03.659945 |
4 | 5 | ao | Angola | 1246700 | 1246700 | 0 | 19625353 | 2.78 | 38.78 | 11.49 | 0.46 | 2015-11-01 13:20:08.625072 | 2015-11-01 13:20:08.625072 |
We will write a single query that returns the:
- minimum population
- maximum population
- minimum population growth
- maximum population growth
query2 = '''SELECT MIN(population) 'min_pop',
MAX(population) 'max_pop',
MIN(population_growth) 'min_pop_growth',
MAX(population_growth) 'max_pop_growth'
FROM facts'''
pd.read_sql_query(query2, conn)
min_pop | max_pop | min_pop_growth | max_pop_growth | |
---|---|---|---|---|
0 | 0 | 7256490011 | 0.0 | 4.02 |
query3 = '''SELECT * FROM facts WHERE population == 0 or population == 7256490011'''
pd.read_sql_query(query3, conn)
id | code | name | area | area_land | area_water | population | population_growth | birth_rate | death_rate | migration_rate | created_at | updated_at | |
---|---|---|---|---|---|---|---|---|---|---|---|---|---|
0 | 250 | ay | Antarctica | None | 280000.0 | None | 0 | NaN | NaN | NaN | None | 2015-11-01 13:38:44.885746 | 2015-11-01 13:38:44.885746 |
1 | 261 | xx | World | None | NaN | None | 7256490011 | 1.08 | 18.6 | 7.8 | None | 2015-11-01 13:39:09.910721 | 2015-11-01 13:39:09.910721 |
Here, we can observe that country listed with zero population is Antartica, since there is nobody living there.
And, the country listed with 7.2 Billion people is actually the world itself where whole world's population gets counted in just a single country.
Using just the non-outlier rows, we will generate a 2 by 2 grid of histograms for the following columns:
import matplotlib.pyplot as plt
import seaborn as sns
%matplotlib inline
fig = plt.figure(figsize =(10, 5))
ax = fig.add_subplot(1, 1, 1)
query4 = '''SELECT population, population_growth,
birth_rate, death_rate
FROM facts
WHERE population != 0 and population != 7256490011;'''
pd.read_sql_query(query4, conn).hist(ax = ax)
plt.show()
*Which countries have the highest population density?*
query5 = '''SELECT name, cast(population as float)/cast(area_land as float) 'population_density'
FROM facts ORDER BY population_density desc'''
pd.read_sql_query(query5, conn)
name | population_density | |
---|---|---|
0 | Macau | 21168.964286 |
1 | Monaco | 15267.500000 |
2 | Singapore | 8259.784571 |
3 | Hong Kong | 6655.271202 |
4 | Gaza Strip | 5191.819444 |
5 | Gibraltar | 4876.333333 |
6 | Bahrain | 1771.859211 |
7 | Maldives | 1319.640940 |
8 | Malta | 1310.015823 |
9 | Bermuda | 1299.925926 |
10 | Bangladesh | 1297.977606 |
11 | Sint Maarten | 1167.323529 |
12 | Guernsey | 847.179487 |
13 | Jersey | 838.741379 |
14 | Taiwan | 725.825356 |
15 | Barbados | 675.823256 |
16 | Mauritius | 660.013300 |
17 | Aruba | 623.122222 |
18 | Lebanon | 604.565103 |
19 | Saint Martin | 588.037037 |
20 | San Marino | 541.311475 |
21 | Rwanda | 513.285755 |
22 | Korea, South | 506.760173 |
23 | Netherlands | 500.041424 |
24 | West Bank | 493.859220 |
25 | Nauru | 454.285714 |
26 | India | 420.993721 |
27 | Burundi | 418.312928 |
28 | Tuvalu | 418.038462 |
29 | Puerto Rico | 405.677227 |
... | ... | ... |
231 | Greenland | 0.026653 |
232 | Antarctica | 0.000000 |
233 | Ethiopia | NaN |
234 | South Sudan | NaN |
235 | Sudan | NaN |
236 | Holy See (Vatican City) | NaN |
237 | European Union | NaN |
238 | Ashmore and Cartier Islands | NaN |
239 | Coral Sea Islands | NaN |
240 | Heard Island and McDonald Islands | NaN |
241 | Clipperton Island | NaN |
242 | French Southern and Antarctic Lands | NaN |
243 | Saint Barthelemy | NaN |
244 | Bouvet Island | NaN |
245 | Jan Mayen | NaN |
246 | Akrotiri | NaN |
247 | British Indian Ocean Territory | NaN |
248 | Dhekelia | NaN |
249 | South Georgia and South Sandwich Islands | NaN |
250 | Navassa Island | NaN |
251 | Wake Island | NaN |
252 | United States Pacific Island Wildlife Refuges | NaN |
253 | Paracel Islands | NaN |
254 | Spratly Islands | NaN |
255 | Arctic Ocean | NaN |
256 | Atlantic Ocean | NaN |
257 | Indian Ocean | NaN |
258 | Pacific Ocean | NaN |
259 | Southern Ocean | NaN |
260 | World | NaN |
261 rows × 2 columns
*Histogram of population densities.*
fig1 = plt.figure(figsize=(10, 5))
ax1 = fig1.add_subplot(1, 1, 1)
pd.read_sql_query(query5, conn).hist(ax = ax1)
plt.show()
*Which countries have the highest ratios of water to land?*
query6 = '''SELECT name, cast(area_water as float)/cast(area_land as float)
'water_land_ratio' FROM facts ORDER BY water_land_ratio desc'''
pd.read_sql_query(query6, conn)
name | water_land_ratio | |
---|---|---|
0 | British Indian Ocean Territory | 905.666667 |
1 | Virgin Islands | 4.520231 |
2 | Puerto Rico | 0.554791 |
3 | Bahamas, The | 0.386613 |
4 | Guinea-Bissau | 0.284673 |
5 | Malawi | 0.259396 |
6 | Netherlands | 0.225710 |
7 | Uganda | 0.222922 |
8 | Eritrea | 0.164356 |
9 | Liberia | 0.156240 |
10 | Bangladesh | 0.140509 |
11 | Gambia, The | 0.116601 |
12 | Taiwan | 0.115313 |
13 | Finland | 0.112996 |
14 | India | 0.105634 |
15 | Canada | 0.098000 |
16 | Sweden | 0.097384 |
17 | Colombia | 0.096476 |
18 | Brunei | 0.094967 |
19 | Guyana | 0.092050 |
20 | French Polynesia | 0.088842 |
21 | Nicaragua | 0.086507 |
22 | Burundi | 0.083723 |
23 | Iran | 0.076130 |
24 | United States | 0.072551 |
25 | Tanzania | 0.069429 |
26 | Vietnam | 0.068178 |
27 | Rwanda | 0.067699 |
28 | Estonia | 0.067000 |
29 | Norway | 0.064151 |
... | ... | ... |
231 | Turks and Caicos Islands | 0.000000 |
232 | American Samoa | 0.000000 |
233 | Guam | 0.000000 |
234 | Navassa Island | 0.000000 |
235 | Northern Mariana Islands | 0.000000 |
236 | Wake Island | 0.000000 |
237 | Gaza Strip | 0.000000 |
238 | Paracel Islands | 0.000000 |
239 | Spratly Islands | 0.000000 |
240 | Western Sahara | 0.000000 |
241 | Ethiopia | NaN |
242 | New Zealand | NaN |
243 | South Sudan | NaN |
244 | Sudan | NaN |
245 | Holy See (Vatican City) | NaN |
246 | European Union | NaN |
247 | Greenland | NaN |
248 | French Southern and Antarctic Lands | NaN |
249 | Saint Barthelemy | NaN |
250 | Saint Martin | NaN |
251 | Akrotiri | NaN |
252 | Dhekelia | NaN |
253 | United States Pacific Island Wildlife Refuges | NaN |
254 | Antarctica | NaN |
255 | Arctic Ocean | NaN |
256 | Atlantic Ocean | NaN |
257 | Indian Ocean | NaN |
258 | Pacific Ocean | NaN |
259 | Southern Ocean | NaN |
260 | World | NaN |
261 rows × 2 columns
*Which countries have more water than land?*
query7 = '''SELECT name, cast(area_water as float) - cast(area_land as float)
'diff' FROM facts
WHERE diff > 0
ORDER BY diff desc'''
pd.read_sql_query(query7, conn)
name | diff | |
---|---|---|
0 | British Indian Ocean Territory | 54280.0 |
1 | Virgin Islands | 1218.0 |
*British Indian Ocean Territory and Virgin Islands are the only two countries with more water area than land.*