#!/usr/bin/env python # coding: utf-8 # # Don't Copy The Seal # # or # # A Quick Look at the CIA World Factbook # # The CIA World Factbook is an almanac-style document that "provides basic intelligence on the history, people, government, economy, energy, geography, environment, communications, transportation, military, terrorism, and transnational issues for 266 world entities" (1). It was first made available publicly in 1971 and has been online since 1994 (2). For it's purpose, it is a very thorough document. There are some notable entries like [the World](https://en.wikipedia.org/wiki/Earth), five oceans, and some [miscellaneous locations](https://en.wikipedia.org/wiki/Akrotiri_(prehistoric_city)). Perhaps unsurprisingly, because of the nature of the agency that writes it, there is some controversy over entries for disputed locations. Factual errors, such as [population discrepancies](https://www.npr.org/sections/publiceditor/2010/06/02/127349281/cia-gets-numbers-wrong-on-jewish-settlers) have also been found (3). # # This notebook will use SQLite to take a quick look at a dataset for the entire document. # # One last note, while the source material and information are in the public domain and downloadable, don't make a copy of the official seal of the CIA without permission! That is prohibited by U.S. federal law—specifically, the Central Intelligence Agency Act of 1949 [(50 U.S.C. § 403m)](https://www.govinfo.gov/app/details/USCODE-2011-title50/USCODE-2011-title50-chap15-subchapI-sec403-4). ;) # # ## Let's get started! # In[1]: get_ipython().run_cell_magic('capture', '', '%load_ext sql\n%sql sqlite:///factbook.db\n') # In[2]: get_ipython().run_cell_magic('sql', '', "SELECT *\n FROM sqlite_master\n WHERE type='table';\n") # In[3]: get_ipython().run_cell_magic('sql', '', 'SELECT *\n FROM facts\n LIMIT 5;\n') # In[4]: get_ipython().run_cell_magic('sql', '', '-- row count\nSELECT COUNT (*)\n FROM facts;\n') # **File Info** # The file opens fine. There are 261 rows and 11 columns with pretty straightforward column titles. It would be helpful to know the units for area and how the rates are calculated. # ## Demographics # Just some basics. # In[5]: get_ipython().run_cell_magic('sql', '', 'SELECT MIN(population), MAX(population), MIN(population_growth), MAX(population_growth)\n FROM facts;\n') # In[6]: get_ipython().run_cell_magic('sql', '', 'SELECT *\n FROM facts\n WHERE population = (SELECT MIN(population)\n FROM facts);\n') # In[7]: get_ipython().run_cell_magic('sql', '', 'SELECT *\n FROM facts\n WHERE population < 100\n LIMIT 10;\n') # In[8]: get_ipython().run_cell_magic('sql', '', "SELECT *\n FROM facts\n WHERE population = 'None'\n LIMIT 50;\n") # In[9]: get_ipython().run_cell_magic('sql', '', 'SELECT *\n FROM facts\n ORDER BY population\n LIMIT 45;\n') # In[10]: get_ipython().run_cell_magic('sql', '', 'SELECT *\n FROM facts\n WHERE population = (SELECT MAX(population)\n FROM facts);\n') # In[11]: get_ipython().run_cell_magic('sql', '', 'SELECT *\n FROM facts\n ORDER BY population DESC\n LIMIT 5;\n') # **Outliers** # The only location with a `population` of 0 is [Antarctica](https://en.wikipedia.org/wiki/Antarctica). There are a number of places with a value of `None` in this column, *but I can't seem to capture that value in a query*. The only row with a true outlier on the high end is `World`. # # ## Population growth # Minimums and maximums. # In[12]: get_ipython().run_cell_magic('sql', '', "SELECT MIN(population), MAX(population), MIN(population_growth), MAX(population_growth)\n FROM facts\n WHERE population != 0 AND name != 'World';\n") # In[13]: get_ipython().run_cell_magic('sql', '', 'SELECT *\n FROM facts\n WHERE population < 100;\n') # In[14]: get_ipython().run_cell_magic('sql', '', 'SELECT *\n FROM facts\n WHERE population > 1000000000;\n') # In[15]: get_ipython().run_cell_magic('sql', '', 'SELECT *\n FROM facts\n WHERE population_growth = (SELECT MIN(population_growth)\n FROM facts);\n') # In[16]: get_ipython().run_cell_magic('sql', '', 'SELECT *\n FROM facts\n WHERE population_growth = (SELECT MAX(population_growth)\n FROM facts);\n') # In[17]: get_ipython().run_cell_magic('sql', '', 'SELECT *\n FROM facts\n WHERE population_growth > (SELECT AVG(population_growth)\n FROM facts)\n ORDER BY population_growth DESC;\n') # **Observations** # The lowest population, other than zero or none, is 48 in the [Pitcairn Islands](https://en.wikipedia.org/wiki/Pitcairn_Islands). The highest population is 1,367,485,388 in [China](https://en.wikipedia.org/wiki/China). [India](https://en.wikipedia.org/wiki/India) also has a population over one billion. The locations with no population growth are [Vatican City](https://en.wikipedia.org/wiki/Vatican_City), [Cocos Island](https://en.wikipedia.org/wiki/Cocos_Island), [Greenland](https://en.wikipedia.org/wiki/Greenland), and the Pitcairn Islands. [South Sudan](https://en.wikipedia.org/wiki/South_Sudan) has the highest population growth. Notably, it is the only place where this value is more than 4. A quick look at other locations with high population growth shows that some places, like [Qatar](https://en.wikipedia.org/wiki/Qatar), increase population primarily through migration rate. *This may be worth further comparison.* # # ## Population density # Looking at population density two ways. First, using average population and average area (excluding the World and Antarctica) to determine a ratio, then looking at locations that have an above average ratio. Second, selecting locations that meet the double criteria of having above average population and above average area (i.e. skipping the ratio). # In[18]: get_ipython().run_cell_magic('sql', '', "SELECT *, ROUND(CAST(population AS float)/area, 2) AS 'density ratio'\n FROM facts\n WHERE (population/area) > (SELECT AVG(population)/AVG(area)\n FROM facts\n WHERE name != 'World' AND name != 'Antarctica')\n ORDER BY (population/area) DESC;\n") # In[19]: get_ipython().run_cell_magic('sql', '', "SELECT COUNT(*)\n FROM facts\n WHERE (population/area) > (SELECT AVG(population)/AVG(area)\n FROM facts\n WHERE name != 'World' AND name != 'Antarctica');\n") # In[20]: get_ipython().run_cell_magic('sql', '', "SELECT *, ROUND(CAST(population AS float)/area, 2) AS 'density ratio'\n FROM facts\n WHERE (population/area) <= (SELECT AVG(population)/AVG(area)\n FROM facts\n WHERE name != 'World' AND name != 'Antarctica')\n ORDER BY population/area ASC;\n") # In[21]: get_ipython().run_cell_magic('sql', '', "SELECT COUNT(*)\n FROM facts\n WHERE (population/area) <= (SELECT AVG(population)/AVG(area)\n FROM facts\n WHERE name != 'World' AND name != 'Antarctica');\n") # In[22]: get_ipython().run_cell_magic('sql', '', "SELECT *, ROUND(CAST(population AS float)/area, 2) AS 'density ratio'\n FROM facts\n WHERE population > (SELECT AVG(population)\n FROM facts\n WHERE name != 'World' AND name != 'Antarctica')\n AND area < (SELECT AVG(area)\n FROM facts\n WHERE name != 'World' AND name != 'Antarctica')\n ORDER BY population/area DESC;\n") # In[23]: get_ipython().run_cell_magic('sql', '', "SELECT COUNT(*)\n FROM facts\n WHERE (population > (SELECT AVG(population)\n FROM facts\n WHERE name != 'World' AND name != 'Antarctica'))\n AND (area < (SELECT AVG(area)\n FROM facts\n WHERE name != 'World' AND name != 'Antarctica'));\n") # In[24]: get_ipython().run_cell_magic('sql', '', "SELECT *, ROUND(CAST(population AS float)/area, 2) AS 'density ratio'\n FROM facts\n WHERE population <= (SELECT AVG(population)\n FROM facts\n WHERE name != 'World' AND name != 'Antarctica')\n AND area >= (SELECT AVG(area)\n FROM facts\n WHERE name != 'World' AND name != 'Antarctica')\n ORDER BY population/area ASC;\n") # In[25]: get_ipython().run_cell_magic('sql', '', "SELECT COUNT(*)\n FROM facts\n WHERE population <= (SELECT AVG(population)\n FROM facts\n WHERE name != 'World' AND name != 'Antarctica')\n AND area >= (SELECT AVG(area)\n FROM facts\n WHERE name != 'World' AND name != 'Antarctica');\n") # **Observations** # The two methods produce fairly different results, with some overlap. The largest difference is the capture size of the two groups, which subsequently gives different countries. # * The first method, comparing a country specific density ratio to average density ratio, yields 146 countries above average and 90 countries below or equal to average for a total of 236. # * Method two, using a double criteria match, yields 14 countries with high population and low area and 21 countries with low population and high area. This gives a total of 35 countries, a much smaller selection. # # Taking a closer look at the whole dataset shows 20 countries that list population as zero or none, leaving only five countries out of the group produced by the first method. These most likely have an area value of `none`, like the entry for [Saint Bathelemy](https://en.wikipedia.org/wiki/Saint_Barth%C3%A9lemy). Which incidentally looks like an awesome place to go visit! Here is a table with the results. The upper and lower bounds for each subgroup are provided. # | Method 1 | | || Method 2 | | | # | --- | --- | --- || --- | --- | --- | # | City | population : area | ratio || City | population : area | ratio | # | [Macau](https://en.wikipedia.org/wiki/Macau) | 592,731 : 28 | 21,168.96 || [Bangladesh](https://en.wikipedia.org/wiki/Bangladesh) | 168,957,745 : 148,460 | 1138.07 | # | [Norfolk Island](https://en.wikipedia.org/wiki/Norfolk_Island) | 2,210 : 36 | 61.38 || [Morocco](https://en.wikipedia.org/wiki/Morocco) | 33,322,699 : 446,300 | 74.66 | # | [Greenland](https://en.wikipedia.org/wiki/Greenland) | 57,733 : 2,166,086 | 0.03 || [Greenland](https://en.wikipedia.org/wiki/Greenland) | 57,733 : 2,166,086 | 0.03 | # | [Tajikistan](https://en.wikipedia.org/wiki/Tajikistan) | 8,191,958 : 144,100 | 56.85 || [Madagascar](https://en.wikipedia.org/wiki/Madagascar) | 2,381,268 : 587,041 | 4.06 | # ## Conclusions # # This is just an quick overview of the information contained in this dataset. Overall it seems like an interesting product that could be combined with other data to yield results with increased depth. That being said, the context of the source needs to be taken into account. # ## Citations # # # 1) Central Intelligence Agency. (n.d.). The World Factbook. Central Intelligence Agency. Retrieved August 30, 2022, from https://www.cia.gov/the-world-factbook/ # 2) Wikimedia Foundation. (2022, August 10). The World Factbook. Wikipedia. Retrieved August 30, 2022, from https://en.wikipedia.org/wiki/The_World_Factbook # 3) Shepard, A. C. (2010, June 2). CIA gets numbers wrong on Jewish settlers. NPR. Retrieved August 30, 2022, from https://www.npr.org/sections/publiceditor/2010/06/02/127349281/cia-gets-numbers-wrong-on-jewish-settlers