Benford's Law

Purpose

To take an iterable object (assumed to contain numbers) and plot the frequency of their leading digits. Based on Benford's Law (also called the first-digit law), if it is a "natural dataset," we should see the following distribution of leading digits:

d P(d)
1 30.1%
2 17.6%
3 12.5%
4 9.7%
5 7.9%
6 6.7%
7 5.8%
8 5.1%
9 4.6%

Application

In data science, this pattern is used to detect fraud, primarily for taxes purposes. It can also be used to detect deepfakes or altered images.

In [1]:
import numpy as np
import pandas as pd
import matplotlib.pyplot as plt

%matplotlib inline
In [2]:
world = pd.read_csv('world_population_data.csv')
world.head()
Out[2]:
Country Population_2020 Yearly_Change Net_Change Density Land_Area Migrants Fert_Rate Med_Age Urban_Pop World_Share
0 China 1,439,323,776 0.39% 5,540,090 153 9,388,211 -348,399 1.7 38 61% 18.47%
1 India 1,380,004,385 0.99% 13,586,631 464 2,973,190 -532,687 2.2 28 35% 17.70%
2 United States 331,002,651 0.59% 1,937,734 36 9,147,420 954,806 1.8 38 83% 4.25%
3 Indonesia 273,523,615 1.07% 2,898,047 151 1,811,570 -98,955 2.3 30 56% 3.51%
4 Pakistan 220,892,340 2.00% 4,327,022 287 770,880 -233,379 3.6 23 35% 2.83%
In [3]:
def digit_widget(list):
    number_stash = []
    for num in list:
        leading_digit = str(num)[0]
        if leading_digit == '-':
            leading_digit = str(num)[1]
        if leading_digit == '$':
            leading_digit = str(num)[1]
        if leading_digit == 'n':
            continue
        if leading_digit == '0':
            continue
        number_stash.append(leading_digit)
    number_stash = sorted(number_stash)
    fig, ax = plt.subplots()
    ax.set_yticks([0.10, 0.20, 0.30])
    plt.hist(number_stash, bins=9, density=True)
    return plt.show()
In [4]:
digit_widget(world['Population_2020'])
In [5]:
digit_widget(world['Migrants'])
In [6]:
digit_widget(world['Net_Change'])