#!/usr/bin/env python # coding: utf-8 # # Coincidence, probability and randomness # # By [Allison Parrish](http://www.decontextualize.com/) # # This is a [Jupyter Notebook](https://jupyter-notebook.readthedocs.io/en/stable/) with my notes on randomness, probability, and the sensation of "coincidence." The notebook has written notes and Python code that you can execute. (You don't need to be able to read or write Python code to follow along with the notebook.) # # These notes are still a work in progress! I originally wrote these notes to go along with an in-class presentation, and they're not quite "freestanding" at this point. # ## Randomness and coincidence in design # # From [Does your iPod play favorites?](https://www.newsweek.com/does-your-ipod-play-favorites-116739): # # ![ipod shuffle](https://upload.wikimedia.org/wikipedia/commons/1/15/Ipod-shuffle-usb-connector.jpg) # # > Last spring it dawned on Apple CEO Steve Jobs that the heart of his hit iPod digital music player was the "shuffle." [...] [Jobs] used the idea as the design principle of the new low-cost iPod Shuffle. Its ad slogan celebrates the serendipity music lovers embrace when their songs are reordered by chance--"Life is random." [...] But just about everyone who has an iPod has wondered how random the iPod shuffle function really is. [...] More than a year ago, I outlined these concerns to Jobs; he dialed up **an engineer who insisted that shuffle played no favorites.** Since then, however, millions of new Podders have started shuffling, and the question has been discussed in newspapers, blogs and countless conversations. It's taking on Oliver Stone-like conspiracy buzz. [...] Apple execs profess amusement. "It's part of the magic of shuffle," says Greg Joswiak, the VP for iPod products. Still, I asked him last week to double-check with the engineers. They flatly assured him that **"Random is random,"** and the algorithm that does the shuffling has been tested and reverified. # # Compare with e.g. [Divination: Shufflemancy](https://wolfofantimonyoccultism.com/2017/07/11/shufflemancy/): # # > Shufflemancy is a type of divination that works by receiving messages in the forms of songs in a playlist. In this form of divination the practitioner will ask a question, or for guidance, and then activate shuffle in a playlist of some type. This playlist will then begin to play a song and from the song the practitioner will get a message.... # # From [Sid Meier and Rob Pardo on Probability and Player Psychology](https://www.shacknews.com/article/62807/sid-meier-and-rob-pardo): # # ![Civilization Revolution](https://upload.wikimedia.org/wikipedia/en/3/32/Civilization_Revolution_Game_Cover.png) # # > When designing the combat system in Civilization: Revolution, Sid Meier found himself up against some interesting design problems. [...] In Civ Rev, the strength of units were displayed up front to players before battle to show the odds of victory. For example, an attacking unit might be rated at 1.5 with the defending unit at 0.5. This is a 3-to-1 situation. [...] [T]he testers **expected to win this battle every time despite there being a 25% chance of losing** each time. [...] When the player was presented with 3-to-1 or 4-to-1 odds, they expect to win. With 2-to-1 odds, the player will accept losing some of the time, but expect to win at 20-to-10, which is just a larger expression of 2-to-1. When the numbers get larger, the perceived advantage grows. # # > To adjust for this, Sid actually changed the math again so that the **outcomes of previous battles are taken into account.** He found that if a player lost too many 2-to-1 battles in a row, they would get frustrated. Instead of risking a player shutting the game down, Sid changed the math. # # ![World of Warcraft](https://upload.wikimedia.org/wikipedia/en/9/91/WoW_Box_Art1.jpg) # # > [Blizzard Designer] Rob Pardo discussed ... the perception of bonuses versus punishments. [...] [P]layers would get into cold streaks where they went a long time without getting a quest item drop. **Instead of correctly attributing this to randomness, they would blame the math or random number generator.** To get around this, Rob and his team actually changed the code to increase the drop rate after each kill until it hits 100% and then reset it for the next one. # # > It's interesting that players don't want to accept math when it doesn't work out for them, but are more than happy to accept it when it rewards them. Purists might dislike Sid Meier changing the match to appease players, but the game became more fun. Blizzard opts to package systems as a reward over penalties and takes a little randomness out of the equation. # ## Demonstrating principles of probability with Tarot # # Below, I use Tarot in order to demonstrate some elementary principles of probability—and use elementary principles of probability in order to talk about Tarot. I'm especially concerned with the idea of *coincidence* in Tarot. Often in my own Tarot practice, I see patterns emerge—certain cards that come up over and over, for example, or certain suits that seem to be overrepresented—and attribute meaning to those patterns. My hope is that in understanding the nature of how likely these patterns are, I can be better at assigning them meaning. # # > Terminology note: In the text below, I refer to a sequence of cards drawn from a Tarot deck as a "spread." This is just a convenience—spreads are, of course, more complicated than that, and their meaning derives not just from the sequence of cards but also from their spatial relationships between each other, etc. # ### Loading the data # In[1]: from tqdm.notebook import tqdm import json # Here I'm loading in [a JSON file with names and interpretations for the 78 cards in a typical Tarot deck](https://github.com/dariusk/corpora/blob/master/data/divination/tarot_interpretations.json). # In[2]: cards_raw = json.load(open("tarot_interpretations.json"))['tarot_interpretations'] # The code below makes a list of every card in the deck. Each item in the list is a dictionary that contains the name, rank, and suit of the card in question. If you're not familiar with Python programming, you can think of this as essentially a spreadsheet with 78 rows (one for each card) and three columns (name, rank, suit). Most of the code in the rest of the notebook is going to be selecting rows from this spreadsheet at random. # In[3]: cards = [{k: item[k] for k in ['name', 'rank', 'suit']} for item in cards_raw] # In[4]: cards # The `len()` method tells us how many cards are in the list: # In[5]: len(cards) # Seventy-eight, as you'd expect for a Tarot deck. We're going to use this number a lot, so I'm going to assign its value to a variable `n` that I can reuse below. Whenever you see `n` in the cells below, it's referring back to the number of cards in the deck. # In[6]: n = len(cards) # ### Picking one card at random # # Python lets us pick things at random from a list in a few ways. All of them require the `random` module, which we can make available in the notebook like so: # In[7]: import random # To pick one card, use `random.choice()`: # In[8]: random.choice(cards) # Think of a card in your mind. The probability that the card in a single-card spread will match the card you're thinking of is $\frac{1}{n}$, where `n` is the number of cards in the deck. Here's how to calculate this in Python: # In[9]: 1 / n # In other words, there's a slightly better than 1% chance that the card you're thinking of will be on the top of the deck. (Assuming that the deck has been shuffled fairly, etc.) # # To make sure this math is right, I like to run "simulations" in Python. In a simulation, I'm going to perform the task whose probability I want to calculate over and over, then increment a counter whenever the event in question happens. After, I can divide the number of successful trials by the total number of trials in order to get the frequency of successes. The code below does this for our simple one-card spread: # In[10]: trials = 100000 matches = 0 # do the following 100k times... for i in tqdm(range(trials)): # pick a card drawn = random.choice(cards) # increment matches if name matches the card we're # looking for if drawn['name'] == 'The Tower': matches += 1 print(matches / trials) # When you run this cell, you should see a number that looks more or less similar to the number that we calculated above (~0.0128). As you increase the number of trials, the resulting frequency should theoretically converge on the calculated probability. Try # # > Python syntax notes: Comments begin with `#`. Code blocks are indicated not with curly braces (`{` and `}`) as in JavaScript, C, etc., but with indentation. # The chance of drawing a card that belongs to a particular category can be calculated in a similar way. If `c` is the number of cards in that category, the probability that a single card you draw will belong to that category is $\frac{c}{n}$. # # To demonstrate, the code below makes a new list of cards that contains *only* the major arcana: # In[11]: major_arcana = [item for item in cards if item['suit'] == 'major'] # There are this many major arcana: # In[12]: len(major_arcana) # A value I'll assign to the variable `n_major`: # In[13]: n_major = len(major_arcana) # Performing the calculation below, we can see that any given one-card draw will consist of a major arcana about 28% of the time: # In[14]: n_major / n # ### Picking multiple cards # # Python has a function `random.sample()` that lets us pick items at random from a list *without replacement*. "Without replacement" here means that once an item has been picked, it won't be picked subsequently in the same sampling process, so we won't see any duplicate cards. This is a good way of simulating a simple Tarot spread: # In[15]: random.sample(cards, 3) # (Change the number after the comma to pick a different number of cards.) # # Now think of *two* Tarot cards. The chance that a two-card spread is calculated as the chance of drawing one of the cards from the full deck (i.e., $\frac{1}{n}$), multiplied by the chance of drawing the other card from a deck that is missing one card (because, in this scenario, the first card is no longer in the deck)—i.e., $\frac{1}{n-1}$. Multiplying across gives us $\frac{1}{n(n-1)}$, which can be expressed in Python like so: # ... can be written like so in Python: # In[16]: 1 / (n * (n - 1)) # This tells us that the cards in a two-card draw will match the cards you're thinking of significantly fewer than one in a thousand times. The following cell simulates this scenario, finding how many two-card spreads are The Star followed by the Hierophant: # In[17]: trials = 100000 target = ['The Star', 'The Pope/Hierophant'] matches = 0 for i in tqdm(range(trials)): if [item['name'] for item in random.sample(cards, 2)] == target: matches += 1 print(matches/trials) # For three cards, the formula would be $\frac{1}{n(n-1)(n-2)}$; for four, $\frac{1}{n(n-1)(n-2)(n-3)}$, and so forth. There's a more general formulation for arbitrary numbers of cards in the spread below. # ### Permutations # # A Tarot spread is an example a *permutation without repetition*, meaning a set of items drawn at random from a discrete set, where the order in which the items are drawn is important. Python comes with a function called `permutations` that shows all of possible permutations for a given list of things. # In[18]: from itertools import permutations # To make this demonstration a bit cleaner, I'm going to make a list of all of the card names: # In[19]: card_names = [item['name'] for item in cards] # There is exactly one permutation of zero cards (i.e., no cards at all): # In[20]: list(permutations(card_names, 0)) # And there are *n* permutations of one card (i.e., every card is its own single-card spread): # In[21]: print(list(permutations(card_names, 1))) # But there are 6006 permutations of *two* cards, of which I show just the first 100 below: # In[22]: print(list(permutations(card_names, 2))[:100]) # The number 6006 results from multiplying our `n` by `n - 1`: # In[23]: n * (n - 1) # Here's a chunk of the permutations of three cards: # In[24]: print(list(permutations(card_names, 3))[:100]) # Using the `permutations` function, we can find the total number of permutations for spreads of any size: # In[25]: for i in range(4): print(i, "cards ->", len(list(permutations(cards, i))), "permutations") # Actually "any" size isn't right! The `permutations()` function gives an exhaustive list of all possible permutations, and with 78 cards in the deck, the number of permutations in a spread of five cards is so large that it hangs the CPU on my computer just to generate it. If we're just interested in the number of permutations, we don't have to list them. Instead, we can calculate them with the following formula: # $$\frac{n!}{(n-k)!}$$ # The `!` is the *factorial* operator, which evaluates to `n` multiplied by every integer smaller than it down to one (i.e., 4! = 4 × 3 × 2 × 1 = 24). The following code implements a function `npr()` that takes the number of items in the deck and the number to sample at once, and returns the total possible number of permutations: # In[26]: from math import factorial def npr(n, r): return factorial(n) // factorial(n-r) # Using this, we can see that the number of possible permutations of a Tarot deck increases dramatically as you draw more and more cards: # In[27]: for i in range(10): print(i, "cards ->", npr(n, i), "permutations") # One divided by the number of permutations gives us the probability of any one particular spread. For example the probability of any one particular three-card spread: # In[28]: 1 / npr(n, 3) # Or without scientific notation: # In[29]: print(f'{1/npr(n,3):.10f}') # This is a really surprising result to me! Three-card spreads to me have such an even texture that it *feels* like they tell me the same thing over and over. But it's unlikely you'll see the exact same three-card spread twice, even if you did half a million of them. # Here's an empirical trial of the formula above. This will take a while, since we have to run several million simulations if we want to have a chance of finding a matching spread: # In[32]: trials = 2000000 target = ['The Star', 'Temperance', 'The Devil'] matches = 0 for i in tqdm(range(trials)): if [c['name'] for c in random.sample(cards, 3)] == target: matches += 1 print(matches, "match(es);", matches/trials, "frequency") # There are this many ten-card spreads: # In[33]: npr(n, 10) # That's 4.5 quintillion! In other words, there is *this* probability of drawing any particular sequence of ten Tarot cards: # In[34]: print(f'{1/npr(n,10):.25f}') # Meaning that any ten card spread is almost certainly unique. # ### Combinations # # There are some patterns in Tarot spreads that aren't about the order of the cards. We might, for example, observe that there are an unusual number of cards of a particular suit in a spread, or that major arcana have been showing up with unusual frequency in daily readings over the course of a week. But how do we know if these frequencies are actually unusual? # # Let's start with the following question: How likely is it that *every card* in a three-card spread belongs to the suit of cups? The probability that a single card belongs to cups is the number of cards in the suit of cups divided by the number of cards total ($\frac{14}{78}$): # In[35]: 14 / n # (Meaning that about 18% of cards you draw will be cups.) # # The probability of drawing *two* cards that are cups is $\frac{14}{78} \times \frac{13}{77}$ (because after having drawn the first card, there are only 13 cups and 77 cards left in the deck): # In[36]: (14 / n) * (13 / (n - 1)) # Following the same pattern, the chance of all cards being cups in a three-card spread is $\frac{14}{78} \times \frac{13}{77} \times \frac{12}{76}$: # In[37]: (14 / n) * (13 / (n-1)) * (12 / (n-2)) # ... or around five spreads out of every thousand you draw. The following cell tests this empirically: # In[38]: trials = 100000 all_cup_count = 0 for i in tqdm(range(trials)): # get just the suit for three random cards c_suits = [c['suit'] for c in random.sample(cards, 3)] # if all are cups, up the count if c_suits == ['cups', 'cups', 'cups']: all_cup_count += 1 print(all_cup_count / trials) # if you do a three-card spread every day, you're likely to get (on average) 1.7 all-cups spreads per year: # In[39]: (all_cup_count / trials) * 365 # Another way to state this problem is this: there are a certain number of different ways that three cards can be drawn from a Tarot deck. In how many of those are all of the cards cups? In this case, we don't care about the *order* of the cards, so we're talking about *combinations* (rather than permutations). The formula for calculating the number of combinations of items of a particular size is: # # $$\binom{n}{k} = \frac{n!}{k! (n-k)!}$$ # The following cell has an implementation of this function in Python, *ncr()* (short for "given *n* items, choose *r*): # In[40]: import operator as op from functools import reduce def ncr(n, r): r = min(r, n-r) numer = reduce(op.mul, range(n, n-r, -1), 1) denom = reduce(op.mul, range(1, r+1), 1) return numer // denom # The function tells us how many combinations of two cards there are: # In[41]: ncr(n, 2) # This is, notably, exactly half of the number of permutations: # In[42]: npr(n, 2) # ... which makes sense. If you aren't worrying about the order of cards, then half of the two-card permutations will be identical to each other. # # The number of three-card permutations: # In[43]: ncr(n, 3) # Each suit in Tarot has 14 cards. The `ncr()` function lets us calculate how many combinations of three cards from a group of fourteen there are: # In[44]: ncr(14, 3) # Dividing the latter by the former gives us the same number that we calculated above for the probability of getting a spread with three cups: # In[45]: ncr(14, 3) / ncr(n, 3) # The probability of dealing a five-card spread with all major arcana: # In[46]: ncr(22, 5) / ncr(n, 5) # An empirical trial of same: # In[47]: trials = 100000 matches = 0 for i in tqdm(range(trials)): c_suits = [c['suit'] for c in random.sample(cards, 5)] if c_suits.count('major') == 5: matches += 1 print(matches/trials) # ### Any card in a spread # # There are a certain number of three-card spreads: # In[48]: ncr(n, 3) # If the deck had one fewer card, there would be fewer possible spreads: # In[49]: ncr(n-1, 3) # So, the probability of any one card being absent from a three-card spread can be calculated by dividing the latter by the former: # In[50]: ncr(n-1, 3) / ncr(n, 3) # This means that there is a 96% chance that any card you're thinking of will *not* show up in a spread. Subtract this from one to get the probability that the opposite occurs, i.e., the probability that any given card *will* show up in a three-card spread: # In[51]: 1 - (ncr(n-1, 3) / ncr(n, 3)) # Let's run a simulation to make sure this is the case. In the following cell, we deal three cards and then check to see if any of them are The Star: # In[52]: trials = 100000 star_count = 0 for i in tqdm(range(trials)): c_names = [c['name'] for c in random.sample(cards, 3)] if 'The Star' in c_names: star_count += 1 print(star_count / trials) # Looks about right! The function below wraps up this expression to make it easy to calculate the chance that a given card will be in a spread of a particular size: # In[53]: def chance_in_spread(n, r): return 1 - (ncr(n-1, r) / ncr(n, r)) # Note that as the size of the spread goes up, so do the chances of any given card showing up in the spread: # In[54]: chance_in_spread(n, 10) # Notably, this means that while there is a relatively low chance for the same card to occur in two subsequent three-card spreads: # In[55]: chance_in_spread(n, 3) * chance_in_spread(n, 3) # There's a nearly 2% chance that the same card will show up in two 10-card spreads in a row. # In[57]: chance_in_spread(n, 10) * chance_in_spread(n, 10) # Empirical test of same: # In[58]: trials = 100000 star_count = 0 for i in tqdm(range(trials)): c_names1 = [c['name'] for c in random.sample(cards, 10)] c_names2 = [c['name'] for c in random.sample(cards, 10)] if 'The Tower' in c_names1 and 'The Tower' in c_names2: star_count += 1 print(star_count/trials) # ## Random distributions # # Often when we think of random numbers, we think of rolling a die. Each side has the same probability of coming up. If you rolled a six-sided die a million times, for example, you'd expect the each side to come up roughly the same number of times ($\frac{1000000}{6}$, or about 166667). The following cell performs this experiment, then counts up which sides of the die came up most frequently: # In[59]: from collections import Counter rolls = [random.randrange(6)+1 for i in range(1000000)] Counter(rolls).most_common() # As you can see, even with a million rolls, the numbers don't quite match our estimate, but they're pretty close. # # It turns out that this is just one kind of randomness. Other phenomena in the world also produce random outcomes, but they don't always look like the outcomes of rolling dice. # # The following cell has some code to help display graphs, which I'm going to do fairly frequently in the rest of the notebook. # In[60]: get_ipython().run_line_magic('matplotlib', 'inline') import numpy as np import matplotlib.pyplot as plt plt.style.use('ggplot') plt.rcParams["figure.figsize"] = (10, 4) # ### Uniform random numbers # # The kind of randomness produced by rolling dice is called *uniform randomness*. It's called uniform because no outcome is more likely than any other. The following function produces ten uniformly distributed random numbers between 0 and 1: # In[61]: np.random.uniform(size=(10,)) # As you select more and more uniformly distributed numbers, the distribution of those numbers tends to flatten out. The following cell shows a series of [histograms](https://en.wikipedia.org/wiki/Histogram) with more and more uniform random numbers being generated: # In[63]: for i in range(1, 7): plt.figure(figsize=(6,2)) plt.title("with 10^%d samples" % i) plt.hist(np.random.uniform(size=(10**i,)), bins=20) plt.show() # By the time we've generated $10^6$ samples, the histogram looks flat. # # Notably, the *sum* of two dice does *not* follow a uniform distribution. The code in the following cell rolls two dice one thousand times, then counts up the most common numbers that the values of the dice add up to: # In[64]: rolls = [sum([random.randrange(6)+1 for j in range(2)]) for i in range(10000)] Counter(rolls).most_common() # As you can see, 6, 7 and 8 are more likely than the other numbers. The following cell uses histograms to show the distribution of the sums of two uniformly distributed random numbers, drawing progressively more samples: # In[65]: for i in range(1, 6): plt.figure(figsize=(6,2)) plt.title("with 10^%d samples" % i) plt.hist(np.random.uniform(size=(10**i,)) + np.random.uniform(size=10**i,), bins=20) plt.show() # This distribution begins to resemble a pyramid, with the most likely values in the center of the range. This should be intuitive to anyone who has played a board game that uses the combined values of two dice as a gameplay mechanic. In games like this, you'll often find that dice rolls of 2 or 12 have special rules attached to them (because they're so rare). # # ### Normal distribution # # We don't have to stop at summing the rolls of two dice. The code in the following cell sums up the values of an arbitrary number of dice with an arbitrary number of sides, and then calculates the most common outcomes for the given number of trials: # In[66]: def get_rolls(sides, dice, trials): rolls = [sum([random.randrange(sides)+1 for j in range(dice)]) for i in range(trials)] return Counter(rolls).most_common() # So, for example, the most common sums of three six-sided dice over a few thousand rolls: # In[67]: get_rolls(6, 3, 10000) # Again, you'll see clustering around values that are halfway between zero and the maximum sum of the dice. The cell below defines a function that plots these values onto a histogram: # In[68]: def plot_int_counts(counts): nums = np.zeros(shape=(max([item[0] for item in counts])+1,)) for k,v in counts: nums[k] = v plt.bar(range(nums.shape[0]), nums) plt.show() # When you plot this histogram, you'll notice that the shape is no longer like a pyramid—instead it seems to have a bit of a curve at the top: # In[69]: plot_int_counts(get_rolls(6, 3, 100000)) # Plotting with even more dice gives us an even curvier curve: # In[70]: plot_int_counts(get_rolls(6, 6, 100000)) # The sums of uniform random numbers approximate another distribution of random numbers, called *normal* (or Gaussian) distribution. Random numbers with a normal distribution cluster around a particular value (the *center* of the distribution), and that cluster has a particular density (the *variance* of the distribution). # # There's a function in Python to generate normal random numbers: # In[71]: np.random.normal(0, 1, size=24) # The first parameter is the center and the second is the variance. You can switch these up: # In[72]: np.random.normal(6, 6, size=24) # ### Graphing different kinds of random # In[73]: num_points = 10000 plt.figure(figsize=(5,5)) plt.scatter(np.random.uniform(size=(num_points,)), np.random.uniform(size=(num_points,))) plt.show() # In[74]: num_points = 10000 plt.figure(figsize=(5,5)) plt.scatter(np.random.normal(size=(num_points,)), np.random.normal(size=(num_points,))) plt.show() # ### Normal and uniform data # In[75]: name_data = json.load(open("firstNames.json"))['firstNames'] # In[76]: name_lengths = [len(name) for name in name_data] # In[77]: plot_int_counts(Counter(name_lengths).most_common()) # In[78]: words = open("frankenstein.txt").read().split() # In[79]: word_lengths = [len(word) for word in words] # In[80]: plot_int_counts(Counter(word_lengths).most_common()) # In[81]: word_counts = Counter(words) # In[82]: word_counts_indexed = [(i, count) for i, (word, count) in enumerate(word_counts.most_common(250))] # ### Power law distribution # In[83]: plot_int_counts(word_counts_indexed) # class activities: length of names, last digit of dob # In[84]: len("AllisonParrish") # In[85]: name_lengths = [ 14, 14, 20, 12, 12, 11, 15, 12, 14, 10, 7, 11, 14, 10, 7 ] # In[86]: plot_int_counts(Counter(name_lengths).most_common()) # In[87]: digits = [ 8, 1, 8, 7, 3, 8, 4, 9, 4, 5, 0, 7, 4, 8, 8 ] # In[88]: plot_int_counts(Counter(digits).most_common()) # ## Hot hand # From Gilovich, Thomas, et al. [“The Hot Hand in Basketball: On the Misperception of Random Sequences.”](https://www.sciencedirect.com/science/article/pii/0010028585900106) Cognitive Psychology, vol. 17, no. 3, 1985, pp. 311: # # > [T]he tendency to perceive a sequence as streak shooting decreases with the probability of alternation. [...] The sequences selected as best examples of chance shooting had probabilities of alternation of 0.7 and 0.8 rather # than 0.5. Furthermore, the sequence with the probability of alternation of 0.5 (the proper example of chance shooting) was classified as chance shooting only by 32% of subjects, whereas 62% identified it as an example of streak shooting. # # > Evidently, **people tend to perceive chance shooting as streak shooting,** and they expect sequences exemplifying chance shooting to contain many more alternations than would actually be produced by a random (chance) process. Thus, people "see" a positive serial correlation in independent sequences, and they fail to detect a negative serial correlation in alternating sequences. Hence, people not only perceive random sequences as positively correlated, they also perceive negatively correlated sequences as random. # # [Interactive demo here](https://alpha.editor.p5js.org/allison.parrish/sketches/B1eFfCqal)