Generating Markov Chains¶

A Markov chain is a mathematical system that transitions from one state to another within a finite set of states. It is characterized by the memoryless property, which means that the next state depends only on the current state and not on the sequence of events that preceded it.

https://simple.wikipedia.org/wiki/Markov_chain

Generating Text with Markov Chains¶

One fun application of Markov chains is to generate text that sounds like it was written by a particular author. If we have lots of text written by someone with a distinct voice and vocabulary, we can use it to calculate probabilities that one word will follow the next, then we can use those probabilities to "autocomple" next words.

Here's how it works:

States: The words in the text are treated as states.

Transitions: The transitions between states are the occurrences of words following each other in the text. Each word has a probability of being followed by other words, based on how often that transition occurs in the corpus.

Transition Probabilities: By analyzing the text, we calculate the probabilities of each word following another word. This is done by counting how many times each word follows another and then converting these counts into probabilities.

Text Generation: Starting with an initial word (state), the Markov chain uses the transition probabilities to randomly choose the next word.

This process is repeated to generate a sequence of words, creating a new piece of text that mimics the style and structure of the original corpus.

Example:

Here's a very simple example using simple python that you should be able to read and recognize.

import random

# Step 1: Initialize a list of words
word_list = [
    "the", "cat", "in", "the", "hat", "the", "cat", "on", "the", "mat",
    "the", "dog", "and", "the", "cat", "the", "cat", "and", "the", "hat"
]

# Step 2: Calculate the transition probabilities
transitions = {}

# Populate the transitions dictionary with counts
for i in range(len(word_list) - 1):
    current_word = word_list[i]
    next_word = word_list[i + 1]
    if current_word not in transitions:
        transitions[current_word] = {}
    if next_word not in transitions[current_word]:
        transitions[current_word][next_word] = 0
    transitions[current_word][next_word] += 1

# Convert counts to probabilities
for current_word, next_words in transitions.items():
    total = sum(next_words.values())
    for next_word in next_words:
        transitions[current_word][next_word] /= total

# Step 3: Implement the function to generate the next word
def next_word(current_word):
    if current_word not in transitions:
        return random.choice(list(transitions.keys()))
    probabilities = transitions[current_word]
    next_words = list(probabilities.keys())
    probabilities = list(prbabilities.values())
    return random.choices(next_words, probabilities)[0]

# Step 4: Generate a sequence of words
def generate_sequence(start_word, length):
    sequence = [start_word]
    current_word = start_word
    for _ in range(length - 1):
        current_word = next_word(current_word)
        sequence.append(current_word)
    return sequence

# Example usage
start_word = "the"
length = 20
sequence = generate_sequence(start_word, length)
print("Generated sequence:", sequence)

Assignment¶

Copy and paste the above code into a code block and test it out. Read it and try and understand what's happening.
Replace the word_list with one you'll create on your own.

Load the songs.csv dataset from the shared folder.
```
from google.colab import drive
import pandas as pd

drive.mount('/content/gdrive')
df = pd.read_csv('/content/gdrive/My Drive/datasets/songs.csv')
```
This dataset is a collection of songs and lyrics by different artists.

Filter the dataset down to a single artist and extract all of the text into a list of words.

Here's an example of filtering the dataset to only taylor swift.
```
taylor_df = df[df["Artist"] == "Taylor Swift"]
taylor_df.head()
```
And here's an example of extracting her lyrics into one long word_list.
```
taylor_words = taylor_df["Lyrics"].str.cat(sep=" ").split()
```
Use the above code, dataset, and your own modifications to generate a song written by the artist of your choice from this dataset.

In [2]:

import random

# Step 1: Initialize a list of words
word_list = [
    "the", "cat", "in", "the", "hat", "the", "cat", "on", "the", "mat",
    "the", "dog", "and", "the", "cat", "the", "cat", "and", "the", "hat"
]

# Step 2: Calculate the transition probabilities
transitions = {}

# Populate the transitions dictionary with counts
for i in range(len(word_list) - 1):
    current_word = word_list[i]
    next_word = word_list[i + 1]
    if current_word not in transitions:
        transitions[current_word] = {}
    if next_word not in transitions[current_word]:
        transitions[current_word][next_word] = 0
    transitions[current_word][next_word] += 1

# Convert counts to probabilities
for current_word, next_words in transitions.items():
    total = sum(next_words.values())
    for next_word in next_words:
        transitions[current_word][next_word] /= total

# Step 3: Implement the function to generate the next word
def next_word(current_word):
    if current_word not in transitions:
        return random.choice(list(transitions.keys()))
    probabilities = transitions[current_word]
    next_words = list(probabilities.keys())
    probabilities = list(probabilities.values())
    return random.choices(next_words, probabilities)[0]

# Step 4: Generate a sequence of words
def generate_sequence(start_word, length):
    sequence = [start_word]
    current_word = start_word
    for _ in range(length - 1):
        current_word = next_word(current_word)
        sequence.append(current_word)
    return sequence

# Example usage
start_word = "the"
length = 20
sequence = generate_sequence(start_word, length)
print("Generated sequence:", sequence)

Generated sequence: ['the', 'hat', 'the', 'dog', 'and', 'the', 'mat', 'the', 'dog', 'and', 'the', 'cat', 'the', 'cat', 'the', 'hat', 'the', 'cat', 'on', 'the']