This notebook shows a typical example of data loading and preprocessing necessary for NLP. In this case we are loading a corpus downloaded from the Hip-Hop Lyrics webpage www.ohhla.com. Our primary goal is to provide a dataset loading function for the language modelling chapter in this book.
We provide the corpus in the data
directory. As this notebook lives in a sub-directory itself, we access it via ../data
. Before preprocessing all files and provide generic loaders it is useful to inspect the format of the files based on a specific example file, and work on the loading process in this context. Here we look at /data/ohhla/train/www.ohhla.com/anonymous/j_live/SPTA/authentc.jlv.txt
.
with open('../data/ohhla/train/www.ohhla.com/anonymous/j_live/allabove/satisfy.jlv.txt.html', 'r') as f:
# we use read().splitlines() instead of readlines() to skip newline characters
lines = f.read().splitlines()
lines
['<!DOCTYPE html PUBLIC "-//W3C//DTD XHTML 1.1//EN" "http://www.w3.org/TR/xhtml11/DTD/xhtml11.dtd">', '<html xmlns="http://www.w3.org/1999/xhtml" xml:lang="en">', '', '<head>', ' <meta http-equiv="Content-Type" content="text/html; charset=ISO-8859-1" />', ' <meta http-equiv="charset" content="ISO-8859-1" />', ' <meta http-equiv="content-language" content="English" />', ' <meta http-equiv="vw96.object type" content="Document" />', ' <meta name="resource-type" content="document" />', ' <meta name="distribution" content="Global" />', ' <meta name="rating" content="General" />', ' <meta name="robots" content="all" />', ' <meta name="revist-after" content="2 days" />', ' <link rel="shortcut icon" href="../../../favicon.ico" />', ' <title>The Original Hip-Hop (Rap) Lyrics Archive</title>', '', ' <!-- link rel="stylesheet" type="text/css" href="http://ohhla.com/files/main.css" / -->', '\t <!-- BEGIN SITE ANALYTICS //-->', ' <script type="text/javascript">', " if (typeof siteanalytics == 'undefined') { siteanalytics = {}; };", " siteanalytics.gn_tracking_defaults = {google: '',comscore: {},quantcast:''};", ' siteanalytics.website_id = 180;', " siteanalytics.cdn_hostname = 'cdn.siteanalytics.evolvemediametrics.com';", ' </script>', ' <script type="text/javascript" src=\'http://cdn.siteanalytics.evolvemediametrics.com/js/siteanalytics.js\'></script>', ' <!-- END SITE ANALYTICS //-->', '', '</head>', '', '<body>', '', '<a href="javascript: history.go(-1)">Back to the previous page</a>', '', '<div>', '</div>', '', '<div style="width: 720px; text-align: center; color: #ff0000; font-weight: bold; font-size: 1em;">', '', ' <!-- AddThis Button BEGIN -->', ' <div class="addthis_toolbox addthis_default_style" style="margin: auto 0 auto 0; padding-left: 185px;">', ' <a class="addthis_button_facebook_like" fb:like:layout="button_count"></a>', ' <a class="addthis_button_tweet"></a>', ' <a class="addthis_button_google_plusone" g:plusone:size="medium"></a>', ' <a class="addthis_counter addthis_pill_style"></a>', ' </div>', ' <script type="text/javascript" src="http://s7.addthis.com/js/250/addthis_widget.js#pubid=ra-4e8ea9f77f69af2f"></script>', ' <!-- AddThis Button END -->', '', '</div>', '', '<br />', '', '<div style="float: left; min-width: 560px;">', '<pre>', 'Artist: J-Live', 'Album: All of the Above', 'Song: Satisfied', 'Typed by: Burnout678@aol.com', '', 'Hey yo', 'Lights, camera, tragedy, comedy, romance', 'You better dance from your fighting stance', "Or you'll never have a fighting chance", 'In the rat race', "Where the referee's son started way in advance", "But still you livin' the American Dream", "Silk PJ's, sheets and down pillows", 'Who the fuck would wanna wake up?', 'You got it good like hot sex after the break up', "Your four car garage it's just more space to take up", 'You even bought your mom a new whip scrap the jalopy', 'Thousand dollar habit, million dollar hobby', 'You a success story everybody wanna copy', 'But few work for it, most get jerked for it', "If you think that you could ignore it, you're ig-norant", 'A fat wallet still never made a man free', 'They say to eat good, yo, you gotta swallow your pride', "But dead that game plan, I'm not satisfied", '', '[Chorus]', 'The poor get worked, the rich get richer', 'The world gets worse, do you get the picture?', 'The poor gets dead, the rich get depressed', 'The ugly get mad, the pretty get stressed', 'The ugly get violent, the pretty get gone', 'The old get stiff, the young get stepped on', 'Whoever told you that it was all good lied', 'So throw your fists up if you not satisfied', '', '{*Singing*}', 'Are you satisfied?', "I'm not satisfied", '', "Hey yo, the air's still stale", "The anthrax got my Ole Earth wearin' a mask and gloves to get a meal ", 'I know a older guy that lost twelve close peeps on 9-1-1', "While you kickin' up punchlines and puns", 'Man fuck that shit, this is serious biz', "By the time Bush is done, you won't know what time it is", "If it's war time or jail time, time for promises", 'And time to figure out where the enemy is', 'The same devils that you used to love to hate', 'They got you so gassed and shook now, you scared to debate', 'The same ones that traded books for guns', 'Smuggled drugs for funds', "And had fun lettin' off forty-one", "But now it's all about NYPD caps ", 'And Pentagon bumper stickers', 'But yo, you still a nigga', "It ain't right them cops and them firemen died", "The shit is real tragic, but it damn sure ain't magic", "It won't make the brutality disappear", "It won't pull equality from behind your ear", "It won't make a difference in a two-party country", 'If the president cheats, to win another four years', "Now don't get me wrong, there's no place I'd rather be", "The grass ain't greener on the other genocide", "But tell Huey Freeman don't forget to cut the lawn", 'And uproot the weeds', "Cuz I'm not satisfied", '', '[Chorus]', '', '{*Singing*}', 'All this genocide', 'Is not justified', 'Are you satisfied?', "I'm not satisfied", '', 'Yo, poison pushers making paper off of pipe dreams', 'They turned hip-hop to a get-rich-quick scheme', "The rich minorities control the gov'ment", 'But they would have you believe we on the same team', 'So where you stand, huh?', 'What do you stand for?', "Sit your ass down if you don't know the answer", 'Serious as cancer, this jam demands your undivided attention', 'Even on the dance floor', 'Grab the bull by the horns, the bucks by the antlers', "Get yours, what're you sweatin' the next man for?", 'Get down, feel good to this, let it ride', "But until we all free, I'll never be satisfied", '', '[Chorus] - Repeat 2x', '', '{*Singing with talking in background*}', 'Are you satisfied? ', '(whoever told you that it was all good lied)', "I'm not satisfied ", '(Throw your fists up if you not satisfied)', 'Are you satisfied?', '(Whoever told you that it was all good lied)', "I'm not satisfied ", '(So throw your fists up)', '(So throw your fists up)', '(Throw your fists up)</pre>', '</div>', '', '<div style="float: left;">', '</div>', '', '</body></html>']
We first would like to remove everything outside of the <pre>
tag, and then remove the meta information.
def find_lyrics(lines):
filtered = []
in_pre = False
for line in lines:
if '<pre>' in line:
in_pre = True
filtered.append(line.replace("<pre>",""))
elif '</pre>' in line:
in_pre = False
filtered.append(line.replace("</pre>",""))
elif in_pre:
filtered.append(line)
return filtered[6:]
lyrics = find_lyrics(lines)
lyrics[:10]
['Hey yo', 'Lights, camera, tragedy, comedy, romance', 'You better dance from your fighting stance', "Or you'll never have a fighting chance", 'In the rat race', "Where the referee's son started way in advance", "But still you livin' the American Dream", "Silk PJ's, sheets and down pillows", 'Who the fuck would wanna wake up?', 'You got it good like hot sex after the break up']
Finally, we would like to convert the list of lines with newline characters to a single string, as this will be easier to process for our language models. We will also mark lyrical "bars" (lines) using a BAR
tag to still capture the rhythmical structure in the song.
string = '[BAR]' + '[/BAR][BAR]'.join(lyrics) + '[/BAR]'
string[:500]
"[BAR]Hey yo[/BAR][BAR]Lights, camera, tragedy, comedy, romance[/BAR][BAR]You better dance from your fighting stance[/BAR][BAR]Or you'll never have a fighting chance[/BAR][BAR]In the rat race[/BAR][BAR]Where the referee's son started way in advance[/BAR][BAR]But still you livin' the American Dream[/BAR][BAR]Silk PJ's, sheets and down pillows[/BAR][BAR]Who the fuck would wanna wake up?[/BAR][BAR]You got it good like hot sex after the break up[/BAR][BAR]Your four car garage it's just more space to "
We are now ready to provide a loading function.
def load_song(file_name):
def load_raw(encoding):
with open(file_name, 'r',encoding=encoding) as f:
# we use read().splitlines() instead of readlines() to skip newline characters
lines = f.read().splitlines()
# some files are pure txt files for which we don't need to extract the lyrics
lyrics = find_lyrics(lines) if file_name.endswith('html') else lines[5:]
string = '[BAR]' + '[/BAR][BAR]'.join(lyrics) + '[/BAR]'
return string
try:
return load_raw('utf-8')
except UnicodeDecodeError:
try:
return load_raw('cp1252')
except UnicodeDecodeError:
print("Could not load " + file_name)
return ""
song = load_song('../data/ohhla/train/www.ohhla.com/anonymous/j_live/allabove/satisfy.jlv.txt.html')
song[:500]
"[BAR]Hey yo[/BAR][BAR]Lights, camera, tragedy, comedy, romance[/BAR][BAR]You better dance from your fighting stance[/BAR][BAR]Or you'll never have a fighting chance[/BAR][BAR]In the rat race[/BAR][BAR]Where the referee's son started way in advance[/BAR][BAR]But still you livin' the American Dream[/BAR][BAR]Silk PJ's, sheets and down pillows[/BAR][BAR]Who the fuck would wanna wake up?[/BAR][BAR]You got it good like hot sex after the break up[/BAR][BAR]Your four car garage it's just more space to "
Now we want to load several files from an album directory.
from os import listdir
from os.path import isfile, join
def load_album(path):
# we filter out directories, and files that don't look like song files in OHHLA.
onlyfiles = [join(path, f) for f in listdir(path) if isfile(join(path, f)) and 'txt' in f]
lyrics = [load_song(f) for f in onlyfiles]
return lyrics
songs = load_album('../data/ohhla/train/www.ohhla.com/anonymous/j_live/SPTA/')
[len(s) for s in songs]
[2555, 2779, 3283]
We will also make it easy to load several albums. Then, for a few artists we provide short cuts to the album directories we care about.
def load_albums(album_paths):
return [song
for path in album_paths
for song in load_album(path)]
top_dir = '../data/ohhla/train/www.ohhla.com/anonymous/'
j_live = [
top_dir + '/j_live/allabove/',
top_dir + '/j_live/bestpart/'
]
len(load_albums(j_live))
29
It will be useful to convert a list of documents into a flat list of tokens. Based on the approach showed in the tokenisation chapter we can do this as follows:
import re
token = re.compile("\[BAR\]|\[/BAR\]|[\w-]+|'m|'t|'ll|'ve|'d|'s|\'")
def words(docs, bars=True):
return [word
for doc in docs
for word in token.findall(doc)
if bars or word not in ["[BAR]", "[/BAR]"]]
song_words = words(songs)
song_words[:20]
['[BAR]', 'J-Live', '[/BAR]', '[BAR]', 'Well', 'if', 'isn', "'t", 'the', 'outbreak', 'monkey', 'for', 'that', 'latest', 'epidemic', 'of', 'The', 'Vapors', '[/BAR]', '[BAR]']
Finally we provide a function that can load all songs within a top-level directory.
def load_all_songs(path):
only_files = [join(path, f) for f in listdir(path) if isfile(join(path, f)) and 'txt' in f]
only_paths = [join(path, f) for f in listdir(path) if not isfile(join(path, f))]
lyrics = [load_song(f) for f in only_files]
sub_songs = [song for sub_path in only_paths for song in load_all_songs(sub_path)]
return lyrics + sub_songs
len(load_all_songs("../data/ohhla/train/www.ohhla.com/anonymous/j_live/"))
50