Extra Credit Lab: More Python!


The exercises this week deal with some features that we haven't otherwise covered. There are links to specific parts of the Python for Everyone book in each question, but here's a more general reference:

A Note on Boilerplate

One of the major benefits of tools like Jupyter are that it makes it easy to strip away the "boilerplate" that comes with programming languages; some of these exercises relies on a little bit more of that than in weeks past. If you get confused or something stops working, email your TA or come to ofice hours.

Exercise 1

sections to reference: opening files

This one definitely falls in the category of "pointless exercises". You're going to make a program that prompts for a filename. If the file exists, quit the program! Don't print anything out, don't do anything with the file, don't do anything. Just quit. If the file doesn't exist, complain, and prompt the user for a filename again. Do this until the user gives you a valid filename or pours water on their computer in frustration. (Just kidding. Don't pour water on your computer.)

You're going to want to put the filename prompt inside a loop, so you can ask the user for a filename as many times as you want. But what kind of loop? And how will you get out of the loop once you've got a valid file?

thinking face

In [ ]:

Exercise 2

sections to reference: opening files

Write a program that asks the user for a file and simply prints the entire contents of the file out. If the user enters a file that doesn't exist, give the user an error and prompt for a filename again.

Hint: You can do the printing in one line of code.

In [ ]:

Exercise 3

sections to reference: parsing strings

This exercise deals with a test file, spam1.txt. You can open that file in a new tab so you can get a sense for what it looks like and what you're going to be dealing with.

You're going to be writing a program that deals with a list of spam confidence values. The file, spam1.txt, looks something like this:

X-DSPAM-Confidence: 0.8475 X-DSPAM-Confidence: 0.9288 X-DSPAM-Confidence: 0.0129 X-DSPAM-Confidence: 0.5102 X-DSPAM-Confidence: 0.8912 ...

There are 500 spam confidence values in the file spam1.txt. I want you to tell me two things about the values:

  1. The average of all of the spam values in the file, and
  2. How many emails have a spam value of over 95% (so, >0.95).

You'll need to read the file line by line and process each line so you can get the float.

After you're done with that program, make a Markdown cell below your code and tell me, in a few sentences, what the average spam confidence means. By way of explanation, the spam value is a value generated by an algorithm (individual email providers keep their algorithms closely guarded, but they're all pretty much based on Bayesian math, if you're interested.) That algorithm rates any given email from 0 to 1 on how confident they are that the email is spam. 1 means that the algorithm thinks it definitely is spam, and 0 means that the algorithm thinks there's no chance it's spam.

So, once you have your average value, what does that mean for any given email in the set? What are the chances it'll be spam or not?

In [ ]:

Exercise 4

sections to reference: searching through a file

This exercise is a continuation of the previous one. You've got a new text file to process, called spam2.txt. It has a bunch of spam confidence values, but it's also got some other information, too, like the sender, message subject, and the date and time that it's been sent. A sample record from the file looks like this:

From [email protected] Fri Jan 4 18:10:48 2008 Return-Path: X-Sieve: CMU Sieve 2.3 Message-ID: <[email protected]> Mime-Version: 1.0 Content-Transfer-Encoding: 7bit Date: Fri 4 Jan 2008 18:08:57 -0500 To: [email protected] Subject: [sakai] svn commit: r39771 - in bspace/site-manage/sakai_2-4-x/site-manage-tool/tool/src: bundle java/org/sakaiproject/site/tool X-Content-Type-Outer-Envelope: text/plain; charset=UTF-8 X-Content-Type-Message-Body: text/plain; charset=UTF-8 Content-Type: text/plain; charset=UTF-8 X-DSPAM-Result: Innocent X-DSPAM-Processed: Fri Jan 4 18:10:48 2008 X-DSPAM-Confidence: 0.6178 X-DSPAM-Probability: 0.0000

There is a lot of information here, and this, if you can believe it, is a stripped-down version of what the email server records look like.

spam2.txt contains a bunch of email records. You're going to write a program, similar to Exercise 3, wherein you grab the spam confidence from each email. Like before, you're going to take the average of these values. When you encounter an email with a spam confidence greater than 95%, though, you're going to print out the email address that sent the email, the email address that received it, the date and time that it was sent, and the spam confidence value.

Note: Don't simply print out the entire lines of the file that have that information; get the specific items that you want and format it nicely.

In [ ]:

Exercise 5

sections to reference: list methods

Below are the lines of a program I wrote to add some things to a list of my favorite foods, sort the list, and then print them out in alphabetical order. The problem is that, when I was uploading this program to the supercomputer, my internet stopped working and my program's lines got all mixed up.

Your task is this: rearrange the lines below so the program sorts and then prints the elements in my list of foods. When you've got the order all sorted out, put it in the code cell below. And thanks for your help!

for food in my_fave_foods:
my_fave_foods.append("asiago cheese")
my_fave_foods = ["serrano peppers", "bananas", "apple pie", "veal", "smoked brisket"]
In [ ]:

Exercise 6

sections to reference: lists

So, now that you're an expert in sorting lists of food, let's get a little bit more literary. richard3.txt contains the text of Richard's opening soliloquy from Shakespeare's play Richard III.

You can watch a performance of it here, if you're interested.

You're going to write a program that reads in the soliloquy from the file line by line. Then, using the split function, split each line into a list of words. Create a list to contain all of your words. Going word by word, check if the word is already in the list. If not, add it.

Once you've gone through the whole file, sort the list of words in alphabetical order, and print it out.

In [ ]:

Exercise 7

sections to reference: dictionaries and files

In the previous exercise, you just discarded words that were duplicates, but what if you wanted to count them?

You're going to read in richard3.txt again, but this time use a dictionary. Iterate through the lines in the file and, for each word, if this is its first occurence, add the word as a key to the dictionary and the number "1" as a value. If the word exists in the dictionary, add one to its value.

Then, print out your word-frequency dictionary.

In [ ]: