Written by Makzan. Last updated at 2021 March.
In this lecture, we will explore reading and writing plain text files.
We will learn:
We can further improve the storage by using CSV file format.
import csv
with open("quotes.csv") as file_obj:
csv_reader = csv.reader(file_obj)
for line in csv_reader:
print(line)
['I want to put a ding in the universe.', 'Steve Jobs'] ['Life is 10% what happens to you and 90% how you react to it.', 'Charles R. Swindoll'] ["Family is not an important thing. It's everything.", 'Michael J. Fox'] ["Nothing is impossible,the word itself says 'I'm possible'!", 'Audrey Hepburn'] ['There are two ways of spreading light: to be the candle or the mirror that reflects it.', 'Edith Wharton'] ["Try to be a rainbow in someone's cloud.", 'Maya Angelou'] ['Be brave enough to live life creatively. The creative place where no one else has ever been.', 'Alan Alda'] ['The secret of getting ahead is getting started.', 'Mark Twain']
When using Pandas, we can read the CSV file into DataFrame
.
import pandas as pd
pd.read_csv("quotes.csv")
I want to put a ding in the universe. | Steve Jobs | |
---|---|---|
0 | Life is 10% what happens to you and 90% how yo... | Charles R. Swindoll |
1 | Family is not an important thing. It's everyth... | Michael J. Fox |
2 | Nothing is impossible,the word itself says 'I'... | Audrey Hepburn |
3 | There are two ways of spreading light: to be t... | Edith Wharton |
4 | Try to be a rainbow in someone's cloud. | Maya Angelou |
5 | Be brave enough to live life creatively. The c... | Alan Alda |
6 | The secret of getting ahead is getting started. | Mark Twain |
If you encounter error saying that pandas not found. You may need to install pandas via pip install pandas
. If you’re using Anaconda, it already comes Pandas built-in.
I have prepared a CSV file named [ABCD]2020-07-23.csv
. It is the CSV file from MPay every day listing all the transactions.
import pandas as pd
df = pd.read_csv('[ABCD]2020-07-23.csv')
Let’s inspect the result of read_csv. We call it data frame.
df
S/N | Merchant No. | Terminal No. | Transaction Channel | Transaction Type | Account No | Transaction ID | Channel Transaction ID | Merchant Transaction ID | Transaction Amount | Merchant Coupon Deduction | MacauPass Coupon Deduction | Preferential Amount Merchants | Preferential Amount Macaupass | Actual Transaction Amount | Offsetting Amount | Settlement Amount | Transaction Time | Settlement Date | Remark | |
---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
0 | 1 | ABCD | 123 | MacauPass | Pay(Off Line) | 6045543888 | 62 | NaN | NaN | -22.0 | 0.0 | 0.0 | 0.0 | 0.0 | -22.0 | 0.0 | -22.0 | 2020-07-22 11:06:58 | 2020-07-23 | NaN |
1 | 2 | ABCD | 123 | MacauPass | Pay(Off Line) | 6032888888 | 64 | NaN | NaN | -16.0 | 0.0 | 0.0 | 0.0 | 0.0 | -16.0 | 0.0 | -16.0 | 2020-07-22 13:25:02 | 2020-07-23 | NaN |
2 | 3 | ABCD | 123 | MPay | Pay(Off Line) | 66****00 | 202007228888888F1D014100000065 | 2020072288888880080962 | NaN | -20.0 | 0.0 | 0.0 | 0.0 | 0.0 | -20.0 | 0.0 | -20.0 | 2020-07-22 13:34:03 | 2020-07-23 | NaN |
3 | 4 | ABCD | 123 | MPay | Pay(Off Line) | 65****00 | 202007228888888F1D014100000066 | 2020072288888880097014 | NaN | -24.0 | 0.0 | 0.0 | 0.0 | 0.0 | -24.0 | 0.0 | -24.0 | 2020-07-22 14:20:07 | 2020-07-23 | NaN |
4 | 5 | ABCD | 123 | MPay | Pay(Off Line) | 66****20 | 202007228888888F1D014100000067 | 2020072288888880099085 | NaN | -21.6 | 0.0 | 0.0 | 0.0 | 0.0 | -21.6 | 0.0 | -21.6 | 2020-07-22 14:27:18 | 2020-07-23 | NaN |
5 | 6 | ABCD | 123 | MPay | Pay(Off Line) | 66****60 | 202007228888888F1D014100000068 | 2020072288888880127523 | NaN | -27.0 | 0.0 | 0.0 | 0.0 | 0.0 | -27.0 | 0.0 | -27.0 | 2020-07-22 16:35:43 | 2020-07-23 | NaN |
6 | 7 | ABCD | 123 | MacauPass | Pay(Off Line) | 6043633088 | 69 | NaN | NaN | -115.0 | 0.0 | 0.0 | 0.0 | 0.0 | -115.0 | 0.0 | -115.0 | 2020-07-22 17:07:06 | 2020-07-23 | NaN |
7 | 8 | ABCD | 123 | MPay | Pay(Off Line) | 66****50 | 202007228888888F1D014100000070 | 2020072288888880146235 | NaN | -27.0 | 0.0 | 0.0 | 0.0 | 0.0 | -27.0 | 0.0 | -27.0 | 2020-07-22 17:52:08 | 2020-07-23 | NaN |
8 | 9 | ABCD | 123 | MPay | Pay(Off Line) | 66****80 | 202007228888888F1D014100000071 | 2020072288888880148711 | NaN | -49.5 | 0.0 | 0.0 | 0.0 | 0.0 | -49.5 | 0.0 | -49.5 | 2020-07-22 18:00:53 | 2020-07-23 | NaN |
9 | 10 | ABCD | 123 | MacauPass | Pay(Off Line) | 6023249388 | 72 | NaN | NaN | -44.0 | 0.0 | 0.0 | 0.0 | 0.0 | -44.0 | 0.0 | -44.0 | 2020-07-22 18:10:49 | 2020-07-23 | NaN |
10 | 11 | ABCD | 123 | MPay | Pay(Off Line) | 66****80 | 202007228888888F1D014100000073 | 2020072288888880161439 | NaN | -44.0 | 0.0 | 0.0 | 0.0 | 0.0 | -44.0 | 0.0 | -44.0 | 2020-07-22 18:40:18 | 2020-07-23 | NaN |
11 | 12 | ABCD | 123 | MPay | Pay(Off Line) | 66****70 | 202007228888888F1D014100000074 | 2020072288888880162919 | NaN | -28.8 | 0.0 | 0.0 | 0.0 | 0.0 | -28.8 | 0.0 | -28.8 | 2020-07-22 18:44:43 | 2020-07-23 | NaN |
12 | 13 | ABCD | 123 | MPay | Pay(Off Line) | 62****80 | 202007228888888F1D014100000075 | 2020072288888880164109 | NaN | -72.0 | 0.0 | 3.0 | 0.0 | 0.0 | -69.0 | 0.0 | -72.0 | 2020-07-22 18:48:24 | 2020-07-23 | CouponNo(0300850035833016) |
What are the transaction amount?
df["Transaction Amount"]
0 -22.0 1 -16.0 2 -20.0 3 -24.0 4 -21.6 5 -27.0 6 -115.0 7 -27.0 8 -49.5 9 -44.0 10 -44.0 11 -28.8 12 -72.0 Name: Transaction Amount, dtype: float64
What is the sum of the day?
df["Transaction Amount"].sum()
-510.90000000000003
👆🏻🤔 If you are wondering why the sum is not -510.90. Please refer to the follownig documentation:
If you really need to output the decimal numbers in particular format, you can use the format
function.
format(df["Transaction Amount"].sum(), '.2f')
'-510.90'
How many of them use a physica MacauPass card?
mask = df["Transaction Channel"]=="MacauPass"
df[mask]
S/N | Merchant No. | Terminal No. | Transaction Channel | Transaction Type | Account No | Transaction ID | Channel Transaction ID | Merchant Transaction ID | Transaction Amount | Merchant Coupon Deduction | MacauPass Coupon Deduction | Preferential Amount Merchants | Preferential Amount Macaupass | Actual Transaction Amount | Offsetting Amount | Settlement Amount | Transaction Time | Settlement Date | Remark | |
---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
0 | 1 | ABCD | 123 | MacauPass | Pay(Off Line) | 6045543888 | 62 | NaN | NaN | -22.0 | 0.0 | 0.0 | 0.0 | 0.0 | -22.0 | 0.0 | -22.0 | 2020-07-22 11:06:58 | 2020-07-23 | NaN |
1 | 2 | ABCD | 123 | MacauPass | Pay(Off Line) | 6032888888 | 64 | NaN | NaN | -16.0 | 0.0 | 0.0 | 0.0 | 0.0 | -16.0 | 0.0 | -16.0 | 2020-07-22 13:25:02 | 2020-07-23 | NaN |
6 | 7 | ABCD | 123 | MacauPass | Pay(Off Line) | 6043633088 | 69 | NaN | NaN | -115.0 | 0.0 | 0.0 | 0.0 | 0.0 | -115.0 | 0.0 | -115.0 | 2020-07-22 17:07:06 | 2020-07-23 | NaN |
9 | 10 | ABCD | 123 | MacauPass | Pay(Off Line) | 6023249388 | 72 | NaN | NaN | -44.0 | 0.0 | 0.0 | 0.0 | 0.0 | -44.0 | 0.0 | -44.0 | 2020-07-22 18:10:49 | 2020-07-23 | NaN |
len(df[mask])
4
df[mask]["Transaction Amount"].sum()
-197.0
import datetime
content = input("What do you want to say to Mr. Diary? ")
if len(content) > 0:
with open('diary.txt', "a") as file_obj:
today = datetime.date.today().isoformat()
file_obj.write(today + ": " + content + "\n")
with open('diary.txt', "r") as file_obj:
lines = file_obj.readlines()
for line in lines[-3:]:
print(line.rstrip())
What do you want to say to Mr. Diary? Hello python. 2020-06-11: Hello 2020-06-11: Hello 2020-08-10: Hello python.
We can use python-docx
module to write content to DOCX.
First, we need to install the module by calling pip install python-docx
once in terminal or in Jupyter.
pip install python-docx
Requirement already satisfied: python-docx in /private/var/mobile/Containers/Data/Application/8F879E4C-8D79-4BDE-8FB3-2E6B2D2BD6C6/Library/lib/python3.7/site-packages (0.8.10) Requirement already satisfied: lxml>=2.3.2 in /private/var/mobile/Containers/Data/Application/8F879E4C-8D79-4BDE-8FB3-2E6B2D2BD6C6/Library/lib/python3.7/site-packages/lxml-4.4.2-py3.7-macosx-10.9-x86_64.egg (from python-docx) (4.4.2) Note: you may need to restart the kernel to use updated packages.
👆🏻🤔 If you’re wondering how the above line works. It is a command executed in command line prompt. But Jupyter is smart enough to parse the pip install
command and execute it right inside the notebook.
import datetime
import docx
import os
content = input("What do you want to say to Mr. Diary? ")
if len(content) > 0:
with open('diary.txt', "a") as file_obj:
today = str(datetime.date.today())
file_obj.write(today + ": " + content + "\n")
if os.path.isfile("diary.docx"):
doc = docx.Document("diary.docx")
else:
doc = docx.Document()
doc.add_paragraph(content)
doc.save("diary.docx")
print(f"{content} is written to diary.docx")
What do you want to say to Mr. Diary? Hello Hello is written to diary.docx
Given that we have a DOCX file named Sample Document.docx
. We can read all the paragrahs in the DOCX file.
import docx
doc = docx.Document("Sample Document.docx")
for paragraph in doc.paragraphs:
print(paragraph.text)
Sample Document This is a sample paragraph. This is the second paragraph. Here is the result Summary This is the summary of the sample report document.
We can also read the tables and the content.
doc.tables
[<docx.table.Table at 0x1385d0f60>]
The following code read the data row by row into 3 lists: dates
, morning_visitors
, evening_visitors
.
table = doc.tables[0]
dates = []
morning_visitors = []
evening_visitors = []
for row in table.rows[1:]:
dates.append(row.cells[0].text)
morning_visitors.append(int(row.cells[1].text))
evening_visitors.append(int(row.cells[2].text))
dates
['2020-06-01', '2020-06-02', '2020-06-03', '2020-06-04', '2020-06-05', '2020-06-06', '2020-06-07']
morning_visitors
[23, 25, 24, 26, 25, 24, 23]
sum(morning_visitors)
170
evening_visitors
[17, 16, 16, 15, 16, 17, 18]
sum(morning_visitors) + sum(evening_visitors)
285
The following section about zipping and data frame is a preview of data processing. We will go through detail usage of pandas and data frame in lecture 8.
Have 3 separated lists for the same purpose of data makes it hard to maintain. Usually we want to have a tabular data to hold all related data into rows and columns.
zipped_list = list(zip(dates, morning_visitors, evening_visitors))
zipped_list
[('2020-06-01', 23, 17), ('2020-06-02', 25, 16), ('2020-06-03', 24, 16), ('2020-06-04', 26, 15), ('2020-06-05', 25, 16), ('2020-06-06', 24, 17), ('2020-06-07', 23, 18)]
zipped_list[3]
('2020-06-04', 26, 15)
We will learn pandas and DataFrame in lecture 8. But let’s have a glimpse on how we can process multi-columns of data in handy way by using it.
import pandas as pd
df = pd.DataFrame(zipped_list, columns=['Date', 'Morning Visitors', 'Evening Visitors'])
df
Date | Morning Visitors | Evening Visitors | |
---|---|---|---|
0 | 2020-06-01 | 23 | 17 |
1 | 2020-06-02 | 25 | 16 |
2 | 2020-06-03 | 24 | 16 |
3 | 2020-06-04 | 26 | 15 |
4 | 2020-06-05 | 25 | 16 |
5 | 2020-06-06 | 24 | 17 |
6 | 2020-06-07 | 23 | 18 |
What’re the Benefits of using data frame? We can perform column-based calcuations to all data at once.
df['Total'] = df['Morning Visitors'] + df['Evening Visitors']
df
Date | Morning Visitors | Evening Visitors | Total | |
---|---|---|---|---|
0 | 2020-06-01 | 23 | 17 | 40 |
1 | 2020-06-02 | 25 | 16 | 41 |
2 | 2020-06-03 | 24 | 16 | 40 |
3 | 2020-06-04 | 26 | 15 | 41 |
4 | 2020-06-05 | 25 | 16 | 41 |
5 | 2020-06-06 | 24 | 17 | 41 |
6 | 2020-06-07 | 23 | 18 | 41 |
In this section, we learned to read and write DOCX file. Then, we process the tablular data read from the DOCX file into nested list and even pandas DataFrame.
Furthermore, you may read more examples about reading excel document on AutomateTheBoringStuff.com. And examples about reading PDF/DOCX document.
We will also use pandas to read Excel data into DataFrame in part 3.