Lecture 4—File Reading and Writing¶

Written by Makzan. Last updated at 2021 March.

In this lecture, we will explore reading and writing plain text files.

We will learn:

Read CSV file
Using Pandas to read CSV file
Reading MPay CSV file
Write DOCX file
Reading paragraphs in DOCX file
Reading table of data in DOCX file

Reading CSV file¶

We can further improve the storage by using CSV file format.

In [3]:

import csv

with open("quotes.csv") as file_obj:
    csv_reader = csv.reader(file_obj)
    for line in csv_reader:
        print(line)

['I want to put a ding in the universe.', 'Steve Jobs']
['Life is 10% what happens to you and 90% how you react to it.', 'Charles R. Swindoll']
["Family is not an important thing. It's everything.", 'Michael J. Fox']
["Nothing is impossible,the word itself says 'I'm possible'!", 'Audrey Hepburn']
['There are two ways of spreading light: to be the candle or the mirror that reflects it.', 'Edith Wharton']
["Try to be a rainbow in someone's cloud.", 'Maya Angelou']
['Be brave enough to live life creatively. The creative place where no one else has ever been.', 'Alan Alda']
['The secret of getting ahead is getting started.', 'Mark Twain']

Using Pandas to read CSV file¶

When using Pandas, we can read the CSV file into DataFrame.

In [4]:

import pandas as pd

pd.read_csv("quotes.csv")

Out[4]:

	I want to put a ding in the universe.	Steve Jobs
0	Life is 10% what happens to you and 90% how yo...	Charles R. Swindoll
1	Family is not an important thing. It's everyth...	Michael J. Fox
2	Nothing is impossible,the word itself says 'I'...	Audrey Hepburn
3	There are two ways of spreading light: to be t...	Edith Wharton
4	Try to be a rainbow in someone's cloud.	Maya Angelou
5	Be brave enough to live life creatively. The c...	Alan Alda
6	The secret of getting ahead is getting started.	Mark Twain

If you encounter error saying that pandas not found. You may need to install pandas via pip install pandas. If you’re using Anaconda, it already comes Pandas built-in.

Another examle of reading MPay daily CSV file via Pandas¶

I have prepared a CSV file named [ABCD]2020-07-23.csv. It is the CSV file from MPay every day listing all the transactions.

In [4]:

import pandas as pd

df = pd.read_csv('[ABCD]2020-07-23.csv')

Let’s inspect the result of read_csv. We call it data frame.

In [5]:

df

Out[5]:

	S/N	Merchant No.	Terminal No.	Transaction Channel	Transaction Type	Account No	Transaction ID	Channel Transaction ID	Merchant Transaction ID	Transaction Amount	MacauPass Coupon Deduction	Actual Transaction Amount	Settlement Amount	Transaction Time	Settlement Date	Remark
0	1	ABCD	123	MacauPass	Pay(Off Line)	6045543888	62	NaN	NaN	-22.0	0.0	-22.0	-22.0	2020-07-22 11:06:58	2020-07-23	NaN
1	2	ABCD	123	MacauPass	Pay(Off Line)	6032888888	64	NaN	NaN	-16.0	0.0	-16.0	-16.0	2020-07-22 13:25:02	2020-07-23	NaN
2	3	ABCD	123	MPay	Pay(Off Line)	66****00	202007228888888F1D014100000065	2020072288888880080962	NaN	-20.0	0.0	-20.0	-20.0	2020-07-22 13:34:03	2020-07-23	NaN
3	4	ABCD	123	MPay	Pay(Off Line)	65****00	202007228888888F1D014100000066	2020072288888880097014	NaN	-24.0	0.0	-24.0	-24.0	2020-07-22 14:20:07	2020-07-23	NaN
4	5	ABCD	123	MPay	Pay(Off Line)	66****20	202007228888888F1D014100000067	2020072288888880099085	NaN	-21.6	0.0	-21.6	-21.6	2020-07-22 14:27:18	2020-07-23	NaN
5	6	ABCD	123	MPay	Pay(Off Line)	66****60	202007228888888F1D014100000068	2020072288888880127523	NaN	-27.0	0.0	-27.0	-27.0	2020-07-22 16:35:43	2020-07-23	NaN
6	7	ABCD	123	MacauPass	Pay(Off Line)	6043633088	69	NaN	NaN	-115.0	0.0	-115.0	-115.0	2020-07-22 17:07:06	2020-07-23	NaN
7	8	ABCD	123	MPay	Pay(Off Line)	66****50	202007228888888F1D014100000070	2020072288888880146235	NaN	-27.0	0.0	-27.0	-27.0	2020-07-22 17:52:08	2020-07-23	NaN
8	9	ABCD	123	MPay	Pay(Off Line)	66****80	202007228888888F1D014100000071	2020072288888880148711	NaN	-49.5	0.0	-49.5	-49.5	2020-07-22 18:00:53	2020-07-23	NaN
9	10	ABCD	123	MacauPass	Pay(Off Line)	6023249388	72	NaN	NaN	-44.0	0.0	-44.0	-44.0	2020-07-22 18:10:49	2020-07-23	NaN
10	11	ABCD	123	MPay	Pay(Off Line)	66****80	202007228888888F1D014100000073	2020072288888880161439	NaN	-44.0	0.0	-44.0	-44.0	2020-07-22 18:40:18	2020-07-23	NaN
11	12	ABCD	123	MPay	Pay(Off Line)	66****70	202007228888888F1D014100000074	2020072288888880162919	NaN	-28.8	0.0	-28.8	-28.8	2020-07-22 18:44:43	2020-07-23	NaN
12	13	ABCD	123	MPay	Pay(Off Line)	62****80	202007228888888F1D014100000075	2020072288888880164109	NaN	-72.0	3.0	-69.0	-72.0	2020-07-22 18:48:24	2020-07-23	CouponNo(0300850035833016)

What are the transaction amount?

In [6]:

df["Transaction Amount"]

Out[6]:

0     -22.0
1     -16.0
2     -20.0
3     -24.0
4     -21.6
5     -27.0
6    -115.0
7     -27.0
8     -49.5
9     -44.0
10    -44.0
11    -28.8
12    -72.0
Name: Transaction Amount, dtype: float64

What is the sum of the day?

In [7]:

df["Transaction Amount"].sum()

Out[7]:

-510.90000000000003

👆🏻🤔 If you are wondering why the sum is not -510.90. Please refer to the follownig documentation:

https://docs.python.org/3/tutorial/floatingpoint.html

If you really need to output the decimal numbers in particular format, you can use the format function.

In [21]:

format(df["Transaction Amount"].sum(), '.2f')

Out[21]:

'-510.90'

How many of them use a physica MacauPass card?

In [10]:

mask = df["Transaction Channel"]=="MacauPass"
df[mask]

Out[10]:

	S/N	Merchant No.	Terminal No.	Transaction Channel	Transaction Type	Account No	Transaction ID	Channel Transaction ID	Merchant Transaction ID	Transaction Amount	Actual Transaction Amount	Settlement Amount	Transaction Time	Settlement Date	Remark
0	1	ABCD	123	MacauPass	Pay(Off Line)	6045543888	62	NaN	NaN	-22.0	-22.0	-22.0	2020-07-22 11:06:58	2020-07-23	NaN
1	2	ABCD	123	MacauPass	Pay(Off Line)	6032888888	64	NaN	NaN	-16.0	-16.0	-16.0	2020-07-22 13:25:02	2020-07-23	NaN
6	7	ABCD	123	MacauPass	Pay(Off Line)	6043633088	69	NaN	NaN	-115.0	-115.0	-115.0	2020-07-22 17:07:06	2020-07-23	NaN
9	10	ABCD	123	MacauPass	Pay(Off Line)	6023249388	72	NaN	NaN	-44.0	-44.0	-44.0	2020-07-22 18:10:49	2020-07-23	NaN

In [13]:

len(df[mask])

Out[13]:

In [12]:

df[mask]["Transaction Amount"].sum()

Out[12]:

-197.0

Example of writing plain text file: Diary logging¶

In [5]:

import datetime

content = input("What do you want to say to Mr. Diary? ")
if len(content) > 0:
    with open('diary.txt', "a") as file_obj:
        today = datetime.date.today().isoformat()
        file_obj.write(today + ": " + content + "\n")

with open('diary.txt', "r") as file_obj:
    lines = file_obj.readlines()
    for line in lines[-3:]:
        print(line.rstrip())

What do you want to say to Mr. Diary? Hello python.
2020-06-11: Hello
2020-06-11: Hello
2020-08-10: Hello python.

Writing DOCX¶

We can use python-docx module to write content to DOCX.

First, we need to install the module by calling pip install python-docx once in terminal or in Jupyter.

In [6]:

pip install python-docx

Requirement already satisfied: python-docx in /private/var/mobile/Containers/Data/Application/8F879E4C-8D79-4BDE-8FB3-2E6B2D2BD6C6/Library/lib/python3.7/site-packages (0.8.10)
Requirement already satisfied: lxml>=2.3.2 in /private/var/mobile/Containers/Data/Application/8F879E4C-8D79-4BDE-8FB3-2E6B2D2BD6C6/Library/lib/python3.7/site-packages/lxml-4.4.2-py3.7-macosx-10.9-x86_64.egg (from python-docx) (4.4.2)
Note: you may need to restart the kernel to use updated packages.

👆🏻🤔 If you’re wondering how the above line works. It is a command executed in command line prompt. But Jupyter is smart enough to parse the pip install command and execute it right inside the notebook.

In [7]:

import datetime
import docx
import os

content = input("What do you want to say to Mr. Diary? ")
if len(content) > 0:
    with open('diary.txt', "a") as file_obj:
        today = str(datetime.date.today())
        file_obj.write(today + ": " + content + "\n")

if os.path.isfile("diary.docx"):
    doc = docx.Document("diary.docx")
else:
    doc = docx.Document()
doc.add_paragraph(content)
doc.save("diary.docx")
    
print(f"{content} is written to diary.docx")

What do you want to say to Mr. Diary? Hello
Hello is written to diary.docx

Reading DOCX file¶

Given that we have a DOCX file named Sample Document.docx. We can read all the paragrahs in the DOCX file.

In [8]:

import docx

doc = docx.Document("Sample Document.docx")

In [9]:

for paragraph in doc.paragraphs:
    print(paragraph.text)

Sample Document

This is a sample paragraph.

This is the second paragraph.

Here is the result

Summary

This is the summary of the sample report document.

Reading tables in DOCX file¶

We can also read the tables and the content.

In [10]:

doc.tables

Out[10]:

[<docx.table.Table at 0x1385d0f60>]

The following code read the data row by row into 3 lists: dates, morning_visitors, evening_visitors.

In [11]:

table = doc.tables[0]

dates = []
morning_visitors = []
evening_visitors = []

for row in table.rows[1:]:
    dates.append(row.cells[0].text)
    morning_visitors.append(int(row.cells[1].text))
    evening_visitors.append(int(row.cells[2].text))

dates

Out[11]:

['2020-06-01',
 '2020-06-02',
 '2020-06-03',
 '2020-06-04',
 '2020-06-05',
 '2020-06-06',
 '2020-06-07']

In [12]:

morning_visitors

Out[12]:

[23, 25, 24, 26, 25, 24, 23]

In [13]:

sum(morning_visitors)

Out[13]:

In [14]:

evening_visitors

Out[14]:

[17, 16, 16, 15, 16, 17, 18]

In [15]:

sum(morning_visitors) + sum(evening_visitors)

Out[15]:

Optional: Zipping multiple-list into a multi-columns list¶

The following section about zipping and data frame is a preview of data processing. We will go through detail usage of pandas and data frame in lecture 8.

Have 3 separated lists for the same purpose of data makes it hard to maintain. Usually we want to have a tabular data to hold all related data into rows and columns.

In [16]:

zipped_list = list(zip(dates, morning_visitors, evening_visitors))

zipped_list

Out[16]:

[('2020-06-01', 23, 17),
 ('2020-06-02', 25, 16),
 ('2020-06-03', 24, 16),
 ('2020-06-04', 26, 15),
 ('2020-06-05', 25, 16),
 ('2020-06-06', 24, 17),
 ('2020-06-07', 23, 18)]

In [17]:

zipped_list[3]

Out[17]:

('2020-06-04', 26, 15)

Optional: Converting the multi-columns list into pandas DataFrame¶

We will learn pandas and DataFrame in lecture 8. But let’s have a glimpse on how we can process multi-columns of data in handy way by using it.

In [18]:

import pandas as pd

df = pd.DataFrame(zipped_list, columns=['Date', 'Morning Visitors', 'Evening Visitors'])

In [19]:

df

Out[19]:

	Date	Morning Visitors	Evening Visitors
0	2020-06-01	23	17
1	2020-06-02	25	16
2	2020-06-03	24	16
3	2020-06-04	26	15
4	2020-06-05	25	16
5	2020-06-06	24	17
6	2020-06-07	23	18

What’re the Benefits of using data frame? We can perform column-based calcuations to all data at once.

In [20]:

df['Total'] = df['Morning Visitors'] + df['Evening Visitors']

In [21]:

df

Out[21]:

	Date	Morning Visitors	Evening Visitors	Total
0	2020-06-01	23	17	40
1	2020-06-02	25	16	41
2	2020-06-03	24	16	40
3	2020-06-04	26	15	41
4	2020-06-05	25	16	41
5	2020-06-06	24	17	41
6	2020-06-07	23	18	41

Summary¶

In this section, we learned to read and write DOCX file. Then, we process the tablular data read from the DOCX file into nested list and even pandas DataFrame.

Furthermore, you may read more examples about reading excel document on AutomateTheBoringStuff.com. And examples about reading PDF/DOCX document.

We will also use pandas to read Excel data into DataFrame in part 3.