Reading and Writing Data in Text Format¶

In [2]:

!type examples\ex1.csv

a,b,c,d,message
1,2,3,4,hello
5,6,7,8,world
9,10,11,12,foo

In [3]:

df = pd.read_csv(r'examples\ex1.csv')
df

Out[3]:

	a	b	c	d	message
0	1	2	3	4	hello
1	5	6	7	8	world
2	9	10	11	12	foo

In [4]:

pd.read_table('examples/ex1.csv', sep=',')

Out[4]:

	a	b	c	d	message
0	1	2	3	4	hello
1	5	6	7	8	world
2	9	10	11	12	foo

A file will not always have a header row. Consider this file

In [5]:

!type examples\ex2.csv

1,2,3,4,hello
5,6,7,8,world
9,10,11,12,foo

In [6]:

pd.read_csv(r'examples/ex2.csv', header=None)

Out[6]:

	0	1	2	3	4
0	1	2	3	4	hello
1	5	6	7	8	world
2	9	10	11	12	foo

In [7]:

pd.read_csv(r'examples/ex2.csv', names=['a', 'b', 'c', 'd', 'message'])

Out[7]:

	a	b	c	d	message
0	1	2	3	4	hello
1	5	6	7	8	world
2	9	10	11	12	foo

In [8]:

names=['a', 'b', 'c', 'd', 'message']

In [9]:

# make message column to be the index of the returned DataFrame

pd.read_csv(r'examples/ex2.csv', names=names, index_col='message')

Out[9]:

	a	b	c	d
message
hello	1	2	3	4
world	5	6	7	8
foo	9	10	11	12

In the event that you want to form a hierarchical index from multiple columns, pass a list of column numbers or names

In [10]:

!type examples\csv_mindex.csv

key1,key2,value1,value2
one,a,1,2
one,b,3,4
one,c,5,6
one,d,7,8
two,a,9,10
two,b,11,12
two,c,13,14
two,d,15,16

In [11]:

parsed = pd.read_csv(r'examples/csv_mindex.csv', index_col=['key1', 'key2'])
parsed

Out[11]:

		value1	value2
key1	key2
one	a	1	2
	b	3	4
	c	5	6
	d	7	8
two	a	9	10
	b	11	12
	c	13	14
	d	15	16

In some cases, a table might not have a fixed delimiter, using whitespace or some other pattern to separate fields.

In [12]:

list(open(r'examples/ex3.txt'))

Out[12]:

['            A         B         C\n',
 'aaa -0.264438 -1.026059 -0.619500\n',
 'bbb  0.927272  0.302904 -0.032399\n',
 'ccc -0.264273 -0.386314 -0.217601\n',
 'ddd -0.871858 -0.348382  1.100491\n']

In [13]:

result = pd.read_table(r'examples/ex3.txt', sep='\s+')
result

Out[13]:

	A	B	C
aaa	-0.264438	-1.026059	-0.619500
bbb	0.927272	0.302904	-0.032399
ccc	-0.264273	-0.386314	-0.217601
ddd	-0.871858	-0.348382	1.100491

In [14]:

!type examples\ex4.csv

# hey!
a,b,c,d,message
# just wanted to make things more difficult for you
# who reads CSV files with computers, anyway?
1,2,3,4,hello
5,6,7,8,world
9,10,11,12,foo

In [15]:

pd.read_csv(r'examples/ex4.csv', skiprows=[0, 2,3])

Out[15]:

	a	b	c	d	message
0	1	2	3	4	hello
1	5	6	7	8	world
2	9	10	11	12	foo

Handling missing values is an important and frequently nuanced part of the file parsing process. Missing data is usually either not present (empty string) or marked by some sentinel value. By default, pandas uses a set of commonly occurring sentinels, such as NA and NUL

In [16]:

!type examples\ex5.csv

something,a,b,c,d,message
one,1,2,3,4,NA
two,5,6,,8,world
three,9,10,11,12,foo

In [17]:

result = pd.read_csv(r'examples/ex5.csv')
result

Out[17]:

	something	a	b	c	d	message
0	one	1	2	3.0	4	NaN
1	two	5	6	NaN	8	world
2	three	9	10	11.0	12	foo

In [18]:

pd.isnull(result)

Out[18]:

	something	a	b	c	d	message
0	False	False	False	False	False	True
1	False	False	False	True	False	False
2	False	False	False	False	False	False

The na_values option can take either a list or set of strings to consider missing values

In [19]:

result = pd.read_csv(r'examples/ex5.csv', na_values=['NULL'])
result

Out[19]:

	something	a	b	c	d	message
0	one	1	2	3.0	4	NaN
1	two	5	6	NaN	8	world
2	three	9	10	11.0	12	foo

In [20]:

# Different NA sentinels can be specified for each column in a dict

sentinels = {'message': ['foo', 'NA'], 'something':['two']}

In [21]:

pd.read_csv(r'examples/ex5.csv', na_values=sentinels)

Out[21]:

	something	a	b	c	d	message
0	one	1	2	3.0	4	NaN
1	NaN	5	6	NaN	8	world
2	three	9	10	11.0	12	NaN

Reading Text Files in Pieces¶

When processing very large files or figuring out the right set of arguments to correctly process a large file, you may only want to read in a small piece of a file or iterate through smaller chunks of the file.Before we look at a large file, we make the pandas display settings more compact

In [22]:

pd.options.display.max_rows = 10

In [23]:

result = pd.read_csv(r'examples/ex6.csv')
result

Out[23]:

	one	two	three	four	key
0	0.467976	-0.038649	-0.295344	-1.824726	L
1	-0.358893	1.404453	0.704965	-0.200638	B
2	-0.501840	0.659254	-0.421691	-0.057688	G
3	0.204886	1.074134	1.388361	-0.982404	R
4	0.354628	-0.133116	0.283763	-0.837063	Q
...	...	...	...	...	...
9995	2.311896	-0.417070	-1.409599	-0.515821	L
9996	-0.479893	-0.650419	0.745152	-0.646038	E
9997	0.523331	0.787112	0.486066	1.093156	K
9998	-0.362559	0.598894	-1.843201	0.887292	G
9999	-0.096376	-1.012999	-0.657431	-0.573315	0

10000 rows × 5 columns

In [24]:

# nrows -  read a small number of rows 

pd.read_csv(r'examples/ex6.csv', nrows=5)

Out[24]:

	one	two	three	four	key
0	0.467976	-0.038649	-0.295344	-1.824726	L
1	-0.358893	1.404453	0.704965	-0.200638	B
2	-0.501840	0.659254	-0.421691	-0.057688	G
3	0.204886	1.074134	1.388361	-0.982404	R
4	0.354628	-0.133116	0.283763	-0.837063	Q

To read a file in pieces, specify a chunksize as a number of rows

The TextParser object returned by read_csv allows you to iterate over the parts of the file according to the chunksize.

In [25]:

chunker = pd.read_csv(r'examples/ex6.csv', chunksize=1000)
chunker

Out[25]:

<pandas.io.parsers.TextFileReader at 0x23380d2b708>

In [26]:

chunker = pd.read_csv(r'examples/ex6.csv', chunksize=1000)
tot = pd.Series([])
for piece in chunker:
    tot = tot.add(piece['key'].value_counts(), fill_value=0)
tot = tot.sort_values(ascending=False)

In [27]:

tot[:10]

Out[27]:

E    368.0
X    364.0
L    346.0
O    343.0
Q    340.0
M    338.0
J    337.0
F    335.0
K    334.0
H    330.0
dtype: float64

Writing Data to Text Format¶

In [28]:

data = pd.read_csv(r'examples/ex5.csv')
data

Out[28]:

	something	a	b	c	d	message
0	one	1	2	3.0	4	NaN
1	two	5	6	NaN	8	world
2	three	9	10	11.0	12	foo

Using DataFrame’s to_csv method, we can write the data out to a comma separated file

In [29]:

data.to_csv(r'examples/out.csv')

In [30]:

!type examples\out.csv

,something,a,b,c,d,message
0,one,1,2,3.0,4,
1,two,5,6,,8,world
2,three,9,10,11.0,12,foo

In [31]:

import sys

data.to_csv(sys.stdout, sep='|')

|something|a|b|c|d|message
0|one|1|2|3.0|4|
1|two|5|6||8|world
2|three|9|10|11.0|12|foo

Missing values appear as empty strings in the output. You might want to denote them by some other sentinel value

In [32]:

data.to_csv(sys.stdout, na_rep='NULL')

,something,a,b,c,d,message
0,one,1,2,3.0,4,NULL
1,two,5,6,NULL,8,world
2,three,9,10,11.0,12,foo

In [33]:

# With no other options specified, both the row and column labels 
# are written. Both of these can be disabled

data.to_csv(sys.stdout, index=False, header=False)

one,1,2,3.0,4,
two,5,6,,8,world
three,9,10,11.0,12,foo

In [34]:

# write only a subset of the columns, and in an order of your choosing

data.to_csv(sys.stdout ,index=False, columns=['a', 'b', 'c'])

a,b,c
1,2,3.0
5,6,
9,10,11.0

In [35]:

#Series

dates = pd.date_range('1/1/2020', periods=10)
ts = pd.Series(np.arange(10), index=dates)
ts.to_csv(r'examples/tseries.csv', header=False)

In [36]:

!type examples\tseries.csv

2020-01-01,0
2020-01-02,1
2020-01-03,2
2020-01-04,3
2020-01-05,4
2020-01-06,5
2020-01-07,6
2020-01-08,7
2020-01-09,8
2020-01-10,9

Working with Delimited Formats¶

It’s possible to load most forms of tabular data from disk using functions like pandas.read_table. In some cases, however, some manual processing may be necessary. It’s not uncommon to receive a file with one or more malformed lines that trip up read_table

In [37]:

!type examples\ex7.csv

"a","b","c"
"1","2","3"
"1","2","3"

For any file with a single-character delimiter, you can use Python’s built-in csv module. To use it, pass any open file or file-like object to csv.reader

In [38]:

import csv

f = open(r'examples/ex7.csv')
reader = csv.reader(f)

In [39]:

for line in reader:
    print(line)

['a', 'b', 'c']
['1', '2', '3']
['1', '2', '3']

In [40]:

with open(r'examples/ex7.csv') as f:
    lines = list(csv.reader(f))

In [41]:

header, values = lines[0], lines[1:]
print(header)
print(values)

['a', 'b', 'c']
[['1', '2', '3'], ['1', '2', '3']]

In [42]:

for i, j in zip(*values):
    print(i, j)

1 1
2 2
3 3

In [43]:

data_dict = {h: v for h, v in zip(header, zip(*values))}

In [44]:

data_dict

Out[44]:

{'a': ('1', '1'), 'b': ('2', '2'), 'c': ('3', '3')}

JSON Data¶

In [45]:

obj = '''
{"name": "Wes",
 "places_lived": ["United States", "Spain", "Germany"],
 "pet": null,
 "siblings": [{"name": "Scott", "age": 30, "pets": ["Zeus", "Zuko"]},
 {"name": "Katie", "age": 38,
 "pets": ["Sixes", "Stache", "Cisco"]}]
}
'''

In [46]:

import json

In [47]:

result = json.loads(obj)
result

Out[47]:

{'name': 'Wes',
 'places_lived': ['United States', 'Spain', 'Germany'],
 'pet': None,
 'siblings': [{'name': 'Scott', 'age': 30, 'pets': ['Zeus', 'Zuko']},
  {'name': 'Katie', 'age': 38, 'pets': ['Sixes', 'Stache', 'Cisco']}]}

In [48]:

# json.dumps, on the other hand, converts a Python object back to JSON:

asjson = json.dumps(result)

In [49]:

siblings = pd.DataFrame(result['siblings'], columns=['name', 'age', 'pets'])
siblings

Out[49]:

	name	age	pets
0	Scott	30	[Zeus, Zuko]
1	Katie	38	[Sixes, Stache, Cisco]

The pandas.read_json can automatically convert JSON datasets in specific arrangements into a Series or DataFrame.

In [50]:

!type examples\example.json

[{"a": 1, "b": 2, "c": 3},
 {"a": 4, "b": 5, "c": 6},
 {"a": 7, "b": 8, "c": 9}]

In [51]:

data = pd.read_json(r'examples/example.json')
data

Out[51]:

	a	b	c
0	1	2	3
1	4	5	6
2	7	8	9

In [52]:

# pandas -> json

print(data.to_json())
print(data.to_json(orient='records'))

{"a":{"0":1,"1":4,"2":7},"b":{"0":2,"1":5,"2":8},"c":{"0":3,"1":6,"2":9}}
[{"a":1,"b":2,"c":3},{"a":4,"b":5,"c":6},{"a":7,"b":8,"c":9}]

In [53]:

# **ToDo:**

# 1.5 - XML and HTML: Web Scraping
# 2 - Binary Data Formats
# 2.1 - Using HDF5 Format
# 2.2 - Reading Microsoft Excel Files
# 3 - Interacting with Web APIs
# 4 - Interacting with Databases

Table of Contents

Reading and Writing Data in Text Format¶

Reading Text Files in Pieces¶

Writing Data to Text Format¶

Working with Delimited Formats¶

JSON Data¶