import pandas as pd
import numpy as np
!type examples\ex1.csv
a,b,c,d,message 1,2,3,4,hello 5,6,7,8,world 9,10,11,12,foo
df = pd.read_csv(r'examples\ex1.csv')
df
a | b | c | d | message | |
---|---|---|---|---|---|
0 | 1 | 2 | 3 | 4 | hello |
1 | 5 | 6 | 7 | 8 | world |
2 | 9 | 10 | 11 | 12 | foo |
pd.read_table('examples/ex1.csv', sep=',')
a | b | c | d | message | |
---|---|---|---|---|---|
0 | 1 | 2 | 3 | 4 | hello |
1 | 5 | 6 | 7 | 8 | world |
2 | 9 | 10 | 11 | 12 | foo |
A file will not always have a header row. Consider this file
!type examples\ex2.csv
1,2,3,4,hello 5,6,7,8,world 9,10,11,12,foo
pd.read_csv(r'examples/ex2.csv', header=None)
0 | 1 | 2 | 3 | 4 | |
---|---|---|---|---|---|
0 | 1 | 2 | 3 | 4 | hello |
1 | 5 | 6 | 7 | 8 | world |
2 | 9 | 10 | 11 | 12 | foo |
pd.read_csv(r'examples/ex2.csv', names=['a', 'b', 'c', 'd', 'message'])
a | b | c | d | message | |
---|---|---|---|---|---|
0 | 1 | 2 | 3 | 4 | hello |
1 | 5 | 6 | 7 | 8 | world |
2 | 9 | 10 | 11 | 12 | foo |
names=['a', 'b', 'c', 'd', 'message']
# make message column to be the index of the returned DataFrame
pd.read_csv(r'examples/ex2.csv', names=names, index_col='message')
a | b | c | d | |
---|---|---|---|---|
message | ||||
hello | 1 | 2 | 3 | 4 |
world | 5 | 6 | 7 | 8 |
foo | 9 | 10 | 11 | 12 |
In the event that you want to form a hierarchical index from multiple columns, pass a list of column numbers or names
!type examples\csv_mindex.csv
key1,key2,value1,value2 one,a,1,2 one,b,3,4 one,c,5,6 one,d,7,8 two,a,9,10 two,b,11,12 two,c,13,14 two,d,15,16
parsed = pd.read_csv(r'examples/csv_mindex.csv', index_col=['key1', 'key2'])
parsed
value1 | value2 | ||
---|---|---|---|
key1 | key2 | ||
one | a | 1 | 2 |
b | 3 | 4 | |
c | 5 | 6 | |
d | 7 | 8 | |
two | a | 9 | 10 |
b | 11 | 12 | |
c | 13 | 14 | |
d | 15 | 16 |
In some cases, a table might not have a fixed delimiter, using whitespace or some other pattern to separate fields.
list(open(r'examples/ex3.txt'))
[' A B C\n', 'aaa -0.264438 -1.026059 -0.619500\n', 'bbb 0.927272 0.302904 -0.032399\n', 'ccc -0.264273 -0.386314 -0.217601\n', 'ddd -0.871858 -0.348382 1.100491\n']
result = pd.read_table(r'examples/ex3.txt', sep='\s+')
result
A | B | C | |
---|---|---|---|
aaa | -0.264438 | -1.026059 | -0.619500 |
bbb | 0.927272 | 0.302904 | -0.032399 |
ccc | -0.264273 | -0.386314 | -0.217601 |
ddd | -0.871858 | -0.348382 | 1.100491 |
!type examples\ex4.csv
# hey! a,b,c,d,message # just wanted to make things more difficult for you # who reads CSV files with computers, anyway? 1,2,3,4,hello 5,6,7,8,world 9,10,11,12,foo
pd.read_csv(r'examples/ex4.csv', skiprows=[0, 2,3])
a | b | c | d | message | |
---|---|---|---|---|---|
0 | 1 | 2 | 3 | 4 | hello |
1 | 5 | 6 | 7 | 8 | world |
2 | 9 | 10 | 11 | 12 | foo |
Handling missing values is an important and frequently nuanced part of the file parsing process. Missing data is usually either not present (empty string) or marked by some sentinel value. By default, pandas uses a set of commonly occurring sentinels, such as NA and NUL
!type examples\ex5.csv
something,a,b,c,d,message one,1,2,3,4,NA two,5,6,,8,world three,9,10,11,12,foo
result = pd.read_csv(r'examples/ex5.csv')
result
something | a | b | c | d | message | |
---|---|---|---|---|---|---|
0 | one | 1 | 2 | 3.0 | 4 | NaN |
1 | two | 5 | 6 | NaN | 8 | world |
2 | three | 9 | 10 | 11.0 | 12 | foo |
pd.isnull(result)
something | a | b | c | d | message | |
---|---|---|---|---|---|---|
0 | False | False | False | False | False | True |
1 | False | False | False | True | False | False |
2 | False | False | False | False | False | False |
The na_values option can take either a list or set of strings to consider missing values
result = pd.read_csv(r'examples/ex5.csv', na_values=['NULL'])
result
something | a | b | c | d | message | |
---|---|---|---|---|---|---|
0 | one | 1 | 2 | 3.0 | 4 | NaN |
1 | two | 5 | 6 | NaN | 8 | world |
2 | three | 9 | 10 | 11.0 | 12 | foo |
# Different NA sentinels can be specified for each column in a dict
sentinels = {'message': ['foo', 'NA'], 'something':['two']}
pd.read_csv(r'examples/ex5.csv', na_values=sentinels)
something | a | b | c | d | message | |
---|---|---|---|---|---|---|
0 | one | 1 | 2 | 3.0 | 4 | NaN |
1 | NaN | 5 | 6 | NaN | 8 | world |
2 | three | 9 | 10 | 11.0 | 12 | NaN |
When processing very large files or figuring out the right set of arguments to correctly process a large file, you may only want to read in a small piece of a file or iterate through smaller chunks of the file.Before we look at a large file, we make the pandas display settings more compact
pd.options.display.max_rows = 10
result = pd.read_csv(r'examples/ex6.csv')
result
one | two | three | four | key | |
---|---|---|---|---|---|
0 | 0.467976 | -0.038649 | -0.295344 | -1.824726 | L |
1 | -0.358893 | 1.404453 | 0.704965 | -0.200638 | B |
2 | -0.501840 | 0.659254 | -0.421691 | -0.057688 | G |
3 | 0.204886 | 1.074134 | 1.388361 | -0.982404 | R |
4 | 0.354628 | -0.133116 | 0.283763 | -0.837063 | Q |
... | ... | ... | ... | ... | ... |
9995 | 2.311896 | -0.417070 | -1.409599 | -0.515821 | L |
9996 | -0.479893 | -0.650419 | 0.745152 | -0.646038 | E |
9997 | 0.523331 | 0.787112 | 0.486066 | 1.093156 | K |
9998 | -0.362559 | 0.598894 | -1.843201 | 0.887292 | G |
9999 | -0.096376 | -1.012999 | -0.657431 | -0.573315 | 0 |
10000 rows × 5 columns
# nrows - read a small number of rows
pd.read_csv(r'examples/ex6.csv', nrows=5)
one | two | three | four | key | |
---|---|---|---|---|---|
0 | 0.467976 | -0.038649 | -0.295344 | -1.824726 | L |
1 | -0.358893 | 1.404453 | 0.704965 | -0.200638 | B |
2 | -0.501840 | 0.659254 | -0.421691 | -0.057688 | G |
3 | 0.204886 | 1.074134 | 1.388361 | -0.982404 | R |
4 | 0.354628 | -0.133116 | 0.283763 | -0.837063 | Q |
To read a file in pieces, specify a chunksize as a number of rows
The TextParser object returned by read_csv allows you to iterate over the parts of the file according to the chunksize.
chunker = pd.read_csv(r'examples/ex6.csv', chunksize=1000)
chunker
<pandas.io.parsers.TextFileReader at 0x23380d2b708>
chunker = pd.read_csv(r'examples/ex6.csv', chunksize=1000)
tot = pd.Series([])
for piece in chunker:
tot = tot.add(piece['key'].value_counts(), fill_value=0)
tot = tot.sort_values(ascending=False)
tot[:10]
E 368.0 X 364.0 L 346.0 O 343.0 Q 340.0 M 338.0 J 337.0 F 335.0 K 334.0 H 330.0 dtype: float64
data = pd.read_csv(r'examples/ex5.csv')
data
something | a | b | c | d | message | |
---|---|---|---|---|---|---|
0 | one | 1 | 2 | 3.0 | 4 | NaN |
1 | two | 5 | 6 | NaN | 8 | world |
2 | three | 9 | 10 | 11.0 | 12 | foo |
Using DataFrame’s to_csv method, we can write the data out to a comma separated file
data.to_csv(r'examples/out.csv')
!type examples\out.csv
,something,a,b,c,d,message 0,one,1,2,3.0,4, 1,two,5,6,,8,world 2,three,9,10,11.0,12,foo
import sys
data.to_csv(sys.stdout, sep='|')
|something|a|b|c|d|message 0|one|1|2|3.0|4| 1|two|5|6||8|world 2|three|9|10|11.0|12|foo
Missing values appear as empty strings in the output. You might want to denote them by some other sentinel value
data.to_csv(sys.stdout, na_rep='NULL')
,something,a,b,c,d,message 0,one,1,2,3.0,4,NULL 1,two,5,6,NULL,8,world 2,three,9,10,11.0,12,foo
# With no other options specified, both the row and column labels
# are written. Both of these can be disabled
data.to_csv(sys.stdout, index=False, header=False)
one,1,2,3.0,4, two,5,6,,8,world three,9,10,11.0,12,foo
# write only a subset of the columns, and in an order of your choosing
data.to_csv(sys.stdout ,index=False, columns=['a', 'b', 'c'])
a,b,c 1,2,3.0 5,6, 9,10,11.0
#Series
dates = pd.date_range('1/1/2020', periods=10)
ts = pd.Series(np.arange(10), index=dates)
ts.to_csv(r'examples/tseries.csv', header=False)
!type examples\tseries.csv
2020-01-01,0 2020-01-02,1 2020-01-03,2 2020-01-04,3 2020-01-05,4 2020-01-06,5 2020-01-07,6 2020-01-08,7 2020-01-09,8 2020-01-10,9
It’s possible to load most forms of tabular data from disk using functions like pandas.read_table. In some cases, however, some manual processing may be necessary. It’s not uncommon to receive a file with one or more malformed lines that trip up read_table
!type examples\ex7.csv
"a","b","c" "1","2","3" "1","2","3"
For any file with a single-character delimiter, you can use Python’s built-in csv module. To use it, pass any open file or file-like object to csv.reader
import csv
f = open(r'examples/ex7.csv')
reader = csv.reader(f)
for line in reader:
print(line)
['a', 'b', 'c'] ['1', '2', '3'] ['1', '2', '3']
with open(r'examples/ex7.csv') as f:
lines = list(csv.reader(f))
header, values = lines[0], lines[1:]
print(header)
print(values)
['a', 'b', 'c'] [['1', '2', '3'], ['1', '2', '3']]
for i, j in zip(*values):
print(i, j)
1 1 2 2 3 3
data_dict = {h: v for h, v in zip(header, zip(*values))}
data_dict
{'a': ('1', '1'), 'b': ('2', '2'), 'c': ('3', '3')}
obj = '''
{"name": "Wes",
"places_lived": ["United States", "Spain", "Germany"],
"pet": null,
"siblings": [{"name": "Scott", "age": 30, "pets": ["Zeus", "Zuko"]},
{"name": "Katie", "age": 38,
"pets": ["Sixes", "Stache", "Cisco"]}]
}
'''
import json
result = json.loads(obj)
result
{'name': 'Wes', 'places_lived': ['United States', 'Spain', 'Germany'], 'pet': None, 'siblings': [{'name': 'Scott', 'age': 30, 'pets': ['Zeus', 'Zuko']}, {'name': 'Katie', 'age': 38, 'pets': ['Sixes', 'Stache', 'Cisco']}]}
# json.dumps, on the other hand, converts a Python object back to JSON:
asjson = json.dumps(result)
siblings = pd.DataFrame(result['siblings'], columns=['name', 'age', 'pets'])
siblings
name | age | pets | |
---|---|---|---|
0 | Scott | 30 | [Zeus, Zuko] |
1 | Katie | 38 | [Sixes, Stache, Cisco] |
The pandas.read_json can automatically convert JSON datasets in specific arrangements into a Series or DataFrame.
!type examples\example.json
[{"a": 1, "b": 2, "c": 3}, {"a": 4, "b": 5, "c": 6}, {"a": 7, "b": 8, "c": 9}]
data = pd.read_json(r'examples/example.json')
data
a | b | c | |
---|---|---|---|
0 | 1 | 2 | 3 |
1 | 4 | 5 | 6 |
2 | 7 | 8 | 9 |
# pandas -> json
print(data.to_json())
print(data.to_json(orient='records'))
{"a":{"0":1,"1":4,"2":7},"b":{"0":2,"1":5,"2":8},"c":{"0":3,"1":6,"2":9}} [{"a":1,"b":2,"c":3},{"a":4,"b":5,"c":6},{"a":7,"b":8,"c":9}]
# **ToDo:**
# 1.5 - XML and HTML: Web Scraping
# 2 - Binary Data Formats
# 2.1 - Using HDF5 Format
# 2.2 - Reading Microsoft Excel Files
# 3 - Interacting with Web APIs
# 4 - Interacting with Databases