(file_handling)=
{index}
Suppose we have a text file like below, from which we would like to extract temperature and density data:
# Density of air at different temperatures, at 1 atm pressure
# Column 1: temperature in Celsius degrees
# Column 2: density in kg/m^3
0.0 999.8425
4.0 999.9750
15.0 999.1026
20.0 998.2071
25.0 997.0479
37.0 993.3316
50.0 988.04
100.0 958.3665
# Source: Wikipedia (keyword Density)
(python_file_handling)=
It is usually best to read and write files using the with
statement which will improve our syntax and do a lot of things automatically for us (like closing the file we opened).
Our file contains 3 lines at the beginning and 1 line at the end which we do not want to extract. Additionally, the temperature and density values are separated by whitespaces. Here is an example of how we could extract the data into two arrays:
with open('density_water.dat', 'r') as file: # r for read
# skip extra lines by taking a slice of the file
lines = file.readlines()[3:-1]
# initialise 2 empty arrays to store our data
temp, density = np.zeros(len(lines)), np.zeros(len(lines))
for i in range(len(lines)):
values = lines[i].split()
temp[i] = float(values[0])
density[i] = float(values[1])
print(temp)
print(density)
[ 0. 4. 15. 20. 25. 37. 50. 100.] [999.8425 999.975 999.1026 998.2071 997.0479 993.3316 988.04 958.3665]
The lines variable in the above example is a list which contains our lines with numbers, but they are stored as a string - they are not recognised as numbers at this point. Note that the temperature and density arrays are already initialised as arrays of specific length, rather than empty lists which we will then append numbers to. It is good practice to initalise arrays like this when we know exactly how many elements our final array will have. The for loop is used to cycle through all elements of our data list. Each i-th element is a string, which we split into a new list with the same number of elements as the number of values that are separated by whitespaces in the line - in our case 2. Finally we can assign those values to elements in the temperature and density arrays, but first we converted them from a string to a float.
The below example shows how we could write the data into a text file with, for example here, comma separated values.
with open('output.txt', 'w') as file: # w for write
file.write('# Temperature and density data\n')
for i in range(len(temp)):
file.write(f'{temp[i]},{density[i]}\n')
(numpy_file_handling)=
NumPy's genfromtxt
(generate from text) function provides full control over the file which we are trying to open. It initialises a numpy array from the data. Some of the parameters include:
dtype
- set a data type; if not set, determines the data type automatically for each columncomments
- skips every line starting with a string that is set here; #
by defaultskip_header
- number of lines to skip at the beginning of the fileskip_footer
- number of lines to skip at the end of the filedelimiter
- the string used to separate values; any whitespaces by defaultFor our file, we do not have to change anything since comments are marked by a # at the beginning of the line and the values are separated by whitespaces.
import numpy as np
data = np.genfromtxt('density_water.dat')
print(data)
[[ 0. 999.8425] [ 4. 999.975 ] [ 15. 999.1026] [ 20. 998.2071] [ 25. 997.0479] [ 37. 993.3316] [ 50. 988.04 ] [100. 958.3665]]
We can save our array to a file in multiple ways - examples below. If we plan on using the array in another python code, perhaps it is best to save it as a .npy
file. We can later easily load the .npy
file and reconstruct the original array. The reader is encouraged to read about pickling in Python, which allows any object in Python to be saved in this way, not only numpy arrays.
np.savetxt('data.txt', data) # save data array to a text file
np.save('data.npy', data) # save data array to a .npy file
A = np.load('data.npy')
print(A)
[[ 0. 999.8425] [ 4. 999.975 ] [ 15. 999.1026] [ 20. 998.2071] [ 25. 997.0479] [ 37. 993.3316] [ 50. 988.04 ] [100. 958.3665]]
(pandas_file_handling)=
Despite its name, pandas' read_csv
can read many different types of files, including our .dat
file. Pandas DataFrame
is the primary data structure in pandas, so this function will generate one such DataFrame
. It is a very powerful object with many capabilities, our own tutorial can be found in {ref}Introduction to Pandas <pandas_intro>
.
As in the NumPy example, we specify that our comment lines begin with a #
and that our delimiter is whitespace(s). Furthermore, we can give names (header) to our columns in the DataFrame
, where we assign our list col_names
using the names parameter. Pandas' read_csv()
by default automatically sets this from the header line in our file, which we do not have in our file so we set header=None
.
from pandas import read_csv
col_names = ['temperature', 'density']
df = read_csv('density_water.dat', comment='#', delim_whitespace=True, names=col_names, header=None)
print(df)
df.to_csv('data.csv', index=False)
temperature density 0 0.0 999.8425 1 4.0 999.9750 2 15.0 999.1026 3 20.0 998.2071 4 25.0 997.0479 5 37.0 993.3316 6 50.0 988.0400 7 100.0 958.3665
(file_handling_exercises)=
\t
separated.The file can be found at 'Data\\TempData.txt'
HINT: Structure the data as a list of tuples.
{admonition}
:class: dropdown
```python
def isStan(country):
if len(country) < 4:
return False
else:
return country[-4::] == "stan" # True or False
with open('Data\\TempData.txt', 'r') as file:
lines = file.readlines()
nameTemp =[]
res = 0
stan_countries = 0
for i in range(len(lines)):
values = lines[i].split('\t')
nameTemp.append((values[0].strip(), float(values[1])))
if(isStan(nameTemp[i][0])): # if True
res += nameTemp[i][1]
stan_countries += 1
print(res/stan_countries)
```
HINT: Set dtype=['U20', float]
, remember about delimiters.
{admonition}
:class: dropdown
```python
import numpy as np
data = np.genfromtxt('Data\\TempData.txt', delimiter='\t', dtype=['U20', float])
for i in data:
if i[1] > 27:
print(i[0])
```
Three_Letter_Country_Code
value. Use the Pandas library.The file can be found at Data\\CountryContinent.csv
.
HINT: Use the for index, row in df.iterrows():
structure of the for
loop. The final result should look something like this: [['Asia', 58], ['Europe', 57], ['Antarctica', 5], ['Africa', 58], ['Oceania', 27], ['North America', 43], ['South America', 14]]
{admonition}
:class: dropdown
```python
from pandas import read_csv
df = read_csv('Data\\CountryContinent.csv')
continents = df['Continent_Name'].unique() # list of continent names from the file
res = [[continent, 0] for continent in continents] # initial list, not counted yet
for index, row in df.iterrows():
if row["Three_Letter_Country_Code"] != "nan":
for j in range(len(res)):
if row["Continent_Name"] == res[j][0]: # find correct continent
res[j][1] += 1 # increase country count by 1
print(res)
```