In basic python, we often use dictionaries containing our measurements as vectors. While these basic structures are handy for collecting data, they are suboptimal for further data processing. For that we introduce panda DataFrames which are more handy in the next steps. In Python, scientists often call tables "DataFrames".
import pandas as pd
Assume we did some image processing and have some results in available in a dictionary that contains lists of numbers:
measurements = {
"labels": [1, 2, 3],
"area": [45, 23, 68],
"minor_axis": [2, 4, 4],
"major_axis": [3, 4, 5],
}
This data structure can be nicely visualized using a DataFrame:
df = pd.DataFrame(measurements)
df
labels | area | minor_axis | major_axis | |
---|---|---|---|---|
0 | 1 | 45 | 2 | 3 |
1 | 2 | 23 | 4 | 4 |
2 | 3 | 68 | 4 | 5 |
Using these DataFrames, data modification is straighforward. For example one can append a new column and compute its values from existing columns:
df["aspect_ratio"] = df["major_axis"] / df["minor_axis"]
df
labels | area | minor_axis | major_axis | aspect_ratio | |
---|---|---|---|---|---|
0 | 1 | 45 | 2 | 3 | 1.50 |
1 | 2 | 23 | 4 | 4 | 1.00 |
2 | 3 | 68 | 4 | 5 | 1.25 |
We can also save this table for continuing to work with it.
df.to_csv("../../data/short_table.csv")
Sometimes, we are confronted to data in form of lists of lists. To make pandas understand that form of data correctly, we also need to provide the headers in the same order as the lists
header = ['labels', 'area', 'minor_axis', 'major_axis']
data = [
[1, 2, 3],
[45, 23, 68],
[2, 4, 4],
[3, 4, 5],
]
# convert the data and header arrays in a pandas data frame
data_frame = pd.DataFrame(data, header)
# show it
data_frame
0 | 1 | 2 | |
---|---|---|---|
labels | 1 | 2 | 3 |
area | 45 | 23 | 68 |
minor_axis | 2 | 4 | 4 |
major_axis | 3 | 4 | 5 |
As you can see, this tabls is rotated. We can bring it in the usual form like this:
# rotate/flip it
data_frame = data_frame.transpose()
# show it
data_frame
labels | area | minor_axis | major_axis | |
---|---|---|---|---|
0 | 1 | 45 | 2 | 3 |
1 | 2 | 23 | 4 | 4 |
2 | 3 | 68 | 4 | 5 |
Tables can also be read from CSV files.
df_csv = pd.read_csv('../../data/blobs_statistics.csv')
df_csv
Unnamed: 0 | area | mean_intensity | minor_axis_length | major_axis_length | eccentricity | extent | feret_diameter_max | equivalent_diameter_area | bbox-0 | bbox-1 | bbox-2 | bbox-3 | |
---|---|---|---|---|---|---|---|---|---|---|---|---|---|
0 | 0 | 422 | 192.379147 | 16.488550 | 34.566789 | 0.878900 | 0.586111 | 35.227830 | 23.179885 | 0 | 11 | 30 | 35 |
1 | 1 | 182 | 180.131868 | 11.736074 | 20.802697 | 0.825665 | 0.787879 | 21.377558 | 15.222667 | 0 | 53 | 11 | 74 |
2 | 2 | 661 | 205.216339 | 28.409502 | 30.208433 | 0.339934 | 0.874339 | 32.756679 | 29.010538 | 0 | 95 | 28 | 122 |
3 | 3 | 437 | 216.585812 | 23.143996 | 24.606130 | 0.339576 | 0.826087 | 26.925824 | 23.588253 | 0 | 144 | 23 | 167 |
4 | 4 | 476 | 212.302521 | 19.852882 | 31.075106 | 0.769317 | 0.863884 | 31.384710 | 24.618327 | 0 | 237 | 29 | 256 |
... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... |
56 | 56 | 211 | 185.061611 | 14.522762 | 18.489138 | 0.618893 | 0.781481 | 18.973666 | 16.390654 | 232 | 39 | 250 | 54 |
57 | 57 | 78 | 185.230769 | 6.028638 | 17.579799 | 0.939361 | 0.722222 | 18.027756 | 9.965575 | 248 | 170 | 254 | 188 |
58 | 58 | 86 | 183.720930 | 5.426871 | 21.261427 | 0.966876 | 0.781818 | 22.000000 | 10.464158 | 249 | 117 | 254 | 139 |
59 | 59 | 51 | 190.431373 | 5.032414 | 13.742079 | 0.930534 | 0.728571 | 14.035669 | 8.058239 | 249 | 228 | 254 | 242 |
60 | 60 | 46 | 175.304348 | 3.803982 | 15.948714 | 0.971139 | 0.766667 | 15.033296 | 7.653040 | 250 | 67 | 254 | 82 |
61 rows × 13 columns
Typically, we don't need all the information in these tables and thus, it makes sense to reduce the table. For that, we print out the column names first.
df_csv.keys()
Index(['Unnamed: 0', 'area', 'mean_intensity', 'minor_axis_length', 'major_axis_length', 'eccentricity', 'extent', 'feret_diameter_max', 'equivalent_diameter_area', 'bbox-0', 'bbox-1', 'bbox-2', 'bbox-3'], dtype='object')
We can then copy&paste the colum names we're interested in and create a new data frame.
df_analysis = df_csv[['area', 'mean_intensity']]
df_analysis
area | mean_intensity | |
---|---|---|
0 | 422 | 192.379147 |
1 | 182 | 180.131868 |
2 | 661 | 205.216339 |
3 | 437 | 216.585812 |
4 | 476 | 212.302521 |
... | ... | ... |
56 | 211 | 185.061611 |
57 | 78 | 185.230769 |
58 | 86 | 183.720930 |
59 | 51 | 190.431373 |
60 | 46 | 175.304348 |
61 rows × 2 columns
You can then access columns and add new columns.
df_analysis['total_intensity'] = df_analysis['area'] * df_analysis['mean_intensity']
df_analysis
C:\Users\rober\AppData\Local\Temp/ipykernel_20588/206920941.py:1: SettingWithCopyWarning: A value is trying to be set on a copy of a slice from a DataFrame. Try using .loc[row_indexer,col_indexer] = value instead See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy df_analysis['total_intensity'] = df_analysis['area'] * df_analysis['mean_intensity']
area | mean_intensity | total_intensity | |
---|---|---|---|
0 | 422 | 192.379147 | 81184.0 |
1 | 182 | 180.131868 | 32784.0 |
2 | 661 | 205.216339 | 135648.0 |
3 | 437 | 216.585812 | 94648.0 |
4 | 476 | 212.302521 | 101056.0 |
... | ... | ... | ... |
56 | 211 | 185.061611 | 39048.0 |
57 | 78 | 185.230769 | 14448.0 |
58 | 86 | 183.720930 | 15800.0 |
59 | 51 | 190.431373 | 9712.0 |
60 | 46 | 175.304348 | 8064.0 |
61 rows × 3 columns
For the loaded CSV file, create a table that only contains these columns:
minor_axis_length
major_axis_length
aspect_ratio
df_shape = pd.read_csv('../../data/blobs_statistics.csv')