Introduction to working with DataFrames¶

In basic python, we often use dictionaries containing our measurements as vectors. While these basic structures are handy for collecting data, they are suboptimal for further data processing. For that we introduce panda DataFrames which are more handy in the next steps. In Python, scientists often call tables "DataFrames".

In [1]:

import pandas as pd

Creating DataFrames from a dictionary of lists¶

Assume we did some image processing and have some results in available in a dictionary that contains lists of numbers:

In [2]:

measurements = {
    "labels":      [1, 2, 3],
    "area":       [45, 23, 68],
    "minor_axis": [2, 4, 4],
    "major_axis": [3, 4, 5],
}

This data structure can be nicely visualized using a DataFrame:

In [3]:

df = pd.DataFrame(measurements)
df

Out[3]:

	labels	area	minor_axis	major_axis
0	1	45	2	3
1	2	23	4	4
2	3	68	4	5

Using these DataFrames, data modification is straighforward. For example one can append a new column and compute its values from existing columns:

In [4]:

df["aspect_ratio"] = df["major_axis"] / df["minor_axis"]
df

Out[4]:

	labels	area	minor_axis	major_axis	aspect_ratio
0	1	45	2	3	1.50
1	2	23	4	4	1.00
2	3	68	4	5	1.25

Saving data frames¶

We can also save this table for continuing to work with it.

In [5]:

df.to_csv("../../data/short_table.csv")

Creating DataFrames from lists of lists¶

Sometimes, we are confronted to data in form of lists of lists. To make pandas understand that form of data correctly, we also need to provide the headers in the same order as the lists

In [6]:

header = ['labels', 'area', 'minor_axis', 'major_axis']

data = [
    [1, 2, 3],
    [45, 23, 68],
    [2, 4, 4],
    [3, 4, 5],
]
          
# convert the data and header arrays in a pandas data frame
data_frame = pd.DataFrame(data, header)

# show it
data_frame

Out[6]:

	0	1	2
labels	1	2	3
area	45	23	68
minor_axis	2	4	4
major_axis	3	4	5

As you can see, this tabls is rotated. We can bring it in the usual form like this:

In [7]:

# rotate/flip it
data_frame = data_frame.transpose()

# show it
data_frame

Out[7]:

	labels	area	minor_axis	major_axis
0	1	45	2	3
1	2	23	4	4
2	3	68	4	5

Loading data frames¶

Tables can also be read from CSV files.

In [8]:

df_csv = pd.read_csv('../../data/blobs_statistics.csv')
df_csv

Out[8]:

	Unnamed: 0	area	mean_intensity	minor_axis_length	major_axis_length	eccentricity	extent	feret_diameter_max	equivalent_diameter_area	bbox-0	bbox-1	bbox-2	bbox-3
0	0	422	192.379147	16.488550	34.566789	0.878900	0.586111	35.227830	23.179885	0	11	30	35
1	1	182	180.131868	11.736074	20.802697	0.825665	0.787879	21.377558	15.222667	0	53	11	74
2	2	661	205.216339	28.409502	30.208433	0.339934	0.874339	32.756679	29.010538	0	95	28	122
3	3	437	216.585812	23.143996	24.606130	0.339576	0.826087	26.925824	23.588253	0	144	23	167
4	4	476	212.302521	19.852882	31.075106	0.769317	0.863884	31.384710	24.618327	0	237	29	256
...	...	...	...	...	...	...	...	...	...	...	...	...	...
56	56	211	185.061611	14.522762	18.489138	0.618893	0.781481	18.973666	16.390654	232	39	250	54
57	57	78	185.230769	6.028638	17.579799	0.939361	0.722222	18.027756	9.965575	248	170	254	188
58	58	86	183.720930	5.426871	21.261427	0.966876	0.781818	22.000000	10.464158	249	117	254	139
59	59	51	190.431373	5.032414	13.742079	0.930534	0.728571	14.035669	8.058239	249	228	254	242
60	60	46	175.304348	3.803982	15.948714	0.971139	0.766667	15.033296	7.653040	250	67	254	82

61 rows × 13 columns

Typically, we don't need all the information in these tables and thus, it makes sense to reduce the table. For that, we print out the column names first.

In [9]:

df_csv.keys()

Out[9]:

Index(['Unnamed: 0', 'area', 'mean_intensity', 'minor_axis_length',
       'major_axis_length', 'eccentricity', 'extent', 'feret_diameter_max',
       'equivalent_diameter_area', 'bbox-0', 'bbox-1', 'bbox-2', 'bbox-3'],
      dtype='object')

We can then copy&paste the colum names we're interested in and create a new data frame.

In [10]:

df_analysis = df_csv[['area', 'mean_intensity']]
df_analysis

Out[10]:

	area	mean_intensity
0	422	192.379147
1	182	180.131868
2	661	205.216339
3	437	216.585812
4	476	212.302521
...	...	...
56	211	185.061611
57	78	185.230769
58	86	183.720930
59	51	190.431373
60	46	175.304348

61 rows × 2 columns

You can then access columns and add new columns.

In [18]:

df_analysis['total_intensity'] = df_analysis['area'] * df_analysis['mean_intensity']
df_analysis

C:\Users\rober\AppData\Local\Temp/ipykernel_20588/206920941.py:1: SettingWithCopyWarning: 
A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead

See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy
  df_analysis['total_intensity'] = df_analysis['area'] * df_analysis['mean_intensity']

Out[18]:

	area	mean_intensity	total_intensity
0	422	192.379147	81184.0
1	182	180.131868	32784.0
2	661	205.216339	135648.0
3	437	216.585812	94648.0
4	476	212.302521	101056.0
...	...	...	...
56	211	185.061611	39048.0
57	78	185.230769	14448.0
58	86	183.720930	15800.0
59	51	190.431373	9712.0
60	46	175.304348	8064.0

61 rows × 3 columns

Exercise¶

For the loaded CSV file, create a table that only contains these columns:

minor_axis_length
major_axis_length
aspect_ratio

In [ ]:

df_shape = pd.read_csv('../../data/blobs_statistics.csv')