#!/usr/bin/env python # coding: utf-8 # # Multiparametric MRI - supervised (KNN) tissue classification # # BMED360-2021 `04-mri-knn-tissue-classification.ipynb` # # #

# # ## Learning objectives and context # - In this session you will learn to predict predefined tissue types in a given multispectral MR image using machine learning (**supervised classification**) # # # - The supervised classification model we wil use is simple **K-nearest neighbor** classification model, denoted $f$ (described below) # # # - To perform such pixel-wise tissue classification we will make use of the training (i.e. the `training mask`) we obtained during the **labelling of data** (see figure below of color-coded tissue samples) # # # - In the previous data labeling step (not part of these notebooks) we defined six different classes (tissue types), denoted $\mathbf y$ and the corresponding four channel multispectral MRI data, dentoted $\mathbf X$. # # # - The notebook is thus a practical machine learning example of the formalism: $ y \approx f(\mathbf X, \theta)$ # # # - You will also learn to navigate and appreciate the distinction between **image space** (pixel locations, spatial neiborhoods) and **feature vector space** (signal intensity value combinations, and similarity of pixel-based and tissue-based "signatures"). # ### For using Colab # **--> the following libraries must be `pip installed` (i.e. uncommet the following pip and wget commands):** # In[1]: #!pip install gdown # In[2]: #!pip install pydicom # In[3]: #!pip install simpleitk # In[4]: #!pip install nilearn # In[5]: #!wget -O utils.py https://raw.githubusercontent.com/computational-medicine/BMED360-2021/master/Lab2-ML-tissue-classification/utils.py # In[6]: import gdown import shutil import os import sys # In[7]: # Download zip-file if ./assets does not exist (as when running in Colab) if os.path.isdir('./assets') == False: ## Download assets.zip for Google Drive #https://drive.google.com/file/d/1gl7rbwkmSS6hVwEXUZuAPwEe28OwCH31/view?usp=sharing file_id = '1gl7rbwkmSS6hVwEXUZuAPwEe28OwCH31' url = 'https://drive.google.com/uc?id=%s' % file_id output = './assets.zip' gdown.download(url, output, quiet=False) ## Unzip the assets file into `./assets` shutil.unpack_archive(output, '.') ## Delete the `assets.zip` file os.remove(output) else: print(f'./assets exists already!') # In[8]: # Make a results directory if not exsits r_pth = './results' if os.path.isdir(r_pth) == False: try: os.mkdir(r_pth) except OSError: print("Creation of the directory %s failed" % r_pth) else: print("Successfully created the directory %s " % r_pth) else: print("Directory %s already exists!" % r_pth) # In[9]: # HOME = os.path.expanduser('~') # To make path to local home directory # # Theory: K-nearest neighbor classication (KNN) # Short description and explanation of [KNN](https://en.wikipedia.org/wiki/K-nearest_neighbors_algorithm)-classification: # # In pattern recognition, the $K$-nearest neighbors algorithm (K-NN) is a non-parametric method used for **classification** and **regression**. In both cases, the input consists of the $K$ closest training samples in the feature space. The output depends on whether K-NN is used for classification or regression: # # - In **K-NN classification**, the output is a **class membership**. An object is classified by a [plurality vote](https://en.wikipedia.org/wiki/Plurality_voting) of its neighbors, with the object being assigned to the class most common among its $K$ nearest neighbors ($K$ is a positive integer, typically small and odd, # e.g. $K \in \{1,3,5,7,9,11\}$, say). # # # - If $K = 1$, then the object is simply assigned to the class of that single nearest neighbor, i.e. **nearest neighbor classification**. # # # - In **K-NN regression**, the output is the property value for the object. This value is the _average of the values of $K$ nearest neighbors_. # In[10]: from IPython.display import Image Image(filename='./assets/knn_illustration.png', width=500) # **_Instance of a K-NN classification_** # # - The test sample (the filled green circle), with unkown class belonging, shall be classified to either the class marked as blue squares or the class marked as red triangles, i.e. a **two-class problem** using K-NN. # # # - If $K = 3$ (circle drawn with black continuous line) the unkown observation will be assigned the **red triangle class** since there are two red-label samples and only one blue-label sample in the set consisting of the three closest neighbors being labeled (using the Eucledean distance metric). # # # - If, on the other hand, $K = 5$ (circle drawn with black dashed line) the previously unseen (i.e. not labeled) sample will be assigned the **blue square class** (three blue squares vs. two red triangles among the five closest samples being labeled), i.e. "majority voting". # # # - K-NN is a _lazy learner_ because it doesn't learn a discriminative function from the training data but _"memorizes"_ the training dataset instead. For example, the logistic regression algorithm learns its model weights (parameters, $\theta$) during training time. # # # - For more elaborative explanation, see the Wikipedia entry for the [k-nearest neighbors algorithm](https://en.wikipedia.org/wiki/K-nearest_neighbors_algorithm). # # Four-channel multispectral image and associated training masks # We will be using a four-channel multispectral image (slice 60 from a multispectral 3D recording),
# reported in Lundervold et al. Volume distribution of cerebrospinal fluid using multispectral MR # imaging.
_Medical Image Analysis_ 2000;4:123-136. https://www.ncbi.nlm.nih.gov/pubmed/10972326, [[PDF](https://drive.google.com/file/d/17Ut1ScHU4cX5x_EHwQnNwH_q3Lgcq5WA/view)]
# and a manually delineated training mask (cf. "labeling of data") consisting of 6 tissue types (color coded in [R,G,B] space) as follows: # # # - AIR (air/compact bone) in Magenta [255,0,255] # - GM (gray matter) in Red [255,0,0] # - WM (white matter) in Cyan [0,255,255] # - CSF (cerebrospinal fluid) in Blue [0,0,255] # - MUS (muscle) in Green [0,255,0] # - FAT (fat) in Yellow [255,255,0] # # and a manually delineated `head ROI mask` for spatial restriction of the supervised pixel classification. # # # **Chek out:**
# *Multispectral MRI analysis started with the seminal work of [Vannier et al.](https://www.ncbi.nlm.nih.gov/pubmed/3964938), Radiology 1985 - inspired by the research (on remote sensing) at NASA!* # In[11]: from IPython.display import Image Image(filename='./assets/multispectral_tissue_classification_pptx.png', width=600) # # Set up our Python environment # In[12]: get_ipython().run_line_magic('matplotlib', 'inline') # This to be able to display figures and graphs within the notebook browser import matplotlib.pyplot as plt import matplotlib.image as mpimg import seaborn as sns import numpy as np import pandas as pd import nibabel as nib import imageio import pydicom as dicom from sklearn.cluster import KMeans from nilearn.masking import apply_mask from sklearn.preprocessing import StandardScaler from sklearn.neighbors import KNeighborsClassifier from sklearn.ensemble import RandomForestClassifier from sklearn.metrics import classification_report, confusion_matrix # In[13]: import sklearn print('scikit-learn:', sklearn.__version__) # # MRI pulse sequenses (features) and tissue types (classes) # **Define channel names (MRI pulse sequence acronyms) and class labels (tissue types) that are used** # In[14]: mydata = './assets' chn_names = ['FLASH', 'DESS', 'FISP', 'PSIF'] class_names = ['air', 'gm', 'wm', 'csf', 'mus', 'fat'] # ## Construct a multispectral image # **Use `pydicom` for loading series of 2D DICOM files** # In[15]: dcms = [] # for making a sequence of channel images for fn in chn_names: pth = '%s/%s_060.dcm' % (mydata, fn.lower()) dcms.append(dicom.dcmread(pth)) print(f"{len(dcms)}",end=' dicom images loaded') # **Let's load data (pixel values) from DICOM structures into a 3D numpy matrix and prepare for a NIFTI header** # In[16]: # Get a shape of every single 2D image e.g. from the first image in the list rows, cols = dcms[0].pixel_array.shape # Get the number of DICOM files (images in the whole volume) nchn = len(dcms) # Get slice thickness thick = round(float(dcms[0].SliceThickness), 2) # Get slice pixel spacing pix_spacing = tuple([float(item) for item in dcms[0].PixelSpacing]) # Get the pixel type of an image pixelType = dcms[0].pixel_array.dtype # Create an empty 3D matrix with exact shape (rows, cols, slices) and pixel type # (possible pixel types: int or float; 8, 16, 32 or 64 bits) img_data = np.zeros((rows, cols, nchn), dtype=pixelType) # Fill in 3D array with the data (pixel values) from loaded DICOM structres, # reverse rows in 2D pixel array to keep correct orientation (trial and error) for i, s in enumerate(dcms): img_data[:, :, i] = s.pixel_array[::-1,:] # np.transpose(s.pixel_array[::-1,:]) # In[17]: print(f'shape:{img_data.shape}; pixel type:{pixelType} {type(img_data)}; \ pixel spacing:{pix_spacing}; slice thickness: {thick} mm') # **Here we construct a new empty NIFTI header** # In[18]: hdr = nib.Nifti1Header() # In[19]: # Set zooms as a tuple of length 4, adding tuples to pix_spacing tuple hdr.set_zooms = (pix_spacing + (thick,)) + (1.0,) hdr.set_zooms # In[20]: img = nib.Nifti1Image(img_data, affine=np.eye(4)) # In[21]: img.affine[0,0] = list(hdr.set_zooms)[0] img.affine[1,1] = list(hdr.set_zooms)[1] img.affine[2,2] = list(hdr.set_zooms)[2] img.affine # In[22]: img = nib.Nifti1Image(img_data, img.affine) # In[23]: img.header.set_datatype = pixelType # In[24]: print(img.header) # In[25]: nib.save(img, './results/04_multispectral_060.nii.gz') # # Load a multispectral MR image, training masks and ROI mask # # Using `nibabel` to read the multispectral image, and `imageio` to read the training mask and brain ROI mask # In[26]: fn_multispectral = './results/04_multispectral_060.nii.gz' img = nib.load(fn_multispectral) # In[27]: fn_tmsk = './assets/flash_060_training_mask_6cla.png' tm = imageio.imread(fn_tmsk) tmsk = np.array(tm[::-1,:,:], dtype=np.uint8) # tm[::-1,:,:] to reverse up-side-down training mask tmsk_img = nib.Nifti1Image(tmsk, img.affine) nib.save(tmsk_img, './results/04_flash_060_training_mask_6cla.nii.gz') # In[28]: fn_roimsk = './assets/flash_060_brain_mask.png' rm = imageio.imread(fn_roimsk) roimsk = np.array(rm[::-1,:,:], dtype=np.uint8) # rm[::-1,:,:] to reverse up-side-down ROI mask roimsk_img = nib.Nifti1Image(roimsk, img.affine) nib.save(roimsk_img, './results/04_flash_060_brain_mask.nii.gz') # In[29]: img.header.get_data_dtype() # **Display six-class training mask and ROI brain mask** # In[30]: fig, axes = plt.subplots(1,2, figsize=(14,7)) ax = axes.ravel() ax[0].imshow(tmsk, origin='lower') ax[0].axis('on') ax[0].set_title("Six-class training mask", size=16) ax[0].set_xticks([]), ax[0].set_yticks([]) ax[1].imshow(roimsk, origin='lower') ax[1].axis('on') ax[1].set_title("ROI brain mask", size=16) ax[1].set_xticks([]), ax[1].set_yticks([]) plt.show() # ## Show properties of the multispectral NIFTI image # In[31]: print('**Multispectral image info:**') print('shape of image = ', img.header.get_data_shape()) print('units = ', img.header.get_xyzt_units()) print('voxel size = ', img.header.get_zooms()) print('dtype = %s' % img.header.get_data_dtype()) if img.header.get_data_dtype() == 'int16': # Collaps the singular (z-)dimension dat = np.int16(img.get_fdata().squeeze()) print('min = %.1f' % dat.min()) print('max = %.1f' % dat.max()) print('number of channels =', img.shape[-1]) print('shape of 2D+spectral img_data = ',dat.shape) print('dtype of 2D+spectral img_data = ',dat.dtype) print('img affine:\n', img.affine) # #### and the corresponding training mask # In[32]: print('**Training mask info:**') print('shape = ', tmsk_img.header.get_data_shape()) print('voxel size = ', tmsk_img.header.get_zooms()) print('dtype tmsk = %s' % tmsk_img.header.get_data_dtype()) if tmsk_img.header.get_data_dtype() == 'uint8': # Collaps the singular (z-)dimension tmsk_data = np.uint8(tmsk_img.get_fdata().squeeze()) print('min mask value = %.0f' % tmsk_data.min()) print('max mask value = %.0f' % tmsk_data.max()) print('shape of 2D tmsk_data = ', tmsk_data[:,:,0].shape) print('dtype of 2D tmsk_data = ', tmsk_data.dtype) print('tmsk affine:', tmsk_img.affine) # #### and the ROI (head mask) # In[33]: print('**Brain ROI mask info:**') print('shape = ', roimsk_img.header.get_data_shape()) print('voxel size = ', roimsk_img.header.get_zooms()) print('dtype roimsk = %s' % roimsk_img.header.get_data_dtype()) if roimsk_img.header.get_data_dtype() == 'uint8': # Collaps the singular (z-)dimension roimsk_data = np.uint8(roimsk_img.get_fdata().squeeze()) print('min mask value = %.0f' % roimsk_data.min()) print('max mask value = %.0f' % roimsk_data.max()) print('shape of 2D roimsk_data = ', roimsk_data[:,:,0].shape) print('dtype of 2D roimsk_data = ', roimsk_data.dtype) print('roimsk affine:\n', roimsk_img.affine) # # Display the multispectral image as separate channel images # In[34]: fig, axes = plt.subplots(2,2, figsize=(14,14)) ax = axes.ravel() for k, ch in enumerate(chn_names): ax[k].imshow(dat[:, :, k], cmap='gray', origin='lower') ax[k].set_title(ch) ax[k].set(xlabel="") ax[k].axis('off') plt.suptitle("The multispectral MRI slice (data, $\mathbf{X} \subset \mathbf{R}^4$)", size=16) plt.tight_layout plt.show() # # Show the training mask with color-coded class (tissue type) labels # # # (according to a pre-defined color lookup table, or LUT ) # We use a **dictionary** for the color LUT: # In[35]: import matplotlib col_code = { 'BCK': [255,255,255], # White (background) 'AIR': [255,0,255], # Magenta 'GM': [255,0,0], # Red 'WM': [0,255,255], # Cyan 'CSF': [0,0,255], # Blue 'MUS': [0,255,0], # Green 'FAT': [255,255,0], # Yellow 'ERR': [0,0,0] # Black (error in labeling, should be BCK) } cla_names = list(col_code.keys()) ncla = len(cla_names) colors = np.array(list(col_code.values()))/255 # scale to interval 0-1 mycmap = matplotlib.colors.ListedColormap(colors) cla_cmap = matplotlib.cm.get_cmap(mycmap, ncla) # ncla discrete colors # In[36]: tmask = np.zeros(tmsk_data[:,:,0].shape, dtype=np.uint8) R = tmsk_data[:,:,0].squeeze() G = tmsk_data[:,:,1].squeeze() B = tmsk_data[:,:,2].squeeze() for idx, (key, val) in enumerate(col_code.items()): mm = np.where((R == val[0]) & (G == val[1]) & (B == val[2])) tmask[mm[0],mm[1]] = (idx+1) # mm[0]: rows, mm[1]: cols print(f'{idx}-{key}: {len(mm[0])} pixels (RGB={val})') print(f'\ntmask.dtype: {tmask.dtype}') # In[37]: u, c = np.unique(tmask, return_counts=True) print(f'Unique values: {list(u)} \nNumber of occurences: {list(c)}') # Construct a list of class names and corresponding class numbers, class-mask values, and class cardinality # In[38]: cla_names_num = [] for i in range(ncla): str = cla_names[i] + ' [%d]: val=%d, card=%d' % (i, u[i], c[i]) cla_names_num.append(str) cla_names_num[:] # ### Make a figure showing the color-coded labeled pixels (training mask) # In[39]: fig, ax = plt.subplots(figsize=(10,12)) cmsk = ax.imshow(tmsk_data[:, :]) clim=cmsk.properties()['clim'] cax = ax.imshow(tmsk_data[:, :], cmap=cla_cmap, origin='lower', clim=clim) ax.set_title('Color-coded training mask\n') ax.axis('on') cbar = fig.colorbar(cax, shrink=0.4, label='Tissue classed [0-%d] and their cardinality' % (ncla-1)) tick_locs = np.linspace(ncla*2+clim[0]+0.5, ((255-2*ncla)/255)*clim[1]-0.5, ncla) cbar.set_ticks(tick_locs) cbar.ax.set_yticklabels(cla_names_num) plt.tight_layout plt.show() fig.savefig('./results/04_training_mask_1_6_color_coded.png', transparent=False, dpi=300, bbox_inches="tight") #fig.savefig('%s/prj/MMIV-DLN-AI-2021/results/training_mask_1_6_color_coded.png' % (HOME), # transparent=False, dpi=300, bbox_inches="tight") # #### Correcting training mask (cf. black pixels on the boundary of the CSF mask being mapped to unlabeled background, BCK) # In[40]: c_tmsk_data = tmsk_data.copy() mm = np.where((R == 0) & (G == 0) & (B == 0)) # Black is error in labeling, map to BCK = white, or possibly to CSF) c_tmsk_data[mm[0],mm[1], 0] = 255 # mm[0]: rows, mm[1]: cols should be 1 (=BCK) c_tmsk_data[mm[0],mm[1], 1] = 255 # mm[0]: rows, mm[1]: cols should be 1 (=BCK) c_tmsk_data[mm[0],mm[1], 2] = 255 # mm[0]: rows, mm[1]: cols should be 1 (=BCK) # **Remove last element from `col_code` dictionary** # In[41]: c_col_code = col_code.copy() # Deep copy, will not affect the original col_code err_col = c_col_code.popitem() c_col_code # In[42]: c_tmask = np.zeros(c_tmsk_data[:,:,0].shape, dtype=np.uint8) R = c_tmsk_data[:,:,0].squeeze() G = c_tmsk_data[:,:,1].squeeze() B = c_tmsk_data[:,:,2].squeeze() for idx, (key, val) in enumerate(c_col_code.items()): mm = np.where((R == val[0]) & (G == val[1]) & (B == val[2])) c_tmask[mm[0],mm[1]] = (idx+1) # mm[0]: rows, mm[1]: cols print(f'{idx}-{key}: {len(mm[0])} pixels (RGB={val})') # **Update `c_cla_map` accordingly** # In[43]: c_cla_names = list(c_col_code.keys()) c_ncla = len(c_cla_names) c_colors = np.array(list(c_col_code.values()))/255 # scale to interval 0-1 c_mycmap = matplotlib.colors.ListedColormap(c_colors) c_cla_cmap = matplotlib.cm.get_cmap(c_mycmap, c_ncla) # ncla discrete colors # In[44]: u, c = np.unique(c_tmask, return_counts=True) print(f'Unique values: {list(u)} \nNumber of occurences: {list(c)}') # In[45]: c_cla_names_num = [] for i in range(c_ncla): str = c_cla_names[i] + ' [%d]: val=%d, card=%d' % (i, u[i], c[i]) c_cla_names_num.append(str) c_cla_names_num[:] # In[46]: fig, ax = plt.subplots(figsize=(10,12)) cmsk = ax.imshow(c_tmsk_data[:, :]) clim=cmsk.properties()['clim'] cax = ax.imshow(c_tmsk_data[:, :], cmap=c_cla_cmap, origin='lower', clim=clim) ax.set_title('Color-coded corrected training mask\n', size=16) ax.axis('on') ax.set_yticklabels("") ax.set_xticklabels("") cbar = fig.colorbar(cax, shrink=0.4, label='Tissue classed [0-%d] and their cardinality' % (c_ncla-1)) tick_locs = np.linspace(c_ncla*2+clim[0]+0.5, ((255-2*c_ncla)/255)*clim[1]-0.5, c_ncla) cbar.set_ticks(tick_locs) cbar.ax.set_yticklabels(c_cla_names_num) plt.tight_layout plt.show() fig.savefig('./results/04_corrected_training_mask_1_6_color_coded.png', transparent=False, dpi=300, bbox_inches="tight") # # We make a data frame (FVB) from multispectral MRI data and training mask # # FVB = **Feature Vector Base**; the training mask is here a **multiclass training mask** # - Using `np.where()` to find locations to those pixels containing tissue being labelled # # - Thereafter extract the corresponding data (feature vectors) from the multispectral image being spatially aligned with the training mask: # In[47]: # Find pixel locations corresponding to AIR (class 1), GM (class 2), ..., FAT (class 6) and get the MRI data from dat frames = pd.DataFrame() # Create an empty data frame for cn, cla in enumerate(class_names): ind = np.where(c_tmask == cn+2) # Find indices (x,y) for given class, class numbers start at 2 (BCK is 0) df = pd.DataFrame(np.asarray(dat[ind[0][:],ind[1][:],:]), columns = chn_names) df.insert(len(df.columns), 'Class', class_names[cn].upper()) # Last entry is class name frames = frames.append(df) # Concatinate the frames FVB = pd.concat([frames], ignore_index=True) # **Check some training data:** $(x_1^i, x_2^i,x_3^i, x_4^i, y^i)$
# for $i=0,1,\ldots,4$ and for $i=N-4, N-3,\ldots,N$, where $N$ is the total number of pixels being labelled # In[48]: FVB.head() # In[49]: FVB.tail() # **Class-specific summary statistics** from the FVB accross the different features (channels, or pulse sequences) # In[50]: FVB.groupby('Class').describe(percentiles = [0.5]).round(3).T # # Save dataframe FVB from the labeled NIFTI data as a .csv file # In[51]: FVB.to_csv('./results/04_multispectral_mri_training_data_from_manual_png_mask.csv', index=False) # # Supervised classification with the KNN algorithm # # **See also https://machinelearningmastery.com/tutorial-to-implement-k-nearest-neighbors-in-python-from-scratch** # # # Read the training data (`X` and `y`) # In[52]: FVB = pd.read_csv('./results/04_multispectral_mri_training_data_from_manual_png_mask.csv') FVB.head() # ## Separate the data matrix `X_train` from the corresponding vector of labels `y_train` in the FVB dataframe # In[53]: X_train = FVB.iloc[:, :-1].values.astype(float) # The data X y_train = FVB.iloc[:, -1].values # Last column is 'Class' # In[54]: X_train.shape # # Load the test data set `X_test` defined by the head `ROI mask` # For being able to map classified pixels (i.e. pixel-based feature vectors in **signal intensity space** $\subset \mathbf R^4$) back into **image space** $\subset \mathbf N^+ \times \mathbf N^+$ we will need to store their row-column $(i, j)$ locations - not only the feature vectors. # In[55]: # fn_roimsk = './data/mri/brain_roi_mask.nii.gz' fn_roimsk = './results/04_flash_060_brain_mask.nii.gz' roimsk_img = nib.load(fn_roimsk) if roimsk_img.header.get_data_dtype() == 'uint8': roimsk_data = np.uint8(roimsk_img.get_fdata()) print(roimsk_data.shape) print(roimsk_data.min()) print(roimsk_data.max()) # For using the whole image as ROI, uncomment # roimsk_data = np.ones(roimsk_data.shape) # In[56]: plt.imshow(roimsk_data[:,:,:], origin='lower') plt.show() # In[57]: fn_multispectral = './results/04_multispectral_060.nii.gz' img = nib.load(fn_multispectral) if img.header.get_data_dtype() == 'int16': # Collaps the singular (z-)dimension data = np.int16(img.get_fdata()) # In[58]: fig, axes = plt.subplots(2,2, figsize=(10,10)) ax = axes.ravel() for k, ch in enumerate(chn_names): ax[k].imshow(data[:,:,k].squeeze() + roimsk_data[:,:,1].squeeze(), cmap='gray', origin='lower') ax[k].set_title('%d-%s' % (k+1, ch)) ax[k].set(xlabel="") ax[k].axis('off') plt.show() # ## Find all pixel locations in the ROI to extract the test data matrix, `X_test` # # **and save the test data samples and their corresponding pixel locations as a Pandas dataframe** # In[59]: # Find pixel locations corresponding to brain ROI (value red =[255, 0, 0]) ind_test = np.where((roimsk_data[:,:,0] == 255) & (roimsk_data[:,:,1] == 0) & (roimsk_data[:,:,2] == 0)) X_test = np.asarray(data[ind_test[0][:],ind_test[1][:],:]) # The multispectral signal intensities roimask = np.zeros(roimsk_data[:,:,0].shape) roimask[ind_test[0][:], ind_test[1][:]] = 1 # The ROI brain mask values dfTest = pd.DataFrame(X_test, columns = chn_names) dfTest.insert(loc = len(dfTest.columns), column = 'row', value = ind_test[0]) # Row of pixel location dfTest.insert(loc = len(dfTest.columns), column = 'col', value = ind_test[1]) # Col of pixel location # In[60]: plt.imshow(roimask, cmap='gray', origin='lower') plt.show() print(roimask.max(), roimask.shape) # In[61]: dfTest.head() # In[62]: dfTest.tail() # ## Scaling of feature vectors (MRI signal intensities) in `X_train` and `X_test` # ### $\rightarrow$ Your turn # # #### Why should we consider feature scaling (normalization) in machine learning? # # (Hint: See https://en.wikipedia.org/wiki/Feature_scaling) # We will use the [`StandardScaler`](https://scikit-learn.org/stable/modules/generated/sklearn.preprocessing.StandardScaler.html) from `scikit-learn` # In[63]: scaler = StandardScaler() scaler.fit(X_train.astype(float)) X_train_scaled = scaler.transform(X_train.astype(float)) X_test_scaled = scaler.transform(X_test.astype(float)) # We make Pandas dataframes from the pair of scaled X_train (`dX`) and corresponding y_train (`dy`) to confirm the effect of feature scaling, i.e. # In[64]: dX = pd.DataFrame(X_train_scaled, columns=chn_names) dy = pd.DataFrame(y_train, columns=['Class']) FVB_train = pd.concat([dX, dy], axis=1) FVB_train.describe(percentiles = [0.5]).round(4).T # **this demonstrates that we have obtained `zero mean` and `unit variance` for each feature in the training data after scaling** # # Estimate (fit) a KNN classifier given the training data set # # **for (non-scaled) training data** # We use [`KNeighborsClassifier`](https://scikit-learn.org/stable/modules/generated/sklearn.neighbors.KNeighborsClassifier.html) from `scikit-learn` # # We try K=1 # In[65]: K = 1 classifier = KNeighborsClassifier(n_neighbors=K) classifier.fit(X_train, y_train) # ## Prediction (y_train_pred) on the training dataset, (X_train) # # - In our case, we have not labeled all the pixels in the ROI (that will take a lot of time and effort, and also be prone to intra and inter observer variation). # # # - Thus, we do not have the "ground truth" i.e. y_test (only y_train) # # # - We have to rely on **plausible results** (not quantitative performance metrices), regarding the test dataset (see later). # # # - Initially, we examine the performance of our classifier on the training dataset. # # # - Usually, the classifier should perform very well on training data (by the very construction of the classifier), and we will examine this situation. # # # - We obtain perfect classification! (as seen below) # In[66]: y_train_pred = classifier.predict(X_train) # ## Performance evaluation on the traing dataset using the `confusion matrix` # In[67]: print(confusion_matrix(y_train, y_train_pred)) # ### $\rightarrow$ Your turn! # # > #### Why are we guaranteed the confusion matrix being **diagonal** in this case? # > #### What assumption do we make to give such a guarantee? # ## Evaluation on the training dataset according to a `classification_report` # Check the documentation in [`classification_report`](https://scikit-learn.org/stable/modules/generated/sklearn.metrics.classification_report.html) from the `scikit-learn` library # In[68]: print(classification_report(y_train, y_train_pred)) # # No, we try K >> 1 # In[69]: K = 51 classifier = KNeighborsClassifier(n_neighbors=K) classifier.fit(X_train, y_train) y_train_pred = classifier.predict(X_train) # ## Performance evaluation # In[70]: print(confusion_matrix(y_train, y_train_pred)) # In[71]: print(classification_report(y_train, y_train_pred)) # ### $\rightarrow$ Your turn! # # > #### Do experimentes on the training dataset with K-NN for different values of K, e.g. $K \in \{3, 5, 27, 51\}$ and check the corresponding confusion matrix. # > #### Why do we frequently get non-diagonal confusion matrices in this case? # # Prediction on the scaled `X_test` dataset # ### (delineated by the head ROI mask) # **Initially we let K=5 and use feature scaling "zero mean unit variance" across all classes** # In[72]: K = 5 classifier = KNeighborsClassifier(n_neighbors=K) classifier.fit(X_train_scaled, y_train) # classifier.fit(X_train, y_train) # **Prediction on the scaled `X_test_scaled` dataset (delineated by the head ROI mask)** # In[73]: y_pred = classifier.predict(X_test_scaled) # y_pred = classifier.predict(X_test) # ## Prediction results # In[74]: print('Number of classified pixels:', len(y_pred)) print('\nThe first 10 predictions in ROI:', y_pred[:10]) # - **Make a Pandes dataframe for the prediction** # # - **Later we will transform named tissue types to a numerical encoding using a dictionary** # In[75]: df_y_pred = pd.DataFrame(y_pred, columns=['Class']) print(df_y_pred.info()) df_y_pred.head(10).T # In[76]: df_y_pred.tail(10).T # Construct a **dictionary** that defines a ono-to-one map between tissuse type *names* and tissue type classes as *numbers* # In[77]: cla_code = { 'BCK': 0, # White (background) Color-coding according to cla_cmap defined above 'AIR': 1, # Magenta 'GM': 2, # Red 'WM': 3, # Cyan 'CSF': 4, # Blue 'MUS': 5, # Green 'FAT': 6 # Yellow } df_y_pred['Class'] = df_y_pred['Class'].map(cla_code) # Note: if the dictionary does not exhaustively map all entries # then non-matched entries are changed to NaNs # ## We select the FLASH channel (high CNRs) for the superposition of the KNN tissue type classification # **Initially, we "lift" (i.e. add a constant to) the range of signal intensities in the FLASH channels to avoid mixing data values in FLASH with the overlay of tissue class numbers** # In[78]: # Adding max classnumber + 1 to the FLASH data to avoid mixing data values with predicted class-numbers cla_data = np.int16((data[:,:,0].copy().squeeze() + (df_y_pred.values.max() + 1)*np.ones(roimsk_data[:,:,0].shape))) mx = cla_data.max() mn = cla_data.min() print('min - max =', mn, '-', mx) # Scaling to range [0, 1] cla_data = cla_data/mx mx = cla_data.max().round(4) mn = cla_data.min().round(4) print('min - max =', mn, '-', mx) cla_data.shape # **Fill the `cla_data` matrix with the pixel-wise KNN predictions (scaled to the interval [0, 1])** # In[79]: mx = df_y_pred['Class'].values.max() # Max class value # **Insert the predicted class numbers in the corresponding pixel locations (row, col) in the head ROI mask, i.e. going from feature space to image space** # In[80]: cla_data[dfTest['row'].values, dfTest['col'].values] = df_y_pred['Class'].values / mx # ## Inspect the tissue classification using both color-coding and gray-level coding # In[81]: fig, axes = plt.subplots(2,2, figsize=(15,15)) ax = axes.ravel() ax[0].imshow(c_tmsk_data[:, :], cmap=c_cla_cmap, origin='lower') ax[0].set_title('Color-coded corrected training mask') ax[0].axis('on') ax[1].imshow(data[:, :, 0].squeeze(), cmap='gray', origin='lower') ax[1].set_title('FLASH') ax[2].imshow(cla_data[:, :] / mx, cmap='gray', origin='lower') ax[2].set_title('Tissue-class prediction (gray-scale)') # Hadamar (element-wise) product np.multiply(a,b) cla_within_roi = np.multiply(cla_data[:, :], roimask) # / mx u, c = np.unique(cla_within_roi, return_counts=True) s = 0.5*(u[-1]-u[0])/(c_ncla+1) c_cla_names_num = [] for i in range(c_ncla): str = c_cla_names[i] + ' [%d]: card=%d' % (i, c[i]) c_cla_names_num.append(str) cmsk = ax[3].imshow(cla_within_roi) clim=cmsk.properties()['clim'] cax = ax[3].imshow(cla_within_roi, cmap=c_cla_cmap, origin='lower', clim=clim) cbar = fig.colorbar(cax, shrink=0.4, label='Tissue classed [0-%d] and their cardinality' % (c_ncla-1)) tick_locs = np.linspace(u[0]+s, u[-1]-s, c_ncla) cbar.set_ticks(tick_locs) cbar.ax.set_yticklabels(c_cla_names_num) ax[3].set_title('Tissue-class prediction (color-coded)') plt.suptitle('KNN classification (K=%d) of the multispectral MRI slice within the ROI brain mask' % K, size=16) plt.tight_layout plt.show() fig.savefig('./results/04_KNN_(K=%d)_classification_results_on_flash.png' % K, transparent=False, dpi=300, bbox_inches="tight") # ### $\rightarrow$ Your turn! # # > #### Discuss the K-NN classification results (e.g. plausibility? any likely misclassifications? ....) # > #### Repeat the KNN-classification experiments using the original signal intensities (no feature scaling) # > #### Compare and discuss your findings with those obtained using feature scaling # > #### How could you quantitatively assess the difference in predictions between *feauture scaling* and *no feature scaling*? # # EXTRA: Tissue classification using the Random Forest classifier # # **where tha class labels are mapped to numeric values** # In[82]: df_y_train = pd.DataFrame(y_train, columns=['Class']) print(df_y_train.info()) df_y_train.head().T # In[83]: df_y_train['Class'] = df_y_train['Class'].map(cla_code) y_train_num = df_y_train['Class'].values # **Random Forest** classifier: # In[84]: classifierRF = RandomForestClassifier(n_estimators=100, n_jobs=-1, random_state=42) # In[85]: classifierRF.fit(X_train_scaled, y_train_num) # In[86]: y_rf_pred = classifierRF.predict(X_test_scaled) # Fill a data matrix `cla_data_rf` with the pixel-wise RF predicitons (scaled to the interval [0, 1]) # In[87]: mx = df_y_pred['Class'].values.max() # Max class value cla_data_rf = cla_data.copy() cla_data_rf[dfTest['row'].values, dfTest['col'].values] = y_rf_pred / mx # # Visual comparison between the KNN and the RF predictions # In[88]: fig, axes = plt.subplots(1,2, figsize=(20,10)) ax = axes.ravel() ax[0].imshow(np.multiply(cla_data[:, :], roimask) / mx, cmap=c_cla_cmap, origin='lower') ax[0].set_title('Tissue-class KNN prediction (K=%d)' % K, fontsize=18) ax[1].imshow(np.multiply(cla_data_rf[:, :], roimask) / mx, cmap=c_cla_cmap, origin='lower') ax[1].set_title('Tissue-class RF prediction', fontsize=18) plt.show() # ### $\rightarrow$ Your turn! # # > #### Which classifier KNN or RF seems to do the best job (giving the most plausible results)? # > #### The RF classifier has several hyperparameters. Could you tune these to improve performance (visually assessed) # ## Evaluation of results - Confusion matrix between KNN and RF classification # **i.e. y_pred_knn ("true") versus y_pred_rf** # In[89]: from utils import plot_confusion_matrix, plot_confusion_matrix_with_colorbar # In[90]: y_pred_knn = df_y_pred['Class'].values # In[91]: from sklearn.metrics import confusion_matrix cm = confusion_matrix(y_pred_knn, y_rf_pred) print(cm) # In[92]: plot_confusion_matrix_with_colorbar(cm, classes=class_names, title='Confusion matrix - Predicted=RF, "True"=KNN (K=%d)' % K, figsize=(10,10)) plt.ylabel('KNN label ("True")') plt.xlabel('RF label (Predicted)') plt.show() # Or, using Seaborn's `heatmap`: # In[93]: import seaborn as sn df_cm = pd.DataFrame(cm, index = [i for i in class_names], columns = [i for i in class_names]) plt.figure(figsize = (12,10)) ax=sn.heatmap(df_cm, annot=True, annot_kws={"size": 16}, fmt='d') sn.set(font_scale=2) # for label size cbar = ax.collections[0].colorbar cbar.ax.tick_params(labelsize=14) plt.show() # ### $\rightarrow$ Your turn! # # > #### Could you suggest (or try out) any other classification methods for this task and the given data? # HINT: Check https://scikit-learn.org/stable/supervised_learning.html#supervised-learning # In[ ]: