This notebook was made as part of the Health and Bioscience IDEAS training programme, funded by the UKRI Innovation Scholars Data Science Training in Health and Bioscience. Please visit our website for more information.
This notebook will give a brief example of how to correct, or harmonize for site-related differences in image biomarkers. This problem is cropping up more and more often as we need multi-site data to have large enough sample sizes for many research applications, particularly for those that are based on machine learning and need a large enough training set to produce an accurate model. For more background on this, please see the accompanying documenation.
import os
import nest_asyncio
nest_asyncio.apply()
import numpy as np
import pandas as pd
import datalad.api as dl
As in the previous notebook, we will grab the data for the lesson using datalad. This time we will ne using a spreadsheet that has all the imgaging volumes for the study and read it into memory. This data is a limited dataset from GENFI a multi-site study of genetic frontotemporal dementia.
dl_source='https://github.com/HealthBioscienceIDEAS/demon-imaging-data.git'
sample=dl.clone(dl_source,path='/tmp/sample',description='Cloned sample dataset for import')
sample.update(merge=True)
sample.siblings(action='enable',name='demons-data')
in_file=sample.get('./GENFI_DEMON_SPREADSHEET.xlsx')
[INFO] Fetching updates for Dataset(/tmp/sample)
.: demons-data(?) [git]
For the pusporses of some of the regression methods, I am making some of the variables categorical in nature.
df_xsec=pd.read_excel(in_file[0]['path'])
df_xsec['Sex']=pd.Categorical(df_xsec['Sex'],categories=[0,1])
df_xsec['Sex']=df_xsec['Sex'].cat.rename_categories(['Female','Male'])
df_xsec['Site'] = pd.Categorical(df_xsec['Site'],categories=np.arange(23))
df_xsec['Group']=pd.Categorical(df_xsec['Group'],categories=[0,1,2])
df_xsec['Group']=df_xsec['Group'].cat.rename_categories(['Non-carrier','Presymptomatic','Symptomatic'])
Next I will produce a quick crosstab to show the number of individuals scanned at each site and with each scanner. Here are some of the key points for this table:
pd.crosstab(df_xsec['Site'],df_xsec['Scanner'])
Scanner | GE 1.5T | GE 3T | Philips 3T | Siemens 1.5T | Siemens Prisma 3T | Siemens Skyra 3T | Siemens Trio 3T |
---|---|---|---|---|---|---|---|
Site | |||||||
0 | 0 | 0 | 0 | 0 | 0 | 8 | 20 |
1 | 0 | 0 | 45 | 0 | 0 | 0 | 0 |
2 | 0 | 0 | 0 | 0 | 3 | 0 | 0 |
3 | 0 | 0 | 143 | 0 | 0 | 0 | 0 |
4 | 0 | 0 | 3 | 0 | 0 | 0 | 0 |
5 | 0 | 25 | 0 | 0 | 6 | 0 | 0 |
6 | 0 | 0 | 18 | 0 | 0 | 0 | 0 |
7 | 0 | 0 | 0 | 0 | 10 | 0 | 12 |
8 | 0 | 0 | 0 | 33 | 0 | 36 | 0 |
9 | 8 | 0 | 1 | 0 | 0 | 0 | 0 |
10 | 0 | 0 | 0 | 5 | 0 | 0 | 0 |
11 | 0 | 0 | 0 | 0 | 0 | 0 | 35 |
12 | 0 | 0 | 0 | 0 | 0 | 23 | 0 |
13 | 0 | 0 | 0 | 0 | 0 | 0 | 30 |
14 | 0 | 0 | 0 | 0 | 2 | 0 | 0 |
15 | 0 | 0 | 0 | 0 | 0 | 10 | 0 |
16 | 0 | 0 | 0 | 0 | 0 | 0 | 28 |
17 | 0 | 0 | 29 | 0 | 0 | 0 | 0 |
18 | 0 | 0 | 0 | 0 | 0 | 0 | 14 |
19 | 0 | 0 | 0 | 0 | 0 | 0 | 8 |
20 | 0 | 0 | 0 | 0 | 0 | 0 | 16 |
21 | 0 | 0 | 0 | 0 | 14 | 0 | 37 |
22 | 0 | 2 | 0 | 0 | 0 | 0 | 0 |
The participants in the dataset fall into thre main clinical classifications (variable name Group). We want to examine what the between-group differences are for the volume of the insula, a key brain structure known to be affected early in frontotemporal dementia. But we do any statstical analysis, let;s get a few summary statistics, including the sample size and mean(standard deviation), min and max of three variables:
df_xsec.groupby("Group").agg(
{
"ID": ["count"],
"Age": ["mean", "std", "min", "max" ],
"EYO": ["mean", "std", "min", "max" ],
"TIV": ["mean", "std", "min", "max" ]
}
).style.format(precision=1)
ID | Age | EYO | TIV | ||||||||||
---|---|---|---|---|---|---|---|---|---|---|---|---|---|
count | mean | std | min | max | mean | std | min | max | mean | std | min | max | |
Group | |||||||||||||
Non-carrier | 251 | 47.0 | 13.7 | 18.6 | 85.7 | -12.8 | 14.3 | -50.0 | 27.6 | 1420.5 | 142.5 | 1100.7 | 2051.7 |
Presymptomatic | 266 | 44.8 | 11.9 | 20.1 | 75.5 | -13.8 | 11.5 | -47.4 | 15.7 | 1412.2 | 137.7 | 1074.7 | 1811.5 |
Symptomatic | 107 | 63.0 | 8.2 | 32.9 | 78.7 | 3.4 | 6.9 | -26.1 | 21.7 | 1442.3 | 157.8 | 1129.5 | 1848.9 |
Some things to note from this table:
One more thing to check, the breakdown of sex between groups.
pd.crosstab(df_xsec["Group"],df_xsec["Sex"],normalize='index').style.format(precision=3)
Sex | Female | Male |
---|---|---|
Group | ||
Non-carrier | 0.574 | 0.426 |
Presymptomatic | 0.632 | 0.368 |
Symptomatic | 0.430 | 0.570 |
The non-carrier and presymptomatic carriers have pretty much the same proportion of females and males, while the symptomaitc group have more males (Note: test not performed!)
The first analysis we will do is a very simple linear regression where site is not accounted for. We will, however, include sex, age, and TIV as covariates to deteminre what differences there are between groups when adjusting for these variables.
import statsmodels.api as stats
import statsmodels.formula.api as statsfx
If you are familiar with R, the regression equations in statsmodels are setup in a similar fashion. So the first model will take into account age, sex, and TIV but not site. We will center age and TIV so that the resulting intercept from the regression is more interpetable.
This regression corresponds to the following formula:
$y_i = \beta_0 + \beta_1x_{i,Age} + \beta_2x_{i,Male} + \beta_3x_{i,TIV} + \epsilon_i$
df_xsec=df_xsec.dropna(subset=["Group","Age","Sex","TIV"])
df_xsec["Age_centered"]=df_xsec['Age']-df_xsec['Age'].mean()
df_xsec["TIV_centered"]=df_xsec['TIV']-df_xsec['TIV'].mean()
results= statsfx.ols('Insula_volume ~ Group + Age_centered + Sex + TIV_centered',data=df_xsec).fit()
print(results.summary())
OLS Regression Results ============================================================================== Dep. Variable: Insula_volume R-squared: 0.711 Model: OLS Adj. R-squared: 0.709 Method: Least Squares F-statistic: 303.5 Date: Wed, 15 Sep 2021 Prob (F-statistic): 1.72e-163 Time: 10:31:31 Log-Likelihood: -5109.0 No. Observations: 622 AIC: 1.023e+04 Df Residuals: 616 BIC: 1.026e+04 Df Model: 5 Covariance Type: nonrobust =========================================================================================== coef std err t P>|t| [0.025 0.975] ------------------------------------------------------------------------------------------- Intercept 1.073e+04 69.720 153.853 0.000 1.06e+04 1.09e+04 Group[T.Presymptomatic] -223.4410 79.535 -2.809 0.005 -379.635 -67.247 Group[T.Symptomatic] -2147.2374 114.987 -18.674 0.000 -2373.052 -1921.423 Sex[T.Male] -100.2846 95.156 -1.054 0.292 -287.154 86.584 Age_centered -40.6126 2.975 -13.652 0.000 -46.455 -34.770 TIV_centered 6.1774 0.326 18.977 0.000 5.538 6.817 ============================================================================== Omnibus: 3.180 Durbin-Watson: 2.069 Prob(Omnibus): 0.204 Jarque-Bera (JB): 3.453 Skew: 0.055 Prob(JB): 0.178 Kurtosis: 3.348 Cond. No. 501. ============================================================================== Notes: [1] Standard Errors assume that the covariance matrix of the errors is correctly specified.
Let's look at the results from the regression output:
Before we go on, let's take a look and see if we can spot differences by site, even though it is not yet in the model. To do that, we will plot the residuals from the regression based on the lienar regression, and group them by site.
# Get residuals from model
df_xsec['residuals'] = results.resid
# Get mean of residuals by site
site_resid = df_xsec.groupby('Site').mean()['residuals']
#sort the residuals from smallest to biggest
site_rank = site_resid.rank().astype('int')
df_xsec.boxplot('residuals',by=['Site'],positions=site_rank.values.tolist(),
showmeans=True,figsize=(12,6))
df_nc=df_xsec.query('Group =="Non-carrier"')
df_nc.boxplot('residuals',by=['Site'],positions=site_rank.values.tolist(),
showmeans=True,figsize=(12,6))
df_pmc=df_xsec.query('Group =="Presymptomatic"')
df_pmc.boxplot('residuals',by=['Site'],positions=site_rank.values.tolist(),
showmeans=True,figsize=(12,6))
df_smc=df_xsec.query('Group =="Symptomatic"')
df_smc.boxplot('residuals',by=['Site'],positions=site_rank.values.tolist(),
showmeans=True,figsize=(12,6))
In general, we can see that the predicted values not only have different mean values (green triangles), but different amounts of variability as well. There are also some sites that have very low samples, making the abiity to identify their influence on the insual volume difficult to determine.
Are there differences in site or scanner that are occuring in this data? A common approach has been to include a mutl-level factor variable for site or scanner intot he lineear regression. Let's run the same analysis again including the site variable as a covariate. The regression now changes to:
$y_i = \beta_0 + \beta_1x_{i,Age} + \beta_2x_{i,Male} + \beta_3x_{i,TIV} + \beta_4x_{i,Site_1} + \beta5x_{i,Site_2} + \ldots + \beta_{n+3}x_{i,Site_n} + \epsilon_i$
# Perform the regression
site_results= statsfx.ols('Insula_volume ~ Group + Age_centered + Sex + TIV_centered + Site',data=df_xsec).fit()
# Print the output summary table
print(site_results.summary())
# Show the ANOVA test for the different covariates
table=stats.stats.anova_lm(site_results, typ=2)
print(table)
# Show a likelihood ratio test comparing this with the base model above (no Site covariate)
lr_test=site_results.compare_lr_test(results)
print('P-value for likelihood ratio test with original mode: ', np.format_float_positional(lr_test[1],precision=6))
Overall the estimates change only slightly:
However, as you can see from the model output that all is not well. Some of this may arise due to the small sample size in some of the factors making ti difficult to get a reliabl estimate for these sites.
It could be that this the difference is primarily which scanner was used, not which site it was acquired in. In fact some sites have used more than one scanner. Soo we could run this again with scanner as the factor variabke rather than site. The primary scanner that the protocol was developed on was a Siemens Trio 3T, so let's use that as our baseline level to see how the other scanners compare.
df_xsec['Scanner_cat'] = pd.Categorical(df_xsec['Scanner'],
categories=['Siemens Trio 3T',
'Siemens Prisma 3T',
'Siemens Skyra 3T',
'Siemens 1.5T',
'Philips 3T',
'GE 1.5T',
'GE 3T'])
# Perform the new regression model
scanner_results= statsfx.ols('Insula_volume ~ Group + Age_centered + Sex + TIV_centered + Scanner_cat',data=df_xsec).fit()
# Print the output summary table
print(scanner_results.summary())
# Show the ANOVA test for the different covariates
table=stats.stats.anova_lm(scanner_results, typ=2)
print(table)
# Show a likelihood ratio test comparing this with the base model above (no Site covariate)
lr_test=scanner_results.compare_lr_test(results)
print('P-value for likelihood ratio test with original mode: ', np.format_float_positional(lr_test[1],precision=6))
The model fits are overall pretty similar:
Notce the warning message that comes up in both regression equations when site or scanner is included. Given the large imbalance, small sample size, and potential heteroskedasticity between levels of this value, this is probably not an ideal, or even appropriate way to deal with the multi-site data. So where now?
Now that you have seen how to do the simple regression methods above. Try them with another one of the volumes. I'll put the names below so that you can choose one have have a go.
list(df_xsec.columns)[10:45]
And in the empty cell below, you can try various regressions using any of the regression models mentioend.
A more recent, advanced approach to handle multi-site data is the ComBat method, originally proposed for genomic analysis and now applied quite often to imaging analyses. ComBat has a lot of advantages, as it was designed to help both with the additive bias AND the multiplicative bias (heteroskedasticity). In addition, it has far better suited to handle uneven data between sitesa dn samll sample sizes for those sites. So let's take a look at This is really helpful, but what if there are difference in standard error between sites, and how can we trust those small sample size in some cells. ComBat provides "harmonised" values out of the analyses that have accounted for the batch variable (here site) so that you can perform your analysis on them. There are many implementations of ComBat around for R and for Python. I am using neuroCombat for Python, but there is also other implementations/extensions, such as neuroHarmonize.
ComBat prefers lots of different features, so unlike the simple linear regression example, we will include lots of different volumes here, from the right accumbens (column 10) to the Insula volume (column 32)
from neuroCombat import neuroCombat
covars=df_xsec.filter(["Age_centered","TIV_centered"])
covars["Group"]=df_xsec['Group'].cat.codes
covars["Sex"]=df_xsec['Sex'].cat.codes
covars["Site"]=df_xsec['Site'].cat.codes
covars=covars[["Group","Age_centered","Sex","TIV_centered","Site"]]
categorical_cols=["Group","Sex"]
batch_col="Site"
data=df_xsec.iloc[:,10:33].to_numpy(copy=True).transpose()
data_combat = neuroCombat(dat=data,
covars=covars,
batch_col=batch_col,
categorical_cols=categorical_cols)
So now we have harmonised data, let's run again the results to see what infomration we now get for these volumes. First I'll grab the harmoinsed insula volume and add it to the data frame.
df_xsec['Harmonised_Insula'] = data_combat["data"][22,:].transpose()
Then I'll perform a similar regression again, this time with no site variable (as it should be harmonised by ComBat!)
# Perform linear regression on harmonised values.
harmonised_results= statsfx.ols('Harmonised_Insula ~ Group + Age_centered + Sex + TIV_centered',data=df_xsec).fit()
#Print Summary of regression.
print(harmonised_results.summary())
If anything, the models are a touch stronger:
What happens if we were to include Site as a covariate now?
harmonised_site_results= statsfx.ols('Harmonised_Insula ~ Group + Age_centered + Sex + TIV_centered + Site',data=df_xsec).fit()
print(harmonised_site_results.summary())
#ANOVA test
table=stats.stats.anova_lm(harmonised_site_results, typ=2)
print(table)
# Show a likelihood ratio test comparing this with the base model above (no Site covariate)
lr_test=harmonised_site_results.compare_lr_test(harmonised_results)
print('P-value for likelihood ratio test with original mode: ', np.format_float_positional(lr_test[1],precision=6))
Notice now that there is NOT statsitcal evidence in the ANOVA test, nor in the likelihood ratio test. With harmonised values, site no longer adds to the data, as would be expected. Below are the residuals plotted agian, the left being the original plot and the right being residuals from the harmoinsed regression. Notices how the residuals are far more centered around zero.
df_xsec['harmonised_residuals']=harmonised_results.resid
df_xsec.boxplot(['residuals','harmonised_residuals'],by=['Site'],positions=site_rank,figsize=(12,6))
Now that you can see how neuroCombat runs, grab one of the harmonised columns (perhaps one of the variables you tried in the regression above, and run the same regression with the haronised verion. Are the results similar to the insula?
Hopefully, this demonstration showed the effects of site on statistical analysis and how ComBat is a worthwhile appproach to pursue when you need to correct for these batch effects.