Analysis of cohort perturbation experiments including nested group designs - statistical models to multicell correlation networks • scglmmr

Pseudobulk mixed effects model workflow

These functions implement wrappers dream from the variancePartition package to enable mixed (i.e. varying) effects models of gene expression. This approach is described in Hoffman et al Bininformatics 2020 this paper should be cited if using the wrapper below FitDream.

We then run enrichment testing with fgsea to define and use network methods to understand relationships between different

These models: 1. Allow estimation of treatment effects and treatment differences across groups in multi-subject perturbation experiments 1. Adjust for non-independence of repeated measures from the same donors with random effects
2. Utilize observational level weights to enable normal methods. 3. Shrink estimates toward genome wide trends with empirical Bayes methods tailored to lme4 based models.

The experiment design shown below includes multiple subjects (subjectid), each with multiple measurements (sample) corresponding to a pre and post treatment timepoint. Additionally subjects are nested into different pre-defined response groups (e.g. high and low responder). Below the workflow will describe defining pre-treatment baseline effect differences, treatment effects across all subjects and the difference in treatment effects between the groups including covariate adjustment.

sample	subjectid	time	group	sex
101_t0	101	d0	low	F
101_t1	101	d1	high	F
102_t0	102	d0	low	M
102_t1	102	d1	high	M
… (x n donors)	… (x n donors)	… (x n donors)	… (x n donors)	… (x n donors)

Getting started - setup, data loading, and quaity control

In the workflow below, mixed effect model fits and gene set enrichment steps are parallelized with the BiocParallel package.

#devtools::install_github(repo = "https://github.com/MattPM/scglmmr")
suppressMessages(library(scglmmr))
suppressMessages(library(magrittr))
suppressMessages(library(BiocParallel))
#the mixed model fits and gsea are parallelized 
BiocParallel::register(BiocParallel::SnowParam(4))
pparam = BiocParallel::SnowParam(workers = 4, type = "SOCK", progressbar = TRUE)

Load single cell data and quality control


datapath = "mypath/"

# load seurat or sce object etc. 
s = readRDS("path/seuratobject.rds")

# define counts and metadata and subset to cells above rm seurat object from workspace 
meta = s@meta.data
umi = s@assays$RNA@counts
rm(s); gc()

# QC contingency of cells by subject for each celltype 
tab = scglmmr::SubjectCelltypeTable(metadata = meta, celltype_column = "celltype", sample_column = "sample")
tab$celltypes_remove; tab$`low representation celltypes`; tab$table

# remove cells prior to pseudobulk analysis with 0 representation in some donors. 
meta = meta[!meta$celltype %in% tab$celltypes_remove, ]

# subset data 
umi = umi[ ,rownames(meta)]

# Create aggregated pseudobulk data
pb = scglmmr::PseudobulkList(
  rawcounts = umi,
  metadata = meta, 
  sample_col = "sample",
  celltype_col = "celltype",
  avg_or_sum = "sum"
  )


# Create aggretated sample level metadata
met = scglmmr::AggregateCellMetadata(
  cell.metadata = meta, 
  sample_column = 'sample', 
  variable_columns = c('subjectid', 'timepoint', 'response', 'age', 'sex'),
  pseudobulk.List = pb
  )

Now we create a combined grouping factor indicating both timepoint and response group for each sample which will help define any contrast of effects with respect to time and group, e.g. differences in treatment effects.

Note in a one way design where the target estimand is for example the treatment effect only, this step is not needed as we can simply extract the effect of “time” from the lme4 fits.


met$group.time = paste(met$group, met$timepoint, sep = '_')

# make sure this is a factor and the levels are in a coherent order for the 
# contrast matrix -- see below. here: 
# time 0 = 0, time 1 = 1. low group = 0 high group = 1. 
met$group.time = factor(
  met$group.time,
  levels = c("1_0", "1_1", "0_0", "0_1")
  )

# now filter genes within each cell type that are reasonably expressed. 
design = model.matrix( ~ 0 + met$group.time)
dge = Normalize(pseudobulk.list = pb, design = design, minimum.gene.count = 5)

Fit models and specify a priori contrasts corresponding to the desired effect comparisons

Below shows 2 steps
1: Specify a varying intercept model using lme4 symbolic model formula. The 1|subjectID term treats individual as a random effect to adjust for the non independence in expression between the same individual by estimating baseline expression variation with a normal distribution for each gene in the genome.

In the example below, we want estimate baseline differences, treatment effects across all individuals and the difference in fold changes between groups adjusting estimates for age and sex (and modeling the individual variation with a varying effect). To do this we specify a custom a. priori contrast matrix.

2: Using the factor variable combining group and time, we create contrasts over the levels to define effects of interest. The function makeContrastsDream is used for this purpose which is similar to the limma function makeContrasts. More simple contrasts or more complex contrasts are also possible. This step is critical for deriving the correct effect size estimates. More information on specifying contrast matrices is available here: A guide to creating design matrices for gene expression experiments

# specify a random intercept model with lme4 syntax.
f1 <- ~ 0 + age + sex + group.time + (1|subjectid) 

# make contrast matrix 
L2 = makeContrastsDream(
  formula = f1, 
  data = samplemd,
  contrasts = c(
    baseline = "group.time1_0 - group.time0_0", # note fixef model also fit for this single time point contrast 
    treatment_delta = "( group.time1_1 - group.time1_0 ) - ( group.time0_1 - group.time0_0 )",
    treatment = "( group.time1_1 + group.time0_1 ) / 2 - ( group.time1_0 + group.time0_0 ) / 2 "
    )
  )

# visualize the contrast matrix, compare to levels of group.time variable
# to check for correctly specified effects. 
plotContrasts(L2)

Fit the models with variancePartition::dream which uses voom weights in lme4 mixed effects models with an empirical bayes shrinkage toward genome wide trend accounting for per gene degrees of freedom.


fit1 = FitDream(pb.list = dge, 
                sample.metadata = met, 
                lme4.formula = f1,
                dream.contrast.matrix = L2,
                ncores = 4)

Note, variancePartition::dream now incorporates an emperical Bayes step derived for models fit with lme4.

Downstream gene set enrichment analysis within celltypes for different effects

Run gene set enrichment analysis within each cell type based on genes ranked by effect size for each of the effects defined above.

Here within each cell type and for each contrast we run gene set enrichment analysis using 2 functions ExtractResult and FgseaList. ExtractResult is used to extract a list of gene ranks (a wrapper around dream::topTable and limma::topTable to return a full list of results). Since we had multiple covariates in a mixed model and we used a custom contrast matrix, we specify the covariate of interest from the contrast using arguments coefficient.number and coef.name which are output in results based on the names of the contrasts.

Based on our contrast matrix (the object L2 we created above with makeContrastsDream), coefficient.number = 1 corresponds to the baseline difference between groups, coefficient.number = 2 is the difference in fold changes between the groups and coefficient.number = 3 is the pre vs post perturbation difference across subjects in both groups. Estimates for the covariates are also availabe.

Examples of extracting gene ranks based on contrasts and running gene set enrichment

# msigDB hallmark pathways are included in the scglmmr package
# can load any custom gene set 
hlmk = scglmmr::hallmark


# extract genes ranked by treatment effect (across all donors) 
rtreat = ExtractResult(model.fit.list = fit1,
                       what = 'lmer.z.ranks',
                       coefficient.number = 3,
                       coef.name = 'treatment')
# run fgsea 
hlmk.treat = FgseaList(rank.list.celltype = rtreat,
                       pathways = hlmk,
                       BPPARAM = pparam)



# extract gene ranks of treatment effect (across all donors) 
rdelta = ExtractResult(model.fit.list = fit1,
                       what = 'lmer.z.ranks',
                       coefficient.number = 2,
                       coef.name = 'treatment_delta')
hlmk.delta = RunFgseaOnRankList(rank.list.celltype = rdelta,
                                pathways = hlmk,
                                BPPARAM = pparam)

On a technical note, it’s also possible to fit a simple model for the baseline contrast not estimating the variation across subjects with a random effect as this contrast is more typically estimated with a standard group 1 vs 2 least squares approach. The function RunVoomLimma is provided to fit these models. If you compare the effect size estimates for an individual cell type for the coefficient.number = 1 from the mixed model fits (object fit1 above) to the models below they are highly correlated along the diagonal with some differences arising to the different degrees of freedom, sample size and variance coming from the additional layer of variation estimated by the mixed model. The choice of model to use is up to the user.

# fit simple linear model for the baseline group level contrast 
design.2 = model.matrix(~0 + age + sex + group.time, data = met)
fit0 = scglmmr::RunVoomLimma(dgelists = dge, 
                           design_matrix = design.2, 
                           do_contrast_fit = T,
                           # we use only the first row of the contrast matrix L2
                           my_contrast_matrix = L2[ ,1])


# extract fixed efect model estimate of baseline contrast (pre treatment differences between groups)
r0 = ExtractResult(model.fit.list = fit0,
                   what = 'gene.t.ranks',
                   coefficient.number = 1,
                   coef.name = 'baseline')
## run gene set enrichment 
hlmk.0 = FgseaList(rank.list.celltype = r0,
                   pathways = hlmk,
                   BPPARAM = pparam)

Further analysis and curation of enrichment results

# full set of leading edge genes indexed by celltype x effect x module 
lefull = scglmmr::GetLeadingEdgeFull(gsea.list = hlmk.treat,
                                     padj.filter = 0.02, 
                                     NES.filter = -Inf)

# extract model fit results instead of ranks
# automatically this sets argument `result` to 'statistics'
fit1.res = scglmmr::ExtractResult(model.fit.list = fit1 , 
                                  coefficient.number = 3, 
                                  coef.name = 'treatment')

# combine GSEA results with model coefficient for each gene in leading edge 
# include all leading edge genes irrespective of individual gene p value 
cr = scglmmr::CombineResults(gsealist = hlmk.treat, 
                             contrastlist = fit1.res, 
                             gseafdr = 0.02, 
                             genefdr = 1)

Extract all leading edge genes indexed by cell type

This outputs a nested list of leading edge genes from each enrichment across cell types. list level 1 is cell type, level 2 is enrichment signal.

li = scglmmr::LeadingEdgeIndexed(gsea.result.list = hlmk.treat, padj.threshold = 0.02)

Calculate the Jaccard similarity of the leading edge genes for enrichments within a given cell type and effect

This is useful for understanding which enrichment signals may come from the same vs distinct genes.

# figpath.temp = here('figures')
treat.JI = EnrichmentJaccard(gsealist = hlmk.treat, 
                          indexedgenes = li, 
                          #saveplot = TRUE, 
                          #figpath = figpath.temp,
                          returnJaccardMtx = TRUE)

# curate results 
results.sorted = treat.JI$sortedgsea %>%
  dplyr::mutate(signal = paste(celltype, pathway, sep = '~'))
  
# data.table::fwrite(results.sorted, file = paste0(datapath, 'g0.result.sort.txt'),sep = "\t")

Network analysis of relationships between cell types based on correlation and shared information.

See separate networks vignette (coming soon).

Example usage provided below.

For details on implementation see methods section of https://www.medrxiv.org/content/10.1101/2023.03.20.23287474v1


index = LeadingEdgeIndexed(gsea.result.list = glist,padj.threshold = 0.05)
jmat = Jmatlist(index)
sig = Signaturescores(feature.list = gexp, indexedgenes = index)
sli.list = SLImatrix(sig.scores = sig, jmat.list = jmat, correlation.type = 'spearman')

Visualization of enrichment results

Create a bubble plot heatmap of enrichment results within clusters

# NES_filter to -Inf to plot positive and negative enrichment
p = PlotFgsea(gsea_result_list = hlmk.treat, NES_filter = -Inf,padj_filter = 0.05)

Create heatmap of gene perturbation fold changes across cell subsets based on model fit coefficients

Extract and make a heatmap of log fold changes across celltypes. HeatmapDiag uses the package slanter which accepts the same arguments as pheatmap.

gene.mat = GetGeneMatrix(result.list = fit1, 
                         pvalfilter = 0.05, 
                         stat_for_matrix = 'logFC',
                         logfcfilter = 0.1)

#make heatmap  e.g. 
pheatmap::pheatmap(gene.mat, fontsize_row = 5)

# diagnoalize the heatmap so that the most correlated signals are grouped 
# this uses the package slanter 
HeatmapDiag(matrix = gene.mat, fontsize_row = 5)

Create heatmap of gene perturbation fold changes from enrichments across individuals within a cell subset

Visualize the log counts per million at the sample level of the gene set enrichment results.


# make tidy average data for visualization of weighted pb results 
lcmp = lapply(pb, edgeR::cpm, log = TRUE )

# 
le_expr = scglmmr::LeadEdgeTidySampleExprs(av.exprs.list = lcmp,
                                           gsea.list = hlmk.treat, 
                                           padj.filter = 0.1,
                                           NES.filter = -Inf)


# example plot of sample level average leading edge genes annotated 
scglmmr::LeadEdgeSampleHeatmap(tidy.exprs.list = le_expr,
                               modulename = "HALLMARK_HYPOXIA",
                               elltype_plot = "CD4_NaiveTcell",
                               metadata = meta, 
                               metadata_annotate = c('group', 'timepoint', 'age', 'sex'),
                               sample_column = 'sample',
                               returnmat = FALSE, 
                               savepath = figpath, 
                               savename = "filename")

# see also: 
# scglmmr::TopGenesTidySampleExprs() - same as aobve for 'top' de genes estimated by model coeffcient / p value. 
# scglmmr::GetTidySummary() - for custom plotting


# sample level vsualization of average data above 
# repeat this for all enriched pathways 
heatpath = here("sime/path"); dir.create(heatpath)
for (i in 1:length(le_expr)) {
  cdat = le_expr[[i]]
  ctype = names(le_expr[i])
  umod = unique(cdat$module)
  for (u in 1:length(umod)) {
    scglmmr::LeadEdgeSampleHeatmap(tidy.exprs.list = le_expr, 
                          modulename = umod[u], 
                          celltype_plot = ctype,
                          metadata = meta, metadata_annotate = c('group', 'timepoint', 'age', 'gender'),
                          sample_column = 'sample',
                          returnmat = F, 
                          savepath = heatpath,
                          savename = paste0(ctype, " ",umod[u],'.pdf'))
  }
}

You can also create a customized map by returning the matrix from LeadEdgeSampleHeatmap.


# scglmmr function to extract leading edge genes 
lexp1 = LeadEdgeTidySampleExprs(av.exprs.list = lcpm, gsea.list = g1c, padj.filter = 0.05, NES.filter = 0)

# annotate time and batch on the heatmap
heatmap_anno = meta[, c('batch', 'timepoint')]
anno_color = list(
  timepoint = c('1' = "orange", '2' = 'red',  "0" = "white"),
  batch = c('1' = "black", '2' = "white")
)

# define your own custom color vector for the log fold change values
cu = c("#053061", "#1E61A5", "#3C8ABE", "#7CB7D6", "#BAD9E9", "#E5EEF3", 
       "#F9EAE1", "#F9C7AD", "#EB9273", "#CF5246", "#AB1529", "#67001F")

# scglmmr function for leading edge gene matrix across donors
mat2 = scglmmr::LeadEdgeSampleHeatmap(tidy.exprs.list = le_expr, 
                                 modulename = "reactome interferon signaling",
                                 celltype_plot = 'CD14_Mono',
                                 # metadata = meta, # unused  
                                 # metadata_annotate = c('batch'), # unused 
                                 sample_column = 'sample',
                                 returnmat = TRUE)
# draw heatmap 
pheatmap::pheatmap(mat2, 
                   border_color = NA,
                   treeheight_row = 0, treeheight_col = 10,
                   annotation = heatmap_anno,
                   annotation_colors = anno_color,
                   color = cu,
                   width = 5,  height = 7.6,
                   # this can be set to false to look at raw expression e.g. 
                   scale = "row",
                   filename = paste0(figpath, "mono_d1_ReactomeIFN.pdf")
                   )

sessionInfo()