Analysis of cohort perturbation experiments including nested group designs - statistical models to multicell correlation networks
Source:vignettes/pseudobulk_mixed_effects.rmd
pseudobulk_mixed_effects.rmd
Pseudobulk mixed effects model workflow
These functions implement wrappers dream
from the
variancePartition
package to enable mixed (i.e. varying)
effects models of gene expression. This approach is described in Hoffman et al
Bininformatics 2020 this paper should be cited if using the
wrapper below FitDream
.
We then run enrichment testing with fgsea
to define and
use network methods to understand relationships between different
These models: 1. Allow estimation of treatment effects and treatment
differences across groups in multi-subject perturbation experiments 1.
Adjust for non-independence of repeated measures from the same donors
with random effects
2. Utilize observational level weights to enable normal methods. 3.
Shrink estimates toward genome wide trends with empirical Bayes methods
tailored to lme4 based models.
The experiment design shown below includes multiple subjects (subjectid), each with multiple measurements (sample) corresponding to a pre and post treatment timepoint. Additionally subjects are nested into different pre-defined response groups (e.g. high and low responder). Below the workflow will describe defining pre-treatment baseline effect differences, treatment effects across all subjects and the difference in treatment effects between the groups including covariate adjustment.
sample | subjectid | time | group | sex |
---|---|---|---|---|
101_t0 | 101 | d0 | low | F |
101_t1 | 101 | d1 | high | F |
102_t0 | 102 | d0 | low | M |
102_t1 | 102 | d1 | high | M |
… (x n donors) | … (x n donors) | … (x n donors) | … (x n donors) | … (x n donors) |
Getting started - setup, data loading, and quaity control
In the workflow below, mixed effect model fits and gene set enrichment steps are parallelized with the BiocParallel package.
#devtools::install_github(repo = "https://github.com/MattPM/scglmmr")
suppressMessages(library(scglmmr))
suppressMessages(library(magrittr))
suppressMessages(library(BiocParallel))
#the mixed model fits and gsea are parallelized
BiocParallel::register(BiocParallel::SnowParam(4))
pparam = BiocParallel::SnowParam(workers = 4, type = "SOCK", progressbar = TRUE)
Load single cell data and quality control
datapath = "mypath/"
# load seurat or sce object etc.
s = readRDS("path/seuratobject.rds")
# define counts and metadata and subset to cells above rm seurat object from workspace
meta = s@meta.data
umi = s@assays$RNA@counts
rm(s); gc()
# QC contingency of cells by subject for each celltype
tab = scglmmr::SubjectCelltypeTable(metadata = meta, celltype_column = "celltype", sample_column = "sample")
tab$celltypes_remove; tab$`low representation celltypes`; tab$table
# remove cells prior to pseudobulk analysis with 0 representation in some donors.
meta = meta[!meta$celltype %in% tab$celltypes_remove, ]
# subset data
umi = umi[ ,rownames(meta)]
# Create aggregated pseudobulk data
pb = scglmmr::PseudobulkList(
rawcounts = umi,
metadata = meta,
sample_col = "sample",
celltype_col = "celltype",
avg_or_sum = "sum"
)
# Create aggretated sample level metadata
met = scglmmr::AggregateCellMetadata(
cell.metadata = meta,
sample_column = 'sample',
variable_columns = c('subjectid', 'timepoint', 'response', 'age', 'sex'),
pseudobulk.List = pb
)
Now we create a combined grouping factor indicating both timepoint and response group for each sample which will help define any contrast of effects with respect to time and group, e.g. differences in treatment effects.
Note in a one way design where the target estimand is for example the treatment effect only, this step is not needed as we can simply extract the effect of “time” from the lme4 fits.
met$group.time = paste(met$group, met$timepoint, sep = '_')
# make sure this is a factor and the levels are in a coherent order for the
# contrast matrix -- see below. here:
# time 0 = 0, time 1 = 1. low group = 0 high group = 1.
met$group.time = factor(
met$group.time,
levels = c("1_0", "1_1", "0_0", "0_1")
)
# now filter genes within each cell type that are reasonably expressed.
design = model.matrix( ~ 0 + met$group.time)
dge = Normalize(pseudobulk.list = pb, design = design, minimum.gene.count = 5)
Fit models and specify a priori contrasts corresponding to the desired effect comparisons
Below shows 2 steps
1: Specify a varying intercept model using lme4 symbolic model
formula. The 1|subjectID
term treats individual as a
random effect to adjust for the non independence in expression between
the same individual by estimating baseline expression variation with a
normal distribution for each gene in the genome.
In the example below, we want estimate baseline differences, treatment effects across all individuals and the difference in fold changes between groups adjusting estimates for age and sex (and modeling the individual variation with a varying effect). To do this we specify a custom a. priori contrast matrix.
2: Using the factor variable combining group and time, we create
contrasts over the levels to define effects of interest. The function
makeContrastsDream
is used for this purpose which is
similar to the limma function makeContrasts. More simple contrasts or
more complex contrasts are also possible. This step is critical for
deriving the correct effect size estimates. More information on
specifying contrast matrices is available here: A
guide to creating design matrices for gene expression
experiments
# specify a random intercept model with lme4 syntax.
f1 <- ~ 0 + age + sex + group.time + (1|subjectid)
# make contrast matrix
L2 = makeContrastsDream(
formula = f1,
data = samplemd,
contrasts = c(
baseline = "group.time1_0 - group.time0_0", # note fixef model also fit for this single time point contrast
treatment_delta = "( group.time1_1 - group.time1_0 ) - ( group.time0_1 - group.time0_0 )",
treatment = "( group.time1_1 + group.time0_1 ) / 2 - ( group.time1_0 + group.time0_0 ) / 2 "
)
)
# visualize the contrast matrix, compare to levels of group.time variable
# to check for correctly specified effects.
plotContrasts(L2)
Fit the models with variancePartition::dream
which uses
voom weights in lme4 mixed effects models with an empirical bayes
shrinkage toward genome wide trend accounting for per gene degrees of
freedom.
fit1 = FitDream(pb.list = dge,
sample.metadata = met,
lme4.formula = f1,
dream.contrast.matrix = L2,
ncores = 4)
Note, variancePartition::dream
now incorporates
an emperical Bayes step derived for models fit with lme4.
Downstream gene set enrichment analysis within celltypes for different effects
Run gene set enrichment analysis within each cell type based on genes ranked by effect size for each of the effects defined above.
Here within each cell type and for each contrast we run gene set
enrichment analysis using 2 functions ExtractResult
and
FgseaList
. ExtractResult
is used to extract a
list of gene ranks (a wrapper around dream::topTable and limma::topTable
to return a full list of results). Since we had multiple covariates in a
mixed model and we used a custom contrast matrix, we specify the
covariate of interest from the contrast using arguments
coefficient.number
and coef.name
which are
output in results based on the names of the contrasts.
Based on our contrast matrix (the object L2
we created
above with makeContrastsDream
),
coefficient.number = 1
corresponds to the baseline
difference between groups, coefficient.number = 2
is the
difference in fold changes between the groups and
coefficient.number = 3
is the pre vs post perturbation
difference across subjects in both groups. Estimates for the covariates
are also availabe.
Examples of extracting gene ranks based on contrasts and running gene set enrichment
# msigDB hallmark pathways are included in the scglmmr package
# can load any custom gene set
hlmk = scglmmr::hallmark
# extract genes ranked by treatment effect (across all donors)
rtreat = ExtractResult(model.fit.list = fit1,
what = 'lmer.z.ranks',
coefficient.number = 3,
coef.name = 'treatment')
# run fgsea
hlmk.treat = FgseaList(rank.list.celltype = rtreat,
pathways = hlmk,
BPPARAM = pparam)
# extract gene ranks of treatment effect (across all donors)
rdelta = ExtractResult(model.fit.list = fit1,
what = 'lmer.z.ranks',
coefficient.number = 2,
coef.name = 'treatment_delta')
hlmk.delta = RunFgseaOnRankList(rank.list.celltype = rdelta,
pathways = hlmk,
BPPARAM = pparam)
On a technical note, it’s also possible to fit a simple model for the
baseline contrast not estimating the variation across subjects with a
random effect as this contrast is more typically estimated with a
standard group 1 vs 2 least squares approach. The function
RunVoomLimma
is provided to fit these models. If you
compare the effect size estimates for an individual cell type for the
coefficient.number = 1
from the mixed model fits (object
fit1
above) to the models below they are highly correlated
along the diagonal with some differences arising to the different
degrees of freedom, sample size and variance coming from the additional
layer of variation estimated by the mixed model. The choice of model to
use is up to the user.
# fit simple linear model for the baseline group level contrast
design.2 = model.matrix(~0 + age + sex + group.time, data = met)
fit0 = scglmmr::RunVoomLimma(dgelists = dge,
design_matrix = design.2,
do_contrast_fit = T,
# we use only the first row of the contrast matrix L2
my_contrast_matrix = L2[ ,1])
# extract fixed efect model estimate of baseline contrast (pre treatment differences between groups)
r0 = ExtractResult(model.fit.list = fit0,
what = 'gene.t.ranks',
coefficient.number = 1,
coef.name = 'baseline')
## run gene set enrichment
hlmk.0 = FgseaList(rank.list.celltype = r0,
pathways = hlmk,
BPPARAM = pparam)
Further analysis and curation of enrichment results
# full set of leading edge genes indexed by celltype x effect x module
lefull = scglmmr::GetLeadingEdgeFull(gsea.list = hlmk.treat,
padj.filter = 0.02,
NES.filter = -Inf)
# extract model fit results instead of ranks
# automatically this sets argument `result` to 'statistics'
fit1.res = scglmmr::ExtractResult(model.fit.list = fit1 ,
coefficient.number = 3,
coef.name = 'treatment')
# combine GSEA results with model coefficient for each gene in leading edge
# include all leading edge genes irrespective of individual gene p value
cr = scglmmr::CombineResults(gsealist = hlmk.treat,
contrastlist = fit1.res,
gseafdr = 0.02,
genefdr = 1)
Extract all leading edge genes indexed by cell type
This outputs a nested list of leading edge genes from each enrichment across cell types. list level 1 is cell type, level 2 is enrichment signal.
li = scglmmr::LeadingEdgeIndexed(gsea.result.list = hlmk.treat, padj.threshold = 0.02)
Calculate the Jaccard similarity of the leading edge genes for enrichments within a given cell type and effect
This is useful for understanding which enrichment signals may come from the same vs distinct genes.
# figpath.temp = here('figures')
treat.JI = EnrichmentJaccard(gsealist = hlmk.treat,
indexedgenes = li,
#saveplot = TRUE,
#figpath = figpath.temp,
returnJaccardMtx = TRUE)
# curate results
results.sorted = treat.JI$sortedgsea %>%
dplyr::mutate(signal = paste(celltype, pathway, sep = '~'))
# data.table::fwrite(results.sorted, file = paste0(datapath, 'g0.result.sort.txt'),sep = "\t")
Network analysis of relationships between cell types based on correlation and shared information.
See separate networks vignette (coming soon).
Example usage provided below.
For details on implementation see methods section of https://www.medrxiv.org/content/10.1101/2023.03.20.23287474v1
index = LeadingEdgeIndexed(gsea.result.list = glist,padj.threshold = 0.05)
jmat = Jmatlist(index)
sig = Signaturescores(feature.list = gexp, indexedgenes = index)
sli.list = SLImatrix(sig.scores = sig, jmat.list = jmat, correlation.type = 'spearman')
Visualization of enrichment results
Create a bubble plot heatmap of enrichment results within clusters
# NES_filter to -Inf to plot positive and negative enrichment
p = PlotFgsea(gsea_result_list = hlmk.treat, NES_filter = -Inf,padj_filter = 0.05)
Create heatmap of gene perturbation fold changes across cell subsets based on model fit coefficients
Extract and make a heatmap of log fold changes across celltypes.
HeatmapDiag uses the package slanter which
accepts the same arguments as pheatmap
.
gene.mat = GetGeneMatrix(result.list = fit1,
pvalfilter = 0.05,
stat_for_matrix = 'logFC',
logfcfilter = 0.1)
#make heatmap e.g.
pheatmap::pheatmap(gene.mat, fontsize_row = 5)
# diagnoalize the heatmap so that the most correlated signals are grouped
# this uses the package slanter
HeatmapDiag(matrix = gene.mat, fontsize_row = 5)
Create heatmap of gene perturbation fold changes from enrichments across individuals within a cell subset
Visualize the log counts per million at the sample level of the gene set enrichment results.
# make tidy average data for visualization of weighted pb results
lcmp = lapply(pb, edgeR::cpm, log = TRUE )
#
le_expr = scglmmr::LeadEdgeTidySampleExprs(av.exprs.list = lcmp,
gsea.list = hlmk.treat,
padj.filter = 0.1,
NES.filter = -Inf)
# example plot of sample level average leading edge genes annotated
scglmmr::LeadEdgeSampleHeatmap(tidy.exprs.list = le_expr,
modulename = "HALLMARK_HYPOXIA",
elltype_plot = "CD4_NaiveTcell",
metadata = meta,
metadata_annotate = c('group', 'timepoint', 'age', 'sex'),
sample_column = 'sample',
returnmat = FALSE,
savepath = figpath,
savename = "filename")
# see also:
# scglmmr::TopGenesTidySampleExprs() - same as aobve for 'top' de genes estimated by model coeffcient / p value.
# scglmmr::GetTidySummary() - for custom plotting
# sample level vsualization of average data above
# repeat this for all enriched pathways
heatpath = here("sime/path"); dir.create(heatpath)
for (i in 1:length(le_expr)) {
cdat = le_expr[[i]]
ctype = names(le_expr[i])
umod = unique(cdat$module)
for (u in 1:length(umod)) {
scglmmr::LeadEdgeSampleHeatmap(tidy.exprs.list = le_expr,
modulename = umod[u],
celltype_plot = ctype,
metadata = meta, metadata_annotate = c('group', 'timepoint', 'age', 'gender'),
sample_column = 'sample',
returnmat = F,
savepath = heatpath,
savename = paste0(ctype, " ",umod[u],'.pdf'))
}
}
You can also create a customized map by returning the matrix from
LeadEdgeSampleHeatmap
.
# scglmmr function to extract leading edge genes
lexp1 = LeadEdgeTidySampleExprs(av.exprs.list = lcpm, gsea.list = g1c, padj.filter = 0.05, NES.filter = 0)
# annotate time and batch on the heatmap
heatmap_anno = meta[, c('batch', 'timepoint')]
anno_color = list(
timepoint = c('1' = "orange", '2' = 'red', "0" = "white"),
batch = c('1' = "black", '2' = "white")
)
# define your own custom color vector for the log fold change values
cu = c("#053061", "#1E61A5", "#3C8ABE", "#7CB7D6", "#BAD9E9", "#E5EEF3",
"#F9EAE1", "#F9C7AD", "#EB9273", "#CF5246", "#AB1529", "#67001F")
# scglmmr function for leading edge gene matrix across donors
mat2 = scglmmr::LeadEdgeSampleHeatmap(tidy.exprs.list = le_expr,
modulename = "reactome interferon signaling",
celltype_plot = 'CD14_Mono',
# metadata = meta, # unused
# metadata_annotate = c('batch'), # unused
sample_column = 'sample',
returnmat = TRUE)
# draw heatmap
pheatmap::pheatmap(mat2,
border_color = NA,
treeheight_row = 0, treeheight_col = 10,
annotation = heatmap_anno,
annotation_colors = anno_color,
color = cu,
width = 5, height = 7.6,
# this can be set to false to look at raw expression e.g.
scale = "row",
filename = paste0(figpath, "mono_d1_ReactomeIFN.pdf")
)