CD4 protein vs CD4 mRNA in CITE-seq data

Ideas / critique / comments? contact me on twitter or by email: mattmule AT gmail.com

Ming Tang tweeted:

1/ a question on CD4 mRNA vs protein. @CalebLareau I saw “CD4+ T cells express low levels of the CD4 transcript but very high levels of CD4 protein (Stoeckius et al., 2017)” in your paper 2/ is it based on sup figure 7 of Stoeckius et al. 2017? I used to think CD4 mRNA is not well captured in 10x

To investigate noise in CD4 mRNA, below I look at bone marrow CITE-seq data generated by the Satija lab in the SeuratData package and extract out CD4 mRNA and protein.

library(Seurat)
library(SeuratData)

data('bmcite')
s = bmcite
rm(bmcite)

# extract out CD4 UMI counts for rna and adt
prot = GetAssayData(object = s,slot = 'counts', assay = 'ADT')['CD4', ]
mRNA = GetAssayData(object = s,slot = 'counts', assay = 'RNA')['CD4', ]

d = cbind(
  s@meta.data, 
  CD4_PROT = prot, 
  CD4_mRNA = mRNA
  )

Since a major component of ADT noise in CITE-seq data is due to ambient capture of unbound ADTs nearly ALL of the cells in this experiment are ‘positive’ for CD4 protein. For example in the raw umi count space, what fraction of cells are positive for CD4 ADT?

length(prot[prot > 0]) / length(prot)

## [1] 0.9999348

That’s all the cells except for 2.

The cell populations with the lowest levels of CD4 protein still have a mean of ~4.5 in log2 space, about 25 UMI counts. The highest CD4 ADT expression is seen in the CD4 clusters have about 1000 raw UMI counts. This is very different from the type of noise we see in mRNA data! Also we can start to see that CD4 is actually not a very specific T cell marker – this is actually known CD4 is ubiquitously expressed at lower levels on monocytes as we can see below.

library(ggplot2);  theme_set(theme_bw())
ggplot(d, aes( x = reorder(celltype.l2, log2(CD4_PROT)), y = log2(CD4_PROT))) +
  xlab("")+
  geom_boxplot(outlier.size = 0.1, size = 0.1) + 
  coord_flip() + 
  ggtitle('black line = 10 ADT counts')

For the sake of understanding noise in CD4 mRNA counts, we can roughly gate out the ‘ground truth’ helper T cells based on expression of CD4 protein. We can also subset out only the CD4 naive T cells from the cell type annotations in this dataset to get a more homogeneous population.

cd4_cells = d[log2(d$CD4_PROT) > 8 & d$celltype.l2 == 'CD4 Naive', ]
th_d = d[rownames(cd4_cells), ]

In Svensson 2020 observed vs expected 0 rates were modeled with count distributions (i.e. error models) for datasets with droplets loaded a solution of mRNAs. The expected distribution of RNA in these control droplets was the same, there was no expected biological variations.

The CD4 protein positive cells can in principle serve as a similar biological variation negative control. It’s not a perfect experiment but we expect far less biological CD4 mRNA variation within this subset of gated naive cells than the variation across PBMCs.

Motivated by the approach in the Svensson paper we can simulate a poisson distribution with a rate of detection lambda equal to the CD4 mRNA mean in the gated population.

sim.pois = rpois(n = nrow(th_d), lambda = mean(th_d$CD4_mRNA))

Next we can calculate what the expected proportion of 0s are for poisson distributed counts with a rate parameter equal to the observed rate of detection in ground truth CD4 T cells:

sim.0 = sum(sim.pois == 0) / length(sim.pois)
sim.0

## [1] 0.9078829

The expected rate of 0 counts for a poisson distribution is e^-lambda which is consistent with the simulated data above.

exp(-mean(th_d$CD4_mRNA))

## [1] 0.9042262

If we compare that simulated proportion of 0’s to the stochastic sampling rate for the CD4 transcript in gated ‘ground truth’ CD4+ protein based T cells, we can get a feel for whether there are more 0 counts in the CD4 cells than expected.

The actual rate of 0 counts in CD4 T cells:

sum(th_d$CD4_mRNA == 0) / nrow(th_d)

## [1] 0.9045045

The number of 0s is exactly what we would expect for poisson distributed count data with that rate parameter. This is consistent with the Svensson 2020 paper. It suggests there is no evidence of ‘0 inflation’ for the CD4 gene, but how does that rate parameter compare to other genes in naive CD4 T cells?

For ease of visualization calculate log cpm vs standard deviation of genes in the gated Naive CD4 cells

library(magrittr)
umi = s@assays$RNA@counts[ ,rownames(cd4_cells)]

cd4 = subset(s,cells = rownames(cd4_cells))

# remove the non-expressed genes and compute counts per million
cd4 = CreateSeuratObject(counts = cd4@assays$RNA@counts, min.cells = 10)
cd4 = NormalizeData(s,normalization.method = 'RC')

# extract gene stats 
gene.stats = data.frame(
  mean = rowMeans(as.matrix(cd4@assays$RNA@data)), 
  sd = matrixStats::rowSds(as.matrix(cd4@assays$RNA@data))
) 
gene.stats$gene = rownames(gene.stats)

# visualize mean vs sd 
# subset of T cell genes to label
gene.lab = c('CD3E', 'MALAT1', 'CCR7', 'STAT4', 'IL7R', 
             'ID3', 'TCF7',  'GATA3', 'GZMA', 'CD4','RUNX1', 
             'RPS27','HBB', 'LYZ', 'B2M')


ggplot(gene.stats, aes(x = log2(mean + 1), y = log2(sd + 1), label = gene)) + 
  geom_point(size = 0.2) + 
  ggrepel::geom_text_repel( 
    data = gene.stats %>% dplyr::filter(gene %in% gene.lab ),
    col = 'red', force = 80, seed = 1990) +
  geom_density_2d(size = 0.4, color = 'blue')

CD4 has relatively low expression but not as low as other CD4 naive T cell defining genes especially some transcription factors.

Given this detection rate vs background genes does CD4 have any power for classifying CD4 cells? With Seurat we can implement a ROC test for classifying all protein + CD4 T cells.

# again removign the basically non-expressed genes 
s = CreateSeuratObject(counts = s@assays$RNA@counts,min.cells = 1, meta.data = s@meta.data)
s = NormalizeData(s, normalization.method = 'LogNormalize')
s$iscd4 = ifelse(s$celltype.l2 %in% c("CD4 Naive", "CD4 Memory"), 'cd4', 'other')

# roc test for classification power of the CD4 cells
Idents(s) <- 'iscd4'
cd4.genes = FindMarkers(object = s,
                        ident.1 = 'cd4',
                        ident.2 = 'other',
                        test.use = 'roc',
                        logfc.threshold = 0)

cd4.d = cd4.genes %>% 
  tibble::rownames_to_column('gene') %>% 
  dplyr::arrange(desc(myAUC)) 
  
ggplot(cd4.d, aes(x = power, y = avg_log2FC, label = gene)) + 
  geom_point() + 
   ggrepel::geom_text_repel( 
    data = cd4.d %>% dplyr::filter(gene %in% gene.lab ),
    col = 'red', force = 80, seed = 1990)

Despite its role as a protein marker for helper T cells, CD4 is expressed on multiple cell types including monocytes, albeit at lower levels. This is likely one reason it is not a great classifier.

In addition to the Svensson paper, some other recent work was published investigating different error models for modeling scRNAseq counts that were quite helpful for me in understanding noise in mRNA data:
Sarkar and Stephens
Choudhary and Satija

In this example my takeaway is less that CD4 mRNA has a poor detection, it’s more that any individual gene is not going to be a perfect classifier, we need to combine genes to characterize cell types and states. Of course the great sensitivity of ADT detection makes CITE-seq such an exciting development for immunology research.

Final note–we don’t expect that all the cells are actually expressing CD4 protein–this was added noise. Quick shameless plug–we did experiments to define these noise sources and developed an R package you can use to denoise the ADT data based on expected noise from empty droplets as well as remove to cell to cell technical variations.

sessionInfo()

## R version 4.0.5 Patched (2021-03-31 r80136)
## Platform: x86_64-apple-darwin17.0 (64-bit)
## Running under: macOS Mojave 10.14.6
## 
## Matrix products: default
## BLAS:   /System/Library/Frameworks/Accelerate.framework/Versions/A/Frameworks/vecLib.framework/Versions/A/libBLAS.dylib
## LAPACK: /Library/Frameworks/R.framework/Versions/4.0/Resources/lib/libRlapack.dylib
## 
## locale:
## [1] en_US.UTF-8/en_US.UTF-8/en_US.UTF-8/C/en_US.UTF-8/en_US.UTF-8
## 
## attached base packages:
## [1] stats     graphics  grDevices utils     datasets  methods   base     
## 
## other attached packages:
##  [1] magrittr_2.0.1                 ggplot2_3.3.3                  pbmcsca.SeuratData_3.0.0      
##  [4] cbmc.SeuratData_3.1.4          bonemarrowref.SeuratData_1.0.0 bmcite.SeuratData_0.3.0       
##  [7] SeuratData_0.2.1               sp_1.4-5                       SeuratObject_4.1.0            
## [10] Seurat_4.0.1                   here_1.0.1                    
## 
## loaded via a namespace (and not attached):
##   [1] Rtsne_0.15            colorspace_2.0-0      deldir_1.0-6          ellipsis_0.3.2        ggridges_0.5.3       
##   [6] rprojroot_2.0.2       rstudioapi_0.13       spatstat.data_2.1-0   farver_2.0.3          leiden_0.3.7         
##  [11] listenv_0.8.0         ggrepel_0.9.1         codetools_0.2-18      splines_4.0.5         knitr_1.39           
##  [16] polyclip_1.10-0       jsonlite_1.7.2        packrat_0.7.0         ica_1.0-2             cluster_2.1.2        
##  [21] png_0.1-7             rgeos_0.5-9           uwot_0.1.10           shiny_1.6.0           sctransform_0.3.2    
##  [26] spatstat.sparse_2.0-0 compiler_4.0.5        httr_1.4.2            assertthat_0.2.1      Matrix_1.4-1         
##  [31] fastmap_1.1.0         lazyeval_0.2.2        cli_3.3.0             later_1.1.0.1         htmltools_0.5.2      
##  [36] tools_4.0.5           igraph_1.2.6          gtable_0.3.0          glue_1.6.2            RANN_2.6.1           
##  [41] reshape2_1.4.4        dplyr_1.0.4           rappdirs_0.3.3        Rcpp_1.0.9            scattermore_0.7      
##  [46] jquerylib_0.1.3       vctrs_0.4.1           nlme_3.1-152          progressr_0.10.0      lmtest_0.9-38        
##  [51] xfun_0.30             stringr_1.4.0         globals_0.14.0        mime_0.10             miniUI_0.1.1.1       
##  [56] lifecycle_1.0.0       irlba_2.3.3           goftest_1.2-2         future_1.21.0         MASS_7.3-53.1        
##  [61] zoo_1.8-8             scales_1.1.1          spatstat.core_2.0-0   promises_1.2.0.1      spatstat.utils_2.3-0 
##  [66] parallel_4.0.5        RColorBrewer_1.1-2    yaml_2.2.1            reticulate_1.18       pbapply_1.4-3        
##  [71] gridExtra_2.3         sass_0.4.0            rpart_4.1-15          stringi_1.5.3         highr_0.8            
##  [76] rlang_1.0.2           pkgconfig_2.0.3       matrixStats_0.58.0    evaluate_0.15         lattice_0.20-41      
##  [81] ROCR_1.0-11           purrr_0.3.4           tensor_1.5            labeling_0.4.2        patchwork_1.1.1      
##  [86] htmlwidgets_1.5.3     cowplot_1.1.1         tidyselect_1.1.0      parallelly_1.23.0     RcppAnnoy_0.0.18     
##  [91] plyr_1.8.6            R6_2.5.0              generics_0.1.2        DBI_1.1.1             withr_2.4.3          
##  [96] mgcv_1.8-34           pillar_1.4.7          fitdistrplus_1.1-3    survival_3.2-10       abind_1.4-5          
## [101] tibble_3.0.6          future.apply_1.7.0    crayon_1.4.1          KernSmooth_2.23-18    spatstat.geom_2.4-0  
## [106] plotly_4.9.3          rmarkdown_2.9         isoband_0.2.3         grid_4.0.5            data.table_1.14.0    
## [111] digest_0.6.27         xtable_1.8-4          tidyr_1.1.2           httpuv_1.5.5          munsell_0.5.0        
## [116] viridisLite_0.3.0     bslib_0.3.1

CD4 protein vs CD4 mRNA in CITE-seq data

Matt Mulè