

基于单细胞数据进行Bulk定量之MuSiC

R语言数据分析指南

2021-12-02

导读：前言Immugent最近在读文献时一直遇到各种利用deconvolution基于单细胞参考数据集进行bulk

前言

Immugent最近在读文献时一直遇到各种利用deconvolution基于单细胞参考数据集进行bulk细胞定量的文章，于是专门系统的学习了一下此类常用的几个软件，今天就做一个简单的分享。

其实在之前IOBR:一个R包带你走进数据挖掘的殿堂一文中就已经系统的介绍各种基于参考基因集进行bulk细胞定量的算法,但是这些算法都是bulk对bulk,要么对基因的选择要求很高，要么识别的细胞亚群有限。而近些年大量单细胞数据集的出现，让我们想到如果利用定义好的单个细胞水平的数据构建reference，结合deconvolution对bulk数据进行细胞定量，那岂不是美滋滋。

于是deconvolution的门面算法之一CIBERSORT为实现这个理念，就在2019年推出了新版CIBERSORTx,相应文章发表在了Nature Biotec hnology (IF:54.9),这个杂志可谓生信甚至实验技术的天花板了，可见其价值之高。就当小编兴致冲冲打算用CIBERSORTx玩一圈TCGA数据时，出现了很多bug，因为目前这个软件只有网页版的工具，想要做分析必须将自己的数据传上去，然鹅系统并不能识别新上传的文件，就很难受了。虽然网页友好的给出了镜像文件，但是又搞了个什么需要用邮箱申请密钥，于是小编等了好几天都没等到，干脆不等了。

于是小编开始寻求新的此类软件，就发现了同年发表的另一个软件MuSiC，虽然杂志没有上一个好，但也算不错了，而且两年内被引用了几百次了，可见这个算法还是值得信赖的。

最难能可贵的是这个软件是一个非常容易安装的R包，运算也很快，非常的轻便。最给力的是它还可以对小鼠和大鼠的bulk数据进行deconvolution,因此深得Immugent的喜爱，就走了一下分析流程。

主要流程

作者给出了以下几套数据集(都是已经发表的GE数据)，以便于熟悉此软件的用法。

setwd("D:\\Music")
library(MuSiC)

# bulk data
GSE50244.bulk.eset = readRDS('GSE50244bulkeset.rds')
GSE50244.bulk.eset

# EMTAB single cell dataset
EMTAB.eset = readRDS('EMTABesethealthy.rds')
EMTAB.eset

# Download Xin et al. single cell dataset from Github
XinT2D.eset = readRDS('XinT2Deset.rds')
XinT2D.eset

# Estimate cell type proportions
Est.prop.GSE50244 = music_prop(bulk.eset = GSE50244.bulk.eset, sc.eset = EMTAB.eset, clusters = 'cellType',
                               samples = 'sampleID', select.ct = c('alpha', 'beta', 'delta', 'gamma',
                                                                   'acinar', 'ductal'), verbose = F)
                                                            
                                                            
names(Est.prop.GSE50244)
#[1] "Est.prop.weighted" "Est.prop.allgene"  "Weight.gene"       "r.squared.full"    "Var.prop"  

# Jitter plot of estimated cell type proportions
jitter.fig = Jitter_Est(list(data.matrix(Est.prop.GSE50244$Est.prop.weighted),
                             data.matrix(Est.prop.GSE50244$Est.prop.allgene)),
                        method.name = c('MuSiC', 'NNLS'), title = 'Jitter plot of Est Proportions')

# A more sophisticated jitter plot is provided as below. We seperated the T2D subjects and normal 
#subjects by their HbA1c levels.
library(reshape2)
m.prop.GSE50244 = rbind(melt(Est.prop.GSE50244$Est.prop.weighted), 
                        melt(Est.prop.GSE50244$Est.prop.allgene))

colnames(m.prop.GSE50244) = c('Sub', 'CellType', 'Prop')
m.prop.GSE50244$CellType = factor(m.prop.GSE50244$CellType, levels = c('alpha', 'beta', 'delta', 'gamma', 'acinar', 'ductal'))
m.prop.GSE50244$Method = factor(rep(c('MuSiC', 'NNLS'), each = 89*6), levels = c('MuSiC', 'NNLS'))
m.prop.GSE50244$HbA1c = rep(GSE50244.bulk.eset$hba1c, 2*6)
m.prop.GSE50244 = m.prop.GSE50244[!is.na(m.prop.GSE50244$HbA1c), ]
m.prop.GSE50244$Disease = factor(c('Normal', 'T2D')[(m.prop.GSE50244$HbA1c > 6.5)+1], levels = c('Normal', 'T2D'))
m.prop.GSE50244$D = (m.prop.GSE50244$Disease == 'T2D')/5
m.prop.GSE50244 = rbind(subset(m.prop.GSE50244, Disease == 'Normal'), subset(m.prop.GSE50244, Disease != 'Normal'))

jitter.new = ggplot(m.prop.GSE50244, aes(Method, Prop)) + 
  geom_point(aes(fill = Method, color = Disease, stroke = D, shape = Disease), 
             size = 2, alpha = 0.7, position = position_jitter(width=0.25, height=0)) +
  facet_wrap(~ CellType, scales = 'free') + scale_colour_manual( values = c('white', "gray20")) +
  scale_shape_manual(values = c(21, 24))+ theme_minimal()
library(cowplot)
plot_grid(jitter.fig, jitter.new, labels = 'auto')

简单一个函数即可实现基于单细胞数据对bulk进行细胞定可谓非常轻便，除了使用自身算法进行定量，MuSiC还嵌入了非负矩阵的算法来进行对比，从上图可以看出两者间相似性非常高。

关联分析

众所周知，β细胞比例与T2D疾病状态有关。在T2D过程中，β细胞数量减少。其中最重要的T2D检测是糖化血红蛋白(HbA1c)检测。当糖化血红蛋白水平大于6.5%时，诊断为T2D。我们可以看β细胞比例与糖化血红蛋白水平的关系。


# Create dataframe for beta cell proportions and HbA1c levels
m.prop.ana = data.frame(pData(GSE50244.bulk.eset)[rep(1:89, 2), c('age', 'bmi', 'hba1c', 'gender')],
                        ct.prop = c(Est.prop.GSE50244$Est.prop.weighted[, 2], 
                                    Est.prop.GSE50244$Est.prop.allgene[, 2]), 
                        Method = factor(rep(c('MuSiC', 'NNLS'), each = 89), 
                                        levels = c('MuSiC', 'NNLS')))
colnames(m.prop.ana)[1:4] = c('Age', 'BMI', 'HbA1c', 'Gender')
m.prop.ana = subset(m.prop.ana, !is.na(HbA1c))
m.prop.ana$Disease = factor( c('Normal', 'T2D')[(m.prop.ana$HbA1c > 6.5) + 1], c('Normal', 'T2D') )
m.prop.ana$D = (m.prop.ana$Disease == 'T2D')/5

ggplot(m.prop.ana, aes(HbA1c, ct.prop)) + 
  geom_smooth(method = 'lm',  se = FALSE, col = 'black', lwd = 0.25) +
  geom_point(aes(fill = Method, color = Disease, stroke = D, shape = Disease), size = 2, alpha = 0.7) +  facet_wrap(~ Method) + 
  ggtitle('HbA1c vs Beta Cell Type Proportion') + theme_minimal() + 
  scale_colour_manual( values = c('white', "gray20")) + 
  scale_shape_manual(values = c(21, 24))


lm.beta.MuSiC = lm(ct.prop ~ HbA1c + Age + BMI + Gender, data = subset(m.prop.ana, Method == 'MuSiC'))
lm.beta.NNLS = lm(ct.prop ~ HbA1c + Age + BMI + Gender, data = subset(m.prop.ana, Method == 'NNLS'))
summary(lm.beta.MuSiC)
summary(lm.beta.NNLS)

从上图我们可以看出β细胞比例与糖化血红蛋白水平呈现明显的负调节关系，而且MuSiC算出的系数比非负矩阵更显著。

对小鼠bulk数据进行分析

# Mouse bulk dataset
Mouse.bulk.eset = readRDS('Mousebulkeset.rds')
Mouse.bulk.eset

# Download EMTAB single cell dataset from Github
Mousesub.eset = readRDS('Mousesubeset.rds')
Mousesub.eset

levels(Mousesub.eset$cellType)
# [1] "Endo"     "Podo"     "PT"       "LOH"      "DCT"      "CD-PC"    "CD-IC"    "CD-Trans" "Novel1"  
#[10] "Fib"      "Macro"    "Neutro"   "B lymph"  "T lymph"  "NK"       "Novel2"    

# Produce the first step information
Mousesub.basis = music_basis(Mousesub.eset, clusters = 'cellType', samples = 'sampleID', 
                             select.ct = c('Endo', 'Podo', 'PT', 'LOH', 'DCT', 'CD-PC', 'CD-IC', 'Fib',
                                           'Macro', 'Neutro','B lymph', 'T lymph', 'NK'))

# Plot the dendrogram of design matrix and cross-subject mean of realtive abundance
par(mfrow = c(1, 2))
d <- dist(t(log(Mousesub.basis$Disgn.mtx + 1e-6)), method = "euclidean")
# Hierarchical clustering using Complete Linkage
hc1 <- hclust(d, method = "complete" )
# Plot the obtained dendrogram
plot(hc1, cex = 0.6, hang = -1, main = 'Cluster log(Design Matrix)')
d <- dist(t(log(Mousesub.basis$M.theta + 1e-8)), method = "euclidean")
# Hierarchical clustering using Complete Linkage
# hc2 <- hclust(d, method = "complete" )
hc2 <- hclust(d, method = "complete")
# Plot the obtained dendrogram
plot(hc2, cex = 0.6, hang = -1, main = 'Cluster log(Mean of RA)')

这个图是对不同细胞进行聚类，可以看出免疫细胞和非免疫细胞聚类不同。

clusters.type = list(C1 = 'Neutro', C2 = 'Podo', C3 = c('Endo', 'CD-PC', 'LOH', 'CD-IC', 'DCT', 'PT'), C4 = c('Macro', 'Fib', 'B lymph', 'NK', 'T lymph'))

cl.type = as.character(Mousesub.eset$cellType)

for(cl in 1:length(clusters.type)){
  cl.type[cl.type %in% clusters.type[[cl]]] = names(clusters.type)[cl]
}
pData(Mousesub.eset)$clusterType = factor(cl.type, levels = c(names(clusters.type), 'CD-Trans', 'Novel1', 'Novel2'))

# 13 selected cell types
s.mouse = unlist(clusters.type)
s.mouse


load('IEmarkers.RData')
# This RData file provides two vectors of gene names Epith.marker and Immune.marker

# We now construct the list of group marker
IEmarkers = list(NULL, NULL, Epith.marker, Immune.marker)
names(IEmarkers) = c('C1', 'C2', 'C3', 'C4')
# The name of group markers should be the same as the cluster names

Est.mouse.bulk = music_prop.cluster(bulk.eset = Mouse.bulk.eset, sc.eset = Mousesub.eset, group.markers = IEmarkers, 
                                    clusters = 'cellType', groups = 'clusterType', samples = 'sampleID', clusters.type = clusters.type)

用已知细胞构成对MuSiC定量结果进行评估

XinT2D.construct.full = bulk_construct(XinT2D.eset, clusters = 'cellType', samples = 'SubjectName')

names(XinT2D.construct.full)
# [1] "Bulk.counts" "num.real" 

XinT2D.construct.full$Bulk.counts

head(XinT2D.construct.full$num.real)
XinT2D.construct.full$prop.real = relative.ab(XinT2D.construct.full$num.real, by.col = FALSE)
head(XinT2D.construct.full$prop.real)

# Estimate cell type proportions of artificial bulk data
Est.prop.Xin = music_prop(bulk.eset = XinT2D.construct.full$Bulk.counts, sc.eset = EMTAB.eset,
                          clusters = 'cellType', samples = 'sampleID', 
                          select.ct = c('alpha', 'beta', 'delta', 'gamma'))

# Estimation evaluation
Eval_multi(prop.real = data.matrix(XinT2D.construct.full$prop.real), 
           prop.est = list(data.matrix(Est.prop.Xin$Est.prop.weighted), 
                           data.matrix(Est.prop.Xin$Est.prop.allgene)), 
           method.name = c('MuSiC', 'NNLS'))

prop.comp.fig = Prop_comp_multi(prop.real = data.matrix(XinT2D.construct.full$prop.real),
                                prop.est = list(data.matrix(Est.prop.Xin$Est.prop.weighted),
                                                data.matrix(Est.prop.Xin$Est.prop.allgene)),
                                method.name = c('MuSiC', 'NNLS'), 
                                title = 'Heatmap of Real and Est. Prop' )

abs.diff.fig = Abs_diff_multi(prop.real = data.matrix(XinT2D.construct.full$prop.real),
                              prop.est = list(data.matrix(Est.prop.Xin$Est.prop.weighted),
                                              data.matrix(Est.prop.Xin$Est.prop.allgene)),
                              method.name = c('MuSiC', 'NNLS'), 
                              title = 'Abs.Diff between Real and Est. Prop' )

plot_grid(prop.comp.fig, abs.diff.fig, labels = "auto", rel_widths = c(4,3))

从这个图中我们可以看出基于MuSiC算出的各种细胞比例和实际细胞比例相似度很高，比非负矩阵的结果要好很多。在原文中作者还和 CIBERSORT (Newman et al. 2015) and bseq-sc ( Baron et al. 2016)都进行了对比，感兴趣的小伙伴可以去看看原文。

小结

虽然小编夸了MuSiC一路，但是还是要实事求是的看待这个软件，作者上述的结果肯定都是选择利于自身呈现软件功能进行挑选过的。在实际应用过程中小编发现MuSiC有一些问题还需要理性对待，如它对很多比例很低的细胞群识别能力很差；而且出现要基于整个组织全部细胞进行计算才比较准确，无法自定义检测某一些特殊的细胞亚群。但是好在这个软件很轻便，我们可以结合实验或者其它算法来综合使用。

本期推文讲到这就结束啦，希望小伙伴一直关注生信宝库，我们后续还会推出更实用的软件，敬请期待！