CIC曲线、DCA曲线、校准曲线、ROC曲线和PR曲线的概念、区别及其在医学统计中的应用。
CIC曲线:
评估诊断测试或预测模型的实际临床影响,展示不同阈值下正确/错误分类人数
横轴:阈值概率
纵轴:高风险人群数量(红色轴)、真阳性数量(蓝色轴)
DCA曲线:
通过净获益(Net Benefit)衡量模型在不同阈值下的临床决策价值
横轴:阈值概率
纵轴:净获益(=TP/n - FP/n×(pt/(1-pt)))
校准曲线:
检验模型预测概率与实际发生概率的一致性
横轴:预测概率分箱
纵轴:实际发生概率(理想情况为45度直线)
ROC曲线:
评估模型区分能力,通过TPR与FPR关系反映诊断准确性
横轴:FPR(假阳性率)
纵轴:TPR(真阳性率)
PR曲线:
关注正类预测性能,适合不平衡数据评估
横轴:召回率(Recall)
纵轴:精确率(Precision)
1. CIC曲线
临床应用:癌症筛查方案选择:对比不同模型在高风险人群中的真阳性检出效率
资源配置优化:预测ICU床位需求时,评估不同风险阈值对应的患者数量
示例:在乳腺癌筛查中,当设定风险阈值为20%时,CIC显示:高风险人群占比15%(约150人/千例)
其中真阳性占比30%(45例实际患病)
2. DCA曲线
临床应用:治疗决策支持:比较化疗方案在不同风险阈值下的净临床获益
模型优选:在多个预测模型中,选择净获益最高的用于临床实践
示例:前列腺癌预测模型中,当阈值概率在10%-30%区间时,DCA显示:净获益从0.15提升至0.25
显著优于"全部治疗"或"不治疗"策略
3. 校准曲线
临床应用:模型验证:评估心梗风险评分(如GRACE评分)的预测准确性
概率校准:对机器学习模型(如XGBoost)的输出进行Platt缩放修正
示例:某卒中预测模型在0-20%低风险区间预测值偏高(实际发生率10% vs 预测15%),需重新校准
4. ROC曲线
临床应用:生物标志物评价:比较PSA与PCA3对前列腺癌的诊断效能(AUC对比)
影像组学模型:评估CT特征对肺结节良恶性鉴别的敏感度/特异度组合
示例:新冠预测模型中,ROC曲线显示:最佳阈值为0.32(Youden指数最大)
此时敏感度85%,特异度76%
5. PR曲线
临床应用:罕见病检测:评估新生儿遗传病筛查模型的精确率-召回率平衡
药物副作用预警:在ADR发生率<1%的数据中,识别高召回率模型
示例:胰腺癌早期筛查数据(患病率2%)中:模型A:精确率40%,召回率80%
模型B:精确率60%,召回率50%
根据临床需求选择侧重召回或精确的模型
那么我们如何分别用Python和R语言进行CIC曲线、DCA曲线、校准曲线、ROC曲线、PR曲线等模型评价曲线的绘制呢?我们仍以熟悉的示例数据为例,进行简单操作示例。
我们先用Python初步拟合logistic回归并对测试集进行预测:
#加载程序包(openpyxl和pandas等)
#使用pandas读取示例数据xlsx文件
import openpyxl
import pandas as pd
import matplotlib.pyplot as plt
datalr = pd.read_excel(r'C:\Users
\L\Desktop\示例数据.xlsx')
# 查看前几行数据
print(datalr.head())
#尝试进行logistic回归
#安装包
import pandas as pd
from sklearn.linear_model import LogisticRegression
from sklearn.model_selection import train_test_split
from sklearn.metrics import accuracy_score
# 分离特征和目标变量
X = datalr[['指标1', '指标2', '指标3','指标4','指标5','指标6']]
y = datalr['结局']
# 划分训练集和测试集
X_train, X_test, y_train, y_test =
train_test_split(X, y, test_size=0.3,
random_state=42)
# 创建并训练Logistic回归模型
model = LogisticRegression()
model.fit(X_train, y_train)
#返回Logistic回归模型内各个参数(截距和偏回归系数)
print("Logistic模型的截距和偏回归系数及参数为:")
print(model.intercept_,model.coef_)
# 对测试集数据进行预测
y_pred = model.predict(X_test)
prob = model.predict_proba(X_test)[:, 1]
resulta=accuracy_score(y_test, y_pred)
# 评估模型准确度
accuracy = accuracy_score(y_test, y_pred)
print(f'模型准确度: {accuracy}')
#使用statsmodels库拟合logistic回归并获取OR值
import statsmodels.api as sm
modellr = sm.Logit(y_train, X_train)
# 进行模型拟合
result = modellr.fit()
# 打印结果摘要,其中包括OR值
print(result.summary())
#加载绘制DCA曲线、CIC曲线、校准曲线(calibration curve)、ROC曲线、PR曲线等的库
import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
from sklearn.datasets import make_classification
from sklearn.calibration import calibration_curve
from sklearn.linear_model import LogisticRegression
from sklearn.model_selection import train_test_split
from sklearn.isotonic import IsotonicRegression
from sklearn.utils import resample
#绘制DCA曲线
# DCA曲线计算函数
def calculate_dca(y_true, prob, thresholds):
net_benefits = []
for pt in thresholds:
tp = np.sum((y_pred >= pt) & (y_true == 1))
fp = np.sum((y_pred >= pt) & (y_true == 0))
nb = (tp/len(y_true)) - (fp/len(y_true))*(pt/(1-pt))
net_benefits.append(nb)
return np.array(net_benefits)
# DCA曲线绘制
thresholds = np.linspace(0.01, 0.99, 100)
nb_model = calculate_dca(y_test, y_pred, thresholds)
nb_all = np.linspace(0, 1, 100) * 0 # 全阴性基线
nb_treat_all = (y_test.mean() - (1 - y_test.mean())*(thresholds/(1-thresholds))) # 全阳性
plt.plot(thresholds, nb_model, label='Prediction Model')
plt.plot(thresholds, nb_all, 'k--', label='Treat None')
plt.plot(thresholds, nb_treat_all, 'k:', label='Treat All')
plt.xlabel('Threshold Probability')
plt.ylabel('Net Benefit')
plt.legend()
plt.show()
#clinical_impact绘制CIC曲线
# 临床影响曲线(CIC曲线)
def clinical_impact(y_test, y_pred, thresholds,n_bootstraps=100):
high_risk = []
true_pos = []
for pt in thresholds:
high_risk.append(np.sum(y_pred >= pt))
true_pos.append(np.sum((y_pred >= pt) & (y_test == 1)))
return high_risk, true_pos
high_risk, true_pos = clinical_impact(y_test, prob, thresholds)
fig, ax1 = plt.subplots()
ax2 = ax1.twinx()
ax1.plot(thresholds, high_risk, 'r-', label='High Risk')
ax2.plot(thresholds, true_pos, 'b-', label='True Positive')
ax1.set_xlabel('Threshold Probability')
ax1.set_ylabel('High Risk Cases', color='r')
ax2.set_ylabel('True Positives', color='b')
plt.show()
#cic_composite绘制CIC曲线
from sklearn.utils import resample
import seaborn as sns
def cic_composite(y_test, y_pred, thresholds):
df = pd.DataFrame({
'Threshold': thresholds,
'HighRisk': [np.sum(y_pred >= pt) for pt in thresholds],
'TP': [np.sum((y_pred >= pt) & (y_test == 1)) for pt in thresholds]
})
# 双轴可视化
fig, ax1 = plt.subplots()
sns.lineplot(data=df, x='Threshold', y='HighRisk', ax=ax1, color='r')
ax2 = ax1.twinx()
sns.lineplot(data=df, x='Threshold', y='TP', ax=ax2, color='b')
fig.show()
# ==使用sklearn内置函数计算校准曲线参数 ==
sklearn_mean_pred, sklearn_mean_true =
calibration_curve(y_test, prob, n_bins=10, strategy='quantile')
# ===== 手动分桶实现 =====
def manual_calibration_curve(y_true, prob, n_bins=10,
strategy='quantile'):
df = pd.DataFrame({'true': y_true, 'pred': prob})
if strategy == 'quantile':
df['bin'] = pd.qcut(df['pred'], q=n_bins, duplicates='drop')
elif strategy == 'uniform':
df['bin'] = pd.cut(df['pred'], bins=n_bins)
bin_stats = df.groupby('bin').agg(
mean_pred=('pred', 'mean'),
mean_true=('true', 'mean'),
count=('true', 'count')
).reset_index()
return bin_stats['mean_pred'], bin_stats['mean_true']
# ===== 校准曲线可视化 =====
plt.figure(figsize=(12, 8))
# 绘制理想对角线
plt.plot([0, 1], [0, 1], 'k--', label='Perfect calibration')
# 绘制手动计算结果
manual_pred, manual_true =
manual_calibration_curve(y_test, prob)
plt.plot(manual_pred, manual_true,
's-', label='Manual Binning')
# 绘制sklearn结果
plt.plot(sklearn_mean_pred, sklearn_mean_true,
'o-', label='sklearn API')
# 图表装饰
plt.xlabel('Predicted probability', fontsize=12)
plt.ylabel('True probability', fontsize=12)
plt.title('Calibration Curve Comparison (n_bins=10)', fontsize=14)
plt.legend(loc='upper left')
plt.grid(alpha=0.3)
plt.show()
#ROC曲线及PR曲线绘制
import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
from sklearn.metrics import roc_curve, auc,
precision_recall_curve
# ROC曲线
fpr, tpr, _ = roc_curve(y_test, prob)
roc_auc = auc(fpr, tpr)
plt.plot(fpr, tpr, label=f'ROC (AUC={roc_auc:.2f})')
plt.plot([0,1],[0,1],'k--')
plt.xlabel('False Positive Rate')
plt.ylabel('True Positive Rate')
plt.legend()
plt.show()
# PR曲线
precision, recall, _ = precision_recall_curve(y_test, prob)
pr_auc = auc(recall, precision)
plt.plot(recall, precision,
label=f'PR (AUC={pr_auc:.2f})')
plt.xlabel('Recall')
plt.ylabel('Precision')
plt.legend()
plt.show()
#混淆矩阵评估模型
#导入第三方模块
from sklearn import metrics
# 混淆矩阵
print("混淆矩阵四格表输出如下:")
print(metrics.confusion_matrix(y_test, y_pred, labels = [0, 1]))
Accuracy = metrics._scorer.accuracy_score(y_test, y_pred)
Sensitivity = metrics._scorer.recall_score(y_test, y_pred)
Specificity = metrics._scorer.recall_score(y_test, y_pred, pos_label=0)
print("logistic回归模型混淆矩阵评价结果如下:")
print('模型准确率为%.2f%%' %(Accuracy*100))
print('正例覆盖率为%.2f%%' %(Sensitivity*100))
print('负例覆盖率为%.2f%%' %(Specificity*100))
# 使用Seaborn的heatmap绘制混淆矩阵
import matplotlib.pyplot as plt
from sklearn.metrics import confusion_matrix
import seaborn as sns
sns.heatmap(metrics.confusion_matrix(y_test, y_pred), annot=True, fmt='d')
plt.title('Confusion Matrix')
plt.xlabel('Predicted label')
plt.ylabel('True label')
plt.show()
我们先整理环境、加载包并读取数据:
rm(list=ls()) #移除所有变量数据
install.packages("glmnet") #安装包
library() #加载包
#1.数据及包准备
#读取Excel数据
#加载数据包并读取数据
library(glmnet) #加载包
library(Matrix) #加载包
library(ggplot2) #绘图用包
library(caret) #分层抽样拆分数据库
library(lattice) #绘图可视化包
library(pROC) #ROC曲线用包
library(Hmisc) #列线图用包
library(rms) #列线图用包
library(rmda) #列线图用包
library(dplyr) #加载包
library(readxl) #加载包
data <- read_excel("C:/Users/L/Desktop/示例数据.xlsx")
#.全数据基础分析
#str函数在查看数据结构
str(data)
summary(data)
#全变量概览
# 分组比较
library(gtsummary)
data %>% tbl_summary(by=结局) %>%
add_p(pvalue_fun=label_style_pvalue(digits=2))
#2.列线图模型、logostoc回归的校准曲线绘制
#列线图
dd<-datadist(data)
options(datadist="dd")
data$Group<-as.factor(data$结局)
f_lrm<-lrm(Group~指标1+指标2+指标3+指标4+指标5+指标6,
data=data,x=TRUE,y=TRUE)
f_lrm
summary(f_lrm)
par(mgp=c(1.6,0.6,0),mar=c(5,5,3,1))
nomogram <- nomogram(f_lrm,fun=function(x)1/(1+exp(-x)),
fun.at=c(0.01,0.05,0.2,0.5,0.8,0.95,1),
funlabel ="Prob of 结局",
conf.int = F,
abbrev = F )
plot(nomogram)
#绘制校准曲线
summary(f_lrm)
cal1<-calibrate(f_lrm,method="boot",B=1000)
plot(cal1)
# 在图表的底部插入文本
mtext("C=0.996 LR.chi2=337.49 d.f.=5 Pr(> chi2) <0.0001", side=1, line=3)
# 3. Logistic回归及ROC曲线、DCA曲线、CIC曲线绘制
logit.model <- glm(Group~指标1+指标2+指标3+指标4+指标5+指标6,
data=data, family=binomial(link="logit"))
logit.result <- as.data.frame(summary(logit.model)$coefficients[, c(1, 4)])
logit.result <- cbind(logit.result, confint(logit.model))
colnames(logit.result) <- c("Coef", "p", "CI_lower", "CI_upper")
logit.result
#ROC曲线
library(pROC)
logit.pred.prob <- predict(logit.model, newdata=data, type="response")
par(las=1, cex.axis=.8)
logit.roc <- roc(
y ~ pred,
data=data.frame(y=data$Group, pred=logit.pred.prob),
plot=T, ci=T, main="ROC Curve of Logistic Regression",
print.auc=T, print.auc.cex=.8,
print.thres=T, print.thres.cex=.8,
auc.polygon=T, max.auc.polygon=T, grid=T)
#DCA曲线、CIC曲线绘制
# 生成预测概率
data$pred_prob <- predict(logit.model,
type = "response")
data$Group <- as.numeric(data$结局)
# 决策曲线分析(含临床影响曲线)
dca_result <- decision_curve(
Group ~ pred_prob,
data = data,
policy = "opt-in", # 临床干预策略
thresholds = seq(0, 1, by = 0.05), # 阈值范围设置
bootstraps = 1000 # Bootstrap重抽样次数
)
#绘制DCA曲线
plot_decision_curve(dca_result,
curve.names = "Model:结局~",
col = c("red"),
confidence.intervals=F )#不显示置信区间
# 决策曲线分析(含临床影响曲线)
dca_result <- decision_curve(
Group ~ pred_prob,
data = data,
policy = "opt-in", # 临床干预策略
thresholds = seq(0, 0.3, by = 0.05), # 阈值范围设置
bootstraps = 1000 # Bootstrap重抽样次数
)
# 绘制临床影响曲线
plot_clinical_impact(dca_result,
population.size = 322,
cost.benefit.axis = TRUE,
n.cost.benefits = 8,
col = c("red", "blue"))
在最后,我们再一起回顾一下这五种评价曲线的概念和核心指标

