生存分析云笔记

本文系复旦大学人口学系张震老师和华东师范大学人口研究所李强老师的数据分析课程笔记。

一个R生存分析应用的介绍网页Survival Analysis with R

生存分析应用:移动通讯运营商的用户流失问题分析

两篇中文文献:

李强, 张震. (2009). 生存分析中时间变量的选择[J]. 中国人口科学(6), 88-95.

李强, 徐刚, 陈丽梅.(2019) 生存分析的应用误区[J]. 中国人口科学(01):101-112.


基本概念

  • 社会调查只能观察到状态。人口学关心从一个状态到另一个状态转移的风险。试想每一种状态是一个格子,格子内部是人口频数。人口学能通过数频数的方式估计状态转移的风险。在一个状态内待的时间和风险有着密切的关系

\(T\)为随机变量,上帝也不知道

生存函数:\(S(t) = P(T > t)\)。是存活概率也是存活百分比。

失效函数(Failure Function):\(F(t)\),\(S(t) = 1 - F(t)\), \(F(t)\)是T的累积分布函数。

\(f(t) = \frac{{dF(t)}}{{dt}}\),即\(S(t)\)的斜率。等价于生存时间的概率密度函数(直方图)。是无条件的风险。

风险函数: \[h(t) = \mathop {\lim }\limits_{\Delta t \to 0} \frac{{P(t \le T < t + \Delta t|T \ge t)}}{{\Delta t}},0 < h(t) < + \infty \]

\(h(t)\)不是密度也不是概率。

  • censor(删截): 知道事件发生,但不知道事件何时发生。事件观测期内无法观测。

  • truncate(截平): 只有在给定观测期内的个体才会被观测到的状况。样本选择问题。生存函数对应于某种条件概率。

    • censor涉及到事件,truncate涉及到暴露量。
  • 能活到预期寿命者总是62%


https://cmdlinetips.com/2019/03/introduction-to-maximum-likelihood-estimation-in-r/

最大似然估计(MLE)

Cox回归避开了模型分布假定的问题。

  • 假定数据独立同分布。

    似然函数:

\[{\rm{L}}(\lambda ){\rm{ = }}\prod\limits_{i = 1}^\pi {P(T = t)} = \prod {f({t_i})}\]
\(\delta\)是积分小区间

\[P(T = t) \approx 2\delta f(t)\]

  • 发生风险率(Occurrence-exposure rate) \[\widehat \lambda = \frac{1}{{\bar t }} = \frac{n}{{\sum\limits_{}^{} {{t_i}} }}\] \(\sum\limits_{}^{} {{t_i}}\)是历险人年数。\(n\)是事件发生数。
    \(\lambda\)是个概率估计。
library(survival)
library(haven) 
library(ggplot2)
library(dplyr)
library(ggfortify)
library(survminer)
library(rpart)
library(rpart.plot)
library(stats4)
y <- rpois(n=10000, lambda=7)
df_Poisson <- data.frame(data=y)
#df_Poisson
df_Poisson %>% 
  ggplot(aes(data))+ 
  geom_histogram() +
  ylab("Count")+ xlab("data")+
  theme_bw(base_size = 16) +
  ggtitle("Poisson Distribution: rpois(10000,7)")

NROW(df_Poisson)
[1] 10000
llh_poisson <- function(lambda,y){
  # log(likelihood) by summing 
   llh <- sum(dpois(df_Poisson$data,lambda, log=T))
   return(llh)
}

lambdas <- seq(1,15, by=0.2)
ll <- sapply(lambdas,llh_poisson)
 
# save the lambdas and log-likelihoods in a data frame
df <- data.frame(ll=ll, lambda=lambdas)


df %>% 
  ggplot(aes(lambda,ll))+
  geom_point(size=4,color="dodgerblue")+
  xlab("Lambda") +
  ylab("Log Likelihood")+
  theme_bw(base_size = 16) +
  geom_vline(xintercept = lambdas[which.max(ll)], color="red",size=2)

ll <- function(lambda){
  # log(likelihood) by summing 
   -sum(dpois(df_Poisson$data,lambda, log=T))
}

fit0 <- mle(ll, start = list(lambda = 1), nobs = NROW(df_Poisson))
fit0

Call:
mle(minuslogl = ll, start = list(lambda = 1), nobs = NROW(df_Poisson))

Coefficients:
 lambda 
7.00488 
summary(fit0)
Maximum likelihood estimation

Call:
mle(minuslogl = ll, start = list(lambda = 1), nobs = NROW(df_Poisson))

Coefficients:
       Estimate Std. Error
lambda  7.00488 0.02646689

-2 log L: 47766.8 

删截(Censored)样本的似然函数

  • 完全样本的似然函数是在该时点t的密度函数\(f({t_i})\)。右删截数据是生存函数\(S({C_r})\)。左删截数据为累计分布函数\(1 - S({C_l})\)。区间删截为在该时段内的发生概率\(S(L) - S(R)\)

右删截: \[P({t_i} > 40) = S(40)\]

  • 40岁删截

\[L(\lambda ) = \prod\limits_{observed} {f({t_i})} \cdot \prod\limits_{censored} {S(40)}\]

Assume the first \(d\) individuals died, \((n-d)\) ware alive at 40:

\[\hat \lambda = \frac{d}{{\sum {{t_i} - (n - d) \cdot 40} }}\]

分子为事件发生数。

人口学、医学中最多的情况是left-truncated & right-cencored(LT-RC)数据。

非参数模型Aalen’s additive regression model可以替代Cox模型。R的survival包的aareg函数可以进行拟合。


生存分析估计

R生存分析介绍:survival analysis in R

使用?rpart::stagec查看数据

Kaplan–Meier estimator | 卡普兰迈耶法

卡普兰迈耶法又称乘积极限法(Product-Limit Method), 公式如下:

\[{\rm{\hat S}}(t) = \prod\limits_{{t_i} \le t} {(1 - {{\hat q}_i})} = \prod\limits_{{t_i} \le t} {(1 - {{{d_i}} \over {{n_i}}})}\]

\(n_i\)是在时间\(t_i\)中的历险人数,\(d_i\)是事件发生数。

table(stagec$pgstat)

 0  1 
92 54 
km <- with(stagec, Surv(pgtime, pgstat == 1))
head(km,10) 
 [1]  6.1+  9.4+  5.2   3.2   1.9   4.8+  5.8+  7.3+  3.7  15.9+
km_fit <- survfit(km ~ 1, stagec)
km_fit 
Call: survfit(formula = km ~ 1, data = stagec)

      n  events  median 0.95LCL 0.95UCL 
  146.0    54.0      NA     7.7      NA 
生存函数的Kaplan–Meier estimator是生命表$nPx$列的累乘。

Figure 1: 生存函数的Kaplan–Meier estimator是生命表\(nPx\)列的累乘。

summary(km_fit, times = c(0.2*(1:25))) #这个表和生命表等价
Call: survfit(formula = km ~ 1, data = stagec)

 time n.risk n.event survival std.err lower 95% CI upper 95% CI
  0.2    146       0    1.000 0.00000        1.000        1.000
  0.4    144       2    0.986 0.00962        0.968        1.000
  0.6    143       1    0.979 0.01174        0.957        1.000
  0.8    143       0    0.979 0.01174        0.957        1.000
  1.0    143       2    0.966 0.01505        0.937        0.996
  1.2    139       2    0.952 0.01768        0.918        0.987
  1.4    138       1    0.945 0.01883        0.909        0.983
  1.6    135       4    0.918 0.02273        0.874        0.963
  1.8    132       1    0.911 0.02357        0.866        0.958
  2.0    130       3    0.890 0.02589        0.841        0.942
  2.2    128       1    0.883 0.02659        0.833        0.937
  2.4    126       1    0.876 0.02727        0.825        0.932
  2.6    125       3    0.856 0.02916        0.800        0.915
  2.8    122       1    0.849 0.02974        0.792        0.909
  3.0    120       3    0.828 0.03135        0.769        0.892
  3.2    118       2    0.814 0.03235        0.753        0.880
  3.4    111       4    0.786 0.03414        0.722        0.856
  3.6    111       0    0.786 0.03414        0.722        0.856
  3.8    106       3    0.764 0.03538        0.698        0.837
  4.0    106       0    0.764 0.03538        0.698        0.837
  4.2    105       2    0.750 0.03615        0.682        0.824
  4.4    103       1    0.743 0.03653        0.674        0.818
  4.6    101       0    0.743 0.03653        0.674        0.818
  4.8     97       2    0.728 0.03727        0.658        0.805
  5.0     94       1    0.720 0.03763        0.650        0.798
生命表示例

Figure 2: 生命表示例

kme <- survfit(Surv(pgtime, pgstat) ~ 1,stagec)
autoplot(kme, surv.colour = 'orange', censor.colour = 'red')
Kaplan Meier Survival Curve

Figure 3: Kaplan Meier Survival Curve

km_trt_fit <- survfit(Surv(pgtime, pgstat) ~ grade,stagec)
autoplot(km_trt_fit)  
Kaplan Meier Survival Curve

Figure 4: Kaplan Meier Survival Curve

Cox比例风险模型

\[\ln{h_i}(t) = \ln {h_0}(t) + {\beta _1}{X_1} + {\beta _2}{X_2} + \cdot \cdot \cdot + {\beta _p}{X_p}\]

  • \({h_i}(t)\)是个体i在时间t时的风险值。\({h_0}(t)\)是所有个体在时间t时的非参数风险函数值。\({\beta_p}{X_p}\)是对个体i的解释变量和系数。

\[{{h(t|{X_i})} \over {h(t|{X_j})}} = {{{h_0}(t)\exp (X_i^T\beta )} \over {{h_0}(t)\exp (X_j^T\beta )}} = \exp [(X_i^T - X_j^T)\beta ]\]

  • 两个分别具有协变量\({X_i}\)\({X_j}\)的个体,其风险函数之比为相对危险度(risk ration,RR)或风险比(hazard ration,HR)

  • \({h_i}(t)\)\({h_0}(t)\)随时间变化。在满足比例风险假定后,不同特征样本的风险比(hazard ration,HR)不随时间变化。所以系数是固定的。无时变变量的Cox模型为宽数据(wide data)。

coxreg <- coxph(Surv(pgtime , pgstat)~age + eet + g2 + grade + gleason + ploidy,stagec)
summary(coxreg)
Call:
coxph(formula = Surv(pgtime, pgstat) ~ age + eet + g2 + grade + 
    gleason + ploidy, data = stagec)

  n= 134, number of events= 49 
   (12 observations deleted due to missingness)

                     coef exp(coef) se(coef)      z Pr(>|z|)   
age              -0.01859   0.98158  0.02899 -0.641  0.52139   
eet               0.09073   1.09498  0.37418  0.242  0.80840   
g2               -0.05569   0.94583  0.02664 -2.091  0.03654 * 
grade             1.37745   3.96478  0.43615  3.158  0.00159 **
gleason           0.21023   1.23397  0.17099  1.230  0.21887   
ploidytetraploid  0.98917   2.68901  0.42622  2.321  0.02030 * 
ploidyaneuploid   1.22007   3.38742  0.66723  1.829  0.06747 . 
---
Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

                 exp(coef) exp(-coef) lower .95 upper .95
age                 0.9816     1.0188    0.9274    1.0390
eet                 1.0950     0.9133    0.5259    2.2799
g2                  0.9458     1.0573    0.8977    0.9965
grade               3.9648     0.2522    1.6864    9.3212
gleason             1.2340     0.8104    0.8826    1.7252
ploidytetraploid    2.6890     0.3719    1.1663    6.2000
ploidyaneuploid     3.3874     0.2952    0.9161   12.5261

Concordance= 0.777  (se = 0.032 )
Likelihood ratio test= 47.93  on 7 df,   p=4e-08
Wald test            = 40.97  on 7 df,   p=8e-07
Score (logrank) test = 47.66  on 7 df,   p=4e-08
coxreg
Call:
coxph(formula = Surv(pgtime, pgstat) ~ age + eet + g2 + grade + 
    gleason + ploidy, data = stagec)

                     coef exp(coef) se(coef)      z       p
age              -0.01859   0.98158  0.02899 -0.641 0.52139
eet               0.09073   1.09498  0.37418  0.242 0.80840
g2               -0.05569   0.94583  0.02664 -2.091 0.03654
grade             1.37745   3.96478  0.43615  3.158 0.00159
gleason           0.21023   1.23397  0.17099  1.230 0.21887
ploidytetraploid  0.98917   2.68901  0.42622  2.321 0.02030
ploidyaneuploid   1.22007   3.38742  0.66723  1.829 0.06747

Likelihood ratio test=47.93  on 7 df, p=3.67e-08
n= 134, number of events= 49 
   (12 observations deleted due to missingness)
ggforest(coxreg,stagec)

#library(rms)
#dd=datadist(stagec)
#options(datadist="dd")
#f2 <- psm(Surv(time,status) ~ age+gender+grade,data =  #LIHC, dist='lognormal')
#med <- Quantile(f2) # 计算中位生存时间
#surv <- Survival(f2) # 构建生存概率函数

## 绘制COX回归中位生存时间的Nomogram图
#nom <- nomogram(f2, fun=function(x) #med(lp=x),funlabel="Median Survival Time")

#plot(nom)

#cox_fit <- survfit(coxreg)
#autoplot(cox_fit) #拟合值
##C-指数 Harrell's concordance index
#coxreg$concordance
#concordance(Surv(pgtime , pgstat) ~ predict(coxreg),stagec)

Aalen’s additive regression model

aaregfit <- aareg(Surv(pgtime, pgstat)~age + eet + g2 + grade + gleason + ploidy,stagec)
aaregfit
Call:
aareg(formula = Surv(pgtime, pgstat) ~ age + eet + g2 + grade + 
    gleason + ploidy, data = stagec)

  n=134 (12 observations deleted due to missingness)
    38 out of 38 unique event times used

                    slope      coef se(coef)      z       p
Intercept        -0.16500 -0.008530 0.022600 -0.377 0.70600
age              -0.00230 -0.000206 0.000312 -0.660 0.50900
eet              -0.01000 -0.001460 0.003910 -0.374 0.70900
g2               -0.00598 -0.000649 0.000246 -2.640 0.00832
grade             0.12200  0.010800 0.004560  2.380 0.01740
gleason           0.02260  0.001750 0.002250  0.780 0.43500
ploidytetraploid  0.09670  0.010700 0.004330  2.480 0.01320
ploidyaneuploid   0.37900  0.021700 0.019900  1.090 0.27400

Chisq=29.68 on 7 df, p=0.000109; test weights=aalen
ggforest(aaregfit,stagec)
Error in ggforest(aaregfit, stagec): 不是所有的class(model) == "coxph"都是TRUE
autoplot(aaregfit)

CART树生存分析

fit <- rpart(Surv(pgtime , pgstat)~age +
eet + g2 + grade + gleason + ploidy,stagec,method='exp')
fit
n= 146 

node), split, n, deviance, yval
      * denotes terminal node

 1) root 146 192.111100 1.0000000  
   2) grade< 2.5 61  44.799010 0.3634439  
     4) g2< 11.36 33   9.117405 0.1229835 *
     5) g2>=11.36 28  27.602190 0.7345610  
      10) gleason< 5.5 20  14.297110 0.5304115 *
      11) gleason>=5.5 8  11.094650 1.3069940 *
   3) grade>=2.5 85 122.441500 1.6148600  
     6) age>=56.5 75 103.062900 1.4255040  
      12) gleason< 7.5 50  66.119800 1.1407320  
        24) g2< 13.475 24  27.197170 0.8007306 *
        25) g2>=13.475 26  36.790960 1.4570210  
          50) g2>=17.915 15  20.332740 0.9789825 *
          51) g2< 17.915 11  13.459010 2.1714480 *
      13) gleason>=7.5 25  33.487250 2.0307290  
        26) g2>=15.29 10  11.588480 1.2156230 *
        27) g2< 15.29 15  18.939150 2.7053610 *
     7) age< 56.5 10  13.769010 3.1822320 *
rpart.plot(fit,shadow.col = 'gray',branch = 0.4)
CART树剪枝前

Figure 5: CART树剪枝前

fit2 <- prune(fit, cp = 0.017)
fit2$cptable
          CP nsplit rel error    xerror       xstd
1 0.12945955      0 1.0000000 1.0044808 0.07011258
2 0.04205598      1 0.8705405 0.8818044 0.07347477
3 0.02919986      2 0.8284845 0.8982644 0.07797793
4 0.01798864      3 0.7992846 0.9176689 0.08186875
5 0.01700000      4 0.7812960 0.9778645 0.08680235
rpart.plot(fit2,shadow.col = 'gray',branch = 0.4)
CART树剪枝后

Figure 6: CART树剪枝后

temp <- snip.rpart(fit2,6)
km <- survfit(Surv(pgtime, pgstat) ~ temp$where, stagec)
autoplot(km)
每个结点患者的生存风险估计

Figure 7: 每个结点患者的生存风险估计

ggplot(stagec)+
  geom_point(aes(age,g2,shape=factor(pgstat),
                 col=factor(pgstat)))+
  facet_wrap(~grade)
年龄、G2阶段的瘤细胞比例、肿瘤等级grade的关系

Figure 8: 年龄、G2阶段的瘤细胞比例、肿瘤等级grade的关系