R模型可视化

PDF版本下载

English Version

模型可视化是应用统计学的重要内容。任何模型都离不开结果的可视化。所谓模型,不过是将一堆散点简化为一条线。结果的可视化需要预测值。Hadley Wickham的modelr包提供用于预测的函数。预测的结果可以直接被ggplot2使用并画图。modelr支持管道操作,是将数据分析流程化的利器。

modelr包的主要函数有:

data_grid: 生成预测数据

add_predictions: 加入预测值

crossv_kfoldcrossv_mccrossv_loo: 交叉验证

library(dplyr)
library(tidyr)
library(ggplot2)
library(modelr)
library(haven)
library(cowplot)
library(stargazer)
`%>%` <- magrittr::`%>%`

基础回归

hatdt为作者个人整理的中国家庭追踪调查(CFPS)收入数据1

hatdt <- hatdt %>% 
  filter(type=='个人收入(元)') %>% 
  drop_na(agem,inc,fswt_nat)

set.seed(20191001)
sample <- sample(1:nrow(hatdt),600,replace = F)
sampled <- hatdt[sample,]

plota <- ggplot(hatdt,aes(agem,inc,weight=fswt_nat)) +
  geom_jitter(data=sampled,height=550,width=5,
              size =1.5,alpha=1/3) +
  geom_smooth(span =10,size=1) + 
  geom_smooth(method='lm',size=1,color='red') +
  ylim(0, 20000) +
  labs(x = "年龄",y = "人民币(元)") +
  theme_bw()

plotb <- ggplot() +
  geom_jitter(data=sampled,aes(agem,inc),
              height=550,width=5,size =1.5,alpha=1/3) +
  geom_quantile(data=hatdt,
  aes(agem,inc,weight=fswt_nat),
  size=1,color='red')+
  ylim(0, 20000) +
  labs(x = "年龄",y = "人民币(元)") +
  theme_bw()
plot_grid(plota,plotb,ncol = 2)
个人收入与年龄。左图:红线为线性回归模型。蓝色曲线为非参数回归。右图:三条线分别是分位数回归。高收入者收入随年龄下降的速度快于低收入者。可将中位数回归与左图线性回归相比较,观测其中的差异。

Figure 1: 个人收入与年龄。左图:红线为线性回归模型。蓝色曲线为非参数回归。右图:三条线分别是分位数回归。高收入者收入随年龄下降的速度快于低收入者。可将中位数回归与左图线性回归相比较,观测其中的差异。

交互项

交互项是计量经济学和应用统计学常用的机制分析技术。公式如下:

\[y = {\alpha _0} + {\alpha _1}{x_1} + {\alpha _2}{x_2} + {\alpha _3}{x_1}{x_2}\]

par(mar = c(4,4,1,0.5), mfrow = c(1, 2), cex.main = 1)
sq = 1:10
x = rep(sq, 10)
z = rep(sq, each = 10)
y = c(outer(sq, sq, function(x, z) 2 + x + 0.5 *
z + 0.5 * x * z + runif(1)))
symbols(x, z, y, bg = rgb(0, 1, 0, 0.3), fg = "blue",
main = "",
inches = 0.4)
y = c(outer(sq, sq, function(x, z) 2 + x + 0.5 *
z + runif(1)))
symbols(x, z, y, bg = rgb(0, 1, 0, 0.3), fg = "blue",
main = "", inches = 0.2)
谢益辉的交互效应表示方法。左图:$y = 2 + x + 0.5 z + 0.5 x z + \epsilon$。右图:$y = 2 + x + 0.5 z + \epsilon$。圆圈面积表示因变量$y$的大小;坐标轴分别表示自变量$x$和$z$。

Figure 2: 谢益辉的交互效应表示方法。左图:\(y = 2 + x + 0.5 z + 0.5 x z + \epsilon\)。右图:\(y = 2 + x + 0.5 z + \epsilon\)。圆圈面积表示因变量\(y\)的大小;坐标轴分别表示自变量\(x\)\(z\)

下面使用R自带数据,1994年加拿大劳动与收入动态调查(SLID)。详细信息请在R中输入?carData::SLID查看。

分类变量与连续变量交互

因变量为收入。自变量为教育年限(年)和使用的语言(英语、法语、其他)。下面分别展示了没有交互项和有交互项的模型。

#?carData::SLID

data(SLID,package = 'carData')
SLID <- SLID %>% drop_na()

mod1 <- lm(wages ~ education + language,SLID)
mod2 <- lm(wages ~ education * language,SLID)

grid <- SLID %>%
data_grid(education,language) %>%
gather_predictions(mod1,mod2)

ggplot(SLID,aes(education,wages))+
  geom_jitter(size=1,width=2,height=10,alpha=1/7)+
  geom_line(data=grid,
            aes(education,pred,color=language),size=1)+
  facet_wrap(~model)+
  xlim(0,25)+ ylim(0,40)+
  theme_bw()
左图:语言不与教育年限交互。不同语言使用者的斜率相同但截距不同。右图:交互模型,英语使用者的工资随教育回报率更高,假定其他条件不变。英语使用者在15年处超越了其他语言使用者。

Figure 3: 左图:语言不与教育年限交互。不同语言使用者的斜率相同但截距不同。右图:交互模型,英语使用者的工资随教育回报率更高,假定其他条件不变。英语使用者在15年处超越了其他语言使用者。

两个连续变量交互

对两个连续交互变量的可视化是一个难题。较好的解决办法是分箱。使用modelrseq_range函数对其中一个连续变量进行分箱。

mod1 <- lm(wages ~ education + age,SLID)
mod2 <- lm(wages ~ education * age,SLID)

grid <- SLID %>%
data_grid(education,age = seq_range(age, 5)) %>%
gather_predictions(mod1,mod2)

ggplot(SLID,aes(education,wages))+
  geom_jitter(size=1,width=2,
              height=10,alpha=1/7)+
  geom_line(data=grid,aes(education,pred,
                          color=age,group=age),size=1)+
  facet_wrap(~model)+
  xlim(0,25)+ ylim(0,40)+
  theme_bw()
无交互效应和有交互效应的区别: 左图体现了不同年龄段者的教育回报率相同(斜率相同)。右图体现了一个因素的大小随着另一个因素的变化而变化。随着年龄的升高教育回报率也在升高。

Figure 4: 无交互效应和有交互效应的区别: 左图体现了不同年龄段者的教育回报率相同(斜率相同)。右图体现了一个因素的大小随着另一个因素的变化而变化。随着年龄的升高教育回报率也在升高。

来个负相关的:

data(freeny)
partial <- lm(y~lag.quarterly.revenue+price.index+
                income.level+market.potential,freeny)

modela <- lm(y~price.index+market.potential,freeny)
modelb <- lm(y~price.index*market.potential,freeny)

stargazer(modela,modelb,partial,
          title='回归结果',
          dep.var.caption='',
          dep.var.labels='Quarterly Revenue',
          header=F,keep.stat=c('n','rsq'), 
         no.space=T,type='html',align=T)
回归结果
Quarterly Revenue
(1) (2) (3)
lag.quarterly.revenue 0.124
(0.142)
price.index -0.414* -39.796*** -0.754***
(0.210) (5.737) (0.161)
income.level 0.767***
(0.134)
market.potential 4.030*** -10.270*** 1.331**
(0.434) (2.102) (0.509)
price.index:market.potential 2.979***
(0.434)
Constant -41.499*** 147.459*** -10.473*
(6.602) (27.863) (6.022)
Observations 39 39 39
R2 0.994 0.997 0.998
Note: p<0.1; p<0.05; p<0.01
gridt <- freeny %>% 
  data_grid(price.index,
            market.potential=
            seq_range(market.potential,5)) %>%
  gather_predictions(modela,modelb)


ggplot(freeny,aes(price.index,y,
                  color=market.potential))+
  geom_point()+
  geom_line(data=gridt,aes(price.index,pred,
color=market.potential,
group=market.potential))+
  facet_wrap(~model)+
  theme_bw()
左图无交互效应,可视为控制变量。右图为两个连续变量的交互效应

Figure 5: 左图无交互效应,可视为控制变量。右图为两个连续变量的交互效应

多项式回归

  • 多项式回归是平滑方法的基础。
set.seed(2019)
x <- seq(0,4,length=100)
y <- -x^2 + 3*x + jitter(rep(5:9,each =20),2) +3
df <- data.frame(x,y)

reg <- lm(y ~ x + I(x^2),df)

grid <- df %>%
data_grid(x) %>%
gather_predictions(reg)

ggplot(df,aes(x,y))+
  geom_point(size =2,alpha=1/3)+
  geom_line(data=grid,aes(x,pred),size=1,color='blue')+
  theme_bw()
对一个模拟数据进行二次项回归。

Figure 6: 对一个模拟数据进行二次项回归。

下面使用多项式回归拟合CFPS数据:

\[y = {\alpha _0} + {\alpha _1}{x_1} + {\alpha _2}x_1^2\] \[y = {\alpha _0} + {\alpha _1}{x_1} + {\alpha _2}x_1^2 + {\alpha _3}x_1^3\]

mtrga <- lm(inc~agem+I(agem^2),hatdt)
mtrgb <- lm(inc~agem+I(agem^2)+I(agem^3),hatdt)

grid <- hatdt %>%
data_grid(agem) %>%
gather_predictions(mtrga,mtrgb)

ggplot() +
  geom_jitter(data=sampled,aes(agem,inc),
              height=550,width=5,size =1.5,alpha=1/3) +
  geom_line(data=grid,aes(agem,pred),
            size=1,color='blue')+
  facet_wrap(~model) +
  ylim(0, 20000) +
  labs(x = "年龄",y = "人民币(元)") +
  theme_bw()
分别对CFPS数据进行二次项和三次项回归。三次项导致了过拟合。

Figure 7: 分别对CFPS数据进行二次项和三次项回归。三次项导致了过拟合。

局部加权回归散点平滑

  • Locally Weighted Scatterplot Smoother,LOWESS

\[{y_i} = g({x_i}) + {\varepsilon _i}\] \(g\)是在\(x\)带宽\(\alpha\)范围内进行的多项式回归。

data(PlantCounts,package = 'MSG')
g <- ggplot(PlantCounts,
            aes(altitude,counts)) +
  geom_point(size=1.5,alpha=1/3) +
  ylim(0,80)+
  theme_bw()

for (i in seq(1,1700,10)){
  col = rgb(0.4,i/1700,0.4)
  g <- g + stat_smooth(geom='line',
                       span=i/1000,
                       size=0.6,
                       se=F,color=col)
}

f <- ggplot(PlantCounts,
            aes(altitude, counts)) +
  geom_point(size=1.5,alpha=1/3) +
  ylim(0,80)+
  theme_bw()

for (i in 1:200) {
idx <- sample(nrow(PlantCounts),300,T)
df <- PlantCounts[idx,]
f <- f + stat_smooth(geom='line',
                     data=df,
                     aes(altitude,counts),
                     span=1,size=0.5,
                     se=F,alpha=1/10)
}

e <- ggplot(PlantCounts,
            aes(altitude, counts)) +
  geom_point(size=1.5,alpha=1/3) +
  geom_smooth(span=1,size=1)+
  ylim(0,80)+
  theme_bw()
plot_grid(g,f,e,ncol = 2)
左图:设置不同带宽进行LOWESS回归。右图:Bootstrap重抽样200次的结果。

Figure 8: 左图:设置不同带宽进行LOWESS回归。右图:Bootstrap重抽样200次的结果。

样条

  • Splines

  • 结点为\(a\), \(b\), \(c\)的样条回归函数为:

\[y = \alpha + {\beta _1}x + {\beta _2}{(x - a)_ + } + {\beta _3}{(x - b)_ + } + {\beta _4}{(x - c)_ + }\]

\({(\mu )_ + } = \mu\)\(\mu>0\),否则\({(\mu )_ + } = 0\)

library(ISLR)
library(splines)
data(wage,package = 'ISLR')
fita <- lm(wage ~ bs(age,degree=1,knots = c(25,40,60)),Wage)
fitb <- lm(wage ~ bs(age,knots = c(25,40,60)),Wage)
summary(fita)

Call:
lm(formula = wage ~ bs(age, degree = 1, knots = c(25, 40, 60)), 
    data = Wage)

Residuals:
    Min      1Q  Median      3Q     Max 
-99.795 -24.686  -4.856  15.344 204.671 

Coefficients:
                                            Estimate Std. Error t value
(Intercept)                                   54.333      5.957   9.120
bs(age, degree = 1, knots = c(25, 40, 60))1   37.645      6.817   5.522
bs(age, degree = 1, knots = c(25, 40, 60))2   65.847      6.019  10.940
bs(age, degree = 1, knots = c(25, 40, 60))3   63.850      6.319  10.104
bs(age, degree = 1, knots = c(25, 40, 60))4   33.772     10.580   3.192
                                            Pr(>|t|)    
(Intercept)                                  < 2e-16 ***
bs(age, degree = 1, knots = c(25, 40, 60))1 3.64e-08 ***
bs(age, degree = 1, knots = c(25, 40, 60))2  < 2e-16 ***
bs(age, degree = 1, knots = c(25, 40, 60))3  < 2e-16 ***
bs(age, degree = 1, knots = c(25, 40, 60))4  0.00143 ** 
---
Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

Residual standard error: 39.91 on 2995 degrees of freedom
Multiple R-squared:  0.08665,   Adjusted R-squared:  0.08543 
F-statistic: 71.03 on 4 and 2995 DF,  p-value: < 2.2e-16
summary(fitb)

Call:
lm(formula = wage ~ bs(age, knots = c(25, 40, 60)), data = Wage)

Residuals:
    Min      1Q  Median      3Q     Max 
-98.832 -24.537  -5.049  15.209 203.207 

Coefficients:
                                Estimate Std. Error t value Pr(>|t|)    
(Intercept)                       60.494      9.460   6.394 1.86e-10 ***
bs(age, knots = c(25, 40, 60))1    3.980     12.538   0.317 0.750899    
bs(age, knots = c(25, 40, 60))2   44.631      9.626   4.636 3.70e-06 ***
bs(age, knots = c(25, 40, 60))3   62.839     10.755   5.843 5.69e-09 ***
bs(age, knots = c(25, 40, 60))4   55.991     10.706   5.230 1.81e-07 ***
bs(age, knots = c(25, 40, 60))5   50.688     14.402   3.520 0.000439 ***
bs(age, knots = c(25, 40, 60))6   16.606     19.126   0.868 0.385338    
---
Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

Residual standard error: 39.92 on 2993 degrees of freedom
Multiple R-squared:  0.08642,   Adjusted R-squared:  0.08459 
F-statistic: 47.19 on 6 and 2993 DF,  p-value: < 2.2e-16
grid <- Wage %>% 
  data_grid(age) %>%
  gather_predictions(fita,fitb)

ggplot(Wage,aes(age,wage))+
  geom_point(size=1,alpha=1/7)+
  geom_line(data=grid,aes(age,pred),
            size=1,color='purple')+
  facet_wrap(~model)+
  theme_bw()
左图:1次项样条。右图:3次项样条

Figure 9: 左图:1次项样条。右图:3次项样条

广义可加模型

\[{y_i} = {\beta _0} + {f_1}({x_1}) + {f_2}({x_2}) + \cdot \cdot \cdot + {f_k}({x_k})\]

library(mgcv)
par(mfrow=c(1,1),mar=c(2,2,2,2))

gamfit <- gam(wages~ s(education)+bs(age, degree = 2, knots = c(20,30,40,55)),data=SLID)
summary(gamfit)

Family: gaussian 
Link function: identity 

Formula:
wages ~ s(education) + bs(age, degree = 2, knots = c(20, 30, 
    40, 55))

Parametric coefficients:
                                                Estimate Std. Error
(Intercept)                                       9.9635     0.8844
bs(age, degree = 2, knots = c(20, 30, 40, 55))1  -1.9846     1.1801
bs(age, degree = 2, knots = c(20, 30, 40, 55))2   1.2099     0.9411
bs(age, degree = 2, knots = c(20, 30, 40, 55))3   7.6049     0.9651
bs(age, degree = 2, knots = c(20, 30, 40, 55))4   8.7600     0.9630
bs(age, degree = 2, knots = c(20, 30, 40, 55))5   8.7637     1.1253
bs(age, degree = 2, knots = c(20, 30, 40, 55))6   5.0601     1.8398
                                                t value Pr(>|t|)    
(Intercept)                                      11.266  < 2e-16 ***
bs(age, degree = 2, knots = c(20, 30, 40, 55))1  -1.682  0.09271 .  
bs(age, degree = 2, knots = c(20, 30, 40, 55))2   1.286  0.19869    
bs(age, degree = 2, knots = c(20, 30, 40, 55))3   7.880 4.20e-15 ***
bs(age, degree = 2, knots = c(20, 30, 40, 55))4   9.096  < 2e-16 ***
bs(age, degree = 2, knots = c(20, 30, 40, 55))5   7.788 8.62e-15 ***
bs(age, degree = 2, knots = c(20, 30, 40, 55))6   2.750  0.00598 ** 
---
Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

Approximate significance of smooth terms:
               edf Ref.df     F p-value    
s(education) 5.339  6.439 93.08  <2e-16 ***
---
Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

R-sq.(adj) =  0.301   Deviance explained = 30.3%
GCV = 43.394  Scale est. = 43.259    n = 3987
vis.gam(gamfit,color='terrain',
        theta=30, phi=30,
        zlab='Wage')
教育年限,年龄与收入

Figure 10: 教育年限,年龄与收入

SLID <- SLID %>% 
  mutate(lgwage=log1p(wages))
gamfit2 <- gam(lgwage ~ s(education)+ bs(age, degree = 2, knots = c(20,30,40,55)),data = SLID)
summary(gamfit2)

Family: gaussian 
Link function: identity 

Formula:
lgwage ~ s(education) + bs(age, degree = 2, knots = c(20, 30, 
    40, 55))

Parametric coefficients:
                                                Estimate Std. Error
(Intercept)                                      2.20610    0.05091
bs(age, degree = 2, knots = c(20, 30, 40, 55))1 -0.10426    0.06794
bs(age, degree = 2, knots = c(20, 30, 40, 55))2  0.29695    0.05418
bs(age, degree = 2, knots = c(20, 30, 40, 55))3  0.62907    0.05556
bs(age, degree = 2, knots = c(20, 30, 40, 55))4  0.68858    0.05544
bs(age, degree = 2, knots = c(20, 30, 40, 55))5  0.67619    0.06478
bs(age, degree = 2, knots = c(20, 30, 40, 55))6  0.41298    0.10589
                                                t value Pr(>|t|)    
(Intercept)                                      43.330  < 2e-16 ***
bs(age, degree = 2, knots = c(20, 30, 40, 55))1  -1.535    0.125    
bs(age, degree = 2, knots = c(20, 30, 40, 55))2   5.481 4.49e-08 ***
bs(age, degree = 2, knots = c(20, 30, 40, 55))3  11.322  < 2e-16 ***
bs(age, degree = 2, knots = c(20, 30, 40, 55))4  12.421  < 2e-16 ***
bs(age, degree = 2, knots = c(20, 30, 40, 55))5  10.438  < 2e-16 ***
bs(age, degree = 2, knots = c(20, 30, 40, 55))6   3.900 9.77e-05 ***
---
Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

Approximate significance of smooth terms:
               edf Ref.df     F p-value    
s(education) 5.398  6.501 73.54  <2e-16 ***
---
Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

R-sq.(adj) =   0.34   Deviance explained = 34.2%
GCV = 0.14374  Scale est. = 0.14329   n = 3987
vis.gam(gamfit2,color='heat',
        theta=30, phi=30,
        zlab='log(Wage)')
教育年限,年龄与对数收入

Figure 11: 教育年限,年龄与对数收入

Box-Cox 变换

为保证变量的正态性进行的统计学转换。它同样是Power transform的一种。

\[ {y(\lambda)}=\left\{ \begin{aligned} {\textstyle{{{y^\lambda } - 1} \over \lambda }} & , & \mbox{if}\quad\lambda\ne 0 \\ \log y & , & \mbox{if}\quad\lambda = 0 \end{aligned} \right. \]

注意Box-Cox 变换只能应用于正数值。如果数据中有负数\(y<0\), \(y\) 会被整体加上一个 \(\lambda _2\) 值。如下所示:

\[ {y(\lambda)}=\left\{ \begin{aligned} {\textstyle{{{{(y + {\lambda _2})}^{{\lambda _1}}} - 1} \over {{\lambda _1}}}} & , & \mbox{if}\quad\lambda_1\ne 0 \\ \log (y + {\lambda _2}{\rm{)}} & , & \mbox{if}\quad\lambda_1 = 0 \end{aligned} \right. \]

library(MASS)
x = rf(500,30,30)
result = boxcox(x~1,lambda = seq(-0.5,0,5),
                plotit = F)
mylambda = result$x[which.max(result$y)]
mylambda
[1] -0.5
x2 <- (x^mylambda-1)/mylambda
x2 = as.data.frame(x2)

ggplot(x2,aes(x2)) +
  geom_density(data=as.data.frame(x),
               aes(x),fill='purple',
               col='purple',alpha=1/5)+
  geom_density(data=as.data.frame(x2),
               aes(x2),fill='green',
               col='green',alpha=1/5) +
  labs(x='')
紫色密度图为原分布;绿色密度图为Box-Cox 变换后的分布

Figure 12: 紫色密度图为原分布;绿色密度图为Box-Cox 变换后的分布

Python中的做法:scipy.special.boxcox

import numpy as np
import seaborn as sns
from scipy.stats import f
from scipy.stats import boxcox

r = f.rvs(30, 30, size=500)
np.random.seed(1234) 
transformed,fitted_lambda = boxcox(r)
print('lambda=',fitted_lambda)
sns.distplot(r, rug=True,color='green')
sns.distplot(transformed, rug=True)

  1. 可从Github下载↩︎