交叉验证:基于caret包

基本概念

交叉验证是机器学习中的常用模型选择方法。其中最常用的方法是K折交叉验证(k-fold cross validation)。K折交叉验证重复使用数据,把给定的数据切分为K个互斥子集,每次使用1个子集作为测试集,使用余下k-1个子集的并集作为训练集。其中,训练集用于模型的训练,测试集用于模型的评估和选择。在样本量不够充足的情况下,交叉验证法通过重复使用数据能够减少样本划分不同导致的差别,并且选择测试误差最小的模型,增强模型的泛化能力。

划分测试集和训练集同样有助于更加客观地评价模型的泛化能力。交叉验证也是模型调节超参数的评价过程。

交叉验证主要有以下种类:

  • 简单交叉验证 (hold-out cross validation)

  • k折交叉验证 (K-fold cross-validation)

  • 留一交叉验证(leave-one-out cross validation)

K折交叉验证原理:所有数据被划分为训练集和测试集。其中,训练集被划分为K份,循环使用其中的K-1份充当训练集,使用剩下的1分充当验证集。

Figure 1: K折交叉验证原理:所有数据被划分为训练集和测试集。其中,训练集被划分为K份,循环使用其中的K-1份充当训练集,使用剩下的1分充当验证集。

谢益辉的统计动画R包animation直观地展示了交叉验证方法。

任务:通过一系列特征预测波士顿房价

输入?mlbench::BostonHousing查看数据集解释。下面主要使用caret包(classification and regression training)进行交叉验证。

`%>%` <- magrittr::`%>%`
library(caret)
library(readr)
library(tidyr)
library(dplyr)
library(ggplot2)
library(pROC)
library(stargazer)
library(ROSE)

R示例

10折交叉验证 (K-fold cross-validation)

rfControl <-trainControl( #10折交叉验证
method ="cv", 
number =10 
)

library(mlbench) #加载数据集
data(BostonHousing)
head(BostonHousing)
     crim zn indus chas   nox    rm  age    dis rad tax ptratio      b
1 0.00632 18  2.31    0 0.538 6.575 65.2 4.0900   1 296    15.3 396.90
2 0.02731  0  7.07    0 0.469 6.421 78.9 4.9671   2 242    17.8 396.90
3 0.02729  0  7.07    0 0.469 7.185 61.1 4.9671   2 242    17.8 392.83
4 0.03237  0  2.18    0 0.458 6.998 45.8 6.0622   3 222    18.7 394.63
5 0.06905  0  2.18    0 0.458 7.147 54.2 6.0622   3 222    18.7 396.90
6 0.02985  0  2.18    0 0.458 6.430 58.7 6.0622   3 222    18.7 394.12
  lstat medv
1  4.98 24.0
2  9.14 21.6
3  4.03 34.7
4  2.94 33.4
5  5.33 36.2
6  5.21 28.7
nrow(BostonHousing) #样本量
[1] 506
rpartFit <- train(medv ~ .,
                  data = BostonHousing,
                  method = "rpart",
                  trControl = rfControl)
rpartFit
CART 

506 samples
 13 predictor

No pre-processing
Resampling: Cross-Validated (10 fold) 
Summary of sample sizes: 456, 457, 455, 456, 454, 456, ... 
Resampling results across tuning parameters:

  cp          RMSE      Rsquared   MAE     
  0.07165784  5.798693  0.6079069  4.150778
  0.17117244  6.310756  0.5249764  4.646935
  0.45274420  8.005426  0.3648840  5.767876

RMSE was used to select the optimal model using the smallest value.
The final value used for the model was cp = 0.07165784.

将数据集D划分为K个子集同样存在多种划分方式。为减小因样本划分不同而引入的差别,k折交叉验证通常要随机使用不同的划分重复p次,最终的评估结果是这p 次k折交叉验证结果的均值, 例如常见的有“10次10 折交叉验证”。每一次重复都会有不同的训练-测试集划分。

repeatedcv <-trainControl(method ="repeatedcv", number =10,repeats =10, savePredictions
=T)
rpartFit <- train(medv ~ .,
                  data = BostonHousing,
                  method = "rpart",
                  trControl = repeatedcv)
rpartFit
CART 

506 samples
 13 predictor

No pre-processing
Resampling: Cross-Validated (10 fold, repeated 10 times) 
Summary of sample sizes: 457, 456, 454, 456, 456, 455, ... 
Resampling results across tuning parameters:

  cp          RMSE      Rsquared   MAE     
  0.07165784  5.812733  0.6069060  4.175634
  0.17117244  6.662202  0.4786666  4.904228
  0.45274420  8.307043  0.3308596  6.034919

RMSE was used to select the optimal model using the smallest value.
The final value used for the model was cp = 0.07165784.
importance <- varImp(rpartFit, scale = F)
imp <- as.data.frame(importance[["importance"]])

imp <- imp %>%
  mutate(name=rownames(imp)) %>%
  filter(Overall>0) %>%
  rename('impt'='Overall') #%>%
  #mutate(varname = fct_reorder(varname,impt))

ggplot(imp,aes(reorder(name,impt),impt)) +
  geom_segment( aes(x=reorder(name,impt), xend=reorder(name,impt), y=0, yend=impt),size=1) +
  geom_point(size=3, color="red")+
  coord_flip()+
  labs(x="重要性",y="变量名") + 
  theme_bw()
变量重要性

Figure 2: 变量重要性

留一交叉验证(leave-one-out cross validation)

rfControl <-trainControl(
method ="LOOCV"
) #留一交叉验证
rpartFit <- train(medv ~ .,
                  data = BostonHousing,
                  method = "rpart",
                  trControl = rfControl)
rpartFit
CART 

506 samples
 13 predictor

No pre-processing
Resampling: Leave-One-Out Cross-Validation 
Summary of sample sizes: 505, 505, 505, 505, 505, 505, ... 
Resampling results across tuning parameters:

  cp          RMSE      Rsquared      MAE     
  0.07165784  6.060240  0.5665051649  4.289854
  0.17117244  7.489499  0.3524415744  6.088882
  0.45274420  9.700709  0.0008358062  7.539715

RMSE was used to select the optimal model using the smallest value.
The final value used for the model was cp = 0.07165784.
varImp(rpartFit, scale = F)
rpart variable importance

        Overall
lstat    0.8646
nox      0.5008
ptratio  0.4643
rm       0.4527
indus    0.2595
crim     0.2484
age      0.2089
tax      0.0000
zn       0.0000
dis      0.0000
rad      0.0000
b        0.0000
chas1    0.0000
  • 我们发现个别变量对于预测完全没有帮助,故可以将这些变量在之后的分析中删除。

使用caret包训练K近邻模型

  • 数据:口袋妖怪数据集
  • 任务:通过各种属性预测神奇宝贝是否为传说级别
  • ID,每只神奇宝贝的ID
  • Name,每只神奇宝贝的名字
  • Type1:每只神奇宝贝的类型,比如说水系,比如说火系
  • Type2:由于有特殊的神奇宝贝拥有复数以上的属性。
  • Total:指每只神奇宝贝的强度,一般越高越强
  • HP:指每只神奇宝贝的生命值
  • Attack:指每只神奇宝贝的攻击力
  • Defense:指每只神奇宝贝的防御力
  • SP Attack:指每只神奇宝贝面对相克属性神奇宝贝时的攻击力,通常会比普通攻击力高
  • SP Defense:指每只神奇宝贝面对相克属性神奇宝贝时的防御力,通常会比普通防御力高。
  • Speed:指每只神奇宝贝的速度
  • Generation:指每一只神奇宝贝属于哪一部神奇宝贝的,目前分为五部。
  • Legendary:指每一只神奇宝贝是不是属于传说级别的神奇宝贝,比如麒麟,超梦,梦幻之类的。

代码参考

pokemon <- read_csv("E:/R_codes/others/Pokemon.csv")
colnames(pokemon)[1] <- 'id' #修改变量名
colnames(pokemon)[3] <- 'Type1'
colnames(pokemon)[4] <- 'Type2'

pokemon <- pokemon %>%
  drop_na(Type1,Type2,Total,HP,Defense,Attack,Speed,Generation) %>%
  select(Legendary,Type1,Type2,Total,HP,Defense,Attack,Speed,Generation) %>%
  mutate(Legendary=as.factor(Legendary),
    Type1=recode(Type1,'Fairy'='oth','Fighting'='oth','Flying'='oth')
)
ggplot(pokemon,aes(Total,Attack,shape=Legendary,color=Legendary)) +
  geom_point(size=3)
过采样前

Figure 3: 过采样前

#过采样
balance <- ovun.sample(Legendary ~ Type1 + Type2 + Total+ HP + Defense+ Attack + Speed + Generation, data = pokemon, method = "both", p=0.5,N=748,seed = 1)$data

sampled <- sample(1:nrow(balance),nrow(balance)*0.7,replace=F)
trainpok <- balance[sampled,]
testpok <- balance[-sampled,]

ggplot(balance,aes(Total,Attack,shape=Legendary,color=Legendary)) +
  geom_jitter(size=1.5,height = 10, width = 40)
过采样后

Figure 4: 过采样后

K近邻算法(k-nearest neighbor)

  • K近邻算法对于测试集的给定样本,在训练集中找到与之最邻近的K个样本,把这个给定样本判定为它们的K临近点的多数类。
  • 在样本比较少的情况下,通常使用交叉验证确定K值。
  • K越大,模型越简单;K越小,模型越复杂,越容易发生过拟合。
trctrl <- trainControl(method = "repeatedcv", number =10,repeats =10, savePredictions
=T)

set.seed(213)
knn_fit <- train(Legendary ~ Type1 + Type2 + Total+ HP + Defense+ Attack + Speed + Generation, data = trainpok, method = "knn",
 trControl=trctrl,
 tuneLength = 10)
knn_fit
k-Nearest Neighbors 

523 samples
  8 predictor
  2 classes: 'FALSE', 'TRUE' 

No pre-processing
Resampling: Cross-Validated (10 fold, repeated 10 times) 
Summary of sample sizes: 470, 470, 471, 471, 472, 471, ... 
Resampling results across tuning parameters:

  k   Accuracy   Kappa    
   5  0.9430503  0.8863625
   7  0.9449590  0.8901647
   9  0.9501266  0.9004474
  11  0.9375277  0.8753408
  13  0.9388303  0.8779424
  15  0.9414972  0.8832631
  17  0.9416970  0.8836506
  19  0.9436204  0.8874898
  21  0.9447597  0.8897732
  23  0.9422666  0.8848008

Accuracy was used to select the optimal model using the largest value.
The final value used for the model was k = 9.
kacc <- as.data.frame(knn_fit[["results"]]) 
kacc <- kacc[,1:2]
ggplot(kacc,aes(k,Accuracy)) +
geom_line() +
geom_point() +
scale_x_continuous(breaks=seq(5,26,1))
k近邻算法的K值和准确率的关系

Figure 5: k近邻算法的K值和准确率的关系

testpok <- testpok %>%
  mutate(pred=predict(knn_fit,testpok))

prob1 <- predict(knn_fit,testpok,type='prob')
pr1 <- prob1[,2] 

trainpok <- trainpok %>%
  mutate(pred=predict(knn_fit,trainpok))

prob2 <- predict(knn_fit,trainpok,type='prob')
pr2 <- prob2[,2] 

conMatrix1 <- confusionMatrix(testpok$pred,testpok$Legendary) 
#matrix <- conMatrix1[["table"]]
conMatrix1
Confusion Matrix and Statistics

          Reference
Prediction FALSE TRUE
     FALSE   106    0
     TRUE     11  108
                                          
               Accuracy : 0.9511          
                 95% CI : (0.9142, 0.9753)
    No Information Rate : 0.52            
    P-Value [Acc > NIR] : < 2.2e-16       
                                          
                  Kappa : 0.9024          
                                          
 Mcnemar's Test P-Value : 0.002569        
                                          
            Sensitivity : 0.9060          
            Specificity : 1.0000          
         Pos Pred Value : 1.0000          
         Neg Pred Value : 0.9076          
             Prevalence : 0.5200          
         Detection Rate : 0.4711          
   Detection Prevalence : 0.4711          
      Balanced Accuracy : 0.9530          
                                          
       'Positive' Class : FALSE           
                                          
conMatrix2 <- confusionMatrix(trainpok$pred,trainpok$Legendary)
#matrix <- conMatrix[["table"]]
conMatrix2
Confusion Matrix and Statistics

          Reference
Prediction FALSE TRUE
     FALSE   245    0
     TRUE     22  256
                                         
               Accuracy : 0.9579         
                 95% CI : (0.937, 0.9735)
    No Information Rate : 0.5105         
    P-Value [Acc > NIR] : < 2.2e-16      
                                         
                  Kappa : 0.916          
                                         
 Mcnemar's Test P-Value : 7.562e-06      
                                         
            Sensitivity : 0.9176         
            Specificity : 1.0000         
         Pos Pred Value : 1.0000         
         Neg Pred Value : 0.9209         
             Prevalence : 0.5105         
         Detection Rate : 0.4685         
   Detection Prevalence : 0.4685         
      Balanced Accuracy : 0.9588         
                                         
       'Positive' Class : FALSE          
                                         
testpok$Legendary <- as.ordered(testpok$Legendary)
trainpok$Legendary <- as.ordered(trainpok$Legendary)

roca <- roc(testpok$Legendary,pr1)
rocb <- roc(trainpok$Legendary,pr2)

学习曲线

ggroc(list('测试集'=roca,'训练集'=rocb),size=1)+
  geom_segment(aes(x = 1, xend = 0, y = 0, yend = 1), color='black', linetype='dashed')+
  theme(legend.title=element_blank()) 
ROC曲线:使用`ggroc`包绘制。模型在测试集的表现比训练集稍差一些。

Figure 6: ROC曲线:使用ggroc包绘制。模型在测试集的表现比训练集稍差一些。

泰坦尼克

library(readxl)
titanic <- read_excel('E:/MaLearning/titanic.xls')
sampled <- sample(1:nrow(titanic),nrow(titanic)*0.75,replace=F)
titanic_train <- titanic[sampled,]
titanic_test <- titanic[-sampled,]

titanic_train <- titanic_train %>% 
  drop_na(survived,sex,age,sibsp,parch,fare,embarked) %>% 
  mutate(pclass=as.factor(pclass))

titanic_test <- titanic_test %>% 
  drop_na(survived,sex,age,sibsp,parch,fare,embarked) %>% 
  mutate(pclass=as.factor(pclass))

control <- trainControl(method = 'repeatedcv', number =10,repeats =10, savePredictions
=T)

set.seed(213)
logfit <- train(survived ~ sex+age+sibsp+parch+fare+embarked,data=titanic_train,
                 method='glm',family='binomial',
 trControl=control,
 tuneLength = 2)
logfit
Generalized Linear Model 

789 samples
  6 predictor

No pre-processing
Resampling: Cross-Validated (10 fold, repeated 10 times) 
Summary of sample sizes: 710, 710, 711, 710, 710, 710, ... 
Resampling results:

  RMSE       Rsquared   MAE     
  0.4033302  0.3396324  0.321572
titanic_test <- titanic_test %>% 
  mutate(pred=predict(logfit,titanic_test),
         pred=ifelse(pred>=0.5,'1','0'),
         pred=as.factor(pred),
         survived=as.factor(survived))

prob1 <- predict(logfit,titanic_test)

titanic_train <- titanic_train %>%
  mutate(pred=predict(logfit,titanic_train),
         pred=ifelse(pred>=0.5,'1','0'),
         pred=as.factor(pred),
         survived=as.factor(survived))

prob2 <- predict(logfit,titanic_train)

conMatrix1 <- confusionMatrix(titanic_test$pred,titanic_test$survived) 
matrix <- conMatrix1[["table"]]
names(dimnames(matrix))<-c('预测值','真实值')
matrix
      真实值
预测值   0   1
     0 127  26
     1  34  67
conMatrix2 <- confusionMatrix(titanic_train$pred,titanic_train$survived)
matrix2 <- conMatrix2[["table"]]

names(dimnames(matrix2))<-c('预测值','真实值')
matrix2
      真实值
预测值   0   1
     0 388 105
     1  69 227
titanic_test$Survived <- as.ordered(titanic_test$survived)
titanic_train$Survived <- as.ordered(titanic_train$survived)

roca <- roc(titanic_test$survived,prob1)
rocb <- roc(titanic_train$survived,prob2)
ggroc(list('test ROC'=roca,'train ROC'=rocb),size=1)+
  geom_segment(aes(x = 1, xend = 0, y = 0, yend = 1), color='black', linetype='solid')+
  theme(legend.title=element_blank()) 

Python交叉验证

Sklearn库中的KFoldcross_validatecross_val_scoreGridSearchCVRandomizedSearchCV等函数实现。

其中,KFoldcross_validate等是外部的交叉验证。不涉及模型拟合。

GridSearchCV等是内部交叉验证。在划分数据集的同时拟合模型。

还有其他的交叉验证方法诸如随机播种法:ShuffleSplit