autoxgboost を使ってみる - 琥珀色呑んだくれ備忘録

前回のTokyoRで＠hoxo-mさんがつぶやいていたautoxgboostを使ってみる。

xgboost の自動パラメータ調整は autoxgboost というのが便利そうだった。#tokyor https://t.co/LvwY9U2zyx
— hoxo_m (@hoxo_m) 2018年7月15日

何？

autoxgboostは、mlr と mlrMBOを使ってxgboostのモデルをベイズ最適化でチューニングするためのラッパー。気合い入れて検討する前にベースラインとる用途だけでも、かなり楽になるのではないかと思う。

(参考)

準備

インストールは github から。

install.packages("devtools") # if you have not installed "devtools" package
devtools::install_github("ja-thomas/autoxgboost")

今回は spam データセットを使う。

set.seed(0)
data(spam, package = "kernlab")

70%を教師データとして使う。このデータのラベルはtype列。

train.set <- sample(1:NROW(spam), NROW(spam) * 0.7)
table(spam$type[train.set])
#> 
#> nonspam    spam 
#>    1954    1266

残りをテストデータに使う。

test.set <- setdiff(1:NROW(spam), train.set)
table(spam$type[test.set])
#> 
#> nonspam    spam 
#>     834     547

autoxgboost のロード。必要な周辺パッケージも併せてロードされる。

require(autoxgboost)
#> Loading required package: autoxgboost
#> Loading required package: ParamHelpers
#> Warning: replacing previous import 'BBmisc::isFALSE' by
#> 'backports::isFALSE' when loading 'ParamHelpers'
#> Loading required package: mlr
#> Loading required package: mlrMBO
#> Loading required package: smoof
#> Loading required package: BBmisc
#> Loading required package: checkmate
#> Loading required package: mlrCPO

モデルのチューニング

一度、教師データでチューニングしてから、テストデータで評価することにする。ほぼマニュアル通りに書き下し。MBOcontrol()の細かな設定は、後日改めて確認しておきたい。

tune.task <- makeClassifTask(data = spam[train.set, ], target = "type")

ctrl <- makeMBOControl()
ctrl <- setMBOControlTermination(ctrl, iters = 1L) #Speed up Tuning by only doing 1 iteration

チューニングの実行。これもほぼマニュアル通り。mlrの分類タスクはデフォルトの評価指標がthe mean misclassification error (mmce) なので、これをaucに変更する。このあたりはタスクにあった指標をお好みで選べばよい。

design.sizeとかtune.thresholdとか、まだ理解していない引数もあるので、これもおいおい確認。

res <- autoxgboost(tune.task, control = ctrl, measure = auc, nthread = 4, tune.threshold = FALSE)
#> Computing y column(s) for design. Not provided.
#> [mbo] 0: eta=0.0992; gamma=18.4; max_depth=19; colsample_bytree=0.521; colsample_bylevel=0.529; lambda=5.39; alpha=35.8; subsample=0.733; scale_pos_weight=0.00616 : y = 0.968 : 0.7 secs : initdesign
#> [mbo] 0: eta=0.0715; gamma=0.109; max_depth=9; colsample_bytree=0.641; colsample_bylevel=0.952; lambda=1.46; alpha=480; subsample=0.865; scale_pos_weight=0.173 : y = 0.5 : 0.3 secs : initdesign
#> [mbo] 0: eta=0.181; gamma=25.6; max_depth=13; colsample_bytree=0.928; colsample_bylevel=0.618; lambda=10.6; alpha=7.48; subsample=0.569; scale_pos_weight=0.606 : y = 0.97 : 0.8 secs : initdesign
#> [mbo] 0: eta=0.138; gamma=0.0292; max_depth=16; colsample_bytree=0.728; colsample_bylevel=0.596; lambda=452; alpha=0.343; subsample=0.618; scale_pos_weight=0.00176 : y = 0.969 : 1.2 secs : initdesign
#> [mbo] 0: eta=0.0208; gamma=2.48; max_depth=4; colsample_bytree=0.744; colsample_bylevel=0.752; lambda=0.0143; alpha=0.859; subsample=0.635; scale_pos_weight=0.0286 : y = 0.965 : 0.4 secs : initdesign
#> [mbo] 0: eta=0.0567; gamma=36.3; max_depth=15; colsample_bytree=0.542; colsample_bylevel=0.769; lambda=30.5; alpha=153; subsample=0.979; scale_pos_weight=396 : y = 0.944 : 0.6 secs : initdesign
#> [mbo] 0: eta=0.13; gamma=5.39; max_depth=17; colsample_bytree=0.785; colsample_bylevel=0.809; lambda=0.0852; alpha=0.192; subsample=0.672; scale_pos_weight=5.51 : y = 0.981 : 0.7 secs : initdesign
#> [mbo] 0: eta=0.17; gamma=1.11; max_depth=8; colsample_bytree=0.569; colsample_bylevel=0.985; lambda=2.84; alpha=0.0632; subsample=0.785; scale_pos_weight=3.19 : y = 0.986 : 0.6 secs : initdesign
#> [mbo] 0: eta=0.158; gamma=0.0599; max_depth=19; colsample_bytree=0.671; colsample_bylevel=0.693; lambda=275; alpha=0.0142; subsample=0.525; scale_pos_weight=131 : y = 0.954 : 0.3 secs : initdesign
#> [mbo] 0: eta=0.0861; gamma=0.441; max_depth=5; colsample_bytree=0.818; colsample_bylevel=0.905; lambda=0.00104; alpha=0.00136; subsample=0.935; scale_pos_weight=0.0513 : y = 0.962 : 0.3 secs : initdesign
#> [mbo] 0: eta=0.0414; gamma=0.0134; max_depth=3; colsample_bytree=0.999; colsample_bylevel=0.65; lambda=0.0032; alpha=19.7; subsample=0.914; scale_pos_weight=614 : y = 0.969 : 0.5 secs : initdesign
#> [mbo] 0: eta=0.0284; gamma=0.0255; max_depth=11; colsample_bytree=0.943; colsample_bylevel=0.724; lambda=0.0355; alpha=1.83; subsample=0.547; scale_pos_weight=1.06 : y = 0.969 : 0.8 secs : initdesign
#> [mbo] 0: eta=0.0812; gamma=0.185; max_depth=14; colsample_bytree=0.878; colsample_bylevel=0.565; lambda=0.115; alpha=0.0313; subsample=0.828; scale_pos_weight=0.0112 : y = 0.985 : 1.1 secs : initdesign
#> [mbo] 0: eta=0.191; gamma=8.07; max_depth=7; colsample_bytree=0.851; colsample_bylevel=0.894; lambda=134; alpha=174; subsample=0.877; scale_pos_weight=19.5 : y = 0.945 : 0.3 secs : initdesign
#> [mbo] 0: eta=0.12; gamma=0.542; max_depth=11; colsample_bytree=0.615; colsample_bylevel=0.846; lambda=0.534; alpha=0.00287; subsample=0.764; scale_pos_weight=28 : y = 0.981 : 0.7 secs : initdesign
#> Loading required package: rgenoud
#> ##  rgenoud (Version 5.8-2.0, Build Date: 2018-04-03)
#> ##  See http://sekhon.berkeley.edu/rgenoud for additional documentation.
#> ##  Please cite software as:
#> ##   Walter Mebane, Jr. and Jasjeet S. Sekhon. 2011.
#> ##   ``Genetic Optimization Using Derivatives: The rgenoud package for R.''
#> ##   Journal of Statistical Software, 42(11): 1-26. 
#> ##
#> [mbo] 1: eta=0.125; gamma=0.0108; max_depth=5; colsample_bytree=0.978; colsample_bylevel=0.618; lambda=0.00409; alpha=0.137; subsample=0.799; scale_pos_weight=0.625 : y = 0.983 : 0.5 secs : infill_cb

チューニング後の結果を表示すると、おススメのパラメータセットを喋ってくれる。

print(res)
#> Autoxgboost tuning result
#> 
#> Recommended parameters:
#>               eta: 0.170
#>             gamma: 1.111
#>         max_depth: 8
#>  colsample_bytree: 0.569
#> colsample_bylevel: 0.985
#>            lambda: 2.836
#>             alpha: 0.063
#>         subsample: 0.785
#>  scale_pos_weight: 3.189
#>           nrounds: 25
#> 
#> 
#> Preprocessing pipeline:
#> dropconst(rel.tol = 1e-08, abs.tol = 1e-08, ignore.na = FALSE)
#> 
#> With tuning result: auc = 0.986

結果オブジェクトの構成は以下の通り。

str(res, 1)
#> List of 5
#>  $ optim.result    :List of 9
#>   ..- attr(*, "class")= chr [1:2] "MBOSingleObjResult" "MBOResult"
#>  $ final.learner   :List of 11
#>   ..- attr(*, "class")= chr [1:3] "CPOLearner" "BaseWrapper" "Learner"
#>  $ final.model     :List of 8
#>   ..- attr(*, "class")= chr [1:3] "CPOModel" "BaseWrapperModel" "WrappedModel"
#>  $ measure         :List of 10
#>   ..- attr(*, "class")= chr "Measure"
#>  $ preproc.pipeline:List of 24
#>   ..- attr(*, "class")= chr [1:2] "CPOPrimitive" "CPO"
#>  - attr(*, "class")= chr "AutoxgbResult"

チューニングしたモデルの評価

ベストのlearnerとmodelも取っておいてくれるので、これを使う。

res$final.learner
#> Learner classif.xgboost.custom.dropconst from package xgboost
#> Type: classif
#> Name: ; Short name: 
#> Class: CPOLearner
#> Properties: numerics,missings,twoclass,multiclass,prob,featimp
#> Predict-Type: prob
#> Hyperparameters: nrounds=25,verbose=0,objective=binary:logistic,eta=0.17,gamma=1.11,max_depth=8,colsample_bytree=0.569,colsample_bylevel=0.985,lambda=2.84,alpha=0.0632,subsample=0.785,scale_pos_weight=3.19
res$final.model
#> Model for learner.id=classif.xgboost.custom.dropconst; learner.class=CPOLearner
#> Trained on: task.id = spam[train.set, ]; obs = 3220; features = 57
#> Hyperparameters: nrounds=25,verbose=0,objective=binary:logistic,eta=0.17,gamma=1.11,max_depth=8,colsample_bytree=0.569,colsample_bylevel=0.985,lambda=2.84,alpha=0.0632,subsample=0.785,scale_pos_weight=3.19

取り分けておいたテストデータを使って予測性能を評価する。

test.task <- makeClassifTask(data = spam[test.set, ], target = "type")

pred <- predict(res$final.model, task = test.task)

performance(pred, measures = auc)
#>       auc 
#> 0.9851731

このデータではテストセットの性能も、チューニングのときのものとあまり変わらない。