Julia's BLOG

机器学习之-XGBoost调参

2019-01-25

XGBoost 作为构建分类模型比较出色的算法,是在GBDT之后又做提升的,我这里通过自己正在做的项目进行一个基于XGBoost的调参实践。以下为具体步骤:

1. 常用参数解读

  1. estimator:所使用的分类器,如果比赛中使用的是XGBoost的话,就是生成的model。比如: model = xgb.XGBRegressor(**other_params)
  2. param_grid:值为字典或者列表,即需要最优化的参数的取值。比如:cv_params = {‘n_estimators’: [550, 575, 600, 650, 675]}
  3. scoring :准确度评价标准,默认None,这时需要使用score函数;或者如scoring=’roc_auc’,根据所选模型不同,评价准则不同。字符串(函数名),或是可调用对象,需要其函数签名形如:scorer(estimator, X, y);如果是None,则使用estimator的误差估计函数。scoring参数选择如下

https://scikit-learn.org/stable/modules/model_evaluation.html

调参开始一般要初始化一些值:

  • learning_rate: 0.1
  • n_estimators: 500
  • max_depth: 5
  • min_child_weight: 1
  • subsample: 0.8
  • colsample_bytree:0.8
  • gamma: 0
  • reg_alpha: 0
  • reg_lambda: 1

XGBoost常用参数一览表

2. 调参过程

调参的时候一般按照以下顺序来进行:

1. 最佳迭代次数:n_estimators

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
if __name__ == '__main__':
trainFilePath = 'dataset/soccer/train.csv'
testFilePath = 'dataset/soccer/test.csv'
data = pd.read_csv(trainFilePath)
X_train, y_train = featureSet(data)
X_test = loadTestData(testFilePath)

cv_params = {'n_estimators': [400, 500, 600, 700, 800]}
other_params = {'learning_rate': 0.1, 'n_estimators': 500, 'max_depth': 5, 'min_child_weight': 1, 'seed': 0,
'subsample': 0.8, 'colsample_bytree': 0.8, 'gamma': 0, 'reg_alpha': 0, 'reg_lambda': 1}

model = xgb.XGBRegressor(**other_params)
optimized_GBM = GridSearchCV(estimator=model, param_grid=cv_params, scoring='r2', cv=5, verbose=1, n_jobs=4)
optimized_GBM.fit(X_train, y_train)
evalute_result = optimized_GBM.grid_scores_
print('每轮迭代运行结果:{0}'.format(evalute_result))
print('参数的最佳取值:{0}'.format(optimized_GBM.best_params_))
print('最佳模型得分:{0}'.format(optimized_GBM.best_score_))

我这里的结果如下:

1
2
3
每轮迭代运行结果:[mean: 0.64585, std: 0.00735, params: {'n_estimators': 400}, mean: 0.64577, std: 0.00760, params: {'n_estimators': 500}, mean: 0.64517, std: 0.00750, params: {'n_estimators': 600}, mean: 0.64444, std: 0.00775, params: {'n_estimators': 700}, mean: 0.64368, std: 0.00766, params: {'n_estimators': 800}]
参数的最佳取值:{'n_estimators': 400}
最佳模型得分:0.6458464912888138

所以选取以下参数继续:

1
cv_params = {'n_estimators': [350, 375, 400, 425, 450]}

结果:

1
2
3
每轮迭代运行结果:[mean: 0.64750, std: 0.00506, params: {'n_estimators': 350}, mean: 0.64760, std: 0.00506, params: {'n_estimators': 375}, mean: 0.64786, std: 0.00505, params: {'n_estimators': 400}, mean: 0.64805, std: 0.00479, params: {'n_estimators': 425}, mean: 0.64813, std: 0.00471, params: {'n_estimators': 450}]
参数的最佳取值:{'n_estimators': 450}
最佳模型得分:0.6481318637692152

所以继续:

1
cv_params = {'n_estimators': [450, 460, 470, 480, 490]}

(也可以粒度不用这么细,看心情,当然粒度越细越准确)

结果:

1
2
3
每轮迭代运行结果:[mean: 0.64785, std: 0.00566, params: {'n_estimators': 450}, mean: 0.64791, std: 0.00572, params: {'n_estimators': 460}, mean: 0.64783, std: 0.00568, params: {'n_estimators': 470}, mean: 0.64785, std: 0.00570, params: {'n_estimators': 480}, mean: 0.64798, std: 0.00578, params: {'n_estimators': 490}]
参数的最佳取值:{'n_estimators': 490}
最佳模型得分:0.6479752914872792

所以选取 n_estimators: 490

2. 更新参数后,开始调试min_child_weight以及max_depth

1
2
3
cv_params = {'max_depth': [3, 4, 5, 6, 7, 8, 9, 10], 'min_child_weight': [1, 2, 3, 4, 5, 6]}
other_params = {'learning_rate': 0.1, 'n_estimators': 490, 'max_depth': 5, 'min_child_weight': 1, 'seed': 0,
'subsample': 0.8, 'colsample_bytree': 0.8, 'gamma': 0, 'reg_alpha': 0, 'reg_lambda': 1}

结果:

1
2
3
4
5
6
7
Fitting 5 folds for each of 48 candidates, totalling 240 fits
[Parallel(n_jobs=4)]: Done 42 tasks | elapsed: 4.2min
[Parallel(n_jobs=4)]: Done 192 tasks | elapsed: 28.6min
[Parallel(n_jobs=4)]: Done 240 out of 240 | elapsed: 59.0min finished
每轮迭代运行结果:[mean: 0.63650, std: 0.00727, params: {'max_depth': 3, 'min_child_weight': 1}, mean: 0.63667, std: 0.00713, params: {'max_depth': 3, 'min_child_weight': 2}, mean: 0.63681, std: 0.00683, params: {'max_depth': 3, 'min_child_weight': 3}, mean: 0.63737, std: 0.00731, params: {'max_depth': 3, 'min_child_weight': 4}, mean: 0.63715, std: 0.00755, params: {'max_depth': 3, 'min_child_weight': 5}, mean: 0.63597, std: 0.00759, params: {'max_depth': 3, 'min_child_weight': 6}, mean: 0.64343, std: 0.00821, params: {'max_depth': 4, 'min_child_weight': 1}, mean: 0.64373, std: 0.00810, params: {'max_depth': 4, 'min_child_weight': 2}, mean: 0.64361, std: 0.00804, params: {'max_depth': 4, 'min_child_weight': 3}, mean: 0.64384, std: 0.00789, params: {'max_depth': 4, 'min_child_weight': 4}, mean: 0.64308, std: 0.00808, params: {'max_depth': 4, 'min_child_weight': 5}, mean: 0.64308, std: 0.00793, params: {'max_depth': 4, 'min_child_weight': 6}, mean: 0.64679, std: 0.00899, params: {'max_depth': 5, 'min_child_weight': 1}, mean: 0.64596, std: 0.00869, params: {'max_depth': 5, 'min_child_weight': 2}, mean: 0.64559, std: 0.00890, params: {'max_depth': 5, 'min_child_weight': 3}, mean: 0.64628, std: 0.00930, params: {'max_depth': 5, 'min_child_weight': 4}, mean: 0.64576, std: 0.00975, params: {'max_depth': 5, 'min_child_weight': 5}, mean: 0.64600, std: 0.00998, params: {'max_depth': 5, 'min_child_weight': 6}, mean: 0.64557, std: 0.00939, params: {'max_depth': 6, 'min_child_weight': 1}, mean: 0.64626, std: 0.00942, params: {'max_depth': 6, 'min_child_weight': 2}, mean: 0.64656, std: 0.00950, params: {'max_depth': 6, 'min_child_weight': 3}, mean: 0.64518, std: 0.00933, params: {'max_depth': 6, 'min_child_weight': 4}, mean: 0.64602, std: 0.00997, params: {'max_depth': 6, 'min_child_weight': 5}, mean: 0.64620, std: 0.00976, params: {'max_depth': 6, 'min_child_weight': 6}, mean: 0.64369, std: 0.00917, params: {'max_depth': 7, 'min_child_weight': 1}, mean: 0.64400, std: 0.00967, params: {'max_depth': 7, 'min_child_weight': 2}, mean: 0.64454, std: 0.00982, params: {'max_depth': 7, 'min_child_weight': 3}, mean: 0.64405, std: 0.00995, params: {'max_depth': 7, 'min_child_weight': 4}, mean: 0.64430, std: 0.00999, params: {'max_depth': 7, 'min_child_weight': 5}, mean: 0.64532, std: 0.00954, params: {'max_depth': 7, 'min_child_weight': 6}, mean: 0.64014, std: 0.00994, params: {'max_depth': 8, 'min_child_weight': 1}, mean: 0.63977, std: 0.00986, params: {'max_depth': 8, 'min_child_weight': 2}, mean: 0.64029, std: 0.01102, params: {'max_depth': 8, 'min_child_weight': 3}, mean: 0.64113, std: 0.01123, params: {'max_depth': 8, 'min_child_weight': 4}, mean: 0.64147, std: 0.01108, params: {'max_depth': 8, 'min_child_weight': 5}, mean: 0.64259, std: 0.01052, params: {'max_depth': 8, 'min_child_weight': 6}, mean: 0.63411, std: 0.01011, params: {'max_depth': 9, 'min_child_weight': 1}, mean: 0.63434, std: 0.01066, params: {'max_depth': 9, 'min_child_weight': 2}, mean: 0.63523, std: 0.01031, params: {'max_depth': 9, 'min_child_weight': 3}, mean: 0.63729, std: 0.01088, params: {'max_depth': 9, 'min_child_weight': 4}, mean: 0.63817, std: 0.01094, params: {'max_depth': 9, 'min_child_weight': 5}, mean: 0.64045, std: 0.01124, params: {'max_depth': 9, 'min_child_weight': 6}, mean: 0.62545, std: 0.01010, params: {'max_depth': 10, 'min_child_weight': 1}, mean: 0.62855, std: 0.00963, params: {'max_depth': 10, 'min_child_weight': 2}, mean: 0.63052, std: 0.01056, params: {'max_depth': 10, 'min_child_weight': 3}, mean: 0.63281, std: 0.01078, params: {'max_depth': 10, 'min_child_weight': 4}, mean: 0.63452, std: 0.01109, params: {'max_depth': 10, 'min_child_weight': 5}, mean: 0.63628, std: 0.01142, params: {'max_depth': 10, 'min_child_weight': 6}]
参数的最佳取值:{'max_depth': 5, 'min_child_weight': 1}
最佳模型得分:0.6467860854227112

3. 接下来调gamma

1
2
3
cv_params = {'gamma': [0.1, 0.2, 0.3, 0.4, 0.5, 0.6]}
other_params = {'learning_rate': 0.1, 'n_estimators': 490, 'max_depth': 5, 'min_child_weight': 1, 'seed': 0,
'subsample': 0.8, 'colsample_bytree': 0.8, 'gamma': 0, 'reg_alpha': 0, 'reg_lambda': 1}

结果:

1
2
3
4
5
Fitting 5 folds for each of 6 candidates, totalling 30 fits
[Parallel(n_jobs=4)]: Done 30 out of 30 | elapsed: 3.9min finished
每轮迭代运行结果:[mean: 0.64692, std: 0.01050, params: {'gamma': 0.1}, mean: 0.64667, std: 0.01054, params: {'gamma': 0.2}, mean: 0.64608, std: 0.01010, params: {'gamma': 0.3}, mean: 0.64646, std: 0.01048, params: {'gamma': 0.4}, mean: 0.64535, std: 0.01040, params: {'gamma': 0.5}, mean: 0.64516, std: 0.01003, params: {'gamma': 0.6}]
参数的最佳取值:{'gamma': 0.1}
最佳模型得分:0.6469161452551757

4. 接下来是 subsample 和 colsample_bytree

1
2
3
cv_params = {'subsample': [0.6, 0.7, 0.8, 0.9], 'colsample_bytree': [0.6, 0.7, 0.8, 0.9]}
other_params = {'learning_rate': 0.1, 'n_estimators': 490, 'max_depth': 5, 'min_child_weight': 1, 'seed': 0,
'subsample': 0.8, 'colsample_bytree': 0.8, 'gamma': 0.1, 'reg_alpha': 0, 'reg_lambda': 1}

结果:

1
2
3
4
5
6
Fitting 5 folds for each of 16 candidates, totalling 80 fits
[Parallel(n_jobs=4)]: Done 42 tasks | elapsed: 5.1min
[Parallel(n_jobs=4)]: Done 80 out of 80 | elapsed: 10.0min finished
每轮迭代运行结果:[mean: 0.64621, std: 0.00702, params: {'colsample_bytree': 0.6, 'subsample': 0.6}, mean: 0.64761, std: 0.00700, params: {'colsample_bytree': 0.6, 'subsample': 0.7}, mean: 0.64828, std: 0.00740, params: {'colsample_bytree': 0.6, 'subsample': 0.8}, mean: 0.64980, std: 0.00781, params: {'colsample_bytree': 0.6, 'subsample': 0.9}, mean: 0.64492, std: 0.00787, params: {'colsample_bytree': 0.7, 'subsample': 0.6}, mean: 0.64679, std: 0.00727, params: {'colsample_bytree': 0.7, 'subsample': 0.7}, mean: 0.64874, std: 0.00819, params: {'colsample_bytree': 0.7, 'subsample': 0.8}, mean: 0.64986, std: 0.00826, params: {'colsample_bytree': 0.7, 'subsample': 0.9}, mean: 0.64518, std: 0.00753, params: {'colsample_bytree': 0.8, 'subsample': 0.6}, mean: 0.64659, std: 0.00771, params: {'colsample_bytree': 0.8, 'subsample': 0.7}, mean: 0.64837, std: 0.00788, params: {'colsample_bytree': 0.8, 'subsample': 0.8}, mean: 0.64923, std: 0.00789, params: {'colsample_bytree': 0.8, 'subsample': 0.9}, mean: 0.64528, std: 0.00779, params: {'colsample_bytree': 0.9, 'subsample': 0.6}, mean: 0.64621, std: 0.00717, params: {'colsample_bytree': 0.9, 'subsample': 0.7}, mean: 0.64807, std: 0.00812, params: {'colsample_bytree': 0.9, 'subsample': 0.8}, mean: 0.64943, std: 0.00739, params: {'colsample_bytree': 0.9, 'subsample': 0.9}]
参数的最佳取值:{'colsample_bytree': 0.7, 'subsample': 0.9}
最佳模型得分:0.6498584508026598

5. 接下来是 reg_alpha 和 reg_lambda

1
2
3
cv_params = {'reg_alpha': [0.05, 0.1, 1, 2, 3], 'reg_lambda': [0.05, 0.1, 1, 2, 3]}
other_params = {'learning_rate': 0.1, 'n_estimators': 490, 'max_depth': 5, 'min_child_weight': 1, 'seed': 0,
'subsample': 0.9, 'colsample_bytree': 0.7, 'gamma': 0.1, 'reg_alpha': 0, 'reg_lambda': 1}

结果如下:

1
2
3
4
5
6
Fitting 5 folds for each of 25 candidates, totalling 125 fits
[Parallel(n_jobs=4)]: Done 42 tasks | elapsed: 4.9min
[Parallel(n_jobs=4)]: Done 125 out of 125 | elapsed: 14.2min finished
每轮迭代运行结果:[mean: 0.65114, std: 0.00495, params: {'reg_alpha': 0.05, 'reg_lambda': 0.05}, mean: 0.65142, std: 0.00508, params: {'reg_alpha': 0.05, 'reg_lambda': 0.1}, mean: 0.65137, std: 0.00509, params: {'reg_alpha': 0.05, 'reg_lambda': 1}, mean: 0.65164, std: 0.00458, params: {'reg_alpha': 0.05, 'reg_lambda': 2}, mean: 0.65162, std: 0.00450, params: {'reg_alpha': 0.05, 'reg_lambda': 3}, mean: 0.65121, std: 0.00506, params: {'reg_alpha': 0.1, 'reg_lambda': 0.05}, mean: 0.65119, std: 0.00576, params: {'reg_alpha': 0.1, 'reg_lambda': 0.1}, mean: 0.65161, std: 0.00511, params: {'reg_alpha': 0.1, 'reg_lambda': 1}, mean: 0.65105, std: 0.00459, params: {'reg_alpha': 0.1, 'reg_lambda': 2}, mean: 0.65147, std: 0.00490, params: {'reg_alpha': 0.1, 'reg_lambda': 3}, mean: 0.65222, std: 0.00450, params: {'reg_alpha': 1, 'reg_lambda': 0.05}, mean: 0.65223, std: 0.00531, params: {'reg_alpha': 1, 'reg_lambda': 0.1}, mean: 0.65241, std: 0.00451, params: {'reg_alpha': 1, 'reg_lambda': 1}, mean: 0.65244, std: 0.00533, params: {'reg_alpha': 1, 'reg_lambda': 2}, mean: 0.65184, std: 0.00494, params: {'reg_alpha': 1, 'reg_lambda': 3}, mean: 0.65210, std: 0.00470, params: {'reg_alpha': 2, 'reg_lambda': 0.05}, mean: 0.65213, std: 0.00474, params: {'reg_alpha': 2, 'reg_lambda': 0.1}, mean: 0.65193, std: 0.00469, params: {'reg_alpha': 2, 'reg_lambda': 1}, mean: 0.65176, std: 0.00461, params: {'reg_alpha': 2, 'reg_lambda': 2}, mean: 0.65166, std: 0.00476, params: {'reg_alpha': 2, 'reg_lambda': 3}, mean: 0.65177, std: 0.00450, params: {'reg_alpha': 3, 'reg_lambda': 0.05}, mean: 0.65149, std: 0.00451, params: {'reg_alpha': 3, 'reg_lambda': 0.1}, mean: 0.65127, std: 0.00453, params: {'reg_alpha': 3, 'reg_lambda': 1}, mean: 0.65100, std: 0.00412, params: {'reg_alpha': 3, 'reg_lambda': 2}, mean: 0.65089, std: 0.00447, params: {'reg_alpha': 3, 'reg_lambda': 3}]
参数的最佳取值:{'reg_alpha': 1, 'reg_lambda': 2}
最佳模型得分:0.6524374554446325

6. 最后是 learning_rate

一般要调小学习率来测试

1
2
3
cv_params = {'learning_rate': [0.01, 0.05, 0.07, 0.1, 0.2]}
other_params = {'learning_rate': 0.1, 'n_estimators': 490, 'max_depth': 5, 'min_child_weight': 1, 'seed': 0,
'subsample': 0.9, 'colsample_bytree': 0.7, 'gamma': 0.1, 'reg_alpha': 1, 'reg_lambda': 2}

结果:

1
2
3
4
5
Fitting 5 folds for each of 5 candidates, totalling 25 fits
[Parallel(n_jobs=4)]: Done 25 out of 25 | elapsed: 3.1min finished
每轮迭代运行结果:[mean: 0.62619, std: 0.00501, params: {'learning_rate': 0.01}, mean: 0.64688, std: 0.00429, params: {'learning_rate': 0.05}, mean: 0.64786, std: 0.00405, params: {'learning_rate': 0.07}, mean: 0.64837, std: 0.00385, params: {'learning_rate': 0.1}, mean: 0.64772, std: 0.00413, params: {'learning_rate': 0.2}]
参数的最佳取值:{'learning_rate': 0.1}
最佳模型得分:0.6483695525378442

7. 最后通过所有参数的最佳取值来训练模型

1
2
3
4
5
other_params = {'learning_rate': 0.1, 'n_estimators': 490, 'max_depth': 5, 'min_child_weight': 1, 'seed': 0,
'subsample': 0.9, 'colsample_bytree': 0.7, 'gamma': 0.1, 'reg_alpha': 1, 'reg_lambda': 2}
bst = xgb.XGBRegressor(**other_params)
bst.fit(train_x, train_y)
ypred=bst.predict(test_x)

3. 模型训练完毕,根据特征 importance 图分析特征重要性

1
2
3
4
from xgboost import plot_importance
from matplotlib import pyplot as plt
plot_importance(bst)
plt.show()

4. 总结

我们发现,通过调参可以一定程度上提高一些model的分数,但是提升幅度不大,最主要的还是数据清洗,特征选择,pre-pruning 等步骤。

使用支付宝打赏
使用微信打赏

若你觉得我的文章对你有帮助,欢迎点击上方按钮对我打赏

扫描二维码,分享此文章