机器学习建模菜鸡模版一本通
一、数据读入:%pythonimport xgbt_model%pythonimport pandas_utils'''titanic数据集字段说明:Survived:0代表死亡,1代表存活【y标签】Pclass:乘客所持票类,有三种值(1,2,3) 【转换成onehot编码】Name:乘客姓名 【舍去】Sex:乘客性别 【转换成bool特征】Age:乘客年龄(有缺失) 【数值特征,添加“年龄是否
·
一、数据读入:
%python
import xgbt_model
%python
import pandas_utils
'''
titanic数据集字段说明:
Survived:0代表死亡,1代表存活【y标签】
Pclass:乘客所持票类,有三种值(1,2,3) 【转换成onehot编码】
Name:乘客姓名 【舍去】
Sex:乘客性别 【转换成bool特征】
Age:乘客年龄(有缺失) 【数值特征,添加“年龄是否缺失”作为辅助特征】
SibSp:乘客兄弟姐妹/配偶的个数(整数值) 【数值特征】
Parch:乘客父母/孩子的个数(整数值)【数值特征】
Ticket:票号(字符串)【舍去】
Fare:乘客所持票的价格(浮点数,0-500不等) 【数值特征】
Cabin:乘客所在船舱(有缺失) 【添加“所在船舱是否缺失”作为辅助特征】
Embarked:乘客登船港口:S、C、Q(有缺失)【转换成onehot编码,四维度 S,C,Q,nan】
'''
training_dataset = pandas_utils.read_csv_from_hdfs('train.csv')
testing_dataset = pandas_utils.read_csv_from_hdfs('test.csv')
dataset = training_dataset.append(testing_dataset).reset_index(drop=True)
dataset.head()
二、负样本采集:
%python
# 如果正负样本比例悬殊,一般而言需要对负样本进行下采样,维持正样本数量:负样本数量 >= 1/3 (通常而言)。
# 假如,正样本数量:负样本数量 = 1:10,我们通过设置sampling_ratio=0.2,可以将最后正样本数量:负样本数量变成1:2
dataset = xgbt_model.negative_down_sampling(dataset, sampling_ratio=0.9, label_column='Survived')
positive data size: (342, 11)
negative data size before sampling: (549, 11)
negative data size after sampling: (494, 11)
total data size after sampling: (836, 11)
三、定义用来建模特征
%python
# Note: 这些特征和标签都需要在Step 2得到的数据集中,否则会报错
# 定义用到的数值型特征
numerical_features = [
'Age',
'SibSp',
'Parch',
'Fare'
]
# 定义用到的类别型特征
categorical_features = [
'Pclass',
'Sex',
'Cabin',
'Embarked'
]
label_column = 'Survived'
四、原始特征转化为建模可用的数据特征
%python
features, labels = xgbt_model.preprocessing(dataset, numerical_features, categorical_features, label_column)
features.head()
[In preprocessing] shape of numerical features = (836, 4)
[In preprocessing] shape of categorical features = (836, 151)
[In preprocessing] shape of all features = (836, 155)
五、搜索不同标签下某一特征的分布情况
%python
%matplotlib inline
xgbt_model.plot_feature_vs_label(features, labels, ['Age', 'Fare'])
六、训练模型
%python
# 定义模型参数
max_tree_number=50
max_depth = 3
eval_metric = 'auc'
random_seed = 123
# 将全部数据集分为训练数据和评测数据
training_features, eval_features, training_labels, eval_labels = xgbt_model.split_train_test(features, labels, test_size=0.2, random_state=1231)
# 使用训练数据进行模型训练
model = xgbt_model.train(
training_features,
training_labels,
max_tree_number=max_tree_number,
max_depth=max_depth,
random_seed=random_seed,
eval_metric='auc',
eval_features=eval_features,
eval_labels=eval_labels
)
[0] train-auc:0.855213 eval-auc:0.839904
Multiple eval metrics have been passed: 'eval-auc' will be used for early stopping.
Will train until eval-auc hasn't improved in 10 rounds.
[1] train-auc:0.869504 eval-auc:0.839265
[2] train-auc:0.871955 eval-auc:0.839833
[3] train-auc:0.878054 eval-auc:0.843945
[4] train-auc:0.882113 eval-auc:0.844796
[5] train-auc:0.883051 eval-auc:0.842385
[6] train-auc:0.883121 eval-auc:0.853517
[7] train-auc:0.888627 eval-auc:0.852241
[8] train-auc:0.893057 eval-auc:0.84671
[9] train-auc:0.8968 eval-auc:0.846356
[10] train-auc:0.899713 eval-auc:0.84437
[11] train-auc:0.901343 eval-auc:0.844654
[12] train-auc:0.901079 eval-auc:0.844299
[13] train-auc:0.901103 eval-auc:0.843945
[14] train-auc:0.905336 eval-auc:0.843023
[15] train-auc:0.90649 eval-auc:0.847703
[16] train-auc:0.909559 eval-auc:0.845647
Stopping. Best iteration:
[6] train-auc:0.883121 eval-auc:0.853517
七、查看特征重要性分布
%python
%matplotlib inline
xgbt_model.plot_feature_importance(model)
八、细致查看模型结构
%python
%matplotlib inline
xgbt_model.plot_tree(model, num_trees=0)
九、查看模型ks指标,选择最优切分阈值
%python
%matplotlib inline
xgbt_model.plot_model_ks(model, eval_features, eval_labels)
十、预测
%python
xgbt_model.predict(model, eval_features)
array([0.82655984, 0.27785543, 0.23938118, 0.12750405, 0.09729803,
0.95870715, 0.7831825 , 0.5179815 , 0.8811106 , 0.03982455,
0.32331768, 0.36604008, 0.720063 , 0.11601663, 0.11403432,
0.39625728, 0.94897515, 0.13791059, 0.37203276, 0.13597682,
0.5418501 , 0.94245553, 0.7604762 , 0.2135466 , 0.12084755,
0.13791059, 0.96606845, 0.7395714 , 0.8811106 , 0.9649121 ,
0.9573398 , 0.8102147 , 0.11403432, 0.53544307, 0.22115439,
0.08738643, 0.19681343, 0.3295594 , 0.6055345 , 0.12084755,
0.95408344, 0.9481865 , 0.95261663, 0.2135466 , 0.95870715,
0.11403432, 0.11403432, 0.44013265, 0.05468271, 0.2422589 ,
0.21938527, 0.94245553, 0.11403432, 0.09064003, 0.11403432,
0.09729803, 0.39625728, 0.81641793, 0.11403432, 0.5180364 ,
0.13791059, 0.1561236 , 0.67381006, 0.0916443 , 0.60591793,
0.81304884, 0.08536454, 0.1791531 , 0.13597682, 0.1228346 ,
0.5179815 , 0.47453502, 0.83776534, 0.20418522, 0.88687545,
0.1228346 , 0.3632514 , 0.858027 , 0.20418522, 0.95468086,
0.1228346 , 0.3512447 , 0.13009362, 0.36939034, 0.4735429 ,
0.15238559, 0.22115439, 0.9314935 , 0.35052568, 0.95644695,
0.95335037, 0.13791059, 0.13178019, 0.95408344, 0.12084755,
0.89744616, 0.19502547, 0.82655984, 0.03466938, 0.46487603,
0.88687545, 0.5179815 , 0.41726422, 0.14591543, 0.949904 ,
0.19681343, 0.89744616, 0.13791059, 0.09608363, 0.09957181,
0.5337448 , 0.5261282 , 0.35052568, 0.81304884, 0.1561236 ,
0.13597682, 0.13597682, 0.1561236 , 0.23731515, 0.08738643,
0.1228346 , 0.09064003, 0.73137987, 0.03982455, 0.11601663,
0.14591543, 0.32331768, 0.95870715, 0.12084755, 0.26739913,
0.94375545, 0.11403432, 0.11403432, 0.9481865 , 0.3613689 ,
0.87709194, 0.19907987, 0.8183291 , 0.1228346 , 0.9511413 ,
0.0355372 , 0.85829735, 0.8906141 , 0.83163077, 0.09729803,
0.13597682, 0.13597682, 0.82655984, 0.73431736, 0.8522414 ,
0.35052568, 0.67381006, 0.96101946, 0.94900113, 0.9553306 ,
0.3613689 , 0.11403432, 0.94900113, 0.7277221 , 0.11403432,
0.20446633, 0.13791059, 0.40981635, 0.16876496, 0.32134193,
0.04640677, 0.67381006, 0.17447259], dtype=float32)
更多推荐
已为社区贡献2条内容
所有评论(0)