机器学习建模菜鸡模版一本通

一、数据读入：%pythonimport xgbt_model%pythonimport pandas_utils'''titanic数据集字段说明：Survived:0代表死亡，1代表存活【y标签】Pclass:乘客所持票类，有三种值(1,2,3) 【转换成onehot编码】Name:乘客姓名【舍去】Sex:乘客性别【转换成bool特征】Age:乘客年龄(有缺失) 【数值特征，添加“年龄是否

浮豹

316人浏览 · 2020-09-16 11:42:48

浮豹 · 2020-09-16 11:42:48 发布

一、数据读入：

%python
import xgbt_model
%python
import pandas_utils

'''
titanic数据集字段说明：

    Survived:0代表死亡，1代表存活【y标签】
    Pclass:乘客所持票类，有三种值(1,2,3) 【转换成onehot编码】
    Name:乘客姓名 【舍去】
    Sex:乘客性别 【转换成bool特征】
    Age:乘客年龄(有缺失) 【数值特征，添加“年龄是否缺失”作为辅助特征】
    SibSp:乘客兄弟姐妹/配偶的个数(整数值) 【数值特征】
    Parch:乘客父母/孩子的个数(整数值)【数值特征】
    Ticket:票号(字符串)【舍去】
    Fare:乘客所持票的价格(浮点数，0-500不等) 【数值特征】
    Cabin:乘客所在船舱(有缺失) 【添加“所在船舱是否缺失”作为辅助特征】
    Embarked:乘客登船港口:S、C、Q(有缺失)【转换成onehot编码，四维度 S,C,Q,nan】

'''

training_dataset = pandas_utils.read_csv_from_hdfs('train.csv')
testing_dataset = pandas_utils.read_csv_from_hdfs('test.csv')

dataset = training_dataset.append(testing_dataset).reset_index(drop=True)
dataset.head()

在这里插入图片描述

二、负样本采集：

%python

# 如果正负样本比例悬殊，一般而言需要对负样本进行下采样，维持正样本数量：负样本数量 >= 1/3 (通常而言)。
# 假如，正样本数量：负样本数量 = 1：10，我们通过设置sampling_ratio=0.2，可以将最后正样本数量：负样本数量变成1：2

dataset = xgbt_model.negative_down_sampling(dataset, sampling_ratio=0.9, label_column='Survived')

positive data size: (342, 11)
negative data size before sampling: (549, 11)
negative data size after sampling: (494, 11)
total data size after sampling: (836, 11)

三、定义用来建模特征

%python

# Note: 这些特征和标签都需要在Step 2得到的数据集中，否则会报错

# 定义用到的数值型特征
numerical_features = [
    'Age',
    'SibSp',
    'Parch',
    'Fare'
]

# 定义用到的类别型特征
categorical_features = [
    'Pclass',
    'Sex',
    'Cabin',
    'Embarked'
]

label_column = 'Survived'

四、原始特征转化为建模可用的数据特征

%python

features, labels = xgbt_model.preprocessing(dataset, numerical_features, categorical_features, label_column)

features.head()

[In preprocessing] shape of numerical features = (836, 4)
[In preprocessing] shape of categorical features = (836, 151)
[In preprocessing] shape of all features = (836, 155)

在这里插入图片描述

五、搜索不同标签下某一特征的分布情况

%python
%matplotlib inline
xgbt_model.plot_feature_vs_label(features, labels, ['Age', 'Fare'])

在这里插入图片描述

六、训练模型

%python
# 定义模型参数
max_tree_number=50
max_depth = 3
eval_metric = 'auc'
random_seed = 123

# 将全部数据集分为训练数据和评测数据
training_features, eval_features, training_labels, eval_labels = xgbt_model.split_train_test(features, labels, test_size=0.2, random_state=1231)

# 使用训练数据进行模型训练
model = xgbt_model.train(
    training_features, 
    training_labels, 
    max_tree_number=max_tree_number, 
    max_depth=max_depth, 
    random_seed=random_seed, 
    eval_metric='auc',
    eval_features=eval_features, 
    eval_labels=eval_labels
)

[0]	train-auc:0.855213	eval-auc:0.839904
Multiple eval metrics have been passed: 'eval-auc' will be used for early stopping.

Will train until eval-auc hasn't improved in 10 rounds.
[1]	train-auc:0.869504	eval-auc:0.839265
[2]	train-auc:0.871955	eval-auc:0.839833
[3]	train-auc:0.878054	eval-auc:0.843945
[4]	train-auc:0.882113	eval-auc:0.844796
[5]	train-auc:0.883051	eval-auc:0.842385
[6]	train-auc:0.883121	eval-auc:0.853517
[7]	train-auc:0.888627	eval-auc:0.852241
[8]	train-auc:0.893057	eval-auc:0.84671
[9]	train-auc:0.8968	eval-auc:0.846356
[10]	train-auc:0.899713	eval-auc:0.84437
[11]	train-auc:0.901343	eval-auc:0.844654
[12]	train-auc:0.901079	eval-auc:0.844299
[13]	train-auc:0.901103	eval-auc:0.843945
[14]	train-auc:0.905336	eval-auc:0.843023
[15]	train-auc:0.90649	eval-auc:0.847703
[16]	train-auc:0.909559	eval-auc:0.845647
Stopping. Best iteration:
[6]	train-auc:0.883121	eval-auc:0.853517

七、查看特征重要性分布

%python
%matplotlib inline
xgbt_model.plot_feature_importance(model)

在这里插入图片描述

八、细致查看模型结构

%python
%matplotlib inline
xgbt_model.plot_tree(model, num_trees=0)

在这里插入图片描述

九、查看模型ks指标，选择最优切分阈值

%python
%matplotlib inline
xgbt_model.plot_model_ks(model, eval_features, eval_labels)

在这里插入图片描述

十、预测

%python

xgbt_model.predict(model, eval_features)

array([0.82655984, 0.27785543, 0.23938118, 0.12750405, 0.09729803,
       0.95870715, 0.7831825 , 0.5179815 , 0.8811106 , 0.03982455,
       0.32331768, 0.36604008, 0.720063  , 0.11601663, 0.11403432,
       0.39625728, 0.94897515, 0.13791059, 0.37203276, 0.13597682,
       0.5418501 , 0.94245553, 0.7604762 , 0.2135466 , 0.12084755,
       0.13791059, 0.96606845, 0.7395714 , 0.8811106 , 0.9649121 ,
       0.9573398 , 0.8102147 , 0.11403432, 0.53544307, 0.22115439,
       0.08738643, 0.19681343, 0.3295594 , 0.6055345 , 0.12084755,
       0.95408344, 0.9481865 , 0.95261663, 0.2135466 , 0.95870715,
       0.11403432, 0.11403432, 0.44013265, 0.05468271, 0.2422589 ,
       0.21938527, 0.94245553, 0.11403432, 0.09064003, 0.11403432,
       0.09729803, 0.39625728, 0.81641793, 0.11403432, 0.5180364 ,
       0.13791059, 0.1561236 , 0.67381006, 0.0916443 , 0.60591793,
       0.81304884, 0.08536454, 0.1791531 , 0.13597682, 0.1228346 ,
       0.5179815 , 0.47453502, 0.83776534, 0.20418522, 0.88687545,
       0.1228346 , 0.3632514 , 0.858027  , 0.20418522, 0.95468086,
       0.1228346 , 0.3512447 , 0.13009362, 0.36939034, 0.4735429 ,
       0.15238559, 0.22115439, 0.9314935 , 0.35052568, 0.95644695,
       0.95335037, 0.13791059, 0.13178019, 0.95408344, 0.12084755,
       0.89744616, 0.19502547, 0.82655984, 0.03466938, 0.46487603,
       0.88687545, 0.5179815 , 0.41726422, 0.14591543, 0.949904  ,
       0.19681343, 0.89744616, 0.13791059, 0.09608363, 0.09957181,
       0.5337448 , 0.5261282 , 0.35052568, 0.81304884, 0.1561236 ,
       0.13597682, 0.13597682, 0.1561236 , 0.23731515, 0.08738643,
       0.1228346 , 0.09064003, 0.73137987, 0.03982455, 0.11601663,
       0.14591543, 0.32331768, 0.95870715, 0.12084755, 0.26739913,
       0.94375545, 0.11403432, 0.11403432, 0.9481865 , 0.3613689 ,
       0.87709194, 0.19907987, 0.8183291 , 0.1228346 , 0.9511413 ,
       0.0355372 , 0.85829735, 0.8906141 , 0.83163077, 0.09729803,
       0.13597682, 0.13597682, 0.82655984, 0.73431736, 0.8522414 ,
       0.35052568, 0.67381006, 0.96101946, 0.94900113, 0.9553306 ,
       0.3613689 , 0.11403432, 0.94900113, 0.7277221 , 0.11403432,
       0.20446633, 0.13791059, 0.40981635, 0.16876496, 0.32134193,
       0.04640677, 0.67381006, 0.17447259], dtype=float32)