失眠网,内容丰富有趣,生活中的好帮手!
失眠网 > 信贷违约风险预测(四)训练模型

信贷违约风险预测(四)训练模型

时间:2021-06-22 05:57:52

相关推荐

信贷违约风险预测(四)训练模型

之前给予application_train.csv和application_test.csv,做了简单的特征工程:得到了下面几个文件:(对recode后的数据做了三种不同的特征处理)

Polynomial Features :poly_train_data.csv,poly_test_data.csvDomain Knowledge Features: domain_train_data.csv,domain_test_data.csvFeaturetools:auto_train_data.csv, auto_test_data.csv

Training model

Logistic Regression

import pandas as pdimport matplotlib.pyplot as plt%matplotlib inline

from sklearn.preprocessing import MinMaxScaler, Imputerfrom sklearn.linear_model import LogisticRegression

# load datapoly_train = pd.read_csv('data/poly_train_data.csv')poly_test = pd.read_csv('data/poly_test_data.csv')#domain_train = pd.read_csv('data/domain_train_data.csv')#domain_test = pd.read_csv('data/domain_test_data.csv')

auto_train = pd.read_csv('data/auto_train_data.csv')auto_test = pd.read_csv('data/auto_test_data.csv')

缺失值标准化处理

target = poly_train['TARGET']Id = poly_test[['SK_ID_CURR']]

polynomial features

poly_train = poly_train.drop(['TARGET'],1)# feature namespoly_features = list(poly_train.columns)

# 中位数填充缺失值imputer = Imputer(strategy = 'median')# scale feature to 0-1scaler = MinMaxScaler(feature_range = (0, 1))# fit train dataimputer.fit(poly_train)# Transform train test datapoly_train = imputer.transform(poly_train)poly_test = imputer.transform(poly_test)# scalerscaler.fit(poly_train)poly_train = scaler.transform(poly_train)poly_test = scaler.transform(poly_test)

domain features

domain_train = domain_train.drop(['TARGET'],1)domain_features = list(domain_train.columns)# fit train dataimputer.fit(domain_train)# Transform train test datadomain_train = imputer.transform(domain_train)domain_test = imputer.transform(domain_test)# scalerscaler.fit(domain_train)domain_train = scaler.transform(domain_train)domain_test = scaler.transform(domain_test)

Featuretool

auto_train = auto_train.drop(['TARGET'],1)auto_features = list(auto_train.columns)# fit train dataimputer.fit(auto_train)# Transform train test dataauto_train = imputer.transform(auto_train)auto_test = imputer.transform(auto_test)# scalerscaler.fit(auto_train)auto_train = scaler.transform(auto_train)auto_test = scaler.transform(auto_test)

print('poly_train',poly_train.shape)print('poly_test',poly_test.shape)print('domain_train',domain_train.shape)print('domain_test',domain_test.shape)print('auto_train',auto_train.shape)print('auto_test',auto_test.shape)

poly_train (307511, 274)poly_test (48744, 274)domain_train (307511, 244)domain_test (48744, 244)auto_train (307511, 239)auto_test (48744, 239)

LogisticRegression,

lr = LogisticRegression(C = 0.0001,class_weight='balanced') # c 正则化参数

Polynomial

lr.fit(poly_train, target)lr_poly_pred = lr.predict_proba(poly_test)[:,1]# submission dataframesubmit = Id.copy()submit['TARGET'] = lr_poly_predsubmit.to_csv('lr_poly_submit.csv',index = False)

Domain Knowledge

lr = LogisticRegression(C = 0.0001,class_weight='balanced',solver='sag')lr.fit(domain_train, target)lr_domain_pred = lr.predict_proba(domain_test)[:,1]submit = Id.copy()submit['TARGET'] = lr_domain_predsubmit.to_csv('lr_domain_submit.csv',index = False)

FeatureTools

lr = LogisticRegression(C = 0.0001,class_weight='balanced',solver='sag')lr.fit(auto_train, target)lr_auto_pred = lr.predict_proba(auto_test)[:,1]submit = Id.copy()submit['TARGET'] = lr_auto_predsubmit.to_csv('lr_auto_submit.csv',index = False)

在线测评结果:

Polynomail–: 0.723Domain------: 0.670Featuretool-: 0.669

下面升级一下算法,仍在这三组数据哈桑进行训练,随机森林.

Random Forest

from sklearn.ensemble import RandomForestClassifier

Polynomail

random_forest = RandomForestClassifier(n_estimators = 100, random_state = 55, verbose = 1,n_jobs = -1)random_forest.fit(poly_train, target)#提取重要特征poly_importance_feature_values = random_forest.feature_importances_poly_importance_features = pd.DataFrame({'feature':poly_features,'importance':poly_importance_feature_values})rf_poly_pred = random_forest.predict_proba(poly_test)[:,1]# 结果submit = Id.copy()submit['TARGET'] = rf_poly_predsubmit.to_csv('rf_poly_submit.csv', index = False)

[Parallel(n_jobs=-1)]: Done 42 tasks| elapsed: 1.6min[Parallel(n_jobs=-1)]: Done 100 out of 100 | elapsed: 3.7min finished[Parallel(n_jobs=4)]: Done 42 tasks| elapsed: 0.4s[Parallel(n_jobs=4)]: Done 100 out of 100 | elapsed: 0.9s finished

Feature importance

poly_importance_features = poly_importance_features.set_index(['feature'])

poly_importance_features.sort_values(by = 'importance').plot(kind='barh',figsize=(10, 120))

根据上面图中的数据,可以做一些特征选择,丢掉一些完全没有作用的特征,同时也给数据降低维度.下面在升级一下算法,机器学习中的大杀器,Light Gradient Boosting Machine

from sklearn.model_selection import KFoldfrom sklearn.metrics import roc_auc_scoreimport lightgbm as lgbimport gcimport numpy as npimport warningswarnings.filterwarnings('ignore')

def model(features, test_features, n_folds = 10):# 取出ID列train_ids = features['SK_ID_CURR']test_ids = test_features['SK_ID_CURR']# TARGETlabels = features[['TARGET']]# 去掉ID和TARGETfeatures = features.drop(['SK_ID_CURR', 'TARGET'], axis = 1)test_features = test_features.drop(['SK_ID_CURR'], axis = 1)# 特征名字feature_names = list(features.columns)# Dataframe-->数组#features = np.array(features)#test_features = np.array(test_features)# 随即切分train _data10份,9份训练,1份验证k_fold = KFold(n_splits = n_folds, shuffle = True, random_state = 50)# test predictionstest_predictions = np.zeros(test_features.shape[0])# validation predictionsout_of_fold = np.zeros(features.shape[0])# 记录每次的scoresvalid_scores = []train_scores = []# Iterate through each foldcount = 0for train_indices, valid_indices in k_fold.split(features):# Training data for the foldtrain_features = features.loc[train_indices, :]train_labels = labels.loc[train_indices, :]# Validation data for the foldvalid_features = features.loc[valid_indices, :]valid_labels = labels.loc[valid_indices, :]# Create the modelmodel = lgb.LGBMClassifier(n_estimators=10000, objective = 'binary', class_weight = 'balanced', learning_rate = 0.05, reg_alpha = 0.1, reg_lambda = 0.1, subsample = 0.8, n_jobs = -1, random_state = 50)# Train the modelmodel.fit(train_features, train_labels, eval_metric = 'auc',eval_set = [(valid_features, valid_labels), (train_features, train_labels)],eval_names = ['valid', 'train'], categorical_feature = 'auto',early_stopping_rounds = 100, verbose = 200)# Record the best iterationbest_iteration = model.best_iteration_# 测试集的结果test_predictions += model.predict_proba(test_features, num_iteration = best_iteration)[:, 1]/n_folds# 验证集结果out_of_fold[valid_indices] = model.predict_proba(valid_features, num_iteration = best_iteration)[:, 1]# Record the best scorevalid_score = model.best_score_['valid']['auc']train_score = model.best_score_['train']['auc']valid_scores.append(valid_score)train_scores.append(train_score)# Clean up memorygc.enable()del model, train_features, valid_featuresgc.collect()count += 1pirnt("%d_fold is over"%count)# Make the submission dataframesubmission = pd.DataFrame({'SK_ID_CURR': test_ids, 'TARGET': test_predictions})# Overall validation scorevalid_auc = roc_auc_score(labels, out_of_fold)# Add the overall scores to the metricsvalid_scores.append(valid_auc)train_scores.append(np.mean(train_scores))# dataframe of validation scoresfold_names = list(range(n_folds))fold_names.append('overall')# Dataframe of validation scoresmetrics = pd.DataFrame({'fold': fold_names,'train': train_scores,'valid': valid_scores}) return submission, metrics

# load datapoly_train = pd.read_csv('data/poly_train_data.csv')poly_test = pd.read_csv('data/poly_test_data.csv')print('poly_train:',poly_train.shape)print('poly_test:',poly_test.shape)

poly_train: (307511, 275)poly_test: (48744, 274)

Select Features

特征影响力排名,从小到大

poly_importance_features = poly_importance_features.sort_values(by = 'importance')

丢掉这20个特征

poly_importance_features.head(20).plot(kind = 'barh')

s_train_1 = poly_train.copy()s_test_1 = poly_test.copy()# 特征列名drop_feature_names = poly_importance_features.index[:20]#去掉掉20个特征s_train_1 = s_train_1.drop(drop_feature_names, axis = 1)s_test_1 = s_test_1.drop(drop_feature_names, axis = 1)submit2, metrics2 = model(s_train_1, s_test_1, n_folds= 5)

Training until validation scores don't improve for 100 rounds.[200]train's auc: 0.800686valid's auc: 0.755447[400]train's auc: 0.831722valid's auc: 0.755842Early stopping, best iteration is:[351]train's auc: 0.824767valid's auc: 0.756092Training until validation scores don't improve for 100 rounds.[200]train's auc: 0.800338valid's auc: 0.757507[400]train's auc: 0.831318valid's auc: 0.757378Early stopping, best iteration is:[307]train's auc: 0.818238valid's auc: 0.757819Training until validation scores don't improve for 100 rounds.[200]train's auc: 0.799557valid's auc: 0.762719Early stopping, best iteration is:[160]train's auc: 0.791849valid's auc: 0.763023Training until validation scores don't improve for 100 rounds.[200]train's auc: 0.80053valid's auc: 0.758546Early stopping, best iteration is:[224]train's auc: 0.804828valid's auc: 0.758703Training until validation scores don't improve for 100 rounds.[200]train's auc: 0.799826valid's auc: 0.758312[400]train's auc: 0.831271valid's auc: 0.758623Early stopping, best iteration is:[319]train's auc: 0.819603valid's auc: 0.758971

metrics2

submit2.to_csv('submit2.csv',index = False)

评分结果:0.734

丢掉几个特征后结果略有提升哦,去掉30个试试

s_train_2 = poly_train.copy()s_test_2 = poly_test.copy()# 特征列名drop_feature_names = poly_importance_features.index[:30]#丢掉30个特征s_train_2 = s_train_2.drop(drop_feature_names, axis = 1)s_test_2 = s_test_2.drop(drop_feature_names, axis = 1)submit3, metrics3 = model(s_train_2, s_test_2, n_folds= 5)

Training until validation scores don't improve for 100 rounds.[200]train's auc: 0.800547valid's auc: 0.755442Early stopping, best iteration is:[267]train's auc: 0.81211valid's auc: 0.755868Training until validation scores don't improve for 100 rounds.[200]train's auc: 0.80048valid's auc: 0.757653Early stopping, best iteration is:[258]train's auc: 0.81057valid's auc: 0.758107Training until validation scores don't improve for 100 rounds.[200]train's auc: 0.799261valid's auc: 0.76291Early stopping, best iteration is:[189]train's auc: 0.797314valid's auc: 0.762962Training until validation scores don't improve for 100 rounds.[200]train's auc: 0.800499valid's auc: 0.758385Early stopping, best iteration is:[202]train's auc: 0.800851valid's auc: 0.758413Training until validation scores don't improve for 100 rounds.[200]train's auc: 0.799977valid's auc: 0.758234Early stopping, best iteration is:[284]train's auc: 0.814454valid's auc: 0.758612

metrics3

submit3.to_csv('submit3.csv',index = False)

结果是:0.733,基本没什么变化.只用了一张主表肯定是不行的,这个纯属娱乐一下.

如果觉得《信贷违约风险预测(四)训练模型》对你有帮助,请点赞、收藏,并留下你的观点哦!

本内容不代表本网观点和政治立场,如有侵犯你的权益请联系我们处理。
网友评论
网友评论仅供其表达个人看法,并不表明网站立场。