失眠网 > 信贷违约风险评估模型（下篇）：机器学习模型

信贷违约风险评估模型（下篇）：机器学习模型

时间：2022-12-22 18:09:46

机器学习训练营——机器学习爱好者的自由交流空间（入群联系qq：2279055353）

机器学习模型

Logistic回归模型

作为一个基础模型，我们将使用scikit-learn库的LogisticRegression, 建立Logistic模型。为此，我们将使用所有的特征，我们也将填补缺失值，归一化特征。

from sklearn.preprocessing import MinMaxScaler, Imputer# Drop the target from the training dataif 'TARGET' in app_train:train = app_train.drop(columns = ['TARGET'])else:train = app_train.copy()# Feature namesfeatures = list(train.columns)# Copy of the testing datatest = app_test.copy()# Median imputation of missing valuesimputer = Imputer(strategy = 'median')# Scale each feature to 0-1scaler = MinMaxScaler(feature_range = (0, 1))# Fit on the training dataimputer.fit(train)# Transform both training and testing datatrain = imputer.transform(train)test = imputer.transform(app_test)# Repeat with the scalerscaler.fit(train)train = scaler.transform(train)test = scaler.transform(test)print('Training data shape: ', train.shape)print('Testing data shape: ', test.shape)

Training data shape: (307511, 240)

Testing data shape: (48744, 240)

我们只改变一个默认参数，正则参数C, 它用来控制过度拟合程度，降低它的值将减小过度拟合度。在这里，我们使用常见的scikit-learn建模语法规则：

产生模型

使用.fit训练模型

使用.predict_proba在检验集上预测客户不还款的概率

from sklearn.linear_model import LogisticRegression# Make the model with the specified regularization parameterlog_reg = LogisticRegression(C = 0.0001)# Train on the training datalog_reg.fit(train, train_labels)# Make predictions# Make sure to select the second column onlylog_reg_pred = log_reg.predict_proba(test)[:, 1]

改善模型：随机森林

建立了基础模型后，我们考虑升级算法。让我们在相同的训练集上使用随机森林(Random Forest)算法，看看预测效果。实践证实，当使用几百棵树训练模型时，随机森林是一个更有力的模型。在这里，我们使用100棵树。

from sklearn.ensemble import RandomForestClassifier# Make the random forest classifierrandom_forest = RandomForestClassifier(n_estimators = 100, random_state = 50, verbose = 1, n_jobs = -1)# Train on the training datarandom_forest.fit(train, train_labels)# Extract feature importancesfeature_importance_values = random_forest.feature_importances_feature_importances = pd.DataFrame({'feature': features, 'importance': feature_importance_values})# Make predictions on the test datapredictions = random_forest.predict_proba(test)[:, 1]

特征工程预测

了解多项式特征和域知识特征是否改善模型的唯一方法，是在这些特征上训练一个检验模型，对比有无这些特征对预测准确率的影响。

poly_features_names = list(app_train_poly.columns)# Impute the polynomial featuresimputer = Imputer(strategy = 'median')poly_features = imputer.fit_transform(app_train_poly)poly_features_test = imputer.transform(app_test_poly)# Scale the polynomial featuresscaler = MinMaxScaler(feature_range = (0, 1))poly_features = scaler.fit_transform(poly_features)poly_features_test = scaler.transform(poly_features_test)random_forest_poly = RandomForestClassifier(n_estimators = 100, random_state = 50, verbose = 1, n_jobs = -1)# Train on the training datarandom_forest_poly.fit(poly_features, train_labels)# Make predictions on the test datapredictions = random_forest_poly.predict_proba(poly_features_test)[:, 1]

模型解释：特征重要性

我们通过查看随机森林模型的特征重要性，反映哪些变量是最相关的。

def plot_feature_importances(df):"""Plot importances returned by a model. This can work with any measure offeature importance provided that higher importance is better. Args:df (dataframe): feature importances. Must have the features in a columncalled `features` and the importances in a column called `importanceReturns:shows a plot of the 15 most importance featuresdf (dataframe): feature importances sorted by importance (highest to lowest) with a column for normalized importance"""# Sort features according to importancedf = df.sort_values('importance', ascending = False).reset_index()# Normalize the feature importances to add up to onedf['importance_normalized'] = df['importance'] / df['importance'].sum()# Make a horizontal bar chart of feature importancesplt.figure(figsize = (10, 6))ax = plt.subplot()# Need to reverse the index to plot most important on topax.barh(list(reversed(list(df.index[:15]))), df['importance_normalized'].head(15), align = 'center', edgecolor = 'k')# Set the yticks and labelsax.set_yticks(list(reversed(list(df.index[:15]))))ax.set_yticklabels(df['feature'].head(15))# Plot labelingplt.xlabel('Normalized Importance'); plt.title('Feature Importances')plt.show()return df# Show the feature importances for the default featuresfeature_importances_sorted = plot_feature_importances(feature_importances)

最重要的特征是我们在EDA阶段分析的EXT_SOURCEandDAYS_BIRTH. 我们还看到，仅仅有少数的特征对模型是重要的，这说明很多不重要的特征可以扔掉。特征重要性对于解释模型或降维来说，并不是一个特别复杂的方法，但是它提示我们在建模时应该考虑哪些因素。

feature_importances_domain_sorted = plot_feature_importances(feature_importances_domain)

结论

在这个案例里，我们首先要理解数据，理解任务，明确测度。然后，我们作探索性数据分析(EDA)尝试识别变量的关系、趋势、异常。在此过程中，我们要做必要的预处理，例如，编码类别变量、填补缺失值、归一化特征。然后，我们使用已有数据加工新特征，并在随后的模型里检验它们的作用。一旦完成了数据探索、数据准备和特征工程，我们先建立一个基准模型，该模型主要用于对比后续的改善模型，比较改善的效果。

我们总结出一个机器学习项目的任务流程：

理解问题和数据

数据清洗与格式化

探索性数据分析

基准模型

改善模型

模型解释

如果觉得《信贷违约风险评估模型（下篇）：机器学习模型》对你有帮助，请点赞、收藏，并留下你的观点哦！

本内容不代表本网观点和政治立场，如有侵犯你的权益请联系我们处理。

网友评论

网友评论仅供其表达个人看法，并不表明网站立场。