失眠网,内容丰富有趣,生活中的好帮手!
失眠网 > 7.Python数据分析项目之银行客户流失分析

7.Python数据分析项目之银行客户流失分析

时间:2021-01-17 14:21:39

相关推荐

7.Python数据分析项目之银行客户流失分析

1.总结

预测类数据分析项目

数量查看:条形图

占比查看:饼图

数据分区分布查看:概率密度函数图

查看相关关系:条形图、热力图

分布分析:分类直方图(countplot)、分布图-带有趋势线的直方图(distplot)

自然语言处理项目:

推荐系统

2.银行业客群及产品类别

银行客户群

个人客户

银行对个人客户的业务主要是以合理安排客户的个人财物为手段,为之提供存取款、小额贷款、代理投资理财、信息咨询及其他各类中介服务,由此为客户取得收益并帮助其防范风险,同时提高银行自身效益。

公司客户

公司客户主要指与银行发生业务关系的各企事业单位及政府机关,其中以企业单位为主体。公司客户能为银行带来大量存款、贷款和收费业务,并成为银行利润的重要来源。

零售客户

包括消费信贷客户、信用卡客户、贵宾理财客户等

银行产品类别

信贷类资产

信用贷款抵押贷款保证书担保贷款贷款证券化负债业务

活期存款

定期存款储蓄存款可转让定期存单其他种类

3.客户流失预警的模型建立分析

3.1 客户流失预警模型的业务意义

严格地讲,客户流失指的是客户在该行所有业务终止,并销号。但是具体业务部门可单独定义在该部门的全部或某些业务上,客户的终止行为

对专家及金融业业内人士的走访及调研结果表明,商业银行客户流失较为严重。国内商业银行,客户流失率可达20%甚至更高。而获得新客户的成本,可达维护现有客户的5倍。

因此,从海量客户交易记录中挖掘出对流失有影响的信息,建立高效的客户流失预警体系尤为重要。

3.2 客户流失主要原因

价格流失产品流失服务流失市场流失促销流失技术流失政治流失

3.3 维护客户关系的基本方法

追踪制度产品跟进扩大销售维护访问机制维护

3.4 建立量化模型,合理预测客群的潜在流失风险

常用的风险因子客户持有的产品数量、种类客户的年龄、性别受地理区域的影响受产品类别的影响交易的间隔时间营销、促销手段银行的服务方式和态度

4.数据描述

import warningswarnings.filterwarnings('ignore')import pandas as pdfrom pandas.plotting import scatter_matriximport numbersimport numpy as npimport mathimport matplotlib.pyplot as pltimport randomfrom numpy import *import operatorimport numbersimport datetimeimport timeimport seaborn as snsfrom statsmodels.formula.api import olsfrom statsmodels.stats.anova import anova_lmfrom scipy.stats import chisquarefrom sklearn.ensemble import GradientBoostingClassifierfrom sklearn.model_selection import train_test_splitfrom sklearn import ensemble, metricsfrom sklearn.model_selection import KFoldfrom sklearn.model_selection import train_test_splitfrom sklearn.model_selection import GridSearchCV%matplotlib inline# 读取银行内部数据 重点关注CHURN_CUST_IND(是否流失客户,0流失,1未流失)banChurn = pd.read_csv('bankChurn.csv')externaData = pd.read_csv('ExternalData.csv')

banChurn.head() # 数值型数据的描述banChurn.describe()# 字符型数据的描述banChurn.describe(include=np.object_)externaData.head()# 数值型数据的描述externaData.describe()# 字符型数据的描述externaData.describe(include=np.object_)

5.数据分析

5.1 绘制每个特征的分布

# 绘制每个特征的分布# dataset:数据集 cols:绘图中每行显示的列数def plot_distribution(dataset, cols=5, width=20, height=15, hspace=0.2, wspace=0.5):plt.style.use('seaborn-whitegrid')fig = plt.figure(figsize=(width,height))fig.subplots_adjust(left=None, bottom=None, right=None, top=None, wspace=wspace, hspace=hspace)rows = math.ceil(float(dataset.shape[1]) / cols) # 绘图的行数for i, column in enumerate(dataset.columns): # 遍历数据集中的每一列(特征)ax = fig.add_subplot(rows, cols, i + 1)ax.set_title(column)#非数值型数据if dataset.dtypes[column] == np.object_:g = sns.countplot(y=column, data=dataset) # 绘制数量描述图substrings = [s.get_text()[:18] for s in g.get_yticklabels()]g.set(yticklabels=substrings)plt.xticks(rotation=25)else: # 数值型数据g = sns.distplot(dataset[column]) # 直方图plt.xticks(rotation=25)plot_distribution(bankChurn, cols=5, width=20, height=160, hspace=0.45, wspace=0.5)

5.2 数值型变量绘图分析

# 将数值型变量绘图分析# target 目标,是否流失# 分析df数据集中的col列对target的影响def NumVarPerf(df,col,target,truncation=False):''':param df: the dataset containing numerical independent variable and dependent variable:param col: independent variable with numerical type:param target: dependent variable, class of 0-1:param truncation: indication whether we need to do some truncation for outliers:return: the descriptive statistics'''#extract target variable and specific indepedent variablevalidDf = df.loc[df[col] == df[col]][[col,target]] # 提取非空(col列值非nan)样本的col列和target列#the percentage of valid elementsvalidRcd = validDf.shape[0]*1.0/df.shape[0] # 计算有效样本(col列值非nan)的比例#format the percentage in the form of percentvalidRcdFmt = "%.2f%%"%(validRcd*100)#the descriptive statistics of each numerical columndescStats = validDf[col].describe()mu = "%.2e" % descStats['mean']std = "%.2e" % descStats['std']maxVal = "%.2e" % descStats['max']minVal = "%.2e" % descStats['min']#we show the distribution by churn/not churn statex = validDf.loc[validDf[target]==1][col] # 提取流失客户的col列y = validDf.loc[validDf[target]==0][col] # 提取现有未流失客户的col列xweights = 100.0 * np.ones_like(x) / x.sizeyweights = 100.0 * np.ones_like(y) / y.size#if need truncation, truncate the numbers in 95th quantileif truncation == True: pcnt95 = np.percentile(validDf[col],95) # 获取col列数据的95分位数x = x.map(lambda x: min(x,pcnt95)) # 如果超过95分位数,则说明是极端值,使用95分位数y = y.map(lambda x: min(x,pcnt95)) # 如果超过95分位数,则说明是极端值,使用95分位数fig, ax = plt.subplots()# weights参数:与x形状相同的权重数组;将x中的每个元素乘以对应权重值再计数ax.hist(x, weights=xweights, alpha=0.5,label='Attrition') # Attrition流失ax.hist(y, weights=yweights, alpha=0.5,label='Retained') # Retained 保持titleText = 'Histogram of '+ col +'\n'+'valid pcnt ='+validRcdFmt+', Mean ='+mu + ', Std='+std+'\n max='+maxVal+', min='+minValax.set(title= titleText, ylabel='% of Dataset in Bin')ax.margins(0.05)ax.set_ylim(bottom=0)plt.legend(loc='upper right')plt.show()# 未截断极端值,绘制的图不能很好刻画存款余额与是否流失的关系NumVarPerf(bankChurn,'SAV_CUR_ALL_BAL','CHURN_CUST_IND',truncation=False)

5.3 字符型变量绘图分析

# 字符型变量绘图分析def CharVarPerf(df,col,target):''':param df: the dataset containing numerical independent variable and dependent variable:param col: independent variable with numerical type:param target: dependent variable, class of 0-1:return: the descriptive statistics'''validDf = df.loc[df[col] == df[col]][[col, target]]validRcd = validDf.shape[0]*1.0/df.shape[0]recdNum = validDf.shape[0]validRcdFmt = "%.2f%%"%(validRcd*100)freqDict = {}churnRateDict = {}#for each category in the categorical variable, we count the percentage and churn ratefor v in set(validDf[col]): # 遍历去重后的列值(集合去重)vDf = validDf.loc[validDf[col] == v]freqDict[v] = vDf.shape[0]*1.0/recdNum # 某特征值占的比率churnRateDict[v] = sum(vDf[target])*1.0/vDf.shape[0] # 某特征的值对应的流失率descStats = pd.DataFrame({'percent':freqDict,'churn rate':churnRateDict})fig = plt.figure() # Create matplotlib figureax = fig.add_subplot(111) # Create matplotlib axesax2 = ax.twinx() # Create another axes that shares the same x-axis as ax.plt.title('The percentage and churn rate for '+col+'\n valid pcnt ='+validRcdFmt)descStats['churn rate'].plot(kind='line', color='red', ax=ax)descStats.percent.plot(kind='bar', color='blue', ax=ax2, width=0.2,position = 1)ax.set_ylabel('churn rate')ax2.set_ylabel('percentage')plt.show()# GENDER_CD(性别代码)与“是否流失”的关系CharVarPerf(bankChurn,'GENDER_CD','CHURN_CUST_IND')

6.数据预处理

6.1 填充缺失值(针对数值型数据)

def MakeupMissing(df,col,types,method): # 针对数值型数据填充缺失值'''df: 数据集DataFramecol:列名types:判断类型method:填充方式'''validDf = df.loc[df[col] == df[col]][[col]] # 获取col列非空(非nan)的记录if validDf.shape[0] == df.shape[0]: # 没有缺失值记录return "{} 列没有缺失值".format(col)missingList = [i for i in df[col]] # 取出col列的所有值(包括了缺失值)if types == "Continuous": # 只针对数值型列的数据if method not in ["Mean","Random"]: # 填充方式仅限于“平均值填充”与“随机值填充”return "填充方式仅限于平均值填充(Mean)与随机值填充(Random)"descStats = validDf[col].describe() # 获取有效数据集的描述信息数据mu = descStats["mean"] # 有效数据集列的均值std = descStats["std"] # 有效数据集列的标准差maxVal = descStats["max"] # 有效数据集列的最大值# 检测极端值(使用3-sigma方式检测)if maxVal > mu + 3 * std: # 先判断最大有效值是否超过指定的边界(是否为极端值)for i in list(validDf.index): # 逐行遍历validDfif validDf.loc[i][col] > mu + 3 * std: # 判断当前行的col列是否是极端值validDf.loc[i][col] = mu + 3 * std # 替换掉当前的极端值mu = validDf[col].describe()['mean'] # 重新计算col列的均值for i in list(df.index): # 遍历原始df数据集if df.loc[i][col] != df.loc[i][col]: # 如果当前行的col列值是缺失值if method == "Mean": # 判断填充方式missingList[i] = mu # 填充当前的缺失值elif method == "Random": # 从validDf[col]有效数据列中随机选择一个数,作为当前缺失值的填充值missingList[i] = random.sample(validDf[col],1)[0]print("{}列的缺失值填充完毕".format(col))return missingList

6.2 数字编码

## 对类别变量使用数字编码: 计算出每一个列中的不同值对应的客户流失率def Encoder(df,col,target):encoder = {}for v in set(df[col]): # 取出df数据集中col列的所有不重复值(集合去重)if v == v: # 如果正在遍历的不是缺失值subDf = df[df[col] == v] # 获取原始数据集中col列的值为当前遍历值的记录else: # 如果正在遍历的是缺失值xList = list(df[col])# 获取df[col]所有缺失值对应的索引nanInd = [i for i in range(len(xList)) if xList[i] != xList[i]]subDf = df.loc[nanInd] # 所有缺失值的记录# 记录col列的每个值与其对应的客户流失率encoder[v] = sum(subDf[target]) * 1.0 / subDf.shape[0] newCol = [encoder[i] for i in df[col]] # 获取每个值对应的客户流失率return newCol

6.3 两变量的比

## 计算两个变量比的函数def ColumnDivide(df,colNumerator,colDenominator):N = df.shape[0] # 数据集的行数rate = [0] * N xNum = list(df[colNumerator])xDenom = list(df[colDenominator])for i in range(N):if xDenom[i] > 0:rate[i] = xNum[i] * 1.0 / xDenom[i] # 给rate填充比值else:rate[i] = 0return rate

6.4 合并数据集

ALLData = pd.merge(bankChurn,externaData,on='CUST_ID')ALLData

6.5 调用预处理函数

modelData = AllData.copy()indepCols = list(modelData.columns)#移除目标列indepCols.remove('CHURN_CUST_IND')# 移除客户ID列indepCols.remove('CUST_ID')except_var = [] # 存放处理过程发生异常的列名for var in indepCols:try:# 将当前遍历的列的值去重,然后存储到x0列表中x0 = list(set(modelData[var])) #forgntvl(是否有境外旅行)列的处理if var == 'forgntvl': x00 = [np.nan]# 如果正在遍历的'forgntvl'列的值非空[x00.append(i) for i in x0 if i not in x00 and i==i]x0 = x00if len(x0) == 1: # 如果当前列的值只有一种值,则移除当前列print('Remove the constant column {}'.format(var))indepCols.remove(var) # 移除列continue# x0去除空值,保存有效值到x列表中x = [i for i in x0 if i==i] # 如果当前列的值属于数值型if isinstance(x[0],numbers.Real) and len(x)>4:if np.nan in x0: # 如果当前列中存在缺失值,则填充print('nan is found in column {}, so we need to make up the missing value'.format(var))modelData[var] = MakeupMissing(modelData,var,'Continuous','Random')else: # 如果当前列的值属于字符型print('Encode {} using numerical representative'.format(var))modelData[var] = Encoder(modelData, var, 'CHURN_CUST_IND')except:print("something is wrong with {}".format(var))except_var.append(var)continue

7.特征工程

生成新的特征(特征衍生)的常用方法

根据业务,求相关比率

根据业务,将相关列可以进行加和

根据业务,取多列特征中有代表性的特征 (以max为例)

根据业务,删除有极大相关性特征等

# 计算相关比,添加到新列(新特征)modelData['AVG_LOCAL_CUR_TRANS_TX_AMT'] = ColumnDivide(modelData, 'LOCAL_CUR_TRANS_TX_AMT','LOCAL_CUR_TRANS_TX_NUM')modelData['AVG_LOCAL_CUR_LASTSAV_TX_AMT'] = ColumnDivide(modelData, 'LOCAL_CUR_LASTSAV_TX_AMT','LOCAL_CUR_LASTSAV_TX_NUM')#### 计算每个样本的指定五个列的最大值,添加到新列(新特征)volatilityMax中maxValueFeatures = ['LOCAL_CUR_SAV_SLOPE','LOCAL_BELONEYR_FF_SLOPE','LOCAL_OVEONEYR_FF_SLOPE','LOCAL_SAV_SLOPE','SAV_SLOPE']# 提取每个样本的相关波动率最大值作为新的特征volatilityMaxmodelData['volatilityMax']= modelData[maxValueFeatures].apply(max, axis =1)## 删除LOCAL_CUR_MON_AVG_BAL_PROP这个冗余特征 #本币活期月日均余额占比 = 1 - 本币定期月日均余额占比del modelData['LOCAL_CUR_MON_AVG_BAL_PROP']## 对指定的特征列相加,得到新的特征sumupCols0 = ['LOCAL_CUR_MON_AVG_BAL','LOCAL_FIX_MON_AVG_BAL']sumupCols1 = ['LOCAL_CUR_WITHDRAW_TX_NUM','LOCAL_FIX_WITHDRAW_TX_NUM']sumupCols2 = ['LOCAL_CUR_WITHDRAW_TX_AMT','LOCAL_FIX_WITHDRAW_TX_AMT']sumupCols3 = ['COUNTER_NOT_ACCT_TX_NUM','COUNTER_ACCT_TX_NUM']sumupCols4 = ['ATM_ALL_TX_NUM','COUNTER_ALL_TX_NUM']sumupCols5 = ['ATM_ACCT_TX_NUM','COUNTER_ACCT_TX_NUM']sumupCols6 = ['ATM_ACCT_TX_AMT','COUNTER_ACCT_TX_AMT']sumupCols7 = ['ATM_NOT_ACCT_TX_NUM','COUNTER_NOT_ACCT_TX_NUM']modelData['TOTAL_LOCAL_MON_AVG_BAL'] = modelData[sumupCols0].apply(sum, axis = 1)modelData['TOTAL_WITHDRAW_TX_NUM'] = modelData[sumupCols1].apply(sum, axis = 1)modelData['TOTAL_WITHDRAW_TX_AMT'] = modelData[sumupCols2].apply(sum, axis = 1)modelData['TOTAL_COUNTER_TX_NUM'] = modelData[sumupCols3].apply(sum, axis = 1)modelData['TOTAL_ALL_TX_NUM'] = modelData[sumupCols4].apply(sum, axis = 1)modelData['TOTAL_ACCT_TX_NUM'] = modelData[sumupCols5].apply(sum, axis = 1)modelData['TOTAL_ACCT_TX_AMT'] = modelData[sumupCols6].apply(sum, axis = 1)modelData['TOTAL_NOT_ACCT_TX_NUM'] = modelData[sumupCols7].apply(sum, axis = 1)## 根据指定列的比,创建新列(新特征)# 分子列numeratorCols = ['LOCAL_SAV_CUR_ALL_BAL','SAV_CUR_ALL_BAL','ASSET_CUR_ALL_BAL','LOCAL_CUR_WITHDRAW_TX_NUM','LOCAL_CUR_WITHDRAW_TX_AMT','COUNTER_NOT_ACCT_TX_NUM','ATM_ALL_TX_NUM','ATM_ACCT_TX_AMT','ATM_NOT_ACCT_TX_NUM']# 分母列denominatorCols = ['LOCAL_SAV_MON_AVG_BAL','SAV_MON_AVG_BAL','ASSET_MON_AVG_BAL','TOTAL_WITHDRAW_TX_NUM','TOTAL_WITHDRAW_TX_AMT','TOTAL_COUNTER_TX_NUM','TOTAL_ACCT_TX_NUM','TOTAL_ACCT_TX_AMT','TOTAL_NOT_ACCT_TX_NUM']newColName = ["RATIO_"+str(i) for i in range(len(numeratorCols))]# 分别求比,添加新特征for i in range(len(numeratorCols)):modelData[newColName[i]] = ColumnDivide(modelData, numeratorCols[i], denominatorCols[i])

8.建模

本项目通过读取modelData.csv文件的数据(该文件已进行更细化的特征工程处理)进行建模,使用GradientBoostingClassifier模型进行建模。

# 通过读取modelData.csv文件(已进行了进一步特征处理)的数据进行建模modelData = pd.read_csv("modelData.csv")allFeatures = list(modelData.columns) # 所有特征名称转换为列表# 移除建模不需要的特征列allFeatures.remove("CUST_ID") # 移除客户ID列allFeatures.remove("CHURN_CUST_IND") # 移除客户流失标签列# 拆分数据集为训练集和测试集X_train,X_test,y_train,y_test = train_test_split(modelData[allFeatures],modelData["CHURN_CUST_IND"],random_state=10)# 建模gbc = GradientBoostingClassifier(random_state=10) # 创建模型对象gbc.fit(X_train,y_train) # 拟合训练集(通过训练集对模型训练)y_pred = gbc.predict(X_test) # 对测试集进行预测print("在测试集上的准确率",metrics.accuracy_score(y_test,y_pred))

9.调参

所谓调参,就是对使用模型的相关参数进行调整,已达到更好的预测准确率。

本项目使用GridSearchCV对象实现交叉验证的方式进行调参。

# 使用交叉验证的方式调参# 提前将GradientBoostingClassifier需要调整的参数的范围设置好param_test = {'n_estimators':range(20,81,10),'max_depth':range(3,8,2), 'min_samples_split':range(100,500,200)}# 交叉验证,用来搜索最佳参数组合 gsearch1 = GridSearchCV(estimator = GradientBoostingClassifier(learning_rate=0.1, min_samples_split=300,min_samples_leaf=20,max_depth=8,max_features='sqrt', subsample=0.8,random_state=10),param_grid = param_test, scoring='roc_auc',cv=5)gsearch1.fit(X_train,y_train) # 在训练集上调参# 查看最佳参数组合与分数gsearch1.best_params_, gsearch1.best_score_

再将调到最好的参数给到测试集

# 将最佳参数设置到模型中gbc = GradientBoostingClassifier(learning_rate=0.1, min_samples_split=100,min_samples_leaf=20,max_depth=3,max_features='sqrt', subsample=0.8,random_state=10,n_estimators=60)gbc.fit(X_train,y_train) # 拟合训练集(通过训练集对模型训练)y_pred = gbc.predict(X_test) # 对测试集进行预测print("在测试集上的准确率",metrics.accuracy_score(y_test,y_pred))

如果觉得《7.Python数据分析项目之银行客户流失分析》对你有帮助,请点赞、收藏,并留下你的观点哦!

本内容不代表本网观点和政治立场,如有侵犯你的权益请联系我们处理。
网友评论
网友评论仅供其表达个人看法,并不表明网站立场。