失眠网,内容丰富有趣,生活中的好帮手!
失眠网 > What's the differece between high price houses and low price houses of airbnb?

What's the differece between high price houses and low price houses of airbnb?

时间:2021-10-26 06:43:25

相关推荐

What's the differece between high price houses and low price houses of airbnb?

Analysis Of AirBNB

1 商业理解(business understanding)2. 数据理解(data understanding)2.1 Load the data2.2 Preview the data3 数据准备(data preparation)- Data clean3.1 First process.3.2 Choose variables to continue observe3.3 Variable transformation(针对性处理)3.4 Numerical variable processing(数值变量处理。)3.5 Categorical variable processing(分类变量处理)3.6 reveiw again(再次遍历处理)

4.EDA(数据探索)Question1 What's the differece between high price houses and low price houses?

5. Build Module(建立模型)Question2 If you are a low/high house host,what should you do to improve the review score value?

Question3 What features influence host to a superhost while the house is a high or low price house?

1 商业理解(business understanding)

Problem I want to solve:

I just split Seatte houses into two parts by the price.The high price house’s price is more than median price (119).The low price house’s price is less than median price(119).

then I want to find out that:

Question1. What’s the differece between high price houses and low price houses.

Question2. If you are a low/high house host,what should you do to improve the review score value?

Question3. Question3 If we are the house hosts,and if we want to be a superhost,what should we do while we are high price house host or low price house host?

import pandas as pdimport numpy as npfrom sklearn.linear_model import LinearRegressionfrom sklearn.metrics import r2_score, mean_squared_error# import ImputingValues as tfrom sklearn.ensemble import RandomForestClassifierimport matplotlib.pyplot as pltimport seaborn as sns# from mpl_toolkits.basemap import Basemapfrom sklearn.preprocessing import StandardScalerfrom sklearn.model_selection import train_test_split, GridSearchCV, cross_val_scorefrom sklearn.metrics import accuracy_score, r2_score, mean_squared_errorfrom sklearn.ensemble import RandomForestRegressor, AdaBoostRegressor, GradientBoostingRegressorfrom datetime import datetimefrom sklearn.model_selection import GridSearchCVimport randomimport numbers# from helper import *pd.set_option("max_columns", None)pd.set_option("max_rows", None)%matplotlib inline

2. 数据理解(data understanding)

2.1 Load the data

path = 'D:/Code/Udacity/02_DataScientist/Write_A_Data_Science_Blog_Post/My_Analysis_Of_ArBNB_new/data/Seattle_AirBNB_Data/'df_Seattle_listings = pd.read_csv(path + 'listings.csv')df_Seattle_listings.head(3)

2.2 Preview the data

The data are mainly divided into the following aspects:

Host information

host_response_time,host_response_rate,host_is_superhost,host_listings_count,host_total_listings_count

House hardware information

neighbourhood_group_cleansed,zipcode,property_type,room_type,accommodates,bathrooms,bedrooms

House other information

price,security_deposit,cleaning_fee,minimum_nights,maximum_nights,availability_365,instant_bookable,cancellation_policy

House scrore informationreview_scores_rating,review_scores_rating,review_scores_accuracy,review_scores_cleanliness,review_scores_checkin,review_scores_communication,review_scores_location,review_scores_value

def Value_counts(das, nhead = 5):tmp = das.value_counts().reset_index().rename_axis({'index':das.name},axis = 1)value = pd.DataFrame(['value {}'.format(i) for i in range(nhead)],index = range(nhead)).join(tmp.iloc[:,0],how = 'left').set_index(0).Tfreq = pd.DataFrame(['freq {}'.format(i) for i in range(nhead)],index = range(nhead)).join(tmp.iloc[:,1],how = 'left').set_index(0).Tnnull = das.isnull().sum()freqother = pd.DataFrame([nnull,das.shape[0] - nnull - freq.sum(axis = 1).sum()],index = ['freqNull','freqOther']).T.rename_axis({0:das.name})op = pd.concat([value,freq,freqother],axis = 1)return(op)def Summary(da):op = pd.concat([pd.DataFrame({"type": da.dtypes, "n": da.notnull().sum(axis = 0)}), da.describe().T.iloc[:,1:],pd.concat(map(lambda i: Value_counts(da.loc[:,i]), da.columns))], axis = 1).loc[da.columns]op.index.name = "Columns"return(op)def MissingCategorial(df,x):missing_vals = df[x].map(lambda x: int(x!=x))return sum(missing_vals)*1.0/df.shape[0]def MissingContinuous(df,x):missing_vals = df[x].map(lambda x: int(np.isnan(x)))return sum(missing_vals) * 1.0 / df.shape[0]df_Seattle_listings_summary = Summary(df_Seattle_listings).reset_index()df_Seattle_listings_summary.to_csv(path+'df_Seattle_listings_summary.csv')df_Seattle_listings_summary

The table blow show us that:

n.The length of the col value.

type.The type of the col.

mean.std.min.The mean,std,min of the col,and of course if the col is object type,it will be null.

25%,50%,75%.The quantile of col.

value0,value1,value2,value3,value4,value5.The most five proportion value of the col.

freq0,freq 1,freq 2,freq 4,freq 4.The most five proportion value’s count of the col.

freqNull,freqOther.The Null/Other value’s count of the col.

Discussion:

From the table above, we can see that several features have just single value or have a high miss rate or have a high proportion value,those features have little value for us to analysis,so we will process them first.

3 数据准备(data preparation)- Data clean

3.1 First process.

1Singe value process.If a feature only have one unque value,then it have no value for our analysis. And at last,we delete scrape_id,experiences_offered,and so on .

2Null value process.If a feature only have a miss rate more than 0.85,then it have no value for our analysis. And at last,we delete thumbnail_url,xl_picture_url,and so on .

3Big proportion process.If a feature only have one value rate more than 0.9,then it have litte value for our analysis. And at last,we delete host_has_profile_pic,street,and so on.

1Singe value process.

If a feature only have one unque value,then it have no value for our analysis. And at last,we delete scrape_id,experiences_offered,and so on

def delete_singe_value_features(df,col,all_features,remove_features):'''Usage: delete the singe value featureInput:df - input dataframecol - the feature to be processall_features - all features in the df remove_features - list to record the delete featuresOutput: df - dataframe which have been processall_features - all features now we are watchremove_features - features we have remove from all features'''if len(set(df[col])) == 1:print('delete {} from the dataset because it is a constant'.format(col))del df[col]all_features.remove(col)remove_features.append({col:'singe_value'})return df,all_features,remove_featuresall_features = list(df_Seattle_listings.columns)select_features = all_featuresremove_features = []threshold_rate = 0.85for col in select_features:df_Seattle_listings,select_features,remove_features = delete_singe_value_features(df_Seattle_listings,col,select_features,remove_features)remove_features

2 Null value process.

If a feature only have a miss rate more than 0.85,then it have no value for our analysis. And at last,we delete thumbnail_url,xl_picture_url,and so on

def process_null_value(df,col,all_features,remove_features,threshold_rate):'''Usage: clean the col if the most proportion is bigger than threshold_rateInput:df - input dataframecol - the feature to be processthreshold_rate - threshold rateOutput: df - dataframe which have been processremove_flag - the flag indicate wheather the col haven been deleted'''miss_rate = df[col].isnull().sum()/df.shape[0]if miss_rate > threshold_rate:print('{} has a miss rate {} and be removed'.format(col,miss_rate))df = df.drop([col],axis = 1)remove_features.append({col:'miss rate is too high'})all_features.remove(col)return df,all_features,remove_features# 删除缺失值较多的行threshold_rate = 0.85for col in select_features:df_Seattle_listings,select_features,remove_features = process_null_value(df_Seattle_listings,col,select_features,remove_features,threshold_rate)remove_features

3Big proportion process.

If a feature only have one value rate more than 0.9,then it have litte value for our analysis. And at last,we delete host_has_profile_pic,street,and so on.

def delete_high_proportion_features(df,col,all_features,remove_features,threshold_rate = 0.9):'''Usage: clean the col if the most proportion is bigger than threshold_rateInput:df - input dataframecol - the feature to be processall_features - all features now we are watchremove_features - features we have remove from all featuresthreshold_rate - threshold rateOutput: df - dataframe which have been processall_features - all features now we are watchremove_features - features we have remove from all features'''most_proportion = df[col].value_counts().reset_index().sort_values(by = col,ascending = False).loc[0,col]/df.shape[0]#print("we are processing {} .....".format(col))if most_proportion > threshold_rate:df = df.drop([col],axis = 1)all_features.remove(col)remove_features.append({col:'high proportion'})print('{} has a most proportion ={} ,and be removed'.format(col,most_proportion))return df,all_features,remove_features# 删除单一值占比超过0.9的列threshold_rate = 0.9for col in select_features:df_Seattle_listings,select_features,remove_features = delete_high_proportion_features(df_Seattle_listings,col,select_features,remove_features,threshold_rate)remove_features

#观察变量df_Seattle_listings_summary[df_Seattle_listings_summary['Columns'].isin(all_features)]

3.2 Choose variables to continue observe

预测各个价格区间段内,对用户多次订购影响最大的因素,从以下几个方面选择

After the first process step,we select features to watch in the following ways:

Host information. host_response_time,host_response_rate,host_is_superhost,host_listings_count,host_total_listings_countHouse hardware information. neighbourhood_group_cleansed,zipcode,property_type,room_type,accommodates,bathrooms,bedroomsHouse other information. price,security_deposit,cleaning_fee,minimum_nights,maximum_nights,availability_365,instant_bookable,cancellation_policyHouse scrore information. review_scores_rating,review_scores_rating,review_scores_accuracy,review_scores_cleanliness,review_scores_checkin,review_scores_communication,review_scores_location,review_scores_value

select_features = ['host_response_time','host_response_rate','host_is_superhost','host_total_listings_count','neighbourhood_group_cleansed'\,'zipcode','property_type','room_type','accommodates','bathrooms','bedrooms','beds','price','security_deposit','cleaning_fee','minimum_nights','maximum_nights'\,'availability_365','number_of_reviews','review_scores_rating','review_scores_accuracy','review_scores_cleanliness','review_scores_checkin'\,'review_scores_communication','review_scores_location','review_scores_value','instant_bookable','cancellation_policy']df_Seattle_listings_summary[df_Seattle_listings_summary['Columns'].isin(select_features)].reset_index().drop(['index'],axis = 1)

df_Seattle_listings = df_Seattle_listings[select_features]#备份一遍数据df_Seattle_listings_bak = df_Seattle_listings.copy()df_Seattle_listings.columns

3.3 Variable transformation(针对性处理)

host_response_time.The feature ‘host_response_time’ can means if a host’respone time is faster ,then we can say the host have a better sevice.so I process it to be a sequence variable.The varible is bigger ,then the sevice is better.

host_response_rate.the host_response_rate should be a numerical value,so I trim the “%” from the value.

**price,security_deposit,cleaning_fee.**Those two col are money value,so I trim ‘$’ from them.

# 变量清洗# host_response_time 认为反应时间越快,说明服务越好,因此host_response_time_mapping = {'a few days or more':1,'within a day':2,'within a few hours':3,'within an hour':4}df_Seattle_listings['host_response_time'] = df_Seattle_listings['host_response_time'].replace(host_response_time_mapping)# host_response_rate# 去掉%号df_Seattle_listings['host_response_rate'] = df_Seattle_listings['host_response_rate'].apply(lambda x:int(str(x).split('%')[0]) if x == x else x)#处理价格变量 price security_deposit cleaning_feemoney_features = ['price','security_deposit','cleaning_fee']for col in money_features:df_Seattle_listings[col] = df_Seattle_listings[col].apply(lambda x: float(str(x).replace('$','').replace(',','')) if x == x else x)

3.4 Numerical variable processing(数值变量处理。)

We select num_features to process:

If the miss rate is more than 0.6 then delete this variable,and add a col to indicate wheather the value is null.

If the miss rate is less than 0.6,then fill the miss value with random value from the not miss value.

# accommodates,bathrooms,beds,minimum_nights,maximum_nights,availability_365,number_of_reviews,review_scores_rating,review_scores_accuracy# review_scores_cleanliness,review_scores_checkin,review_scores_communication,review_scores_location,review_scores_valueprocess_num_features = ['price','security_deposit','cleaning_fee','host_total_listings_count','accommodates','bathrooms','bedrooms','beds','minimum_nights','maximum_nights','availability_365','number_of_reviews',\'review_scores_rating','review_scores_accuracy','review_scores_cleanliness','review_scores_checkin',\'review_scores_communication','review_scores_location','review_scores_value','host_response_time','host_response_rate']def fill_numeric_null_value(df,col,process_num_features,remove_features,threshold_rate = 0.8):'''Usage: fill the Null value of the colInput:df - input dataframecol - the feature to be processOutput: filled dataframe including the fixed col and a filled flag'''allFeatures = list(df.columns)if col in allFeatures:miss_rate = df[col].isnull().sum()/df.shape[0]print('{} miss rate is {}'.format(col,miss_rate))if miss_rate > 0:col_flag = str(col)+'_flag'df[col_flag] = df[col].map(lambda x: 0 if x != x else 1)if miss_rate > threshold_rate:df = df.drop([col],axis = 1)process_num_features.remove(col)remove_features.append({col:'miss rate is too high'})else:# 获取非缺失的数值not_missing = df.loc[df[col] == df[col],col]# 获取缺失值所在位置missing_index = df.loc[df[col] != df[col],col].index# 随机产生补充缺失值的listmiss_makeup = random.sample(list(not_missing),len(missing_index))# 补偿缺失值df.loc[missing_index,col] = miss_makeupreturn df,process_num_features,remove_featuresthreshold_rate = 0.8for col in process_num_features:df_Seattle_listings,process_num_features,remove_features = fill_numeric_null_value(df_Seattle_listings,col,process_num_features,remove_features,threshold_rate)remove_features

3.5 Categorical variable processing(分类变量处理)

We process the categorical varibles in the following ways:

If miss rate is more than 0.8 then delete this variable,else fill then miss value with ‘-1’.

One-hot encoding.

1)空值处理。如果空值占比 >0.8,删除;否则使用特殊值进行填充。2)one-hot编码。

def fill_categorical_null_value(df,col,process_cat_features,remove_features,threshold_rate = 0.8):'''Usage: fill the Null value of the colInput:df - input dataframecol - the feature to be processOutput: filled dataframe including the fixed col and a filled flag'''allFeatures = list(df.columns)if col in allFeatures:missingRate = MissingCategorial(df,col)print('{0} has missing rate as {1}'.format(col,missingRate))if missingRate > threshold_rate:process_cat_features.remove(col)remove_features.append({col:'miss rate is too high'})del df[col]if 0 < missingRate < threshold_rate:uniq_valid_vals = [i for i in df[col] if i == i]uniq_valid_vals = list(set(uniq_valid_vals))if isinstance(uniq_valid_vals[0], numbers.Real):missing_position = df.loc[df[col] != df[col]][col].indexnot_missing_sample = [-1]*len(missing_position)df.loc[missing_position, col] = not_missing_sampleelse:# In this way we convert NaN to NAN, which is a string instead of np.nandf[col] = df[col].map(lambda x: str(x).upper())return df,process_cat_features,remove_features# 对分类变量进行one-hot处理process_cat_features = ['host_is_superhost','neighbourhood_group_cleansed','zipcode','property_type','room_type','instant_bookable','cancellation_policy']threshold_rate = 0.8for col in process_cat_features:df_Seattle_listings,process_cat_features,remove_features = fill_categorical_null_value(df_Seattle_listings,col,process_cat_features,remove_features,threshold_rate)df_Seattle_listings = pd.get_dummies(data = df_Seattle_listings,columns = process_cat_features)df_listings_clean = df_Seattle_listings.copy()df_listings_clean_summary = Summary(df_listings_clean).reset_index()df_listings_clean_summary.to_csv(path +'df_listings_clean_summary.csv')df_listings_clean_summary

3.6 reveiw again(再次遍历处理)

after we process features by the ways above all,we should process the single value ,the big proportion again.

1)缺失值2)单一值处理。

clean_all_features = list(df_listings_clean.columns)for col in clean_all_features:df_listings_clean,clean_all_features,remove_features = delete_singe_value_features(df_listings_clean,col,clean_all_features,remove_features)# 删除单一值占比超过0.9的列threshold_rate = 0.85for col in clean_all_features:df_listings_clean,clean_all_features,remove_features = delete_high_proportion_features(df_listings_clean,col,clean_all_features,remove_features,threshold_rate)remove_features

Because host_is_superhost_f and host_is_superhost_t are strongly correlated, so we just keep one of them.

And then we do same operate to instant_bookable_f and instant_bookable_t

df_listings_clean = df_listings_clean.drop(['host_is_superhost_f','instant_bookable_f'],axis = 1)remove_features.append({'host_is_superhost_f':'Binary redundant variable'})remove_features.append({'instant_bookable_f':'Binary redundant variable'})df_listings_clean.to_csv(path+'df_listings_clean.csv')

4.EDA(数据探索)

I want to find out that :

what’s the difference between the high price house and the low price house.

If we are the host,when our houses is high/low price house,what should we do to improve the review score?

If the host is a superhost,what’s difference between high/low price houses.

obs_cols = ['accommodates','bathrooms','bedrooms','beds','security_deposit','cleaning_fee','minimum_nights','maximum_nights','review_scores_cleanliness','review_scores_location','host_response_time_flag','host_response_rate_flag','security_deposit_flag','host_is_superhost_t','property_type_Apartment','property_type_House','room_type_Entire home/apt','room_type_Private room','instant_bookable_t','cancellation_policy_flexible','cancellation_policy_moderate','cancellation_policy_strict_14_with_grace_period','review_scores_value']## 2 拆分价格区间。将价格拆分为 低、中、高,三个区间,查看不同的价格区间,影响用户订房的因素。price_mid = df_listings_clean['price'].quantile(0.5)print('price_mid = {}'.format(price_mid))df_listings_clean['price_flag'] = df_listings_clean['price'].apply(lambda x : 'low' if x <= price_mid else 'high' )df_low_price = df_listings_clean[df_listings_clean['price_flag'] == 'low']df_high_price = df_listings_clean[df_listings_clean['price_flag'] == 'high']

def get_x_y(df,col):indexNum = df[df['Columns'] == col].index.tolist()[0]x = df.loc[df['Columns'] == col][['value 0','value 1','value 2','value 3','value 4']]xList = list(x.T[indexNum])xList.append('Null Value')xList.append('Other')y = df.loc[df['Columns'] == col][['freq 0','freq 1','freq 2','freq 3','freq 4','freqNull','freqOther']]yList = list(y.T[indexNum])return xList,yListdef campare_plot(df_high,df_low,col):x1,y1 = get_x_y(df_high,col)x2,y2 = get_x_y(df_low,col)font2 = {'weight':'normal','size': 20}colors = ['lightcoral','gold','g','c','m','crimson','brown']plt.figure(figsize=(14, 6))plt.title('Gaussian colored noise')plt.subplot(1,2,1)plt.title('high price house',font2)plt.xlabel(col,font2)plt.ylabel('count',font2)plt.xticks(np.arange(len(x1)), x1)plt.bar(np.arange(len(x1)),y1,color = colors,linewidth=20.0)plt.subplot(1,2,2)plt.xlabel(col,font2)plt.title('low price house',font2)plt.ylabel('count',font2)plt.xticks(np.arange(len(x2)), x2)plt.bar(np.arange(len(x2)),y2,color = colors)df_high_price_summary = Summary(df_high_price).reset_index()df_low_price_summary = Summary(df_low_price).reset_index()df_high_price_summary.to_csv(path+'df_high_price_summary.csv')df_low_price_summary.to_csv(path+'df_low_price_summary.csv')

Question1 What’s the differece between high price houses and low price houses?

Figture explain:

In the follow figtures, I will choose the most 5 proportion value and Null Value and the Other Value ,

to check their differece between high price houses and low price houses

accommodates

The most 5 proportion accommodates of high price house is (4,2,6,3,5),while the low price house is (2,4,3,1,5).

So the high price houses have more accommodates than the low price houses.

campare_plot(df_high_price_summary,df_low_price_summary,obs_cols[0])

bathrooms

In general, most of the high price houses and low price houses only have one bathrooms, but on average, the high price houses have more bathrooms than low price houses

campare_plot(df_high_price_summary,df_low_price_summary,obs_cols[1])

bedrooms

Be similar like bathrooms. The most houses wheather high price houses or low price houses have only one bedrooms,but on average, the high price houses have more bethrooms than low price houses

campare_plot(df_high_price_summary,df_low_price_summary,obs_cols[2])

beds

In general, most of the high price houses and low price houses only have one beds , but on average, the high price houses have more beds than low price houses

campare_plot(df_high_price_summary,df_low_price_summary,obs_cols[3])

security deposit

In general, the security deposit of high price houses is much more than low pirce houses

campare_plot(df_high_price_summary,df_low_price_summary,obs_cols[4])

cleaning fee

In general, the cleaning fee of high price houses is much more than low pirce houses

campare_plot(df_high_price_summary,df_low_price_summary,obs_cols[5])

minimum_nights

In general, the minimum nights of high price houses is a little bit more than low pirce houeses

campare_plot(df_high_price_summary,df_low_price_summary,obs_cols[6])

review scores cleanlines

In general, the review scores cleanlines of high price houses is a little bit less than low pirce houeses

campare_plot(df_high_price_summary,df_low_price_summary,obs_cols[8])

host_is_superhost_t

In general, the low pirce houses have more superhosts than high price houses.

campare_plot(df_high_price_summary,df_low_price_summary,'host_is_superhost_t')

cancellation_policy_flexible

In general, the low pirce houses have more cancellation policy flexible houses than high price houses.

campare_plot(df_high_price_summary,df_low_price_summary,'cancellation_policy_flexible')

review_scores_value

In general, the review scores value of high price houses is a little bit more than low pirce houeses,but not very much.

campare_plot(df_high_price_summary,df_low_price_summary,'review_scores_value')

Question1 What’s the differece between high price houses and low price houses?

Conclusion

Household Appliances. The high price houses provide more facility than low price houses,like accommodates,bedrooms,bathrooms and beds.

House sevice. The low price houses performance better than the high price houses,for example,low price houses needs less cleaning fee than

high price houses,and more proportion of low price houses’ hosts are superhost.

review score value. The price dosen’t influence review scores value very much.

5. Build Module(建立模型)

Question2 If you are a low/high house host,what should you do to improve the review score value?

def ROC_AUC(df, score, target, plot=True):df2 = df.copy()s = list(set(df2[score]))s.sort()tpr_list = [0]fpr_list = [0]for k in s:df2['label_temp'] = df[score].map(lambda x: int(x >= k))TP = df2[(df2.label_temp==1) & (df2[target]==1)].shape[0]FN = df2[(df2.label_temp == 1) & (df2[target] == 0)].shape[0]FP = df2[(df2.label_temp == 0) & (df2[target] == 1)].shape[0]TN = df2[(df2.label_temp == 0) & (df2[target] == 0)].shape[0]try:TPR = TP / (TP + FN)except:TPR =0try:FPR = FP / (FP + TN)except:FPR = 0tpr_list.append(TPR)fpr_list.append(FPR)tpr_list.append(1)fpr_list.append(1)ROC_df = pd.DataFrame({'tpr': tpr_list, 'fpr': fpr_list})ROC_df = ROC_df.sort_values(by='tpr')ROC_df = ROC_df.drop_duplicates()auc = 0ROC_mat = np.mat(ROC_df)for i in range(1, ROC_mat.shape[0]):auc = auc + (ROC_mat[i, 1] + ROC_mat[i - 1, 1]) * (ROC_mat[i, 0] - ROC_mat[i - 1, 0]) * 0.5if plot:plt.plot(ROC_df['fpr'], ROC_df['tpr'])plt.plot([0, 1], [0, 1])plt.title("AUC={}%".format(int(auc * 100)))return aucdef GridSearch(X_train, X_test, y_train, y_test, criterion = ['mse'],tree_Flag = 'Regression',n_estimators = [300, 600],method = 'RF', learning_rate = 0.5, validate = False, cv = 5,max_features = ['auto'], max_depth = [10, 20, 40], min_samples_leaf = [2,4],min_samples_split = [10,20,40], n_jobs = -1):'''Usage: use gridsearch to find optimal parameters for the random forest (RF) regressor.Input: training and testing sets from X and y variablesOutput: the best regressor'''best_clf = np.NAN# 区分是回归模型if tree_Flag == 'regression':parameters = {'criterion': criterion,'n_estimators': n_estimators,'max_depth': max_depth,'min_samples_leaf':min_samples_leaf,'max_features':max_features,'min_samples_split':min_samples_split}clf = RandomForestRegressor(random_state=42, n_jobs = n_jobs)#Use gridsearch to find the best-model parameters.grid_obj = GridSearchCV(clf, parameters, cv = cv)grid_fit = grid_obj.fit(X_train, y_train)#obtaining best model, fit it to training setbest_clf = grid_fit.best_estimator_best_clf.fit(X_train, y_train)# Make predictions using the new model.best_train_predictions = best_clf.predict(X_train)print('The training MSE Score is', mean_squared_error(y_train, best_train_predictions))print('The training R2 Score is', r2_score(y_train, best_train_predictions))if validate:best_test_predictions = best_clf.predict(X_test)print('The testing MSE Score is', mean_squared_error(y_test, best_test_predictions))print('The testing R2 Score is', r2_score(y_test, best_test_predictions))# 如果是分类模型elif tree_Flag == 'classifier':clf = RandomForestClassifier(oob_score=True)param_test1 = {'n_estimators':n_estimators}gsearch1 = GridSearchCV(estimator = RandomForestClassifier(),param_grid = param_test1, scoring='roc_auc',cv=5)gsearch1.fit(X_train, y_train)best_n_estimators = gsearch1.best_params_['n_estimators'] param_test2 = {'max_depth':max_depth, 'min_samples_split':min_samples_split, 'min_samples_leaf':min_samples_leaf}gsearch2 = GridSearchCV(estimator = RandomForestClassifier(n_estimators= best_n_estimators),param_grid = param_test2, scoring='roc_auc',cv=5)gsearch2.fit(X_train, y_train)best_max_depth, best_min_samples_split, best_min_samples_leaf = gsearch2.best_params_['max_depth'],gsearch2.best_params_['min_samples_leaf'],gsearch2.best_params_['min_samples_split']param_test3 ={'max_features':['sqrt','log2']}gsearch3 = GridSearchCV(estimator = RandomForestClassifier(n_estimators= best_n_estimators,max_depth = best_max_depth,min_samples_split = best_min_samples_split,min_samples_leaf = best_min_samples_leaf),param_grid = param_test3, scoring='roc_auc',cv=5)gsearch3.fit(X_train,y_train)best_max_features = gsearch3.best_params_['max_features']best_clf = RandomForestClassifier(oob_score=True, n_estimators= best_n_estimators,max_depth = best_max_depth,min_samples_split = best_min_samples_split,min_samples_leaf = best_min_samples_leaf,max_features = best_max_features)best_clf.fit(X_train,y_train)# print(best_clf.oob_score_)y_predprob = best_clf.predict_proba(X_train)[:,1]result = pd.DataFrame({'real':y_train,'pred':y_predprob})#print("AUC Score (Train): %f" % metrics.roc_auc_score(y, y_predprob))auc = ROC_AUC(result, 'pred', 'real',False)print('The training Auc is', auc )if validate:y_predprob = best_clf.predict_proba(X_test)[:,1]result = pd.DataFrame({'real':y_test,'pred':y_predprob})#print("AUC Score (Train): %f" % metrics.roc_auc_score(y, y_predprob))auc = ROC_AUC(result, 'pred', 'real',False)print('The testing Auc is', auc )return best_clf# 低房价变量重要性df_low_price_X = df_low_price[obs_cols].drop(['review_scores_value'],axis = 1)df_low_price_y = df_low_price[['review_scores_value']].iloc[:,0]df_low_price_X_train, df_low_price_X_test, df_low_price_y_train, df_low_price_y_test = train_test_split(df_low_price_X, df_low_price_y, test_size = 0.3, random_state = 42)# 低房价变量重要性criterion = ['mse']method = 'RF' n_estimators = [200,400]max_features = [10,15,22] max_depth = [10, 20, 40] min_samples_leaf = [2,4]learning_rate = 0.001tree_Flag = ''best_clf = GridSearch(df_low_price_X_train, df_low_price_X_test, df_low_price_y_train, df_low_price_y_test,tree_Flag = 'regression', method = method, learning_rate = learning_rate, \criterion = criterion, n_estimators = n_estimators, \max_features = max_features, max_depth = max_depth, min_samples_leaf = min_samples_leaf)

def show_importances(best_clf,df):importances = best_clf.feature_importances_ feat_names = df.columnstree_result = pd.DataFrame({'feature': feat_names, 'importance': importances})tree_result.sort_values(by='importance',ascending=True)[-10:].plot(x='feature', y='importance', kind='barh')show_importances(best_clf,df_low_price_X_train)

# 高房价变量重要性df_high_price_X = df_high_price[obs_cols].drop(['review_scores_value'],axis = 1)df_high_price_y = df_high_price[['review_scores_value']].iloc[:,0]df_high_price_X_train, df_high_price_X_test, df_high_price_y_train, df_high_price_y_test = train_test_split(df_high_price_X, df_high_price_y, test_size = 0.3, random_state = 42)criterion = ['mse']method = 'RF' n_estimators = [200,400]max_features = [10,15,22] max_depth = [10, 20, 40] min_samples_leaf = [2,4]learning_rate = 0.001best_clf = GridSearch(df_high_price_X_train, df_high_price_X_test, df_high_price_y_train, df_high_price_y_test, tree_Flag = 'regression',method = method, learning_rate = learning_rate, \criterion = criterion, n_estimators = n_estimators, \max_features = max_features, max_depth = max_depth, min_samples_leaf = min_samples_leaf)

show_importances(best_clf,df_high_price_X_train)

Question2 If you are a low/high house host,what should you do to improve the review score value?

Conclusion

From the pictures above,we can see both high price houses’ users and low price houses’ users care about

review_scores_cleanliness,review_scores_cleanliness,cleaning_fee,security_deposit,maximum_nights,minimum_nights.accommodates.

If you are a low price houses’s host,you should try to be a superhost at first,and then maybe you should not make your the houses cancellation policy to be a strict grace period.

If you are a high price houses’ host , more care about beds,and bedrooms,and wheather the house is a Apartment.

Question3 What features influence host to a superhost while the house is a high or low price house?

# 低房价变量重要性df_low_price_X = df_low_price[obs_cols].drop(['host_is_superhost_t'],axis = 1)df_low_price_y = df_low_price[['host_is_superhost_t']].iloc[:,0]df_low_price_X_train, df_low_price_X_test, df_low_price_y_train, df_low_price_y_test = train_test_split(df_low_price_X, df_low_price_y, test_size = 0.3, random_state = 42)criterion = ['gini']method = 'RF' n_estimators = [200,400]max_features = [10,15,22] max_depth = [10, 20, 40] min_samples_split = [10,20,40]min_samples_leaf = [2,4]learning_rate = 0.001best_clf = GridSearch(df_low_price_X_train, df_low_price_X_test, df_low_price_y_train, df_low_price_y_test,tree_Flag = 'classifier', method = method, learning_rate = learning_rate, \criterion = criterion, n_estimators = n_estimators, \max_features = max_features, max_depth = max_depth, min_samples_leaf = min_samples_leaf,min_samples_split = min_samples_split)

show_importances(best_clf,df_low_price_X_train)

# 高房价变量重要性df_high_price_X = df_high_price[obs_cols].drop(['host_is_superhost_t'],axis = 1)df_high_price_y = df_high_price[['host_is_superhost_t']].iloc[:,0]df_high_price_X_train, df_high_price_X_test, df_high_price_y_train, df_high_price_y_test = train_test_split(df_high_price_X, df_high_price_y, test_size = 0.3, random_state = 42)criterion = ['gini']method = 'RF' n_estimators = [200,400]max_features = [10,15,22] max_depth = [10, 20, 40] min_samples_split = [10,20,40]min_samples_leaf = [2,4]learning_rate = 0.001best_clf = GridSearch(df_high_price_X_train, df_high_price_X_test, df_high_price_y_train, df_high_price_y_test,tree_Flag = 'classifier', method = method, learning_rate = learning_rate, \criterion = criterion, n_estimators = n_estimators, \max_features = max_features, max_depth = max_depth, min_samples_leaf = min_samples_leaf,min_samples_split = min_samples_split)

show_importances(best_clf,df_high_price_X_train)

Question3 If we are the house hosts,If we want to be a superhost,what should we do while we are high price house host or low price house host?

conclusion

From the figtures above,we can see that both of low/high price house’s hosts are been influenced by cleaning_fee,maximum_nights,review_scores_value,secutity_deposit,review_scores_cleanliness,host_reponse_rate_flag,host_reponse_time_flag.

So if we want to be superhost,there have not much different between low price houses and high price houses

如果觉得《What's the differece between high price houses and low price houses of airbnb?》对你有帮助,请点赞、收藏,并留下你的观点哦!

本内容不代表本网观点和政治立场,如有侵犯你的权益请联系我们处理。
网友评论
网友评论仅供其表达个人看法,并不表明网站立场。