失眠网 > 机器学习实践—基于Scikit-Learn Keras和TensorFlow2第二版—第2章端到端机器学习项目实践

机器学习实践—基于Scikit-Learn Keras和TensorFlow2第二版—第2章端到端机器学习项目实践

时间：2021-04-02 00:08:32

本章使用California房价数据集进行案例分析

1. 导入所需的库

import osimport tarfileimport urllibimport pandas as pdimport numpy as npimport matplotlib.pyplot as plt%matplotlib inlinefor i in [pd,np]:print(i.__name__,": ",i.__version__,sep="")

输出：

pandas: 0.25.3numpy: 1.17.4

2. 下载数据集并进行拆分

DOWNLOAD_ROOT = "/ageron/handsonml2/master/"HOUSING_PATH = os.path.join("datasets","housing")HOUSING_URL = DOWNLOAD_ROOT + "datasets/housing/housing.tgz"def fetch_housing_data(housing_url=HOUSING_URL, housing_path=HOUSING_PATH):os.makedirs(housing_path, exist_ok=True)tgz_path = os.path.join(housing_path, "housing.tgz")urllib.request.urlretrieve(housing_url, tgz_path)housing_tgz = tarfile.open(tgz_path)housing_tgz.extractall(path=housing_path)housing_tgz.close()fetch_housing_data()

def load_housing_data(housing_path=HOUSING_PATH):csv_path = os.path.join(housing_path,"housing.csv")return pd.read_csv(csv_path)housing = load_housing_data()housing.head()

输出：

housing.info()

输出：

<class 'pandas.core.frame.DataFrame'>RangeIndex: 20640 entries, 0 to 20639Data columns (total 10 columns):longitude 20640 non-null float64latitude 20640 non-null float64housing_median_age 20640 non-null float64total_rooms 20640 non-null float64total_bedrooms 20433 non-null float64population 20640 non-null float64households 20640 non-null float64median_income 20640 non-null float64median_house_value 20640 non-null float64ocean_proximity 20640 non-null objectdtypes: float64(9), object(1)memory usage: 1.6+ MB

housing["ocean_proximity"].value_counts()

输出：

<1H OCEAN9136INLAND 6551NEAR OCEAN 2658NEAR BAY2290ISLAND 5Name: ocean_proximity, dtype: int64

housing.describe().T

输出：

housing.hist(bins=50, figsize=(20,15))plt.show()

输出：

np.random.seed(42)def split_train_test(data, test_ratio):shuffled_indices = np.random.permutation(len(data))test_set_size = int(len(data)*test_ratio)test_indices = shuffled_indices[:test_set_size]train_indices = shuffled_indices[test_set_size:]return data.iloc[train_indices], data.iloc[test_indices]train_set, test_set = split_train_test(housing, 0.2)print(train_set.shape, test_set.shape)

输出：

(16512, 10) (4128, 10)

from zlib import crc32def test_set_check(identifier, test_ratio):return crc32(np.int64(identifier)) & 0xffffffff < test_ratio*2**32def split_train_test_by_id(data, test_ratio, id_column):ids = data[id_column]in_test_set = ids.apply(lambda id_: test_set_check(id_, test_ratio))return data.loc[~in_test_set], data.loc[in_test_set]

由于该数据集没有标识符列，最简单的做法是用行号当作ID：

housing_with_id = housing.reset_index()train_set, test_set = split_train_test_by_id(housing_with_id, 0.2, "index")train_set.head()

输出：

如果用行号作为唯一标识符，则需要保证每次加入新数据时必须放在数据集的最后，同时需要保证数据集不能删除样本。如果无法保证以上两点，那么就需要重新用比较稳定的特征用标识符。例如，可以用联合经度和纬度作为标识符，因为经度和纬度是固定甚至永久不变的。

housing_with_id["id"] = housing["longitude"] * 1000 + housing["latitude"]train_set, test_set = split_train_test_by_id(housing_with_id, 0.2, "id")print(train_set.head())

输出：

longitude latitude housing_median_age total_rooms total_bedrooms \14196 -117.0332.7133.0 3126.0 627.0 8267-118.1633.7749.0 3382.0 787.0 17445 -120.4834.66 4.0 1897.0 331.0 14265 -117.1132.6936.0 1421.0 367.0 2271-119.8036.7843.0 2382.0 431.0 population households median_income median_house_value \141962300.0 623.0 3.2596 103000.0 8267 1314.0 756.0 3.8125 382100.0 17445 915.0 336.0 4.1563 172600.0 142651418.0 355.0 1.9425 93400.0 2271 874.0 380.0 3.5542 96500.0 ocean_proximity income_cat 14196NEAR OCEAN3 8267 NEAR OCEAN3 17445NEAR OCEAN3 14265NEAR OCEAN2 2271 INLAND3

sklearn中提供了多种函数用于切分数据。

train_set_split()：与split_train_set()函数功能几乎一致，另外增加了几个功能：一是增加了随机状态参数，设置随机种子。二是可以传入多个具有相同行标识符数据集，输出相同索引的切分数据，这在数据本身就分散状态时非常有用。

from sklearn.model_selection import train_test_splittrain_set, test_set = train_test_split(housing, test_size=0.2, random_state=42)print(test_set.head())

输出：

longitude latitude housing_median_age total_rooms total_bedrooms \6 -119.0136.0625.0 1505.0 NaN 3024-119.4635.1430.0 2943.0 NaN 15663 -122.4437.8052.0 3830.0 NaN 20484 -118.7234.2817.0 3051.0 NaN 9814-121.9336.6234.0 2351.0 NaN population households median_income median_house_value \61392.0 359.0 1.6812 47700.0 3024 1565.0 584.0 2.5313 45800.0 156631310.0 963.0 3.4801 500001.0 204841705.0 495.0 5.7376 218600.0 9814 1063.0 428.0 3.7250 278000.0 ocean_proximity income_cat 6INLAND2 3024 INLAND2 15663 NEAR BAY3 20484 <1H OCEAN4 9814 NEAR OCEAN3

housing["median_income"].hist()

输出：

housing["income_cat"] = pd.cut(housing["median_income"],bins=[0.,1.5,3.0,4.5,6.0,np.inf],labels=[1,2,3,4,5])housing["income_cat"].hist()

输出：

现在可以根据收入进行分层抽样了：

from sklearn.model_selection import StratifiedShuffleSplitsplit = StratifiedShuffleSplit(n_splits=1, test_size=0.2, random_state=42)for train_index, test_index in split.split(housing, housing["income_cat"]):strat_train_set = housing.loc[train_index]strat_test_set = housing.loc[test_index]print(strat_train_set.head())

输出：

longitude latitude housing_median_age total_rooms total_bedrooms \17606 -121.8937.2938.0 1568.0 351.0 18632 -121.9337.0514.0 679.0 108.0 14650 -117.2032.7731.0 1952.0 471.0 3230-119.6136.3125.0 1847.0 371.0 3555-118.5934.2317.0 6592.01525.0 population households median_income median_house_value \17606 710.0 339.0 2.7042 286600.0 18632 306.0 113.0 6.4214 340600.0 14650 936.0 462.0 2.8621 196900.0 3230 1460.0 353.0 1.8839 46300.0 3555 4459.01463.0 3.0347 254500.0 ocean_proximity 17606 <1H OCEAN 18632 <1H OCEAN 14650NEAR OCEAN 3230 INLAND 3555 <1H OCEAN

strat_test_set["income_cat"].value_counts()/len(strat_test_set)

输出：

3 0.3505332 0.3187984 0.1763575 0.1145831 0.039729Name: income_cat, dtype: float64

housing["income_cat"].value_counts()/len(housing)

输出：

3 0.3505812 0.3188474 0.1763085 0.1144381 0.039826Name: income_cat, dtype: float64

def income_cat_proportions(data):return data["income_cat"].value_counts()/len(data)train_set, test_set = train_test_split(housing, test_size=0.2, random_state=42)compare_props = pd.DataFrame({"Overall":income_cat_proportions(housing),"Stratified":income_cat_proportions(strat_test_set),"Random":income_cat_proportions(test_set)}).sort_index()compare_props["Rand. %error"] = 100*compare_props["Random"]/compare_props["Overall"]-100compare_props["Strat. %error"] = 100*compare_props["Stratified"]/compare_props["Overall"]-100compare_props

输出：

Overall StratifiedRandom Rand. %errorStrat. %error10.0398260.0397290.0402130.973236-0.24330920.3188470.3187980.3243701.732260-0.01519530.3505810.3505330.3585272.266446-0.01382040.1763080.1763570.167393-5.0563340.02748050.1144380.1145830.109496-4.3183740.127011

可以看出分层抽样的结果收入分类比例与数据集全局比例基本一致，好于完全随机的抽样。因此说数据集拆分是机器学习项目中非常重要的步骤。

此时，可以将income_cat列删除掉。

for set_ in (strat_train_set, strat_test_set):set_.drop("income_cat", axis=1, inplace=True)print(strat_train_set.head())

输出：

strat_train_set.shape

输出：

(16512, 10)

3. 数据可视化和探索

此时，把测试集先放一边，专门研究训练集。复制一份训练集，以免在操作过程中修改掉原始训练集。

housing = strat_train_set.copy()

3.1 可视化地理信息数据

housing.plot(kind="scatter",x="longitude",y="latitude")

输出：

根据经度和纬度绘制散点图，其形状很像加利福尼亚地形。但除此之外，很难一眼看出其它含义。

通过设置alpha参数，观察数据分布的密度。

housing.plot(kind="scatter",x="longitude",y="latitude",alpha=0.1)

输出：

housing.plot(kind="scatter", x="longitude", y="latitude", alpha=0.4,s=housing["population"]/100, label="population", figsize=(10,7),c="median_house_value", cmap=plt.get_cmap("jet"), colorbar=True,sharex=False)plt.legend()

输出：

# 下载California地形图片image_path = os.path.join("./images/end_to_end_project")os.makedirs(image_path, exist_ok=True)DOWNLOAD_ROOT = "/ageron/handson-ml2/master/"filename = "california.png"print("Downloading",filename)url = DOWNLOAD_ROOT + "images/end_to_end_project/"+filenameurllib.request.urlretrieve(url, os.path.join(image_path,filename))

import matplotlib.image as mpimgcalifornia_img = mpimg.imread(os.path.join(image_path, filename))ax = housing.plot(kind="scatter", x="longitude", y="latitude",figsize=(10,7),s=housing["population"]/100,label="Population",c="median_house_value",cmap=plt.get_cmap("jet"),colorbar=False, alpha=0.4)plt.imshow(california_img, extent=[-124.55,-113.80,32.45,42.05], alpha=0.5,cmap=plt.get_cmap("jet"))plt.ylabel("Latitude",fontsize=14)plt.xlabel("Longitude",fontsize=14)prices = housing["median_house_value"]tick_values = np.linspace(prices.min(),prices.max(),11)cbar = plt.colorbar()cbar.ax.set_yticklabels(["$%dk"%(round(v/1000)) for v in tick_values],fontsize=14)cbar.set_label("Median House Value",fontsize=16)plt.legend(fontsize=16)plt.show()

输出：

对比地形图可以发现，大部分数据分布在湾区以及洛杉矶和圣地亚哥周围，再加上中央山谷的一长串密度相当高的地区，尤其是萨克拉曼多和弗雷斯诺附近。

这也从侧面反映出房价与地理位置和人口密度密切相关。

3.2 探索相关性

由于训练集不是很大，因此可以计算属性之间的标准相关系数，即皮尔森相关系数。

corr_matrix = housing.corr()corr_matrix["median_house_value"].sort_values(ascending=False)

输出：

median_house_value 1.000000median_income 0.687160total_rooms 0.135097housing_median_age 0.114110households 0.064506total_bedrooms 0.047689population -0.026920longitude -0.047432latitude -0.142724Name: median_house_value, dtype: float64

相关系数取值-1到1。1表示正相关，-1表示负相关，0表示没有相关性。这种相关系数关系只针对线性相关性，不适应于非线性相关性。

Pandas scatter_matrix()函数可以可视化相关系数关系。由于特征太多，绘制的结果图很大，这里选择几个感兴趣的特征进行分析：

attributes = ["median_house_value","median_income","total_rooms","housing_median_age"]pd.plotting.scatter_matrix(housing[attributes],figsize=(12,8))

输出：

从上述图中可以看出，median_house_value和median_income之间有较强的相关性。

将两个属性单独做图进行分析：

housing.plot(kind="scatter",x="median_income",y="median_house_value",alpha=0.1)

输出：

上图表示：两个属性相关性很强，其次在50万附近有一条直线。

3.3 属性组合

如果抛开家庭总数，只关注房间总数就显得没有意义。而是应该关注每个地区平均每个家庭拥有的房间数。卧室数量也是同样的道理。

housing["rooms_per_household"] = housing["total_rooms"]/housing["households"] # 平均每个家庭房屋数housing["bedrooms_per_room"] = housing["total_bedrooms"]/housing["total_rooms"] # 平均每个房屋卧室数housing["population_per_household"] = housing["population"]/housing["households"] # 平均每个家庭人口数corr_matrix = housing.corr()corr_matrix["median_house_value"].sort_values(ascending=False)

输出：

median_house_value1.000000median_income0.687160rooms_per_household 0.146285total_rooms 0.135097housing_median_age0.114110households 0.064506total_bedrooms 0.047689population_per_household -0.021985population -0.026920longitude -0.047432latitude -0.142724bedrooms_per_room-0.259984Name: median_house_value, dtype: float64

惊奇地发现，房屋价格除了与收入很相关外，第二相关的就是刚刚新生成的属性平均每个家庭房间数。

同时可以发现，房屋价格与房屋平均卧室数具有最强负相关性，这也容易理解：卧室所占房屋比例越小，价格就越贵。

4. 模型准备数据

先将训练集训练特征与标签拆分开：

housing = strat_train_set.drop("median_house_value",axis=1)housing_labels = strat_train_set["median_house_value"].copy()

4.1 数据清洗

对于数据中的缺失值，有如下三种处理方法：

删除缺失值的样本删除有缺失值的整个属性用其它值填充缺失值（例如用0、平均值、中值等）

sample_incomplete_rows = housing[housing.isnull().any(axis=1)] # 取出所有空值的样本print(len(sample_incomplete_rows))sample_incomplete_rows.head()

输出：

158longitude latitude housing_median_age total_rooms total_bedrooms \4629-118.3034.0718.0 3759.0 433.0 6068-117.8634.0116.0 4632.0 433.0 17923 -121.9737.3530.0 1955.0 433.0 13656 -117.3034.05 6.0 2155.0 433.0 19252 -122.7938.48 7.0 6837.0 433.0 population households median_income ocean_proximity 4629 3296.01462.0 2.2708 <1H OCEAN 6068 3038.0 727.0 5.1762 <1H OCEAN 17923 999.0 386.0 4.6328 <1H OCEAN 136561039.0 391.0 1.6675INLAND 192523468.01405.0 3.1662 <1H OCEAN

说明在该数据集中总共有158条有缺失值的样本。分别用上面提到的三种方法进行处理：

方法一：

sample_incomplete_rows.dropna(subset=["total_bedrooms"])

输出：

longitudelatitudehousing_median_agetotal_roomstotal_bedroomspopulationhouseholdsmedian_incomeocean_proximity

方法二：

sample_incomplete_rows.drop("total_bedrooms",axis=1)

输出：

方法三：

median = housing["total_bedrooms"].median()sample_incomplete_rows["total_bedrooms"].fillna(median, inplace=True)sample_incomplete_rows

输出：

注意：使用第三种方法时记得保存计算出的值。因为在测试集中也需要同样的操作。

skleanr提供了SimpleImputer类以方便处理这些缺失值：

from sklearn.impute import SimpleImputerimputer = SimpleImputer(strategy="median")housing_num = housing.drop("ocean_proximity",axis=1) # 由于中值只能对数字计算，所以删除字符串的列imputer.fit(housing_num)

输出：

SimpleImputer(add_indicator=False, copy=True, fill_value=None,missing_values=nan, strategy='median', verbose=0)

imputer.statistics_

输出：

array([-118.51 , 34.26 , 29. , 2119.5 , 433. , 1164. ,408. , 3.5409])

housing_num.median().values

输出：

array([-118.51 , 34.26 , 29. , 2119.5 , 433. , 1164. ,408. , 3.5409])

可以看到，imputer简单地计算了中值，并将其存储在statistics_中。

我们已经知道total_bedrooms中存在缺失值，但为了保险起见，对所有属性做中值填充。

X = imputer.transform(housing_num)type(X)

输出：

numpy.ndarray

结果是NumPy arrary格式，可以将其再转换为Pandas DataFrame格式：

housing_tr = pd.DataFrame(X, columns=housing_num.columns,index=housing_num.index)housing_tr.loc[sample_incomplete_rows.index.values]

输出：

可以看到total_bedrooms中所有缺失值被填充成中值433

imputer.strategy

输出：

'median'

4.2 处理文本和类别属性

housing_cat = housing[["ocean_proximity"]]housing_cat.head(10)

输出：

ocean_proximity17606<1H OCEAN18632<1H OCEAN14650NEAR OCEAN3230INLAND3555<1H OCEAN19480INLAND8879<1H OCEAN13685INLAND4937<1H OCEAN4861<1H OCEAN

可以看到这些不是随机的文本，这些属于类别属性。sklearn中可以使用OridinalEncoder类将其转换成数字格式：

from sklearn.preprocessing import OrdinalEncoderordinal_encoder = OrdinalEncoder()housing_cat_encoded = ordinal_encoder.fit_transform(housing_cat)housing_cat_encoded[:10]

输出：

array([[0.],[0.],[4.],[1.],[0.],[1.],[0.],[1.],[0.],[0.]])

ordinal_encoder.categories_

输出：

[array(['<1H OCEAN', 'INLAND', 'ISLAND', 'NEAR BAY', 'NEAR OCEAN'],dtype=object)]

可以使用如上变量输出所有类别值。

对于类别属性，各个类别之间没有大小之分，因此用0，1，2，3，4做属性值时机器学习模型会认为数字越大特征越重要，这样是不符合实际需求的。对于这种情况，one-hot编码是更好的方式，sklearn中提供了OneHotEncoder类：

from sklearn.preprocessing import OneHotEncodercat_encoder = OneHotEncoder()housing_cat_1hot = cat_encoder.fit_transform(housing_cat)housing_cat_1hot

输出：

<16512x5 sparse matrix of type '<class 'numpy.float64'>'with 16512 stored elements in Compressed Sparse Row format>

返回结果不是NumPy array格式，而是SciPy的稀疏矩阵。如果类别相当多时将会产生大量的新属性（列），并且结果中大部分位置都是0，只有很少位置是1，这就需要大量的存储空间。而稀疏矩阵的优点是只存储非0值的位置，节约空间。如果想转换成NumPy array格式，只需调用toarray()函数即可。

housing_cat_1hot.toarray()

输出：

array([[1., 0., 0., 0., 0.],[1., 0., 0., 0., 0.],[0., 0., 0., 0., 1.],...,[0., 1., 0., 0., 0.],[1., 0., 0., 0., 0.],[0., 0., 0., 1., 0.]])

cat_encoder.categories_

输出：

[array(['<1H OCEAN', 'INLAND', 'ISLAND', 'NEAR BAY', 'NEAR OCEAN'],dtype=object)]

或者可以指定sparse=False参数，结果将返回NumPy array格式：

cat_encoder = OneHotEncoder(sparse=False)housing_cat_1hot = cat_encoder.fit_transform(housing_cat)housing_cat_1hot

输出：

array([[1., 0., 0., 0., 0.],[1., 0., 0., 0., 0.],[0., 0., 0., 0., 1.],...,[0., 1., 0., 0., 0.],[1., 0., 0., 0., 0.],[0., 0., 0., 1., 0.]])

4.3 自定义数据清洗类

from sklearn.base import BaseEstimator, TransformerMixinrooms_ix, bedrooom_ix, population_ix, households_ix = 3,4,5,6 # 列编号class CombinedAttributesAdder(BaseEstimator, TransformerMixin):def __init__(self, add_bedrooms_per_room = True):self.add_bedrooms_per_room = add_bedrooms_per_roomdef fit(self, X, y=None):return self # 不进行任何处理def transform(self,X):rooms_per_household = X[:,rooms_ix]/X[:,households_ix]population_per_household = X[:,population_ix]/X[:,households_ix]if self.add_bedrooms_per_room:bedrooom_per_room = X[:,bedrooom_ix]/X[:,rooms_ix]return np.c_[X,rooms_per_household,population_per_household,bedrooom_per_room]else:return np.c_[X,rooms_per_household,population_per_household]attr_adder = CombinedAttributesAdder(add_bedrooms_per_room=False)housing_extra_attribs = attr_adder.transform(housing.values)

housing_extra_attribs = pd.DataFrame(housing_extra_attribs,columns=list(housing.columns)+["rooms_per_household","population_per_household"],index=housing.index)housing_extra_attribs.head()

输出：

longitude latitude housing_median_age total_rooms total_bedrooms \17606 -121.89 37.29 38 1568 351 18632 -121.93 37.05 14 679 108 14650 -117.2 32.77 31 1952 471 3230 -119.61 36.31 25 1847 371 3555 -118.59 34.23 17 6592 1525 population households median_income ocean_proximity rooms_per_household \17606 710 339 2.7042 <1H OCEAN 4.62537 18632 306 113 6.4214 <1H OCEAN 6.00885 14650 936 462 2.8621NEAR OCEAN 4.22511 3230 1460 353 1.8839INLAND 5.23229 3555 4459 1463 3.0347 <1H OCEAN 4.50581 population_per_household 17606 2.0944 18632 2.70796 14650 2.02597 3230 4.13598 3555 3.04785

4.4 数据归一化

将数据喂入机器学习模型前通常需要做归一化操作，常用的归一化有两种：最小最大归一化和标准归一化。

属性值减去最小值再除以最大值，最终将所有值归一化到[0,1]范围内，离群值对这种方法影响大。sklearn中提供MinMaxScaler属性值减去平均值再除以方差，最终数据转化成均值为0、方差为1的正态分布，而有将值限定到某一范围，离群值对这种方法影响小。sklearn中提供了StandardScaler

from sklearn.pipeline import Pipelinefrom sklearn.preprocessing import StandardScalernum_pipeline = Pipeline([("imputer",SimpleImputer(strategy="median")),("attribs_adder",CombinedAttributesAdder()),("std_scaler",StandardScaler())])housing_num_tr = num_pipeline.fit_transform(housing_num)housing_num_tr

输出：

array([[-1.15604281, 0.77194962, 0.74333089, ..., -0.31205452,-0.08649871, 0.15531753],[-1.17602483, 0.6596948 , -1.1653172 , ..., 0.21768338,-0.03353391, -0.83628902],[ 1.18684903, -1.34218285, 0.18664186, ..., -0.46531516,-0.09240499, 0.422 ],...,[ 1.58648943, -0.72478134, -1.56295222, ..., 0.3469342 ,-0.03055414, -0.52177644],[ 0.78221312, -0.85106801, 0.18664186, ..., 0.02499488,0.06150916, -0.30340741],[-1.43579109, 0.99645926, 1.85670895, ..., -0.22852947,-0.09586294, 0.10180567]])

sklearn0.20版本中加入了ColumnTransformer可以同时处理所有列

from pose import ColumnTransformernum_attribs = list(housing_num) # 获取数值属性列名cat_attribs = ["ocean_proximity"] # 获取类别属性列名full_pipeline = ColumnTransformer([("num",num_pipeline,num_attribs),("cat",OneHotEncoder(),cat_attribs)])housing_prepared = full_pipeline.fit_transform(housing)housing_prepared

输出：

array([[-1.15604281, 0.77194962, 0.74333089, ..., 0. ,0. , 0. ],[-1.17602483, 0.6596948 , -1.1653172 , ..., 0. ,0. , 0. ],[ 1.18684903, -1.34218285, 0.18664186, ..., 0. ,0. , 1. ],...,[ 1.58648943, -0.72478134, -1.56295222, ..., 0. ,0. , 0. ],[ 0.78221312, -0.85106801, 0.18664186, ..., 0. ,0. , 0. ],[-1.43579109, 0.99645926, 1.85670895, ..., 0. ,1. , 0. ]])

housing_prepared.shape

输出：

(16512, 16)

housing.shape

输出：

(16512, 9)

为什么最终是16列属性？数值类型的列是8列，类别属性1列变成5列的OneHot，再加上组合的rooms_per_household等三列属性，最终8+5+3=16列。

OneHotEncoder返回的是稀疏矩阵，而num_pipeline返回的密集矩阵，ColumnTransformer会对最终的矩阵进行密度评估：如果密度小于设置的阈值（默认为0.3）就会返回稀疏矩阵，否则会返回密集矩阵。

此案例中返回的是密集矩阵。可以进行一个大概的判断：num_pipeline处理的数字属性有8列，所有的都有值，而OneHotEncoder后的矩阵是5列，综合起来还是大部分是有值的，因此最终返回密集矩阵。

num_attribs

输出：

['longitude','latitude','housing_median_age','total_rooms','total_bedrooms','population','households','median_income']

cat_encoder.categories_

输出：

[array(['<1H OCEAN', 'INLAND', 'ISLAND', 'NEAR BAY', 'NEAR OCEAN'],dtype=object)]

5. 选择模型并训练

5.1 线性回归模型

选择线性回归模型进行训练

from sklearn.linear_model import LinearRegressionlin_reg = LinearRegression()lin_reg.fit(housing_prepared, housing_labels)

输出：

LinearRegression(copy_X=True, fit_intercept=True, n_jobs=None, normalize=False)

some_data = housing.iloc[:5]some_labels = housing_labels[:5]some_data_prepared = full_pipeline.transform(some_data)print("Predictions: ", lin_reg.predict(some_data_prepared))print("Truth Labels: ", list(some_labels))

输出：

Predictions: [210644.60459286 317768.80697211 210956.43331178 59218.98886849189747.55849879]Truth Labels: [286600.0, 340600.0, 196900.0, 46300.0, 254500.0]

预测结果偏差好像有些大。

计算根均方误差看看：

from sklearn.metrics import mean_squared_errorhousing_predictions = lin_reg.predict(housing_prepared)lin_mse = mean_squared_error(housing_labels, housing_predictions)lin_rmse = np.sqrt(lin_mse) # np.sqrt()开平方根lin_rmse

输出：

68628.19819848923

数据集中房价取值范围是120000至120000至265000，而均方根误差达到68628显然不是一个很好的结果，造成些的原因可能是模型欠拟合。

解决欠拟合主要有三个手段：选择更强的模型、喂入更多特征数据、减少对模型的约束。而在本案例中没有对模型进行任何约束，所以从前两个手段入手。

先尝试其它模型试试，例如决策树模型：

5.2 决策树模型

from sklearn.tree import DecisionTreeRegressortree_reg = DecisionTreeRegressor(random_state=42)tree_reg.fit(housing_prepared, housing_labels)

输出：

DecisionTreeRegressor(criterion='mse', max_depth=None, max_features=None,max_leaf_nodes=None, min_impurity_decrease=0.0,min_impurity_split=None, min_samples_leaf=1,min_samples_split=2, min_weight_fraction_leaf=0.0,presort=False, random_state=42, splitter='best')

housing_predictions = tree_reg.predict(housing_prepared)tree_mse = mean_squared_error(housing_labels, housing_predictions)tree_rmse = np.sqrt(tree_mse)tree_rmse

输出：

0.0

均方误差为0，不可能这么好运，很可能模型对训练数据进行了过度拟合。如何判断是否过拟合了，使用测试数据集测试一下就知道了，可是测试数据集一般是盖棺定论的时候用的，而更常用的方法是从训练集中分隔出一部分数据当作验证集。而验证集就是训练、调整模型超参数用的。

5.3 交叉验证

普通的做法是将训练集拆分成较小的训练集和验证集，再用拆分的训练集训练模型，用验证集评估模型。但sklearn中提供了更方便的工具：K折交叉验证（K-fold cross-validation）：

from sklearn.model_selection import cross_val_scorescores = cross_val_score(tree_reg, housing_prepared, housing_labels,scoring="neg_mean_squared_error",cv=10)tree_rmse_scores = np.sqrt(-scores)

sklearn交叉验证特征期望函数越大越好，而损失函数越小越好，交叉验证中的scoring函数与MSE函数相反，因此在scores前面加上负号。

def display_scores(scores):print("Scores: ",scores)print("Mean: ",scores.mean())print("Standard deviation: ",scores.std())display_scores(tree_rmse_scores)

输出：

Scores: [70194.33680785 66855.16363941 72432.58244769 70758.7389678271115.88230639 75585.14172901 70262.86139133 70273.632528575366.87952553 71231.65726027]Mean: 71407.68766037929Standard deviation: 2439.4345041191004

可以看到，误差大概在71407左右，而前非像之前那么好（均方误差为0）。而且决策树的效果似乎还没有线性回归好，线性回归均方误差大概68628左右。

用线性回归试试交叉验证：

lin_scores = cross_val_score(lin_reg, housing_prepared,housing_labels,scoring="neg_mean_squared_error",cv=10)lin_rmse_scores = np.sqrt(-lin_scores)display_scores(lin_rmse_scores)

输出：

Scores: [66782.73843989 66960.118071 70347.95244419 74739.5705255268031.13388938 71193.84183426 64969.63056405 68281.6113799771552.91566558 67665.10082067]Mean: 69052.46136345083Standard deviation: 2731.6740017983425

5.4 随机森林模型

随机森林是集成算法中的一种，即随机生成多个决策树，最终取平均值。

from sklearn.ensemble import RandomForestRegressorforest_reg = RandomForestRegressor(n_estimators=100,random_state=42) # sklearn 0.22版本之后默认值为100forest_reg.fit(housing_prepared, housing_labels)

输出：

RandomForestRegressor(bootstrap=True, criterion='mse', max_depth=None,max_features='auto', max_leaf_nodes=None,min_impurity_decrease=0.0, min_impurity_split=None,min_samples_leaf=1, min_samples_split=2,min_weight_fraction_leaf=0.0, n_estimators=100,n_jobs=None, oob_score=False, random_state=42, verbose=0,warm_start=False)

housing_predictions = forest_reg.predict(housing_prepared)forest_mse = mean_squared_error(housing_labels, housing_predictions)forest_rmse = np.sqrt(forest_mse)forest_rmse

输出：

18603.515021376355

很明显，随机森林的效果比之前两个模型都好。再通过交叉验证检查模型是否有过拟合现象：

from sklearn.model_selection import cross_val_scoreforest_scores = cross_val_score(forest_reg, housing_prepared, housing_labels,scoring="neg_mean_squared_error",cv=10)forest_rmse_scores = np.sqrt(-forest_scores)display_scores(forest_rmse_scores)

输出：

Scores: [49519.80364233 47461.9115823 50029.02762854 52325.2806895349308.39426421 53446.37892622 48634.8036574 47585.7383231153490.10699751 50021.5852922 ]Mean: 50182.303100336096Standard deviation: 2097.0810550985693

可以看到，训练集上的结果和验证集上的结果还是存在很大的差别，说明模型还是存在一定的过拟合。

5.5 支持向量机

from sklearn.svm import SVRsvm_reg = SVR(kernel="linear")svm_reg.fit(housing_prepared, housing_labels)housing_predictions = svm_reg.predict(housing_prepared)svm_mse = mean_squared_error(housing_labels, housing_predictions)svm_rmse = np.sqrt(svm_mse)svm_rmse

输出：

111094.6308539982

以上输出可以发现，支持向量机模型误差太大，效果很差，说明支持向量机不太适合做线性回归的任务，而更合适做分类的任务。

6. 模型微调（Fine-Tune）

6.1 网络搜索（Grid Search）

基本思想：不断地尝试各种超参数组合，最终筛选出最佳超参数组合。sklearn提供了GridSearchCV工具可以自动化地完成这个工作。

from sklearn.model_selection import GridSearchCVpara_grid = [{"n_estimators":[3,10,30],"max_features":[2,4,6,8]},# 3*4=12种组合方式{"bootstrap":[False],"n_estimators":[3,10],"max_features":[2,3,4]} # 1*2*3=6种组合方式]forest_reg = RandomForestRegressor(random_state=42)grid_search = GridSearchCV(forest_reg, para_grid, cv=5,scoring="neg_mean_squared_error", # 每种组合方式进行折交叉验证return_train_score=True)grid_search.fit(housing_prepared, housing_labels)

输出：

GridSearchCV(cv=5, error_score='raise-deprecating',estimator=RandomForestRegressor(bootstrap=True, criterion='mse',max_depth=None,max_features='auto',max_leaf_nodes=None,min_impurity_decrease=0.0,min_impurity_split=None,min_samples_leaf=1,min_samples_split=2,min_weight_fraction_leaf=0.0,n_estimators='warn', n_jobs=None,oob_score=False, random_state=42,verbose=0, warm_start=False),iid='warn', n_jobs=None,param_grid=[{'max_features': [2, 4, 6, 8],'n_estimators': [3, 10, 30]},{'bootstrap': [False], 'max_features': [2, 3, 4],'n_estimators': [3, 10]}],pre_dispatch='2*n_jobs', refit=True, return_train_score=True,scoring='neg_mean_squared_error', verbose=0)

grid_search.best_params_

输出：

{'max_features': 8, 'n_estimators': 30}

grid_search.best_estimator_

输出：

RandomForestRegressor(bootstrap=True, criterion='mse', max_depth=None,max_features=8, max_leaf_nodes=None,min_impurity_decrease=0.0, min_impurity_split=None,min_samples_leaf=1, min_samples_split=2,min_weight_fraction_leaf=0.0, n_estimators=30,n_jobs=None, oob_score=False, random_state=42, verbose=0,warm_start=False)

cvres = grid_search.cv_results_for mean_score, params in zip(cvres["mean_test_score"], cvres["params"]):print(np.sqrt(-mean_score),params)

输出：

63669.05791727153 {'max_features': 2, 'n_estimators': 3}55627.16171305252 {'max_features': 2, 'n_estimators': 10}53384.57867637289 {'max_features': 2, 'n_estimators': 30}60965.99185930139 {'max_features': 4, 'n_estimators': 3}52740.98248528835 {'max_features': 4, 'n_estimators': 10}50377.344409590376 {'max_features': 4, 'n_estimators': 30}58663.84733372485 {'max_features': 6, 'n_estimators': 3}5.15355973719 {'max_features': 6, 'n_estimators': 10}50146.465964159885 {'max_features': 6, 'n_estimators': 30}57869.25504027614 {'max_features': 8, 'n_estimators': 3}51711.09443660957 {'max_features': 8, 'n_estimators': 10}49682.25345942335 {'max_features': 8, 'n_estimators': 30}62895.088889905004 {'bootstrap': False, 'max_features': 2, 'n_estimators': 3}54658.14484390074 {'bootstrap': False, 'max_features': 2, 'n_estimators': 10}59470.399594730654 {'bootstrap': False, 'max_features': 3, 'n_estimators': 3}52725.01091081235 {'bootstrap': False, 'max_features': 3, 'n_estimators': 10}57490.612956065226 {'bootstrap': False, 'max_features': 4, 'n_estimators': 3}51009.51445842374 {'bootstrap': False, 'max_features': 4, 'n_estimators': 10}

pd.DataFrame(grid_search.cv_results_)

输出：

mean_fit_timestd_fit_timemean_score_timestd_score_timeparam_max_featuresparam_n_estimatorsparam_bootstrapparamssplit0_test_scoresplit1_test_score...mean_test_scorestd_test_scorerank_test_scoresplit0_train_scoresplit1_train_scoresplit2_train_scoresplit3_train_scoresplit4_train_scoremean_train_scorestd_train_score00.05580.0035440.00244.898819e-0423NaN{'max_features': 2, 'n_estimators': 3}-3.837622e+09-4.147108e+09...-4.053749e+091.519609e+0818-1.064113e+09-1.105142e+09-1.116550e+09-1.112342e+09-1.129650e+09-1.105559e+092.220402e+0710.17140.0012000.00764.898624e-04210NaN{'max_features': 2, 'n_estimators': 10}-3.047771e+09-3.254861e+09...-3.094381e+091.327046e+0811-5.927175e+08-5.870952e+08-5.776964e+08-5.716332e+08-5.802501e+08-5.818785e+087.345821e+0620.52120.0049960.02108.944157e-04230NaN{'max_features': 2, 'n_estimators': 30}-2.689185e+09-3.021086e+09...-2.849913e+091.626879e+089-4.381089e+08-4.391272e+08-4.371702e+08-4.376955e+08-4.452654e+08-4.394734e+082.966320e+0630.08260.0008000.00300.000000e+0043NaN{'max_features': 4, 'n_estimators': 3}-3.730181e+09-3.786886e+09...-3.716852e+091.631421e+0816-9.865163e+08-1.012565e+09-9.169425e+08-1.037400e+09-9.707739e+08-9.848396e+084.084607e+0740.27120.0009800.00764.899014e-04410NaN{'max_features': 4, 'n_estimators': 10}-2.666283e+09-2.784511e+09...-2.781611e+091.268562e+088-5.097115e+08-5.162820e+08-4.962893e+08-5.436192e+08-5.160297e+08-5.163863e+081.542862e+0750.82700.0068700.02141.019805e-03430NaN{'max_features': 4, 'n_estimators': 30}-2.387153e+09-2.588448e+09...-2.537877e+091.214603e+083-3.838835e+08-3.880268e+08-3.790867e+08-4.040957e+08-3.845520e+08-3.879289e+088.571233e+0660.11120.0029930.00301.784161e-0763NaN{'max_features': 6, 'n_estimators': 3}-3.119657e+09-3.586319e+09...-3.441447e+091.893141e+0814-9.245343e+08-8.886939e+08-9.353135e+08-9.009801e+08-8.624664e+08-9.023976e+082.591445e+0770.36960.0020590.00783.998995e-04610NaN{'max_features': 6, 'n_estimators': 10}-2.549663e+09-2.782039e+09...-2.704640e+091.471542e+086-4.980344e+08-5.045869e+08-4.994664e+08-4.990325e+08-5.055542e+08-5.013349e+083.100456e+0681.12500.0047330.02084.000187e-04630NaN{'max_features': 6, 'n_estimators': 30}-2.370010e+09-2.583638e+09...-2.514668e+091.285063e+082-3.838538e+08-3.804711e+08-3.805218e+08-3.856095e+08-3.901917e+08-3.841296e+083.617057e+0690.13980.0011660.00264.900182e-0483NaN{'max_features': 8, 'n_estimators': 3}-3.353504e+09-3.348552e+09...-3.348851e+091.241864e+0813-9.228123e+08-8.553031e+08-8.603321e+08-8.881964e+08-9.151287e+08-8.883545e+082.750227e+07100.46980.0035440.00744.899598e-04810NaN{'max_features': 8, 'n_estimators': 10}-2.571970e+09-2.718994e+09...-2.674037e+091.392720e+085-4.932416e+08-4.815238e+08-4.730979e+08-5.155367e+08-4.985555e+08-4.923911e+081.459294e+07111.41620.0083520.02106.324851e-04830NaN{'max_features': 8, 'n_estimators': 30}-2.357390e+09-2.546640e+09...-2.468326e+091.091647e+081-3.841658e+08-3.744500e+08-3.773239e+08-3.882250e+08-3.810005e+08-3.810330e+084.871017e+06120.08340.0013560.00301.168008e-0723False{'bootstrap': False, 'max_features': 2, 'n_est...-3.785816e+09-4.166012e+09...-3.955792e+091.900966e+0817-0.000000e+00-0.000000e+00-0.000000e+00-0.000000e+00-0.000000e+000.000000e+000.000000e+00130.27420.0017200.00864.898430e-04210False{'bootstrap': False, 'max_features': 2, 'n_est...-2.810721e+09-3.107789e+09...-2.987513e+091.539231e+0810-6.056477e-02-0.000000e+00-0.000000e+00-0.000000e+00-2.967449e+00-6.056027e-011.181156e+00140.10760.0024170.00324.000664e-0433False{'bootstrap': False, 'max_features': 3, 'n_est...-3.618324e+09-3.441527e+09...-3.536728e+097.795196e+0715-0.000000e+00-0.000000e+00-0.000000e+00-0.000000e+00-6.072840e+01-1.214568e+012.429136e+01150.35220.0036550.00864.899403e-04310False{'bootstrap': False, 'max_features': 3, 'n_est...-2.757999e+09-2.851737e+09...-2.779927e+096.286611e+077-2.089484e+01-0.000000e+00-0.000000e+00-0.000000e+00-5.465556e+00-5.272080e+008.093117e+00160.13160.0023320.00323.999949e-0443False{'bootstrap': False, 'max_features': 4, 'n_est...-3.134040e+09-3.559375e+09...-3.305171e+091.879203e+0812-0.000000e+00-0.000000e+00-0.000000e+00-0.000000e+00-0.000000e+000.000000e+000.000000e+00170.43500.0026080.00906.324851e-04410False{'bootstrap': False, 'max_features': 4, 'n_est...-2.525578e+09-2.710011e+09...-2.601971e+091.088031e+084-0.000000e+00-1.514119e-02-0.000000e+00-0.000000e+00-0.000000e+00-3.028238e-036.056477e-0318 rows × 23 columns

通过GridSearch得出最佳的两参数，8和30组合时mean_scores为49682，比之前的50182效果要好。

但稍加观察可以发现这两个数都是候选值中最大的，即边缘值，也就说继续增大参数选择，可能会找到更优的参数。

尝试更佳参数选择：

para_grid = [{"n_estimators":[30,50,80],"max_features":[8,10,12,14]},# 3*4=12种组合方式{"bootstrap":[False],"n_estimators":[20,30,80],"max_features":[4,8,12,14]} # 1*3*4=12种组合方式]forest_reg = RandomForestRegressor(random_state=42)grid_search = GridSearchCV(forest_reg, para_grid, cv=5,scoring="neg_mean_squared_error", # 每种组合方式进行折交叉验证return_train_score=True)grid_search.fit(housing_prepared, housing_labels)

输出：

GridSearchCV(cv=5, error_score='raise-deprecating',estimator=RandomForestRegressor(bootstrap=True, criterion='mse',max_depth=None,max_features='auto',max_leaf_nodes=None,min_impurity_decrease=0.0,min_impurity_split=None,min_samples_leaf=1,min_samples_split=2,min_weight_fraction_leaf=0.0,n_estimators='warn', n_jobs=None,oob_score=False, random_state=42,verbose=0, warm_start=False),iid='warn', n_jobs=None,param_grid=[{'max_features': [8, 10, 12, 14],'n_estimators': [30, 50, 80]},{'bootstrap': [False], 'max_features': [4, 8, 12, 14],'n_estimators': [20, 30, 80]}],pre_dispatch='2*n_jobs', refit=True, return_train_score=True,scoring='neg_mean_squared_error', verbose=0)

grid_search.best_params_

输出：

{'bootstrap': False, 'max_features': 8, 'n_estimators': 80}

grid_search.best_estimator_

输出：

RandomForestRegressor(bootstrap=False, criterion='mse', max_depth=None,max_features=8, max_leaf_nodes=None,min_impurity_decrease=0.0, min_impurity_split=None,min_samples_leaf=1, min_samples_split=2,min_weight_fraction_leaf=0.0, n_estimators=80,n_jobs=None, oob_score=False, random_state=42, verbose=0,warm_start=False)

从上述输出结果可以看出，max_features可能8就是最佳参数，而n_estimators继续增大时可能会更好。

6.2 随机搜索

当超参数搜索空间很大时，即参数组合很多时，使用GridSearch进行搜索就显得非常吃力，此时随机搜索就比较有用。

from sklearn.model_selection import RandomizedSearchCVfrom scipy.stats import randintpara_distribs = {"n_estimators":randint(low=1,high=200),"max_features":randint(low=1,high=8)}forest_reg = RandomForestRegressor(random_state=42)rnd_search = RandomizedSearchCV(forest_reg,param_distributions=para_distribs,n_iter=10, cv=5, scoring="neg_mean_squared_error",random_state=42)rnd_search.fit(housing_prepared, housing_labels)

输出：

RandomizedSearchCV(cv=5, error_score='raise-deprecating',estimator=RandomForestRegressor(bootstrap=True,criterion='mse',max_depth=None,max_features='auto',max_leaf_nodes=None,min_impurity_decrease=0.0,min_impurity_split=None,min_samples_leaf=1,min_samples_split=2,min_weight_fraction_leaf=0.0,n_estimators='warn',n_jobs=None, oob_score=False,random_sta...warm_start=False),iid='warn', n_iter=10, n_jobs=None,param_distributions={'max_features': <scipy.stats._distn_infrastructure.rv_frozen object at 0x000000002AC750B8>,'n_estimators': <scipy.stats._distn_infrastructure.rv_frozen object at 0x000000002AC5E160>},pre_dispatch='2*n_jobs', random_state=42, refit=True,return_train_score=False, scoring='neg_mean_squared_error',verbose=0)

cvres = rnd_search.cv_results_for mean_score, param in zip(cvres["mean_test_score"],cvres["params"]):print(np.sqrt(-mean_score),param)

输出：

49150.657232934034 {'max_features': 7, 'n_estimators': 180}51389.85295710133 {'max_features': 5, 'n_estimators': 15}50796.12045980556 {'max_features': 3, 'n_estimators': 72}50835.09932039744 {'max_features': 5, 'n_estimators': 21}49280.90117886215 {'max_features': 7, 'n_estimators': 122}50774.86679035961 {'max_features': 3, 'n_estimators': 75}50682.75001237282 {'max_features': 3, 'n_estimators': 88}49608.94061293652 {'max_features': 5, 'n_estimators': 100}50473.57642831875 {'max_features': 3, 'n_estimators': 150}64429.763804893395 {'max_features': 5, 'n_estimators': 2}

随机搜索出10组超参数组合。

6.3 分析最佳模型

feature_importances = grid_search.best_estimator_.feature_importances_feature_importances

输出：

array([7.58744378e-02, 6.78499428e-02, 4.19007228e-02, 1.48764231e-02,1.38685455e-02, 1.41736267e-02, 1.38318766e-02, 3.70241621e-01,4.89380985e-02, 1.10122210e-01, 5.43566571e-02, 5.97927301e-03,1.62816474e-01, 7.91378906e-05, 2.11403803e-03, 2.97691572e-03])

extra_attribs = ["rooms_per_hhold","pop_per_hhold","bedrooms_per_room"]cat_encoder = full_pipeline.named_transformers_["cat"]cat_one_hot_attribs = list(cat_encoder.categories_[0])attributes = num_attribs +extra_attribs+cat_one_hot_attribssorted(zip(feature_importances,attributes),reverse=True)

输出：

[(0.3702416205722281, 'median_income'),(0.16281647391862586, 'INLAND'),(0.11012221006942573, 'pop_per_hhold'),(0.07587443781723896, 'longitude'),(0.06784994279359013, 'latitude'),(0.05435665711948978, 'bedrooms_per_room'),(0.04893809850661322, 'rooms_per_hhold'),(0.041900722751888345, 'housing_median_age'),(0.014876423067328884, 'total_rooms'),(0.014173626700600368, 'population'),(0.01386854546599152, 'total_bedrooms'),(0.013831876573763422, 'households'),(0.005979273010634567, '<1H OCEAN'),(0.00297691571669059, 'NEAR OCEAN'),(0.0021140380252633183, 'NEAR BAY'),(7.913789062714212e-05, 'ISLAND')]

以上输出显示了属性或特征的重要性，可以看到很多特征的重要性很低，可以考虑将这些特征删除掉，以减轻模型负担。

同时需要检查模型错误率，并尝试分析错误发生在什么地方：

final_model = grid_search.best_estimator_X_test = strat_test_set.drop("median_house_value",axis=1)y_test = strat_test_set["median_house_value"].copy()X_test_prepared = full_pipeline.transform(X_test)final_predictions = final_model.predict(X_test_prepared)final_mse = mean_squared_error(y_test, final_predictions)final_rmse = np.sqrt(final_mse)final_rmse

输出：

46736.13265618231

from scipy import statsconfidence = 0.95squared_errors = (final_predictions - y_test)**2np.sqrt(stats.t.interval(confidence, len(squared_errors)-1,loc=squared_errors.mean(),scale=stats.sem(squared_errors)))

输出：

array([44718.52821289, 48670.16977354])

# 手动计算方式：m = len(squared_errors)mean = squared_errors.mean()tscore = stats.t.ppf((1+confidence)/2,df=m-1)tmargin = tscore*squared_errors.std(ddof=1)/np.sqrt(m)np.sqrt(mean-tmargin),np.sqrt(mean+tmargin)

输出：

(44718.52821289402, 48670.16977353933)

# 也可以使用z-scores代替t-scores：zscore = stats.norm.ppf((1+confidence)/2)zmargin = zscore*squared_errors.std(ddof=1)/np.sqrt(m)np.sqrt(mean-zmargin),np.sqrt(mean+zmargin)

输出：

(44719.133276515044, 48669.61382947091)