失眠网 > 机器学习实践—基于Scikit-Learn Keras和TensorFlow2第二版—第4章训练模型

机器学习实践—基于Scikit-Learn Keras和TensorFlow2第二版—第4章训练模型

时间：2023-10-04 11:41:37

通过之前的学习，我们已经可以创建出手写数字训练的模型，并且经过不断优化能达到一个很好的效果，但是这些模型背后是如何工作，我们却一无所知，就像一个黑盒。深入了解模型背后的原理，可以使我们更快地适应模型、选择合适的算法、选择合适的超参数等等，甚至可以快速分析误差来源。

0. 导入所需的库

import numpy as npimport matplotlib as mplimport matplotlib.pyplot as plt%matplotlib inlineimport sklearnfor i in (np, mpl, sklearn):print(i.__name__,": ",i.__version__,sep="")

输出：

numpy: 1.17.4matplotlib: 3.1.2sklearn: 0.21.3

1. 线性回归

机器学习中，向量一般代表列向量，相当于只有一列的矩阵。

线性回归模型：

损失函数为圴方误差

闭型解：

更多关于线性回归及算法原理推导，请查看：/Jwenxue/article/details/106599344

1.1 线性回归求解

为了更好地进行案例分析，现在生成一些近似线性关系的数据：

X = 2 * np.random.rand(100,1)y = 4 + 3 * X + np.random.randn(100,1)plt.plot(X,y,"b.")plt.xlabel("$x_1$",fontsize=18)plt.ylabel("$y$",rotation=0, fontsize=18)plt.axis([0,2,0,15])plt.show()

输出：

上图所示为生成的数据。现在利用上面提到的闭型解公式求解参数θ：

X_b = np.c_[np.ones((100,1)),X] # 为了方便求解偏置b，在X中加入一列全1theta_best = np.linalg.inv(X_b.T.dot(X_b)).dot(X_b.T).dot(y)theta_best

输出：

array([[3.97080953],[2.89942544]])

可以得出𝜃0=4.29050561θ0=4.29050561，𝜃1=2.703075θ1=2.703075，这与原始的4和3已经很接近了，但是由于数据中加入了高斯噪声，所以无法计算出确切的参数值，只能接近，甚至无限接近。

可以使用获得的参数进行预测：

X_new = np.array([[0],[2]])X_new_b = np.c_[np.ones((2,1)), X_new]y_predict = X_new_b.dot(theta_best)y_predict

输出：

array([[3.97080953],[9.7696604 ]])

plt.plot(X_new, y_predict,"r-",linewidth=2,label="Predictions")plt.plot(X,y,"b.")plt.xlabel("$x_1$",fontsize=18)plt.ylabel("$y$",rotation=0,fontsize=18)plt.axis([0,2,0,15])plt.legend(fontsize=14)plt.show()

输出：

使用sklearn工具完成线性回归模型：

from sklearn.linear_model import LinearRegressionlin_reg = LinearRegression()lin_reg.fit(X,y)lin_reg.intercept_, lin_reg.coef_

输出：

(array([3.97080953]), array([[2.89942544]]))

lin_reg.predict(X_new)

输出：

array([[3.97080953],[9.7696604 ]])

可以看到，sklearn结果与上面结果一致。

sklearn中线性回归类LinearRegression是基于Scipy中linalg.lstsq()函数，即最小二乘法：

theta_best_svd, residuals, rank, s = np.linalg.lstsq(X_b,y,rcond=1e-6)theta_best_svd

输出：

array([[3.97080953],[2.89942544]])

上述函数计算过程为：先计算X的伪逆，再乘y，即得到theta的值。也可直接手动调动完成：

np.linalg.pinv(X_b).dot(y)

输出：

array([[3.97080953],[2.89942544]])

计算伪逆可以通过矩阵的奇异值分解的方法，这种方法比上述提到直接计算闭型解的方法高效，同时奇异值分解可以很好地处理特殊情况。因为X.T.dot(X)有时是不可逆的，例如m<n，或者某些特征多余或重复的情况下。

1.2 计算复杂度

不同求解方法计算复杂度：

直接求闭型解：对于X加上偏置后其X.T.dot(X)为(n+1)*(n+1)的矩阵，根据不同算法实现方法的不同，计算复杂度为O（n^2.4)到O（n^3)之间。如果特征增加一倍，则计算量大约增加5.3倍到8位。奇异值分解：O(n^2)，如果特征增加一倍，则计算量大约增加4倍。

无论是直接求解的方法，还是奇异值分解的方法，当数据量很大的时候，需要将所有训练集读入内存，才能进行下一步的计算，这就要求大内存的平台。因此，梯度下降可以很好地解决这些问题。

2. 梯度下降

梯度下降的思想是通过不断地调整参数从而使损失函数最小化。

梯度下降中学习率是一个重要的超参数，不能太大，防止越过最小值或发生震荡，也不能太小，防止训练花费很长很长的时间。

均方误差函数是凸函数，也就是说函数上任意两个点组成的线段都不会穿过曲线。换句话说圴方误差函数没有局部极小值，只有一个全局最小值。同时圴方误差是连续函数。因此，以圴方误差为损失函数的模型，在设置合适学习率的情况下，一定会找到全局最优解。

梯度下降分为三类：

批量梯度下降：每个epoch中所有训练集都参与运算小批量梯度下降：每个epoch中训练集随机分成batch_size大小参与运算随机梯度下降：每个epoch中训练集每次一个样本一个样本参与运算

2.1 批量梯度下降

由偏导数组成的向量叫梯度，梯度的方向是函数值增大的方向。

eta = 0.1n_interations = 1000m = 100theta = np.random.randn(2,1)for iter in range(n_interations):gradients = 2/m * X_b.T.dot(X_b.dot(theta)-y) # 根据偏导数公式计算theta = theta - eta * gradientstheta

输出：

array([[3.97080953],[2.89942544]])

X_new_b.dot(theta)

输出：

array([[3.97080953],[9.7696604 ]])

批量梯度下降可以得出与SVD差不多的结果。

可以再观察分析不同学习率对梯度下降的影响：

theta_path_bgd = []def plot_gradient_descent(theta, eta, theta_path=None):m = len(X_b)plt.plot(X, y, "b.")n_iterations = 1000for iteration in range(n_interations):if iteration < 10:y_predict = X_new_b.dot(theta)style = "b-" if iteration > 0 else "r--"plt.plot(X_new, y_predict, style)gradients = 2/m * X_b.T.dot(X_b.dot(theta) - y)theta = theta - eta * gradientsif theta_path is not None:theta_path.append(theta)plt.xlabel("$x_1$",fontsize=18)plt.axis([0,2,0,15])plt.title(r"$\eta = {}$".format(eta), fontsize=16)np.random.seed(42)theta = np.random.randn(2,1)plt.figure(figsize=(12,5))plt.subplot(131);plot_gradient_descent(theta, eta=0.02)plt.ylabel("$y$", rotation=0, fontsize=18)plt.subplot(132);plot_gradient_descent(theta, eta=0.1,theta_path=theta_path_bgd)plt.subplot(133);plot_gradient_descent(theta, eta=0.5)plt.tight_layout()plt.show()

输出：

上图中红线代表theta值初始化时的值，其它9条蓝线是前9次梯度下降迭代后theta的值。可以看到中间图学习率为0.1时，经过9轮迭代就达到了不错的效果。而第一幅图学习为0.02，说明太小了，需要很多轮的迭代才能达到较好的效果。而第三幅图学习率为0.5，说明太大了，一下子越过了最好的位置。

如何找到最好的学习率，可以使用grid search的方法。如何设置迭代次数，可以先设一个比较大的迭代次数，然后在梯度变化很小时提前停止。

2.2 随机梯度下降

随机梯度下降每次选取一个样本进行训练。随机梯度下降每次迭代很快，因为只有一个样本数据，但是可能存在损失函数上下震荡的情况。

theta_path_sgd = []m = len(X_b)np.random.seed(42)n_epochs = 50t0, t1 = 5,50def learning_schedule(t):return t0/(t+t1)theta = np.random.randn(2,1)for epoch in range(n_epochs):for i in range(m):if epoch == 0 and i < 20:y_predict = X_new_b.dot(theta)style = "b-" if i > 0 else "r--"plt.plot(X_new, y_predict, style)random_index = np.random.randint(m)xi = X_b[random_index:random_index+1]yi = y[random_index:random_index+1]gradients = 2*xi.T.dot(xi.dot(theta)-yi)eta = learning_schedule(epoch*m+i)theta = theta - eta * gradientstheta_path_sgd.append(theta)plt.plot(X,y,"b.")plt.xlabel("$x_1$",fontsize=18)plt.ylabel("$y$",rotation=0,fontsize=18)plt.axis([0,2,0,15])plt.show()

输出：

theta

输出：

array([[4.03600306],[2.84237827]])

可以看到，随机梯度下降只训练了50轮就达到了不错的效果。上图中蓝色线条显示了前20轮的训练情况。

from sklearn.linear_model import SGDRegressorsgd_reg = SGDRegressor(max_iter=1000, tol=1e-3, penalty=None, eta0=0.1, random_state=42)sgd_reg.fit(X,y.ravel())

输出：

SGDRegressor(alpha=0.0001, average=False, early_stopping=False, epsilon=0.1,eta0=0.1, fit_intercept=True, l1_ratio=0.15,learning_rate='invscaling', loss='squared_loss', max_iter=1000,n_iter_no_change=5, penalty=None, power_t=0.25, random_state=42,shuffle=True, tol=0.001, validation_fraction=0.1, verbose=0,warm_start=False)

sgd_reg.intercept_, sgd_reg.coef_

输出：

(array([4.01500975]), array([2.9432024]))

可以看到越来越接近theta参数的真实值：4和3

2.3 小批量梯度下降

通过上面分析批量梯度下降和随机梯度下降，两者均有一些缺点，而小批量梯度下降刚好是两者的折中，有效地避免了批量梯度下降和随机梯度下降的缺点。

theta_path_mgd = []n_iternations = 50minibatch_size = 20np.random.seed(42)theta = np.random.randn(2,1)t0, t1 = 200, 1000def learning_schedule(t):return t0/(t+t1)t = 0for epoch in range(n_iternations):shuffled_indices = np.random.permutation(m)X_b_shuffled = X_b[shuffled_indices]y_shuffled = y[shuffled_indices]for i in range(0,m,minibatch_size):t += 1xi = X_b_shuffled[i:i+minibatch_size]yi = y_shuffled[i:i+minibatch_size]gradients = 2/minibatch_size * xi.T.dot(xi.dot(theta)-yi)theta = theta - eta * gradientstheta_path_mgd.append(theta)theta

输出：

array([[4.01532936],[2.98476426]])

可以看到，小批量随机梯度下降也能得到比较好的结果。

对三种梯度下降方法参数变化趋势进行作图：

theta_list = [theta_path_bgd, theta_path_mgd, theta_path_sgd]theta_type = ["Batch","Mini-batch","Stochastic"]theta_style = ["b-o","g-+","r-s"]plt.figure(figsize=(8,5))for i in reversed(range(3)):plt.plot(np.array(theta_list[i])[:,0],np.array(theta_list[i])[:,1],theta_style[i],linewidth=2,label=theta_type[i])plt.xlabel(r"$\theta_0$",fontsize=20)plt.ylabel(r"$\theta_1$",fontsize=20, rotation=0)plt.axis([2.5,4.5,2.3,3.9])plt.legend(fontsize=16,loc="upper left")plt.show()

输出：

从上图中可以看出，批量梯度下降最终能比较准确定位到损失函数的最优解，而随机梯度下降和小批量随机梯度下降都在最优解附近徘徊，并且随机梯度下降的参数随机摆动性更大。

3. 多项式回归

为了方便案例分析，先生成一些数据：

np.random.seed(42)m = 100X = 6 * np.random.rand(m,1)-3y = 0.5* X**2 + X + 2 + np.random.randn(m,1)X.shape, y.shape

输出：

((100, 1), (100, 1))

plt.plot(X,y,"b.")plt.xlabel("$x_1$",fontsize=18)plt.ylabel("$y$",fontsize=18, rotation=0)plt.show()

输出：

很明显一条直线无法很好地拟合该数据！

可以通过sklearn中的工具给数据X加上新特征，即原数据的平方：

from sklearn.preprocessing import PolynomialFeaturespoly_features = PolynomialFeatures(degree=2, include_bias=False)X_poly = poly_features.fit_transform(X)X[0], X_poly[0]

输出：

(array([-0.75275929]), array([-0.75275929, 0.56664654]))

此时数据X_poly有两个特征：原始数据和其平方。再用线性回归训练数据：

lin_reg = LinearRegression()lin_reg.fit(X_poly, y)lin_reg.intercept_, lin_reg.coef_

输出：

(array([1.78134581]), array([[0.93366893, 0.56456263]]))

X_new = np.linspace(-3,3,100).reshape(100,1)X_new_poly = poly_features.transform(X_new)y_new = lin_reg.predict(X_new_poly)plt.plot(X,y,"b.")plt.plot(X_new,y_new,"r-",linewidth=2,label="Predictions")plt.xlabel("$x_1$",fontsize=18)plt.ylabel("$y$",fontsize=18,rotation=0)plt.legend(fontsize=14)plt.show()

输出：

可以看到，拟合的模型效果还不错。

这种添加多项式的方式对只有一个特征的还比较有用。如果有两个特征a和b，那么多项式组合degree=2时有a平方、b平方和ab三种；如果degree=3时，则有很多种可能了。

4. 学习曲线

from sklearn.preprocessing import StandardScalerfrom sklearn.pipeline import Pipelinefor style, width, degree in (("g-", 1, 300), ("b--", 2, 2), ("r-+", 2, 1)):polybig_features = PolynomialFeatures(degree=degree, include_bias=False)std_scaler = StandardScaler()lin_reg = LinearRegression()polynomial_regression = Pipeline([("poly_features", polybig_features),("std_scaler", std_scaler),("lin_reg", lin_reg)])polynomial_regression.fit(X, y)y_newbig = polynomial_regression.predict(X_new)plt.plot(X_new, y_newbig, style, label=str(degree), linewidth=width)plt.plot(X, y, "b.", linewidth=3)plt.legend(loc="upper center")plt.xlabel("$x_1$", fontsize=18)plt.ylabel("$y$", rotation=0, fontsize=18)plt.axis([-3, 3, 0, 10])plt.show()

输出：

当分别用一元一次、一元二次、一元300次的多项式拟合上述数据时，如上图所示，可以看出一元300次的多项式过分地拟合数据，发生了很多来回震荡的情况，很明显这就是过拟合的现象。

然而一元一次的模型又拟合的不是很好，这属于欠拟合状态。而一元二次的模型拟合的比较好，实际上我们的数据也是通过一元二次公式生成的。

但是在实际应用中，并非知道哪种模型适合我们的数据。

关于过拟合和欠拟合，可以使用交叉验证的方式进行，如果训练集上效果好，验证集上效果不好，属于过拟合；如果两者效果都不好，属于欠拟合。（很少会出现训练集效果不好，而验证效果好的情况，这种一般可能是偶然）

而另一种关于判断过拟合和欠拟合的方法就是观察学习曲线：根据迭代次数对模型在训练集和验证集上的效果进行作图：

from sklearn.metrics import mean_squared_errorfrom sklearn.model_selection import train_test_splitdef plot_learning_curves(model, X, y):X_train, X_val, y_train, y_val = train_test_split(X, y, test_size=0.2, random_state=10)train_errors, val_errors = [], []for m in range(1, len(X_train)):model.fit(X_train[:m], y_train[:m])y_train_predict = model.predict(X_train[:m])y_val_predict = model.predict(X_val)train_errors.append(mean_squared_error(y_train[:m],y_train_predict))val_errors.append(mean_squared_error(y_val, y_val_predict))plt.plot(np.sqrt(train_errors),"r-+",linewidth=2,label="train")plt.plot(np.sqrt(val_errors),"b-",linewidth=3,label="val")plt.legend()plt.xlabel("Training set size")plt.ylabel("RMSE")

lin_reg = LinearRegression()plot_learning_curves(lin_reg, X, y)plt.show()

输出：

上图结果显示，当不断加入训练数据时，模型的拟合效果不断下降，这是由于噪声不断增多，另外也不再是线性关系了。继续增加训练数据时达到一个平台期，拟合效果不再减小。最终两条线都达到了平台，但还是有一定的差距，模型在训练集和验证集上根均方误差都比较大，说明模型处于欠拟合状态。

注意：对于欠拟合模型，增加更多的数据是没有用的，而是应该选择更合适的模型算法或者寻找更合适的特征。

尝试10次多项式的模型：

from sklearn.pipeline import Pipelinepolynomial_regression = Pipeline([("poly_features",PolynomialFeatures(degree=10,include_bias=False)),("lin_reg",LinearRegression()) ])plot_learning_curves(polynomial_regression, X, y)plt.axis([0,80,0,3])plt.show()

输出：

可以看出，10次多项式模型在训练集和验证集上的误差都比线性回归模型的要好。

5. 正则化线性模型

减少过拟合的方法之一是模型中加入正则化。对于多项式模型可以减少多项式的次数来达到正则化的目的；对于线性模型，通常对权重进行约束达到正则化的目的。下面三种模型分别使用三种不同的权重约束方法：

Ridge RegressionLasso RegressionElastic Net

5.1 岭回归（Ridge Regression）

Ridge Regression就是进行L2正则化的线性回归模型，其基本思想是模型尽可能地拟合数据，同时要求模型权重参数越小越好。正则项直接添加到损失函数中。

注意：模型评估的时候不再添加正则化项。（在模型训练和评估时使用不同的损失函数是很常见的，可能训练的时候损失越好求导越好，而评估时越接近实际的需求越好。例如分类模型中训练损失函数常常用损失对数函数，而评估时候往往用精度和召回率）

注意：L2正则化权重项中不包括偏置项b。对于正则化模型模型，数据一定要做归一化操作。

np.random.seed(42)m = 20X = 3 * np.random.rand(m,1)y = 1+0.5*X+np.random.randn(m,1)/1.5X_new = np.linspace(0,3,100).reshape(100,1)

from sklearn.linear_model import Ridgeridge_reg = Ridge(alpha=1, solver="cholesky", random_state=42)ridge_reg.fit(X,y)ridge_reg.predict([[1.5]])

输出：

array([[1.55071465]])

ridge_reg = Ridge(alpha=1, solver="sag", random_state=42)ridge_reg.fit(X,y)ridge_reg.predict([[1.5]])

输出：

array([[1.5507201]])

def plot_model(model_class, polynomial, alphas, **model_kargs):for alpha, style in zip(alphas, ("b-","g--","r:")):model = model_class(alpha, **model_kargs) if alpha > 0 else LinearRegression()if polynomial:model = Pipeline([("poly_features",PolynomialFeatures(degree=10, include_bias=False)),("std_scaler",StandardScaler()),("regul_reg",model)])model.fit(X, y)y_new_regul = model.predict(X_new)lw = 2 if alpha > 0 else 1plt.plot(X_new, y_new_regul, style, linewidth=lw, label=r"$\alpha = {}$".format(alpha))plt.plot(X,y,"b.",linewidth=3)plt.legend(fontsize=18)plt.xlabel("$x_1$",fontsize=18)plt.axis([0,3,0,4])plt.figure(figsize=(10,5))plt.subplot(121)plot_model(Ridge, polynomial=False, alphas=(0,10,100), random_state=42)plt.ylabel("$y$",fontsize=18,rotation=0)plt.subplot(122)plot_model(Ridge, polynomial=True, alphas=(0,10**-5, 1), random_state=42)plt.show()

输出：

不同的alpha值表示不同的正则化惩罚力度，alpha越大表示惩罚越重，反之越小。

左图：利用原始数据进行岭回归模型的训练

右图：先对数据进行10次多项式扩展，再训练岭回归模型。观察可以发现，alpha为0时，过拟合有些严重，随着alpha的增大，过拟合慢慢减轻。

与线性回归一样，岭回归也可以直接通过闭型解进行参数求解，也可以使用梯度下降进行模型训练，但是利弊跟线性回归一样。

下面尝试利用随机梯度下降算法：

sgd_reg = SGDRegressor(penalty="l2", max_iter=1000, tol=1e-3,random_state=42)sgd_reg.fit(X, y.ravel()) # ravel：矩阵向量化

输出：

SGDRegressor(alpha=0.0001, average=False, early_stopping=False, epsilon=0.1,eta0=0.01, fit_intercept=True, l1_ratio=0.15,learning_rate='invscaling', loss='squared_loss', max_iter=1000,n_iter_no_change=5, penalty='l2', power_t=0.25, random_state=42,shuffle=True, tol=0.001, validation_fraction=0.1, verbose=0,warm_start=False)

sgd_reg.predict([[1.5]])

输出：

array([1.47012588])

5.2 Lasso Regression

Lasso：Least Absolute Shrinkage and Selection Operator Regression

Lasso回归是另一个线性回归的正则化版本，在损失函数中添加了L1正则化项。

from sklearn.linear_model import Lassoplt.figure(figsize=(10,5))plt.subplot(121);plot_model(Lasso, polynomial=False, alphas=(0,0.1,1),random_state=42)plt.ylabel("$y$",fontsize=18,rotation=0)plt.subplot(122);plot_model(Lasso, polynomial=True, alphas=(0,10**-7,1),random_state=42)plt.tight_layout()plt.show()

输出：

如上图所示，岭回归和Lasso回归似乎效果类似，但还是存在明显的差别。

Lasso回归的特点是逐渐减小非重要特征的权重，也就是说Lasso会进行特征的选择并输出稀疏模型。

from sklearn.linear_model import Lassolasso_reg = Lasso(alpha=0.1)lasso_reg.fit(X, y)lasso_reg.predict([[1.5]])

输出：

array([1.53788174])

5.3 Elastic Net(弹性网络)

Elastic Net是介于岭回归和Lasso回归之间的一种，综合了L1和L2正则化项，通过超参数r控制两者的比例。r=0时即为L2正则，r=1时即为L1正则。

在机器学习建模过程中，一般情况下建议不要使用纯的线性回归，加入少许正则会比较好。那么该选择使用什么样的正则呢？通常情况下默认使用L2正则，即岭回归算法。如果你怀疑样本数据集中有一些特征是没用的，那么可以考虑使用Lasso回归或者Elastic网络，因为这两个模型可以将那些无用特征的权重降低走近于0。而在Lasso和Elastic两个网络中，更建议使用Elastic网络，因为Lasso网络在特征个数远大于样本数量或者某些特征之间有强相关性时会出现不稳定的现象。

from sklearn.linear_model import ElasticNetelastic_net = ElasticNet(alpha=0.1, l1_ratio=0.5, random_state=42)elastic_net.fit(X, y)elastic_net.predict([[1.5]])

输出：

array([1.54333232])

6. 提前停止模型训练(Early Stopping)

np.random.seed(42)m = 100X = 6*np.random.rand(m,1)-3y = 2+X+0.5*X**2+np.random.randn(m,1)X_train, X_val, y_train, y_val = train_test_split(X[:50],y[:50],test_size=0.5,random_state=42)X_train.shape, X_val.shape, y_train.shape, y_val.shape

输出：

((25, 1), (25, 1), (25, 1), (25, 1))

from sklearn.base import clonepoly_scaler = Pipeline([("poly_features",PolynomialFeatures(degree=90,include_bias=False)),("std_scaler",StandardScaler())])X_train_poly_scaled = poly_scaler.fit_transform(X_train)X_val_poly_scaled = poly_scaler.transform(X_val)sgd_reg = SGDRegressor(max_iter=1,tol=-np.infty,warm_start=True,penalty=None, learning_rate="constant",eta0=0.0005,random_state=42)minimum_val_error = float("inf")best_epoch = Nonebest_model = Nonefor epoch in range(1000):sgd_reg.fit(X_train_poly_scaled, y_train.ravel())y_val_predict = sgd_reg.predict(X_val_poly_scaled)val_error = mean_squared_error(y_val, y_val_predict)if val_error < minimum_val_error:minimum_val_error = val_errorbest_epoch = epochbest_model = clone(sgd_reg)best_epoch, best_model

输出：

(999,SGDRegressor(alpha=0.0001, average=False, early_stopping=False, epsilon=0.1,eta0=0.0005, fit_intercept=True, l1_ratio=0.15,learning_rate='constant', loss='squared_loss', max_iter=1,n_iter_no_change=5, penalty=None, power_t=0.25, random_state=42,shuffle=True, tol=-inf, validation_fraction=0.1, verbose=0,warm_start=True))

sgd_reg = SGDRegressor(max_iter=1, tol=-np.infty, warm_start=True,penalty=None, learning_rate="constant", eta0=0.0005, random_state=42)n_epochs = 5000train_errors, val_errors = [], []for epoch in range(n_epochs):sgd_reg.fit(X_train_poly_scaled, y_train.ravel())y_train_predict = sgd_reg.predict(X_train_poly_scaled)y_val_predict = sgd_reg.predict(X_val_poly_scaled)train_errors.append(mean_squared_error(y_train, y_train_predict))val_errors.append(mean_squared_error(y_val, y_val_predict))best_epoch = np.argmin(val_errors)best_val_rmse = np.sqrt(val_errors[best_epoch])plt.annotate('Best model',xy=(best_epoch, best_val_rmse),xytext=(best_epoch, best_val_rmse + 1),ha="center",arrowprops=dict(facecolor='black', shrink=0.05),fontsize=16,)best_val_rmse -= 0.03 # just to make the graph look betterplt.plot([0, n_epochs], [best_val_rmse, best_val_rmse], "k:", linewidth=2)plt.plot(np.sqrt(val_errors), "b-", linewidth=3, label="Validation set")plt.plot(np.sqrt(train_errors), "r--", linewidth=2, label="Training set")plt.legend(loc="upper right", fontsize=14)plt.xlabel("Epoch", fontsize=14)plt.ylabel("RMSE", fontsize=14)plt.show()

输出：

best_epoch, best_model

输出：

(1374,SGDRegressor(alpha=0.0001, average=False, early_stopping=False, epsilon=0.1,eta0=0.0005, fit_intercept=True, l1_ratio=0.15,learning_rate='constant', loss='squared_loss', max_iter=1,n_iter_no_change=5, penalty=None, power_t=0.25, random_state=42,shuffle=True, tol=-inf, validation_fraction=0.1, verbose=0,warm_start=True))

从上图中可以看出，随着训练的进行，验证集上的误差逐渐减小，但从箭头所指位置之后又慢慢升高，表明此时模型开始过拟合了。因此最好的办法就是在箭头所指的位置就停止模型训练。这种提前停止的方法也被深度学习大牛称为"beautiful free lunch"。

7. 逻辑回归

7.1 估计概率、损失函数、算法原理推导

逻辑回归概念、算法原理推导见：/Jwenxue/article/details/106605346

7.2 决策边界

Iris数据集包括三个类别：Iris setosa, Iris versicolor, and Iris virginica

from sklearn import datasetsiris = datasets.load_iris()list(iris)

输出：

['data', 'target', 'target_names', 'DESCR', 'feature_names', 'filename']

print(iris.DESCR)

输出：

.. _iris_dataset:Iris plants dataset--------------------**Data Set Characteristics:**:Number of Instances: 150 (50 in each of three classes):Number of Attributes: 4 numeric, predictive attributes and the class:Attribute Information:- sepal length in cm- sepal width in cm- petal length in cm- petal width in cm- class:- Iris-Setosa- Iris-Versicolour- Iris-Virginica:Summary Statistics:============== ==== ==== ======= ===== ====================Min Max Mean SD Class Correlation============== ==== ==== ======= ===== ====================sepal length: 4.3 7.9 5.84 0.83 0.7826sepal width: 2.0 4.4 3.05 0.43 -0.4194petal length: 1.0 6.9 3.76 1.76 0.9490 (high!)petal width: 0.1 2.5 1.20 0.76 0.9565 (high!)============== ==== ==== ======= ===== ====================:Missing Attribute Values: None:Class Distribution: 33.3% for each of 3 classes.:Creator: R.A. Fisher:Donor: Michael Marshall (MARSHALL%PLU@io.arc.nasa.gov):Date: July, 1988The famous Iris database, first used by Sir R.A. Fisher. The dataset is takenfrom Fisher's paper. Note that it's the same as in R, but not as in the UCIMachine Learning Repository, which has two wrong data points.This is perhaps the best known database to be found in thepattern recognition literature. Fisher's paper is a classic in the field andis referenced frequently to this day. (See Duda & Hart, for example.) Thedata set contains 3 classes of 50 instances each, where each class refers to atype of iris plant. One class is linearly separable from the other 2; thelatter are NOT linearly separable from each other... topic:: References- Fisher, R.A. "The use of multiple measurements in taxonomic problems"Annual Eugenics, 7, Part II, 179-188 (1936); also in "Contributions toMathematical Statistics" (John Wiley, NY, 1950).- Duda, R.O., & Hart, P.E. (1973) Pattern Classification and Scene Analysis.(Q327.D83) John Wiley & Sons. ISBN 0-471-22361-1. See page 218.- Dasarathy, B.V. (1980) "Nosing Around the Neighborhood: A New SystemStructure and Classification Rule for Recognition in Partially ExposedEnvironments". IEEE Transactions on Pattern Analysis and MachineIntelligence, Vol. PAMI-2, No. 1, 67-71.- Gates, G.W. (1972) "The Reduced Nearest Neighbor Rule". IEEE Transactionson Information Theory, May 1972, 431-433.- See also: 1988 MLC Proceedings, 54-64. Cheeseman et al"s AUTOCLASS IIconceptual clustering system finds 3 classes in the data.- Many, many more ...

X = iris["data"][:,3:] # 选择petal width花瓣宽度属性y = (iris["target"]==2).astype(np.int)X.shape, y.shape

输出：

((150, 1), (150,))

使用逻辑回归算法建模：

from sklearn.linear_model import LogisticRegressionlog_reg = LogisticRegression(solver="lbfgs", random_state=42)log_reg.fit(X,y)

输出：

LogisticRegression(C=1.0, class_weight=None, dual=False, fit_intercept=True,intercept_scaling=1, l1_ratio=None, max_iter=100,multi_class='warn', n_jobs=None, penalty='l2',random_state=42, solver='lbfgs', tol=0.0001, verbose=0,warm_start=False)

X_new = np.linspace(0,3,1000).reshape(-1,1)y_proba = log_reg.predict_proba(X_new)plt.plot(X_new, y_proba[:,1],"g-",linewidth=2, label="Iris virgincia")plt.plot(X_new, y_proba[:,0],"b--", linewidth=2, label="Not Iris virginica")

输出：

X_new = np.linspace(0, 3, 1000).reshape(-1, 1)y_proba = log_reg.predict_proba(X_new)decision_boundary = X_new[y_proba[:, 1] >= 0.5][0]plt.figure(figsize=(8, 3))plt.plot(X[y==0], y[y==0], "bs")plt.plot(X[y==1], y[y==1], "g^")plt.plot([decision_boundary, decision_boundary], [-1, 2], "k:", linewidth=2)plt.plot(X_new, y_proba[:, 1], "g-", linewidth=2, label="Iris virginica")plt.plot(X_new, y_proba[:, 0], "b--", linewidth=2, label="Not Iris virginica")plt.text(decision_boundary+0.02, 0.15, "Decision boundary", fontsize=14, color="k", ha="center")plt.arrow(decision_boundary, 0.08, -0.3, 0, head_width=0.05, head_length=0.1, fc='b', ec='b')plt.arrow(decision_boundary, 0.92, 0.3, 0, head_width=0.05, head_length=0.1, fc='g', ec='g')plt.xlabel("Petal width (cm)", fontsize=14)plt.ylabel("Probability", fontsize=14)plt.legend(loc="center left", fontsize=14)plt.axis([0, 3, -0.02, 1.02])plt.show()

输出：

从上图可以看出，Iris virgincia花瓣宽度在1.4至2.5cm范围内，而其它的花瓣宽度在0.1至1.8范围内。对于花瓣宽度大于1.8的可以明确一定是Iris virgincia，小于1.4的可以明确分类为不是Iris virgincia。

但是在1.4至1.8范围就比较难判断了！

log_reg.predict([[1.8],[1.7],[1.6],[1.5],[1.4]])

输出：

array([1, 1, 0, 0, 0])

可以看到，模型将1.6的位置当作分界，即大于1.6的认为是Iris virgincia，小于的不是。

这个1.6就是模型的决策边界。也就是概率为0.5时对应特征的值：

decision_boundary

输出：

array([1.66066066])

from sklearn.linear_model import LogisticRegressionX = iris["data"][:, (2, 3)] # 花瓣长度、花瓣宽度y = (iris["target"] == 2).astype(np.int)log_reg = LogisticRegression(solver="lbfgs", C=10**10, random_state=42) #正则惩罚系数log_reg.fit(X, y)x0, x1 = np.meshgrid(np.linspace(2.9, 7, 500).reshape(-1, 1),np.linspace(0.8, 2.7, 200).reshape(-1, 1),)X_new = np.c_[x0.ravel(), x1.ravel()]y_proba = log_reg.predict_proba(X_new)plt.figure(figsize=(10, 4))plt.plot(X[y==0, 0], X[y==0, 1], "bs")plt.plot(X[y==1, 0], X[y==1, 1], "g^")zz = y_proba[:, 1].reshape(x0.shape)contour = plt.contour(x0, x1, zz, cmap=plt.cm.brg)left_right = np.array([2.9, 7])boundary = -(log_reg.coef_[0][0] * left_right + log_reg.intercept_[0]) / log_reg.coef_[0][1]plt.clabel(contour, inline=1, fontsize=12)plt.plot(left_right, boundary, "k--", linewidth=3)plt.text(3.5, 1.5, "Not Iris virginica", fontsize=14, color="b", ha="center")plt.text(6.5, 2.3, "Iris virginica", fontsize=14, color="g", ha="center")plt.xlabel("Petal length", fontsize=14)plt.ylabel("Petal width", fontsize=14)plt.axis([2.9, 7, 0.8, 2.7])plt.show()

输出：

和其它线性模型类似，逻辑回归也可以使用L1和L2正则化，sklearn默认使用L2正则化，并且需注意正则化惩罚系数不是alpha，而是C。C越大，惩罚力度越小。

8. Softmax回归

逻辑回归可以直接用于多分类的任务，而不用拆分成多个二分类任务。这种回归就叫Softmax回归，或叫多项式逻辑回归。

Softmax回归主要思想是：对线性回归加权求和的结果经过softmax函数处理，输出多分类各个类别的概率值。

Softmax虽然可以实现多分类任务，但是每次只输出一个类别，即概率最大的那个类别，而不是多个输出。

Softmax使用的损失函数是交叉熵损失函数。

X = iris["data"][:,(2,3)]y = iris["target"]softmax_reg = LogisticRegression(multi_class="multinomial",solver="lbfgs",C=10,random_state=42)softmax_reg.fit(X,y)

输出：

LogisticRegression(C=10, class_weight=None, dual=False, fit_intercept=True,intercept_scaling=1, l1_ratio=None, max_iter=100,multi_class='multinomial', n_jobs=None, penalty='l2',random_state=42, solver='lbfgs', tol=0.0001, verbose=0,warm_start=False)

在sklearn中LogisticRegression函数如果接受的输入数据是多个类别是默认会使用OVR（one versus the rest）的模式完成多分类任务。如果想用softmax完成多分类任务，就需要指定参数multi_class="multinomial"，并且注意指定的solver必须支持softmax算法，例如"lbfgs"

softmax_reg.predict([[5,2]])

输出：

array([2])

softmax_reg.predict_proba([[5,2]])

输出：

array([[6.38014896e-07, 5.74929995e-02, 9.42506362e-01]])

可以从输出概率得知，94.2%的概率是Iris virginica

x0, x1 = np.meshgrid(np.linspace(0, 8, 500).reshape(-1, 1),np.linspace(0, 3.5, 200).reshape(-1, 1))X_new = np.c_[x0.ravel(), x1.ravel()]y_proba = softmax_reg.predict_proba(X_new)y_predict = softmax_reg.predict(X_new)zz1 = y_proba[:, 1].reshape(x0.shape)zz = y_predict.reshape(x0.shape)plt.figure(figsize=(10, 5))plt.plot(X[y==2, 0], X[y==2, 1], "g^", label="Iris virginica")plt.plot(X[y==1, 0], X[y==1, 1], "bs", label="Iris versicolor")plt.plot(X[y==0, 0], X[y==0, 1], "yo", label="Iris setosa")from matplotlib.colors import ListedColormapcustom_cmap = ListedColormap(['#fafab0','#9898ff','#a0faa0'])plt.contourf(x0, x1, zz, cmap=custom_cmap)contour = plt.contour(x0, x1, zz1, cmap=plt.cm.brg)plt.clabel(contour, inline=1, fontsize=12)plt.xlabel("Petal length", fontsize=14)plt.ylabel("Petal width", fontsize=14)plt.legend(loc="center left", fontsize=14)plt.axis([0, 7, 0, 3.5])plt.show()

输出：

上图输出结果为决策边界，可以看到每两个类别之间的决策边界都是线性的

如果觉得《机器学习实践—基于Scikit-Learn Keras和TensorFlow2第二版—第4章训练模型》对你有帮助，请点赞、收藏，并留下你的观点哦！

本内容不代表本网观点和政治立场，如有侵犯你的权益请联系我们处理。

网友评论

网友评论仅供其表达个人看法，并不表明网站立场。