失眠网 > 收藏！如何使用特征提取技术降低数据集维度

收藏！如何使用特征提取技术降低数据集维度

时间：2020-07-03 05:44:06

全文共5320字，预计学习时长20分钟

图源: https://blog.datasciencedojo.c

简介

如今，使用具有数百（甚至数千）个特征的数据集已然十分普遍了。如果这些特征数量与数据集中存储的观察值数量相差无几（或者前者比后者更多）的话，很可能会导致机器学习模型过度拟合。为避免此类问题的发生，需采用正则化或降维技术（特征提取）。在机器学习中，数据集的维数等于用来表示它的变量数。

使用正则化当然有助于降低过度拟合的风险，但使用特征提取技术也具备一定的优势，例如：

· 提高准确性

· 降低过度拟合风险

· 提高训练速度

· 提升数据可视化能力

· 提高模型可解释性

特征提取旨在通过在现有数据集中创建新特征（并放弃原始特征）来减少数据集中的特征数量。这些新的简化特征集需能够汇总原始特征集中的大部分信息。这样便可以从整合的原始特征集中创建原始特征的简化版本。

特征选择也是一种常用的用来减少数据集中特征数量的技术。它与特征提取的区别在于：特征选择旨在对数据集中现有特征的重要性进行排序，放弃次重要的特征（不创建新特征）。

本文将以 Kaggle MushroomClassification Dataset为例介绍如何应用特征提取技术。本文的目标是通过观察给定的特征来对蘑菇是否有毒进行预测。

首先，需导入所有必需的数据库。

import time

import numpy as np

import pandas as pd

import matplotlib.pyplot as plt

from matplotlib.pyplot import figure

import seaborn as sns

from sklearn import preprocessing

from sklearn.preprocessing import LabelEncoder

from sklearn.preprocessing import StandardScaler

from sklearn.model_selection import train_test_split

from sklearn.metrics import classification_report,confusion_matrix

from sklearn.ensemble import RandomForestClassifier

extraction17.py hosted with ❤ by GitHub

下图为本例中将采用的数据集。

图1: 蘑菇分类数据集

将这些数据输入机器学习模型之前，将数据划分为特征（X）和标签（Y）以及独热码所有的分类变量。

X = df.drop(["class"], axis=1)

Y = df["class"]

X = pd.get_dummies(X, prefix_sep="_")

Y = LabelEncoder().fit_transform(Y)

X = StandardScaler().fit_transform(X)

extraction15.py hosted with ❤ by GitHub

接着，创建一个函数（forest_test），将输入数据分成训练集和测试集，训练和测试一个随机森林分类器。

defforest_test(X, Y):

X_Train, X_Test, Y_Train, Y_Test = train_test_split(X, Y,

test_size=0.30,

random_state=101)

start = time.process_time()

trainedforest = RandomForestClassifier(n_estimators=700).fit(X_Train,Y_Train)

print(time.process_time() - start)

predictionforest = trainedforest.predict(X_Test)

print(confusion_matrix(Y_Test,predictionforest))

print(classification_report(Y_Test,predictionforest))

extraction14.py hosted with ❤ by GitHub

现在可以首先将该函数应用于整个数据集，然后再连续使用简化的数据集来比较二者的结果。

forest_test(X, Y)

extraction16.py hosted with ❤ by GitHub

如下图所示，使用这整个特征集训练随机森林分类器，可在2.2秒左右的训练时间内获得100%的准确率。在下列示例中，第一行提供了训练时间，供您参考。

2.2676709799999992

[[1274 0]

[ 0 1164]]

precision recall f1-score support

0 1.00 1.00 1.00 1274

1 1.00 1.00 1.00 1164

accuracy 1.00 2438

macro avg 1.00 1.00 1.00 2438

weighted avg 1.00 1.00 1.00 2438

特征提取

主成分分析 (PCA)

PCA是一项常用的线性降维技术。使用PCA时，将原始数据作为输入，并尝试寻找能够最好地概括原始数据分布的输入特征的组合，从而降低原始数据的维数。它是通过观察pairwisedistances，来最大化方差和最小化重建误差。在PCA中，原始数据投影到一组正交轴上，并且每个轴上的数据都按重要程度排序。

PCA是一种无监督的学习算法，因此它不关注数据标签，只关注变量。这在某些情况下会导致数据分类错误。

在此例中，首先在整个数据集中应用PCA，将数据简化至二维，然后使用这些新数据特征及其标签构建一个数据帧。

from sklearn.decomposition importPCA

pca = PCA(n_components=2)

X_pca = pca.fit_transform(X)

PCA_df= pd.DataFrame(data= X_pca, columns= ["PC1", "PC2"])

PCA_df= pd.concat([PCA_df, df["class"]], axis=1)

PCA_df["class"] = LabelEncoder().fit_transform(PCA_df["class"])

PCA_df.head()

extraction.py hosted with ❤ by GitHub

图2: PCA数据集

有了新创建的数据帧，现在可以在二维散点图中绘制数据分布图。

figure(num=None, figsize=(8, 8), dpi=80, facecolor="w", edgecolor="k")

classes = [1, 0]

colors = ["r", "b"]

for clas, color inzip(classes, colors):

plt.scatter(PCA_df.loc[PCA_df["class"] == clas, "PC1"],

PCA_df.loc[PCA_df["class"] == clas, "PC2"],

c= color)

plt.xlabel("Principal Component 1", fontsize=12)

plt.ylabel("Principal Component 2", fontsize=12)

plt.title("2D PCA", fontsize=15)

plt.legend(["Poisonous", "Edible"])

plt.grid()

extraction2.py hosted with ❤ by GitHub

图3: 2维PCA可视化

现在可以重复这一步骤，但将数据简化至三维，使用Plotly创建动画。

使用PCA还可以通过使用explained_variance_ratio_Scikit-learn函数来探究原始数据方差的保留程度。计算出方差比后就构造精美的可视化图形了。

使用由PCA构造的三维特征集（而不是整个数据集）再次运行随机森林分类器，分类准确率为98%，而使用二维的特征集的分类准确率为95%。

pca = PCA(n_components=3,svd_solver="full")

X_pca = pca.fit_transform(X)

print(pca.explained_variance_)

forest_test(X_pca, Y)

extraction9.py hosted with ❤ by GitHub

[10.31484926 9.42671062 8.35720548]

2.769664902999999

[[1261 13]

[ 41 1123]]

precision recall f1-score support

0 0.97 0.99 0.98 1274

1 0.99 0.96 0.98 1164

accuracy 0.98 2438

macro avg 0.98 0.98 0.98 2438

weighted avg 0.98 0.98 0.98 2438

此外，使用二维数据集，现在还可以对随机森林使所用的决策边界进行可视化，以便对每个不同的数据点进行分类。

from itertools import product

X_Reduced, X_Test_Reduced, Y_Reduced, Y_Test_Reduced = train_test_split(X_pca, Y,

test_size=0.30,

random_state=101)

trainedforest = RandomForestClassifier(n_estimators=700).fit(X_Reduced,Y_Reduced)

x_min, x_max = X_Reduced[:, 0].min() -1, X_Reduced[:, 0].max() +1

y_min, y_max = X_Reduced[:, 1].min() -1, X_Reduced[:, 1].max() +1

xx, yy = np.meshgrid(np.arange(x_min, x_max, 0.1), np.arange(y_min, y_max, 0.1))

Z = trainedforest.predict(np.c_[xx.ravel(), yy.ravel()])

Z = Z.reshape(xx.shape)

plt.contourf(xx, yy, Z,cmap=plt.cm.coolwarm, alpha=0.4)

plt.scatter(X_Reduced[:, 0], X_Reduced[:, 1], c=Y_Reduced, s=20, edgecolor="k")

plt.xlabel("Principal Component 1", fontsize=12)

plt.ylabel("Principal Component 2", fontsize=12)

plt.title("Random Forest", fontsize=15)

plt.show()

extraction3.py hosted with ❤ by GitHub

图4: PCA随机森林决策边界

独立成分分析 (ICA)

ICA是一种线性降维方法，它以将独立成分混合作为输入数据，旨在正确识别每个成分（删除所有不必要的噪声）。如果两个输入特征的线性相关和非线性相关都等于零[1]，则可以认为它们是独立的。

ICA在医学中得到广泛应用，如脑电图和磁共振成像分析等，它常用于区分有用信号和无用信号。

举一个ICA简单的应用事例：在做音频记录时，有两个人在交谈。ICA可以区分出音频中两个不同的独立成分（即两种不同的声音）。这样，ICA就可以识别出对话中不同的说话人。

现在，可以使用ICA再次将数据集简化为三维，利用随机森林分类器来测试其准确性并在三维图中绘制结果。

from sklearn.decomposition import FastICA

ica = FastICA(n_components=3)

X_ica = ica.fit_transform(X)

forest_test(X_ica, Y)

extraction5.py hosted with ❤ by GitHub

2.8933812039999793

[[1263 11]

[ 44 1120]]

precision recall f1-score support

0 0.97 0.99 0.98 1274

1 0.99 0.96 0.98 1164

accuracy 0.98 2438

macro avg 0.98 0.98 0.98 2438

weighted avg 0.98 0.98 0.98 2438

从下面的动画中可以发现，尽管PCA和ICA的准确度相同，但是它们构造出的三维空间分布图却不同。

线性判别式分析(LDA)

LDA是有监督的学习降维技术和机器学习分类器。

LDA旨在最大化类间距离，并最小化类内距离。因此，LDA将类内距离和类间距离作为衡量尺度。在低维空间投影数据，能最大化类间距离，从而可以得出更好的分类结果（不同类之间的重叠减少），因此，LDA是上乘之选。

使用LDA时，应假设输入数据遵循高斯分布（如本例），因此将LDA应用于非高斯数据可能会导致错误的分类结果。

本例将运行LDA将数据集简化为一维，测试其准确性并绘制结果。

from sklearn.discriminant_analysis import LinearDiscriminantAnalysis

lda = LinearDiscriminantAnalysis(n_components=1)

# run an LDA and use it to transform the features

X_lda = lda.fit(X, Y).transform(X)

print("Original number of features:", X.shape[1])

print("Reduced number of features:", X_lda.shape[1])

extraction11.py hosted with ❤ by GitHub

Original number of features: 117

Reduced number of features: 1

由于本例遵循高斯分布，所以LDA得到了非常好的结果，使用随机森林分类器测试，精确度达到100%。

forest_test(X_lda, Y)

extraction12.py hosted with ❤ by GitHub

1.2756952610000099

[[1274 0]

[ 0 1164]]

precision recall f1-score support

0 1.00 1.00 1.00 1274

1 1.00 1.00 1.00 1164

accuracy 1.00 2438

macro avg 1.00 1.00 1.00 2438

weighted avg 1.00 1.00 1.00 2438

X_Reduced, X_Test_Reduced, Y_Reduced, Y_Test_Reduced = train_test_split(X_lda, Y,

test_size=0.30,

random_state=101)

start = time.process_time()

lda = LinearDiscriminantAnalysis().fit(X_Reduced,Y_Reduced)

print(time.process_time() - start)

predictionlda = lda.predict(X_Test_Reduced)

print(confusion_matrix(Y_Test_Reduced,predictionlda))

print(classification_report(Y_Test_Reduced,predictionlda))

extraction13.py hosted with ❤ by GitHub

0.008464782999993758

[[1274 0]

[ 2 1162]]

precision recall f1-score support

0 1.00 1.00 1.00 1274

1 1.00 1.00 1.00 1164

accuracy 1.00 2438

macro avg 1.00 1.00 1.00 2438

weighted avg 1.00 1.00 1.00 2438

最后，可以直观地看到两个类的分布是如何看起来像创建一维数据分布图的。

图5: LDA类分离

局部线性嵌入 (LLE)

本文已经讨论了PCA和LDA等方法，它们能够针对不同特征间的线性关系很好地运行，下面将讨论如何处理非线性情况。

LLE是一种基于流形学习的降维技术。流形数据指嵌入高维空间中的D维对象。流形学习旨在使该对象在最初的D维中表现出来，而不是在不必要的更大空间中表现出来。

机器学习中用于解释流形学习的典型例子便是Swiss Roll Manifold（图6）。我们得到一些数据作为输入，这些数据的分布类似于一个卷（在三维空间中），然后将其展开，从而将数据压缩进二维空间。

流形学习算法有：Isomap、LLE、ModifiedLocally Linear Embedding, Hessian Eigenmapping等。

图6: 流形学习 [2]

现将带你了解如何在本例中使用LLE。根据Scikit-learn文档显示[3]：

LLE在局部邻域内寻求存在距离的数据的低维投影。它可以看作是一系列PCA，通过进行全局比较来寻找最佳的非线性嵌入。

现可以在数据集上运行LLE，将数据降到3维，测试准确度并绘制结果。

from sklearn.manifold import LocallyLinearEmbedding

embedding = LocallyLinearEmbedding(n_components=3)

X_lle = embedding.fit_transform(X)

forest_test(X_lle, Y)

extraction6.py hosted with ❤ by GitHub

2.578125

[[1273 0]

[1143 22]]

precision recall f1-score support

0 0.53 1.00 0.69 1273

1 1.00 0.02 0.04 1165

micro avg 0.53 0.53 0.53 2438

macro avg 0.76 0.51 0.36 2438

weighted avg 0.75 0.53 0.38 2438

t-分布随机邻域嵌入(t-SNE)

t-SNE是一种典型的用于高维数据可视化的非线性降维技术。它的主要应用是自然语言处理（NLP）、语音处理等。

t-SNE通过最小化由原始高维空间中输入特征的成对概率相似性构成的分布和其在缩减的低维空间中的等效分布之间的差异来工作。它利用 Kullback-Leiber (KL)散度来度量两种不同分布的差异性。然后使用梯度下降将KL散度最小化。

使用t-SNE时，高维空间使用高斯分布建模，而低维空间使用学生t分布建模。这样做是为了避免由于转换到低维空间而导致相邻点距离分布不平衡的问题。

现已准备使用t-SNE，并将数据集降至到3维。

from sklearn.manifold importTSNE

start = time.process_time()

tsne = TSNE(n_components=3, verbose=1, perplexity=40, n_iter=300)

X_tsne = tsne.fit_transform(X)

print(time.process_time() - start)

extraction4.py hosted with ❤ by GitHub

[t-SNE] Computing 121 nearestneighbors...

[t-SNE] Indexed 8124 samples in 0.139s...

[t-SNE] Computed neighbors for 8124 samples in 11.891s...