失眠网 > python数据分析与机器学习(Numpy Pandas Matplotlib)

python数据分析与机器学习(Numpy Pandas Matplotlib)

时间：2021-01-07 20:41:06

机器学习怎么学？

机器学习包含数学原理推导和实际应用技巧，所以需要清楚算法的推导过程和如何应用。深度学习是机器学习中神经网络算法的延伸，在计算机视觉和自然语言处理中应用更厉害一些。自己从头开始做笔记。

机器学习怎么动手，哪里去找案例？

最好的资源：github ，kaggle案例积累的作用很大，很少从头去写一个项目。先学会模仿，再去创作。

科学计算库Numpy

numpy(Numerical Python extensions)是一个第三方的Python包，用于科学计算。这个库的前身是1995年就开始开发的一个用于数组运算的库。经过了长时间的发展，基本上成了绝大部分Python科学计算的基础包，当然也包括所有提供Python接口的深度学习框架。

numpy.genfromtxt方法

从文本文件加载数据，并按指定的方式处理缺少的值

delimiter : 分隔符：用于分隔值的字符串。可以是str, int, or sequence。默认情况下，任何连续的空格作为分隔符。

dtype：结果数组的数据类型。如果没有，则dtypes将由每列的内容单独确定。

import numpyworld_alcohol = numpy.genfromtxt("world_alcohol.txt",delimiter=",",dtype=str)print(type(world_alcohol))print(world_alcohol)print(help(numpy.genfromtxt)) #当想知道numpy.genfromtxt用法时，使用help查询帮助文档

输出结果：

<class ‘numpy.ndarray’> #所有的numpy都是ndarray结构

[[‘Year’ ‘WHO region’ ‘Country’ ‘Beverage Types’ ‘Display Value’]

[‘1986’ ‘Western Pacific’ ‘Viet Nam’ ‘Wine’ ‘0’]

[‘1986’ ‘Americas’ ‘Uruguay’ ‘Other’ ‘0.5’]

…,

[‘1987’ ‘Africa’ ‘Malawi’ ‘Other’ ‘0.75’]

[‘1989’ ‘Americas’ ‘Bahamas’ ‘Wine’ ‘1.5’]

[‘1985’ ‘Africa’ ‘Malawi’ ‘Spirits’ ‘0.31’]]

numpy.array

创建一个向量或矩阵（多维数组）

import numpy as npa = [1, 2, 4, 3] #vectorb = np.array(a) # array([1, 2, 4, 3])type(b) # <type 'numpy.ndarray'>

对数组元素的操作1

b.shape # (4,) 返回矩阵的（行数，列数）或向量中的元素个数b.argmax() # 2 返回最大值所在的索引b.max() # 4最大值b.min() # 1最小值b.mean()# 2.5平均值

numpy限制了nump.array中的元素必须是相同数据结构。使用dtype属性返回数组中的数据类型

>>> a = [1,2,3,5]>>> b = np.array(a)>>> b.dtypedtype('int64')

对数组元素的操作2

c = [[1, 2], [3, 4]] # 二维列表d = np.array(c) # 二维numpy数组d.shape # (2, 2)d[1,1] #4,矩阵方式按照行、列获取元素d.size # 4 数组中的元素个数d.max(axis=0)# 找维度0，也就是最后一个维度上的最大值，array([3, 4])d.max(axis=1)# 找维度1，也就是倒数第二个维度上的最大值，array([2, 4])d.mean(axis=0) # 找维度0，也就是第一个维度上的均值，array([ 2., 3.])d.flatten() # 展开一个numpy数组为1维数组，array([1, 2, 3, 4])np.ravel(c)# 展开一个可以解析的结构为1维数组，array([1, 2, 3, 4])

对数组元素的操作3

import numpy as npmatrix = np.array([[5,10,15],[20,25,30],[35,40,45]])print(matrix.sum(axis=1)) #指定维度axis=1，即按行计算输出结果：[ 30 75 120]

import numpy as np

matrix = np.array([

[5,10,15],

[20,25,30],

[35,40,45]

])

print(matrix.sum(axis=0)) #指定维度axis=0，即按列计算

输出结果：

[60 75 90]

矩阵中也可以使用切片

import numpy as npvector = [1, 2, 4, 3] print(vector[0:3]) #[1, 2, 4] 对于索引大于等于0，小于3的所有元素matrix = np.array([[5,10,15],[20,25,30],[35,40,45]])print(matrix[:,1]) #[10 25 40]取出所有行的第一列print(matrix[:,0:2]) #取出所有行的第一、第二列#[[ 5 10][20 25][35 40]]

对数组的判断操作，等价于对数组中所有元素的操作

import numpy as npmatrix = np.array([[5,10,15],[20,25,30],[35,40,45]])print(matrix == 25)输出结果：[[False False False][False True False][False False False]]

second_colum_25 = matrix[:,1]== 25

print(second_colum_25)

print(matrix[second_colum_25,:]) #bool类型的值也可以拿出来当成索引

输出结果：

[False True False]

[[20 25 30]]

对数组元素的与操作,或操作

import numpy as npvector = np.array([5,10,15,20])equal_to_ten_and_five = (vector == 10) & (vector == 5)print (equal_to_ten_and_five)输出结果：[False False False False]

import numpy as np

vector = np.array([5,10,15,20])

equal_to_ten_and_five = (vector == 10) | (vector == 5)

print (equal_to_ten_and_five)

vector[equal_to_ten_and_five] = 50 #bool类型值作为索引时，True有效

print(vector)

输出结果：

[ True True False False]

[50 50 15 20]

对数组元素类型的转换

import numpy as npvector = np.array(['lucy','ch','dd'])vector = vector.astype(float) #astype对整个vector进行值类型的转换print(vector.dtype)print(vector)输出结果：float64[ 5. 10. 15. 20.]

Numpy常用函数

reshape方法，变换矩阵维度

import numpy as npprint(np.arange(15))a = np.arange(15).reshape(3,5) #将向量变为3行5列矩阵print(a)print(a.shape) #shape方法获得（行数，烈数）

输出结果：

[ 0 1 2 3 4 5 6 7 8 9 10 11 12 13 14]

[[ 0 1 2 3 4]

[ 5 6 7 8 9]

[10 11 12 13 14]]

(3, 5)

初始化矩阵为0或1

>>> import numpy as np>>> np.zeros((3,4)) #将一个三行四列矩阵初始化为0输出结果：array([[ 0., 0., 0., 0.],[ 0., 0., 0., 0.],[ 0., 0., 0., 0.]])

>>> import numpy as np

>>> np.ones((3,4),dtype=np.int32) #指定类型为int型

输出结果：

array([[1, 1, 1, 1],

[1, 1, 1, 1],

[1, 1, 1, 1]], dtype=int32)

构造序列

np.arange( 10, 30, 5 ) #起始值10，终止值小于30，间隔为5输出结果：array([10, 15, 20, 25])

np.arange( 0, 2, 0.3 )

输出结果：

array([ 0. , 0.3, 0.6, 0.9, 1.2, 1.5, 1.8])

random模块

np.random.random((2,3)) #random模块中的random函数，产生一个两行三列的随机矩阵。（-1，+1）之间的值输出结果：array([[ 0.40130659, 0.45452825, 0.79776512],[ 0.63220592, 0.74591134, 0.64130737]])

linspace模块，将起始值与终止值之间等分成x份

from numpy import pinp.linspace( 0, 2*pi, 100 )输出结果：array([ 0. , 0.06346652, 0.12693304, 0.19039955, 0.25386607,0.31733259, 0.38079911, 0.44426563, 0.50773215, 0.57119866,0.63466518, 0.6981317 , 0.76159822, 0.82506474, 0.88853126,0.95199777, 1.01546429, 1.07893081, 1.14239733, 1.20586385,1.26933037, 1.33279688, 1.3962634 , 1.45972992, 1.52319644,1.58666296, 1.65012947, 1.71359599, 1.77706251, 1.84052903,1.90399555, 1.96746207, 2.03092858, 2.0943951 , 2.15786162,2.22132814, 2.28479466, 2.34826118, 2.41172769, 2.47519421,2.53866073, 2.60212725, 2.66559377, 2.72906028, 2.7925268 ,2.85599332, 2.91945984, 2.98292636, 3.04639288, 3.10985939,3.17332591, 3.23679243, 3.30025895, 3.36372547, 3.42719199,3.4906585 , 3.55412502, 3.61759154, 3.68105806, 3.74452458,3.8079911 , 3.87145761, 3.93492413, 3.99839065, 4.06185717,4.12532369, 4.1887902 , 4.25225672, 4.31572324, 4.37918976,4.44265628, 4.5061228 , 4.56958931, 4.63305583, 4.69652235,4.75998887, 4.82345539, 4.88692191, 4.95038842, 5.01385494,5.07732146, 5.14078798, 5.2042545 , 5.26772102, 5.33118753,5.39465405, 5.45812057, 5.52158709, 5.58505361, 5.6485,5.71198664, 5.77545316, 5.83891968, 5.9023862 , 5.96585272,6.02931923, 6.09278575, 6.15625227, 6.21971879, 6.28318531])

对矩阵的运算以矩阵为单位进行操作

import numpy as npa = np.array( [20,30,40,50] )b = np.arange( 4 ) #[0 1 2 3]c = a-b print(c) #[20 29 38 47]print(b**2) #[0 1 4 9]print(a<35) #[ True True False False]

矩阵乘法

A = np.array( [[1,1],[0,1]] )B = np.array( [[2,0],[3,4]] )print A.dot(B) #求矩阵乘法的方法一print np.dot(A, B) ##求矩阵乘法的方法二输出结果：[[5 4][3 4]][[5 4][3 4]]

e为底数的运算&开根运算

import numpy as npB = np.arange(3)print (np.exp(B)) #[ 1.2.71828183 7.3890561 ] e的B次方print (np.sqrt(B)) #[ 0.1.1.41421356]

floor向下取整

import numpy as npa = np.floor(10*np.random.random((3,4))) #floor向下取整print(a)print (a.ravel()) #将矩阵中元素展开成一行a.shape = (6, 2)#当采用a.reshape(6,-1) 第二个参数-1表示默认根据行数确定列数print (a)print (a.T) #a的转置（矩阵行列互换）

[[ 8. 7. 2. 1.]

[ 5. 2. 5. 1.]

[ 8. 7. 7. 2.]]

[ 8. 7. 2. 1. 5. 2. 5. 1. 8. 7. 7. 2.]

[[ 8. 7.]

[ 2. 1.]

[ 5. 2.]

[ 5. 1.]

[ 8. 7.]

[ 7. 2.]]

[[ 8. 2. 5. 5. 8. 7.]

[ 7. 1. 2. 1. 7. 2.]]

hstack与vstack实现矩阵的拼接（拼接数据常用）

a = np.floor(10*np.random.random((2,2)))b = np.floor(10*np.random.random((2,2)))print(a)print(b)print(np.hstack((a,b))) #横着拼接print(np.vstack((a,b))) #竖着拼接输出结果：[[ 8. 6.][ 7. 6.]][[ 3. 4.][ 8. 1.]][[ 8. 6. 3. 4.][ 7. 6. 8. 1.]][[ 8. 6.][ 7. 6.][ 3. 4.][ 8. 1.]]

hsplit与vsplit实现矩阵的切分

a = np.floor(10*np.random.random((2,12)))print(a)print(np.hsplit(a,3)) #横着将矩阵切分为3份print(np.hsplit(a,(3,4))) # 指定横着切分的位置，第三列和第四列输出结果：[[ 7. 1. 4. 9. 8. 8. 5. 9. 6. 6. 9. 4.][ 1. 9. 1. 2. 9. 9. 5. 0. 5. 4. 9. 6.]][array([[ 7., 1., 4., 9.],[ 1., 9., 1., 2.]]), array([[ 8., 8., 5., 9.],[ 9., 9., 5., 0.]]), array([[ 6., 6., 9., 4.],[ 5., 4., 9., 6.]])][array([[ 7., 1., 4.],[ 1., 9., 1.]]), array([[ 9.],[ 2.]]), array([[ 8., 8., 5., 9., 6., 6., 9., 4.],[ 9., 9., 5., 0., 5., 4., 9., 6.]])]

a = np.floor(10*np.random.random((12,2)))

print(a)

np.vsplit(a,3) #竖着将矩阵切分为3份

输出结果：

[[ 6. 4.]

[ 0. 1.]

[ 9. 0.]

[ 0. 0.]

[ 0. 4.]

[ 1. 1.]

[ 0. 4.]

[ 1. 6.]

[ 9. 7.]

[ 0. 9.]

[ 6. 1.]

[ 3. 0.]]

[array([[ 6., 4.],

[ 0., 1.],

[ 9., 0.],

[ 0., 0.]]), array([[ 0., 4.],

[ 1., 1.],

[ 0., 4.],

[ 1., 6.]]), array([[ 9., 7.],

[ 0., 9.],

[ 6., 1.],

[ 3., 0.]])]

直接把一个数组赋值给另一个数组，两个数组指向同一片内存区域，对其中一个的操作就会影响另一个结果

a = np.arange(12)b = a #a和b是同一个数组对象的两个名字print (b is a)b.shape = 3,4print (a.shape)print (id(a)) #id表示指向内存区域，具有相同id，表示a、b指向相同内存区域中的值print (id(b))输出结果：True(3, 4)43825600484382560048

view方法创建一个新数组，指向的内存区域不同，但元素值共用

import numpy as npa = np.arange(12)c = a.view()print(id(a)) #id值不同print(id(c))print(c is a) c.shape = 2,6print (a.shape) #改变c的shape，a的shape不变c[0,4] = 1234 #改变c中元素的值print(a) #a中元素的值也会发生改变输出结果：43828972164382897136False(12,)[ 0 1 2 3 1234 5 6 7 8 9 10 11]

copy方法(深复制)创建一个对数组和元素值的完整的copy

d = a.copy()

按照矩阵的行列找出最大值，最大值的索引

import numpy as npdata = np.sin(np.arange(20)).reshape(5,4) print (data)ind = data.argmax(axis=0) #找出每列最大值的索引print (ind)data_max = data[ind, range(data.shape[1])] #通过行列索引取值print (data_max)输出结果：[[ 0.0.84147098 0.90929743 0.14112001][-0.7568025 -0.95892427 -0.2794155 0.6569866 ][ 0.98935825 0.41211849 -0.54402111 -0.99999021][-0.53657292 0.4704 0.99060736 0.65028784][-0.28790332 -0.96139749 -0.75098725 0.14987721]][2 0 3 1][ 0.98935825 0.84147098 0.99060736 0.6569866 ]

tile方法，对原矩阵的行列进行扩展

import numpy as npa = np.arange(0, 40, 10)b = np.tile(a, (2, 3)) #行变成2倍，列变成3倍print(b)输出结果：[[ 0 10 20 30 0 10 20 30 0 10 20 30][ 0 10 20 30 0 10 20 30 0 10 20 30]]

两种排序方法

sort方法对矩阵中的值进行排序，argsort方法得到元素从小到大的索引值，根据索引值的到排序结果

a = np.array([[4, 3, 5], [1, 2, 1]])b = np.sort(a, axis=1) #对a按行由小到大排序，值赋给bprint(b)a.sort(axis=1) #直接对a按行由小到大排序print(a)a = np.array([4, 3, 1, 2])j = np.argsort(a) #argsort方法得到元素从小到大的索引值print (j)print (a[j]) #根据索引值输出a输出结果：[[3 4 5][1 1 2]]-------[[3 4 5][1 1 2]]-------[2 3 1 0]-------[1 2 3 4]

数据分析处理库Pandas，基于Numpy

read_csv方法读取csv文件

import pandas as pdfood_info = pd.read_csv("food_info.csv")print(type(food_info)) #pandas代表的DataFrame可以当成矩阵结构print(food_info.dtypes) #dtypes在当前数据中包含的数据类型输出结果：<class 'pandas.core.frame.DataFrame'>NDB_Noint64Shrt_Desc objectWater_(g)float64Energ_Kcal int64......Cholestrl_(mg)float64dtype: object

获取读取到的文件的信息

print(food_info.head(3)) #head()方法如果没有参数，默认获取前5行print(food_info.tail()) #tail()方法获取最后5行print(food_info.columns) #columns获取所有的列名print(food_info.shape) #获取当前数据维度(8618, 36)

取出指定某行的数据

print(food_info.loc[0]) #取出第零行的数据food_info.loc[8620] # 当index值超过最大值，throw an error: "KeyError: 'the label [8620] is not in the [index]'"food_info.loc[3:6] #取出第三到第六行数据，3、4、5、6two_five_ten = [2,5,10] food_info.loc[two_five_ten] #取出第2、5、10行数据

取出指定某列的数据

ndb_col = food_info["NDB_No"] #取出第一列NDB_No中的数据print (ndb_col)

columns = [“Zinc_(mg)”, “Copper_(mg)”] #要取出多列，就写入所要取出列的列名

zinc_copper = food_info[columns]

print(zinc_copper)

取出以(g)为结尾的列名

col_names = food_info.columns.tolist() #tolist()方法将列名放在一个list里gram_columns = []for c in col_names:if c.endswith("(g)"): gram_columns.append(c)gram_df = food_info[gram_columns]print(gram_df.head(3))输出结果：Water_(g) Protein_(g) Lipid_Tot_(g) Ash_(g) Carbohydrt_(g) \015.87 0.8581.112.11 0.06 115.87 0.8581.112.11 0.06 2 0.24 0.2899.480.00 0.00 342.41 21.4028.745.11 2.34 441.11 23.2429.683.18 2.79

Fiber_TD_(g) Sugar_Tot_(g) FA_Sat_(g) FA_Mono_(g) FA_Poly_(g)

0 0.0 0.06 51.368 21.021 3.043

1 0.0 0.06 50.489 23.426 3.012

2 0.0 0.00 61.924 28.732 3.694

3 0.0 0.50 18.669 7.778 0.800

4 0.0 0.51 18.764 8.598 0.784

对某列中的数据进行四则运算

import pandasfood_info = pandas.read_csv("food_info.csv")iron_grams = food_info["Iron_(mg)"] / 1000 #对列中的数据除以1000food_info["Iron_(g)"] = iron_grams #新增一列Iron_(g) 保存结果

water_energy = food_info[“Water_(g)”] * food_info[“Energ_Kcal”] #将两列数字相乘

求某列中的最大值、最小值、均值

max_calories = food_info["Energ_Kcal"].max()print(max_calories)min_calories = food_info["Energ_Kcal"].min()print(min_calories)mean_calories = food_info["Energ_Kcal"].mean()print(mean_calories)输出结果：906.438616848

使用sort_values()方法对某列数据进行排序

food_info.sort_values("Sodium_(mg)", inplace=True)#默认从小到大排序，inplace=True表示返回一个新的数据结构，而不在原来基础上做改变print(food_info["Sodium_(mg)"])

food_info.sort_values(“Sodium_(mg)”, inplace=True, ascending=False)

#ascending=False表示从大到小排序，

print(food_info[“Sodium_(mg)”])

针对titanic_train.csv 的练习（含pivot_table()透视表方法）

import pandas as pdimport numpy as nptitanic_survival = pd.read_csv("titanic_train.csv")titanic_survival.head()

age = titanic_survival[“Age”]

print(age.loc[0:20]) #打印某一列的0到20行

age_is_null = pd.isnull(age) #isnull()方法用于检测是否为缺失值，缺失为True 不缺失为False

print(age_is_null)

age_null_true = age[age_is_null] #得到该列所有缺失的行

print(age_null_true)

age_null_count = len(age_null_true)

print(age_null_count) #缺失的行数

#存在缺失值的情况下无法计算均值

mean_age = sum(titanic_survival[“Age”]) / len(titanic_survival[“Age”]) #sum()方法对列中元素求和

print(mean_age) #nan

#在计算均值前要把缺失值剔除

good_ages = titanic_survival[“Age”][age_is_null == False] #不缺失的取出来

correct_mean_age = sum(good_ages) / len(good_ages)

print(correct_mean_age) #29.6991176471

#当然也可以不这么麻烦，缺失值很普遍，pandas提供了mean()方法用于自动剔除缺失值并求均值

correct_mean_age = titanic_survival[“Age”].mean()

print(correct_mean_age) #29.6991176471

#求每个仓位等级，船票的平均价格

passenger_classes = [1, 2, 3]

fares_by_class = {}

for this_class in passenger_classes:

pclass_rows = titanic_survival[titanic_survival[“Pclass”] == this_class]

pclass_fares = pclass_rows[“Fare”] #定为到同一等级舱，船票价格的那一列

fare_for_class = pclass_fares.mean()

fares_by_class[this_class] = fare_for_class

print(fares_by_class)

运算结果：

{1: 84.154687499999994, 2: 20.662183152173913, 3: 13.675550101832993}

#pandas为我们提供了更方便的统计工具，pivot_table()透视表方法

#index 告诉pivot_table方法是根据哪一列分组

#values 指定对哪一列进行计算

#aggfunc 指定使用什么计算方法

passenger_survival = titanic_survival.pivot_table(index=“Pclass”, values=“Survived”, aggfunc=np.mean)

print(passenger_survival)

运算结果：

Pclass Survived

1 0.629630

2 0.472826

3 0.242363

#计算不同等级舱乘客的平均年龄

passenger_age = titanic_survival.pivot_table(index=“Pclass”, values=“Age”) #默认采用aggfunc=np.mean计算方法

print(passenger_age)

运算结果：

Pclass Age

1 38.233441

2 29.877630

3 25.140620

#index 根据一列分组

##values 指定对多列进行计算

port_stats = titanic_survival.pivot_table(index=“Embarked”, values=[“Fare”,“Survived”], aggfunc=np.sum)

print(port_stats)

运算结果：

Embarked Fare Survived

C 10072.2962 93

Q 1022.2543 30

S 17439.3988 217

#丢弃有缺失值的数据行

new_titanic_survival = titanic_survival.dropna(axis=0,subset=[“Age”, “Cabin”]) #subset指定了Age和Cabin中任何一个有缺失的，这行数据就丢弃

print(new_titanic_survival)

#按照行列定位元素，取出值

row_index_83_age = titanic_survival.loc[103,“Age”]

row_index_1000_pclass = titanic_survival.loc[766,“Pclass”]

print(row_index_83_age)

print(row_index_1000_pclass)

#sort_values()排序，reset_index()重新设置行号

new_titanic_survival = titanic_survival.sort_values(“Age”,ascending=False) #ascending=False从大到小

print(new_titanic_survival[0:10]) #但序号是原来的序号

itanic_reindexed = new_titanic_survival.reset_index(drop=True) #reset_index(drop=True)更新行号

print(itanic_reindexed.iloc[0:10]) #iloc通过行号获取行数据

#通过定义一个函数，把操作封装起来，然后apply函数

def hundredth_row(column): #这个函数返回第100行的每一列数据

# Extract the hundredth item

hundredth_item = column.iloc[99]

return hundredth_item

hundredth_row = titanic_survival.apply(hundredth_row) #apply()应用函数

print(hundredth_row)

返回结果：

PassengerId 100

Survived 0

Pclass 2

Name Kantor, Mr. Sinai

Sex male

Age 34

SibSp 1

Parch 0

Ticket 244367

Fare 26

Cabin NaN

Embarked S

dtype: object

##统计所有的缺失值

def not_null_count(column):

column_null = pd.isnull(column)

null = column[column_null]

return len(null)

column_null_count = titanic_survival.apply(not_null_count)

print(column_null_count)

输出结果：

PassengerId 0

Survived 0

Pclass 0

Name 0

Sex 0

Age 177

SibSp 0

Parch 0

Ticket 0

Fare 0

Cabin 687

Embarked 2

dtype: int64

#对船舱等级进行转换

def which_class(row):

pclass = row[‘Pclass’]

if pd.isnull(pclass):

return “Unknown”

elif pclass == 1:

return “First Class”

elif pclass == 2:

return “Second Class”

elif pclass == 3:

return “Third Class”

classes = titanic_survival.apply(which_class, axis=1) #通过axis = 1参数，使用DataFrame.apply（）方法来迭代行而不是列。

print(classes)

#使用两个自定义函数，统计不同年龄标签对应的存活率

def generate_age_label(row):

age = row[“Age”]

if pd.isnull(age):

return “unknown”

elif age < 18:

return “minor”

else:

return “adult”

age_labels = titanic_survival.apply(generate_age_label, axis=1)

titanic_survival[‘age_labels’] = age_labels

age_group_survival = titanic_survival.pivot_table(index=“age_labels”, values=“Survived” ,aggfunc=np.mean)

print(age_group_survival)

运算结果：

age_labels Survived

adult 0.381032

minor 0.539823

unknown 0.293785

Series结构

Series (collection of values) DataFrame中的一行或者一列就是Series结构

DataFrame (collection of Series objects)是读取文件read_csv()方法获得的矩阵

Panel (collection of DataFrame objects)

import pandas as pdfandango = pd.read_csv('fandango_score_comparison.csv') #读取电影信息，DataFrame结构 series_film = fandango['FILM'] #定位到“FILM”这一列print(type(series_film)) #<class 'pandas.core.series.Series'>结构print(series_film[0:5]) #通过索引切片series_rt = fandango['RottenTomatoes']print (series_rt[0:5])

from pandas import Series # Import the Series object from pandas

film_names = series_film.values #把Series结构中的每一个值拿出来

print(type(film_names)) #<class ‘numpy.ndarray’>说明series结构中每一个值的结构是ndarray

rt_scores = series_rt.values

series_custom = Series(rt_scores , index=film_names) #设置以film_names为索引的film结构,创建一个Series

series_custom[[‘Minions ()’, ‘Leviathan ()’]] #确实可以使用名字索引

fiveten = series_custom[5:10] #也可以使用数字索引

print(fiveten)

Series中的排序

original_index = series_custom.index.tolist() #将index值放入一个list结构中sorted_index = sorted(original_index) sorted_by_index = series_custom.reindex(sorted_index) #reset index操作print(sorted_by_index)

sc2 = series_custom.sort_index() #根据index值进行排序

sc3 = series_custom.sort_values() #根据value值进行排序

print(sc3)

在Series中的每一个值的类型是ndarray，即NumPy中核心数据类型

import numpy as npprint(np.add(series_custom, series_custom)) #将两列值相加np.sin(series_custom) #对每个值使用sin函数np.max(series_custom) #获取某一列的最大值

取出series_custom列中数值在50到70之间的数值

对某一列中的所有值进行比较运算，返回boolean值

criteria_one = series_custom > 50criteria_two = series_custom < 75both_criteria = series_custom[criteria_one & criteria_two] #返回boolean值的Series对象print(both_criteria)

对index相同的两列运算

#data alignment same indexrt_critics = Series(fandango['RottenTomatoes'].values, index=fandango['FILM'])rt_users = Series(fandango['RottenTomatoes_User'].values, index=fandango['FILM'])rt_mean = (rt_critics + rt_users)/2print(rt_mean)

对DataFrame结构进行操作

设置‘FILM’为索引

fandango = pd.read_csv('fandango_score_comparison.csv')print(type(fandango)) #<class 'pandas.core.frame.DataFrame'>fandango_films = fandango.set_index('FILM', drop=False) #以‘FILM’为索引返回一个新的DataFrame ，drop=False不丢弃原来的FILM列

对DataFrame切片

#可以使用[]或者loc[]来切片fandango_films["Avengers: Age of Ultron ()":"Hot Tub Time Machine 2 ()"] #用string值做的索引也可以切片fandango_films.loc["Avengers: Age of Ultron ()":"Hot Tub Time Machine 2 ()"]fandango_films[0:3] #数值索引依然存在，可以用来切片#选择特定的列#movies = ['Kumiko, The Treasure Hunter ()', 'Do You Believe? ()', 'Ant-Man ()']

可视化库matplotlib

Matplotlib是Python中最常用的可视化工具之一，可以非常方便地创建海量类型地2D图表和一些基本的3D图表。

2D图表之折线图

Matplotlib中最基础的模块是pyplot，先从最简单的点图和线图开始。

更多属性可以参考官网：/api/pyplot_api.html

import pandas as pdimport matplotlib as mplimport matplotlib.pyplot as plt

unrate = pd.read_csv(‘unrate.csv’)

unrate[‘DATE’] = pd.to_datetime(unrate[‘DATE’]) #pd.to_datetime方法标准化日期格式

first_twelve = unrate[0:12] #取0到12行数据

plt.plot(first_twelve[‘DATE’], first_twelve[‘VALUE’]) #plot(x轴,y轴)方法画图

plt.xticks(rotation=45) #设置x轴上横坐标旋转角度

plt.xlabel(‘Month’) #x轴含义

plt.ylabel(‘Unemployment Rate’) #y轴含义

plt.title(‘Monthly Unemployment Trends, 1948’) #图标题

plt.show() #show方法显示图

子图操作

添加子图：add_subplot(first,second,index)

first 表示行数,second 列数.

import matplotlib.pyplot as pltfig = plt.figure() #Creates a new figure.ax1 = fig.add_subplot(3,2,1) #一个3*2子图中的第一个模块ax2 = fig.add_subplot(3,2,2) #一个3*2子图中的第二个模块ax2 = fig.add_subplot(3,2,6) #一个3*2子图中的第六个模块plt.show()

import numpy as np#fig = plt.figure()fig = plt.figure(figsize=(3, 6)) #指定画图区大小（长，宽）ax1 = fig.add_subplot(2,1,1)ax2 = fig.add_subplot(2,1,2)

ax1.plot(np.random.randint(1,5,5), np.arange(5)) #第一个子图画图

ax2.plot(np.arange(10)*3, np.arange(10)) #第二个子图画图

plt.show()

在同一个图中画两条折线（plot两次）

fig = plt.figure(figsize=(6,3))plt.plot(unrate[0:12]['MONTH'], unrate[0:12]['VALUE'], c='red')plt.plot(unrate[12:24]['MONTH'], unrate[12:24]['VALUE'], c='blue')plt.show()

为所画曲线作标记

fig = plt.figure(figsize=(10,6))colors = ['red', 'blue', 'green', 'orange', 'black']for i in range(5):start_index = I*12end_index = (i+1)*12subset = unrate[start_index:end_index]label = str(1948 + i) #label值plt.plot(subset['MONTH'], subset['VALUE'], c=colors[i], label=label) #x轴指标，y轴指标，颜色，label值plt.legend(loc='upper left') #loc指定legend方框的位置,loc = 'best'/'upper right'/'lower left'等，print(help(plt.legend))查看用法plt.xlabel('Month, Integer')plt.ylabel('Unemployment Rate, Percent')plt.title('Monthly Unemployment Trends, 1948-1952')plt.show()

2D图标之条形图与散点图

bar条形图

import pandas as pdreviews = pd.read_csv('fandango_scores.csv') #读取电影评分表cols = ['FILM', 'RT_user_norm', 'Metacritic_user_nom', 'IMDB_norm', 'Fandango_Ratingvalue', 'Fandango_Stars']norm_reviews = reviews[cols]num_cols = ['RT_user_norm', 'Metacritic_user_nom', 'IMDB_norm', 'Fandango_Ratingvalue', 'Fandango_Stars']bar_heights = norm_reviews.ix[0, num_cols].values #柱高度bar_positions = arange(5) + 0.75 #设定每一个柱到左边的距离tick_positions = range(1,6) #设置x轴刻度标签为[1,2,3,4,5]fig, ax = plt.subplots()

ax.bar(bar_positions, bar_heights, 0.5) #bar型图。柱到左边距离，柱高度，柱宽度

ax.set_xticks(tick_positions) #x轴刻度标签

ax.set_xticklabels(num_cols, rotation=45)

ax.set_xlabel(‘Rating Source’)

ax.set_ylabel(‘Average Rating’)

ax.set_title(‘Average User Rating For Avengers: Age of Ultron ()’)

plt.show()

散点图

fig, ax = plt.subplots() #fig控制图的整体情况，如大小，用ax实际来画图ax.scatter(norm_reviews['Fandango_Ratingvalue'], norm_reviews['RT_user_norm']) #scatter方法，画散点图的x轴，y轴ax.set_xlabel('Fandango')ax.set_ylabel('Rotten Tomatoes')plt.show()

散点图子图

fig = plt.figure(figsize=(8,3))ax1 = fig.add_subplot(1,2,1)ax2 = fig.add_subplot(1,2,2)ax1.scatter(norm_reviews['Fandango_Ratingvalue'], norm_reviews['RT_user_norm'])ax1.set_xlabel('Fandango')ax1.set_ylabel('Rotten Tomatoes')ax2.scatter(norm_reviews['RT_user_norm'], norm_reviews['Fandango_Ratingvalue'])ax2.set_xlabel('Rotten Tomatoes')ax2.set_ylabel('Fandango')plt.show()

屏幕快照 -11-05 上午11.42.10.png

</div></div>

如果觉得《python数据分析与机器学习(Numpy Pandas Matplotlib)》对你有帮助，请点赞、收藏，并留下你的观点哦！

本内容不代表本网观点和政治立场，如有侵犯你的权益请联系我们处理。

网友评论

网友评论仅供其表达个人看法，并不表明网站立场。