数据来源
/politics/
import numpy as npimport pandas as pdfrom pandas import Series, DataFrame
# 航班数据半个月20w条link = '/Users/bennyrhys/Desktop/数据分析可视化-数据集/homework/usa_flights.csv'df = pd.read_csv(link)df.head()
df.tail()
# 行列信息df.shape
(64, 14)
判断延误arr_delay>0就是延误
排序航班到达时间,前十名递减
df.sort_values('arr_delay', ascending=False)[:10]
计算延误和没有延误所占比例
df['cancelled'].value_counts()
0 19687314791Name: cancelled, dtype: int64
# 新增一列表示是否延误df['delayed'] = df['arr_delay'].apply(lambda x: x > 0)df.head()
delay_data = df['delayed'].value_counts()delay_data
False 103037True98627Name: delayed, dtype: int64
type(delay_data)
pandas.core.series.Series
delay_data[1] / (delay_data[0] + delay_data[1])
0.4890659711202793
每一个航空公司延误的情况
delay_group = df.groupby(['unique_carrier', 'delayed'])delay_group
<pandas.core.groupby.generic.DataFrameGroupBy object at 0x11ff50710>
# 这是多级Series类型delay_group.size()
unique_carrier delayedAA False 8912True 9841AS False 3527True 2104B6 False 4832True 4401DL False17719True 9803EV False10596True 11371F9 False 1103True 1848HA False 1351True 1354MQ False 4692True 8060NK False 1550True 2133OO False 9977True 10804UA False 7885True 8624US False 7850True 6353VX False 1254True 781WN False21789True 21150dtype: int64
# 将多级Series转换为DataFramedf_delay = delay_group.size().unstack()df_delay
# 画图import matplotlib.pyplot as plt
df_delay.plot()
<matplotlib.axes._subplots.AxesSubplot at 0x1210efb50>
plt.show()
# 蓝色没有延误df_delay.plot(kind='barh', stacked=True, figsize=[16,6], colormap='winter')
<matplotlib.axes._subplots.AxesSubplot at 0x11c9e2290>
透视表功能
flights_by_carrier = df.pivot_table(index='flight_date', columns='unique_carrier')flights_by_carrier.head()
5 rows × 154 columns
如果觉得《【数据分析可视化】分组和透视功能实战-航班数据分析》对你有帮助,请点赞、收藏,并留下你的观点哦!