失眠网 > python假设检验和区间估计_用 Python 实现常用的假设检验

python假设检验和区间估计_用 Python 实现常用的假设检验

时间：2019-01-15 04:16:17

作者：求知鸟

来源：知乎

开门见山。

这篇文章，教大家用Python实现常用的假设检验！

服从什么分布，就用什么区间估计方式，也就就用什么检验！

比如：两个样本方差比服从F分布，区间估计就采用F分布计算临界值（从而得出置信区间），最终采用F检验。

建设检验的基本步骤：

前言

假设检验用到的Python工具包

Statsmodels是Python中，用于实现统计建模和计量经济学的工具包，主要包括描述统计、统计模型估计和统计推断

Scipy是一个数学、科学和工程计算Python工具包，主要包括统计,优化,整合,线性代数等等与科学计算有关的包

导入数据

from sklearn.datasets import load_iris

import numpy as np

#导入IRIS数据集

iris = load_iris()

iris=pd.DataFrame(iris.data,columns=['sepal_length','sepal_width','petal_legth','petal_width'])

print(iris)

一个总体均值的z检验

np.mean(iris['petal_legth'])

'''

原假设：鸢尾花花瓣平均长度是4.2

备择假设：鸢尾花花瓣平均长度不是4.2

'''

import statsmodels.stats.weightstats

z, pval = statsmodels.stats.weightstats.ztest(iris['petal_legth'], value=4.2)

print(z,pval)

'''

P=0.002 <5%, 拒绝原假设，接受备则假设。

'''

一个总体均值的t检验

import scipy.stats

t, pval = scipy.stats.ttest_1samp(iris['petal_legth'], popmean=4.0)

print(t, pval)

'''

P=0.0959 > 5%, 接受原假设，即花瓣长度为4.0。

'''

模拟双样本t检验

#取两个样本

iris_1 = iris[iris.petal_legth >= 2]

iris_2 = iris[iris.petal_legth < 2]

print(np.mean(iris_1['petal_legth']))

print(np.mean(iris_2['petal_legth']))

'''

H0: 两种鸢尾花花瓣长度一样

H1: 两种鸢尾花花瓣长度不一样

'''

import scipy.stats

t, pval = scipy.stats.ttest_ind(iris_1['petal_legth'],iris_2['petal_legth'])

print(t,pval)

'''

p<0.05,拒绝H0，认为两种鸢尾花花瓣长度不一样

'''

练习

数据字段说明：

gender：性别，1为男性，2为女性

Temperature:体温

HeartRate：心率

共130行，3列

用到的数据链接：/s/1t4SKF6

本周需要解决的几个小问题：

1. 人体体温的总体均值是否为98.6华氏度？

2. 人体的温度是否服从正态分布?

3. 人体体温中存在的异常数据是哪些？

4. 男女体温是否存在明显差异？

5. 体温与心率间的相关性(强？弱？中等?)

1.1 探索数据

import numpy as np

import pandas as pd

from scipy import stats

data = pd.read_csv("C:\\Users\\baihua\\Desktop\\test.csv")

print(data.head())

sample_size = data.size #130*3

out：

Temperature Gender HeartRate

0 96.3 1 70

1 96.7 1 71

2 96.9 1 74

3 97.0 1 80

4 97.1 1 73

print(data.describe())

out：

Temperature Gender HeartRate

count 130.000000 130.000000 130.000000

mean 98.249231 1.500000 73.761538

std 0.733183 0.501934 7.062077

min 96.300000 1.000000 57.000000

25% 97.800000 1.000000 69.000000

50% 98.300000 1.500000 74.000000

75% 98.700000 2.000000 79.000000

max 100.800000 2.000000 89.000000

人体体温均值是98.249231

1.2 人体的温度是否服从正态分布?

'''

人体的温度是否服从正态分布?

先画出分布的直方图，然后使用scipy.stat.kstest函数进行判断。

'''

%matplotlib inline

import seaborn as sns

sns.distplot(data['Temperature'], color='b', bins=10, kde=True)

stats.kstest(data['Temperature'], 'norm')

out：

KstestResult(statistic=1.0, pvalue=0.0)

'''

p<0.05,不符合正态分布

'''

判断是否服从t分布

'''

判断是否服从t分布:

'''

np.random.seed(1)

ks = stats.t.fit(data['Temperature'])

df = ks[0]

loc = ks[1]

scale = ks[2]

t_estm = stats.t.rvs(df=df, loc=loc, scale=scale, size=sample_size)

stats.ks_2samp(data['Temperature'], t_estm)

'''

pvalue=0.4321464176976891 <0.05,认为体温服从t分布

'''

判断是否服从卡方分布

'''

判断是否服从卡方分布:

'''

np.random.seed(1)

chi_square = stats.chi2.fit(data['Temperature'])

df = chi_square[0]

loc = chi_square[1]

scale = chi_square[2]

chi_estm = stats.chi2.rvs(df=df, loc=loc, scale=scale, size=sample_size)

stats.ks_2samp(data['Temperature'], chi_estm)

'''

pvalue=0.3956146564478842>0.05,认为体温服从卡方分布

'''

绘制卡方分布直方图

'''

绘制卡方分布图

'''

from matplotlib import pyplot as plt

plt.figure()

data['Temperature'].plot(kind = 'kde')

chi2_distribution = stats.chi2(chi_square[0], chi_square[1],chi_square[2])

x = np.linspace(chi2_distribution.ppf(0.01), chi2_distribution.ppf(0.99), 100)

plt.plot(x, chi2_distribution.pdf(x), c='orange')

plt.xlabel('Human temperature')

plt.title('temperature on chi_square', size=20)

plt.legend(['test_data', 'chi_square'])

1.3 人体体温中存在的异常数据是哪些？

'''

已知体温数据服从卡方分布的情况下，可以直接使用Python计算出P=0.025和P=0.925时(该函数使用单侧概率值)的分布值，在分布值两侧的数据属于小概率，认为是异常值。

'''

lower1=chi2_distribution.ppf(0.025)

lower2=chi2_distribution.ppf(0.925)

t=data['Temperature']

print(t[t

print(t[t>lower2])

out：

0 96.3

1 96.7

65 96.4

66 96.7

67 96.8

Name: Temperature, dtype: float64

63 99.4

64 99.5

126 99.4

127 99.9

128 100.0

129 100.8

Name: Temperature, dtype: float64

1.4 男女体温差异是否显著

'''

此题是一道两个总体均值之差的假设检验问题，因为是否存在差别并不涉及方向，所以是双侧检验。建立原假设和备择假设如下：

H0:u1-u2 =0 没有显著差

H1:u1-u2 != 0 有显著差别

'''

data.groupby(['Gender']).size() #样本量65

male_df = data.loc[data['Gender'] == 1]

female_df = data.loc[data['Gender'] == 2]

'''

使用Python自带的函数，P用的双侧累计概率

'''

import scipy.stats

t, pval = scipy.stats.ttest_ind(male_df['Temperature'],female_df['Temperature'])

print(t,pval)

if pval > 0.05:

print('不能拒绝原假设，男女体温无明显差异。')

else:

print('拒绝原假设，男女体温存在明显差异。')

out：

-2.2854345381654984 0.02393188312240236

拒绝原假设，男女体温存在明显差异。

1.5 体温与心率间的相关性(强？弱？中等?)

'''

体温与心率间的相关性(强？弱？中等?)

'''

heartrate_s = data['HeartRate']

temperature_s = data['Temperature']

from matplotlib import pyplot as plt

plt.scatter(heartrate_s, temperature_s)

stat, p = stats.pearsonr(heartrate_s, temperature_s)

print('stat=%.3f, p=%.3f' % (stat, p))

print(stats.pearsonr(heartrate_s, temperature_s))

'''

相关系数为0.004，可以认为二者之间没有相关性

'''

如果觉得《python假设检验和区间估计_用 Python 实现常用的假设检验》对你有帮助，请点赞、收藏，并留下你的观点哦！

本内容不代表本网观点和政治立场，如有侵犯你的权益请联系我们处理。

网友评论

网友评论仅供其表达个人看法，并不表明网站立场。