失眠网 > python 中文识别不用tesseract_Python——验证码识别 Pillow + tesseract-ocr

python 中文识别不用tesseract_Python——验证码识别 Pillow + tesseract-ocr

时间：2023-12-04 04:49:33

简介

图片验证码识别的可以分为几个步骤，一般用 Pillow 库或 OpenCV 来实现，这几个过程是：

灰度处理&二值化

降噪

字符分割

标准化

识别

灰度化：在RGB模型中，如果R=G=B时，则彩色表示一种灰度颜色，其中R=G=B的值叫做灰度值，因此，灰度图像每个像素值只需一个字节存放灰度值(又称强度值、亮度值)，灰度范围为0-255。

二值化：二值化可以把灰度图片转换成二值图像，把大于某个临界灰度值的像素灰度设置为灰度极大值，把小于这个值的像素灰度设为灰度极小值，从而实现二值化。

降噪就是把不需要的信息通通去除，比如背景，干扰线，干扰像素等等，只留下需要识别的字符，让图片变成2进制点阵，方便代入模型训练。

灰度处理：

from PIL import Image # 用于打开图片和对图片处理

def img_to_gray(path):

"""

图片转灰度

:param path:

:return:

"""

img = Image.open(path)

img = img.convert('L') #转灰度

img.show() #显示图片

return img

path = '../files/verifyimg_edit_1.jpg'

im = img_to_gray(path)

path = path.replace('jpg','png')

im.save(path) #保存图片

处理前后的图片：

输入原始图片：

输出转灰度后的图片：

二值化：

from PIL import Image # 用于打开图片和对图片处理

def processing_image(path):

img = Image.open(path)

pixdata = img.load()

w, h = img.size

threshold = 160 # 该阈值不适合所有验证码，具体阈值请根据验证码情况设置

# 遍历所有像素，大于阈值的为黑色

for y in range(h):

for x in range(w):

if pixdata[x, y] < threshold:

pixdata[x, y] = 0

else:

pixdata[x, y] = 255

return img

path = '../files/verifyimg_edit_1.png' #已经完成转灰度的图片

im = processing_image(path)

path = path.replace('png','jpeg')

im.save(path)

输入转灰度后的图片：

输出二值化后的图片：

降噪

降噪就是把不需要的信息通通去除，比如背景，干扰线，干扰像素等等，只留下需要识别的字符，让图片变成2进制点阵，方便代入模型训练。

from PIL import Image, ImageDraw

# 二值数组

t2val = {}

def twoValue(image, G):

for y in xrange(0, image.size[1]):

for x in xrange(0, image.size[0]):

g = image.getpixel((x, y))

if g > G:

t2val[(x, y)] = 1

else:

t2val[(x, y)] = 0

# 根据一个点A的RGB值，与周围的8个点的RBG值比较，设定一个值N(0

# G: Integer 图像二值化阀值

# N: Integer 降噪率 0

# Z: Integer 降噪次数

# 输出

# 0：降噪成功

# 1：降噪失败

def clearNoise(image, N, Z):

for i in xrange(0, Z):

t2val[(0, 0)] = 1

t2val[(image.size[0] - 1, image.size[1] - 1)] = 1

for x in xrange(1, image.size[0] - 1):

for y in xrange(1, image.size[1] - 1):

nearDots = 0

L = t2val[(x, y)]

if L == t2val[(x - 1, y - 1)]:

nearDots += 1

if L == t2val[(x - 1, y)]:

nearDots += 1

if L == t2val[(x - 1, y + 1)]:

nearDots += 1

if L == t2val[(x, y - 1)]:

nearDots += 1

if L == t2val[(x, y + 1)]:

nearDots += 1

if L == t2val[(x + 1, y - 1)]:

nearDots += 1

if L == t2val[(x + 1, y)]:

nearDots += 1

if L == t2val[(x + 1, y + 1)]:

nearDots += 1

if nearDots < N:

t2val[(x, y)] = 1

def saveImage(filename, size):

image = Image.new("1", size)

draw = ImageDraw.Draw(image)

for x in xrange(0, size[0]):

for y in xrange(0, size[1]):

draw.point((x, y), t2val[(x, y)])

image.save(filename)

path = u'../files/verifyimg_edit_二值化.jpg' #已经完成二值化的图片

image = Image.open(path)

twoValue(image, 100)

clearNoise(image, 2, 1)

path1 = u'../files/verifyimg_edit_降噪11.jpg'

saveImage(path1, image.size)

输入二值化后的图片：

输出降噪后图片：

文字识别

def image_recognition(image):

'''

文字识别

:param image:

:return:

'''

pytesseract.pytesseract.tesseract_cmd = r"d:\Program Files\Tesseract-OCR\tesseract.exe" # 设置pyteseract路径

result = pytesseract.image_to_string(image) # 图片转文字

print(result)

path = u'../files/verifyimg_edit_降噪11.jpg' #已经完成降噪的图片

image = Image.open(path)

image_recognition(image)

输出：6032

如果觉得《python 中文识别不用tesseract_Python——验证码识别 Pillow + tesseract-ocr》对你有帮助，请点赞、收藏，并留下你的观点哦！

本内容不代表本网观点和政治立场，如有侵犯你的权益请联系我们处理。

网友评论

网友评论仅供其表达个人看法，并不表明网站立场。

python 中文识别 不用tesseract_Python——验证码识别 Pillow + tesseract-ocr

python 中文识别不用tesseract_Python——验证码识别 Pillow + tesseract-ocr