tesseract-ocr

2018-03-26

Python 中文OCR

https://github.com/tesseract-ocr/tesseract

https://digi.bib.uni-mannheim.de/tesseract/doc/

Tesseract的OCR引擎目前已作为开源项目发布在Google Project，其项目主页在这里查看https://github.com/tesseract-ocr，
它支持中文OCR，并提供了一个命令行工具。python中对应的包是pytesseract. 通过这个工具我们可以识别图片上的文字。

tesseract安装

https://github.com/tesseract-ocr/tesseract/wiki

https://github.com/UB-Mannheim/tesseract/wiki

https://jingyan.baidu.com/article/219f4bf788addfde442d38fe.html

需要把安装路径设置到环境变量中

运行 Tesseract

Tesseract 是命令行程序，那么第一终端或者打开一个命令提示。在命令使用过程是这样的：

tesseract imagename outputbase [-l lang] [-psm pagesegmode] [configfile...]

结果保存到 out.txt

tesseract myscan.png out

用于多个语言

tesseract myscan.png out -l eng+deu

其他各种选项

1.jpg是当前目录中的1.jpg图片

1.txt是指定结果输出到文本文件

-l是指定使用的包

chi_sim是中文识别包，equ是数学公式包，eng是英文包

tesseract 1.jpg 1.txt -l chi_sim+equ+eng

其他语言

语言 tessdata 储存库

训练语言

https://github.com/tesseract-ocr/tesseract/wiki/TrainingTesseract

https://github.com/tesseract-ocr/tesseract/wiki/TrainingTesseract-4.00

jTessBoxEditor

https://blog.csdn.net/cn_wk/article/details/52280567

http://www.bkjia.com/Pythonjc/1131343.html

https://github.com/tesseract-ocr/tesseract/wiki/AddOns

http://zdenop.github.io/qt-box-editor/

https://www.cnblogs.com/cnlian/p/5765871.html

大体流程为：安装jTessBoxEditor -> 获取样本文件 -> Merge样本文件 –> 生成BOX文件 -> 定义字符配置文件 -> 字符矫正 -> 执行批处理文件 -> 将生成的traineddata放入tessdata中

https://jingyan.baidu.com/article/cdddd41c90544f53cb00e1c3.html

https://blog.csdn.net/woaipangruimao/article/details/78741022

python pytesseract

Pillow

https://pillow.readthedocs.org/

https://pypi.python.org/pypi/pytesseract

pip install pytesseract
pip install pillow

code:

#!/usr/bin/env python3
# -*- coding: utf-8 -*-

import pytesseract
from PIL import Image

pytesseract.pytesseract.tesseract_cmd = r'C:\Program Files (x86)\Tesseract-OCR\tesseract.exe'
# open image
image = Image.open('test.png')
code = pytesseract.image_to_string(image, lang='chi_sim')
print(code)

jsonContent: meta: false pages: false posts: title: true date: true path: true text: false raw: false content: false slug: false updated: false comments: false link: false permalink: false excerpt: false categories: false tags: true