Tesseract-OCRの学習を試してみる

文字位置特定→文字画像を切り出し→kNNで文字認識

という流れを踏む予定だったが、安定した文字位置の特定処理が難しいのでTesseract-OCRを試してみる。

学習前の状態でOCR

f:id:reverent_f:20170110144043p:plain

Tesseract v3.04

$ tesseract number.png out
Tesseract Open Source OCR Engine v3.04.01 with Leptonica
Info in fopenReadFromMemory: work-around: writing to a temp file
$ cat out.txt 
5915§7WE €22

digits を指定

$ tesseract number.png out digits
Tesseract Open Source OCR Engine v3.04.01 with Leptonica
Info in fopenReadFromMemory: work-around: writing to a temp file
$ cat out.txt
. 957 3 522

Tesseract v4.00

$ tesseract number.png out
Tesseract Open Source OCR Engine v4.00.00alpha with Leptonica
Warning. Invalid resolution 0 dpi. Using 70 instead.
$ cat out.txt
oSsS6S?771e

tessedit_char_whitelist 0123456789 を指定しても上手くいかない、なぜだ

当たり前だが未学習だとほぼ読めない。 OCR Engine modes を指定しても同じ結果だった。

学習データの準備

Tesseract-OCR v4.00はまだ開発版なので3.04の学習を試してみる。

Training Tesseract · tesseract-ocr/tesseract Wiki · GitHub

Tesseract-OCRの学習 - はだしの元さん

フォントは入手できないので、とりあえず綺麗め・歪んでいないリザルトからフォントを切り出してみた

f:id:reverent_f:20170110183150p:plain

ファイルの命名規則は

(３文字の言語名).(フォント名(任意)).exp(インデックス番号)

らしい。とりあえず言語名vol, フォント名digitとして学習させていく。

.boxファイルの編集

jTessBoxEditorでチマチマと各文字部分のBoxを定義していく。

trファイルの作成

$ tesseract vol.digit.exp0.png vol.digit.exp0 nobatch box.train.stderr

する。

FAIL!
APPLY_BOXES: boxfile line 2/9 ((49,8),(83,44)): FAILURE! Couldn't find a matching blob
FAIL!
APPLY_BOXES: boxfile line 3/6 ((89,8),(123,44)): FAILURE! Couldn't find a matching blob
FAIL!
APPLY_BOXES: boxfile line 6/7 ((205,9),(231,39)): FAILURE! Couldn't find a matching blob
FAIL!
APPLY_BOXES: boxfile line 7/1 ((235,9),(261,39)): FAILURE! Couldn't find a matching blob
APPLY_BOXES:
   Boxes read from boxfile:       8
   Boxes failed resegmentation:       4
   Found 4 good blobs.
   Leaving 3 unlabelled blobs in 0 words.

どうやら文字間の距離が近すぎると失敗するらしい。画像を編集し試行錯誤すること数回

APPLY_BOXES:
   Boxes read from boxfile:       8
   Found 8 good blobs.
Generated training data for 1 words

成功したっぽい。

トレーニングデータ作成

unicharsetファイル作成

$ unicharset_extractor vol.digit.exp0.box
-bash: unicharset_extractor: command not found

トレーニング用のツールがなかったのでインストールする

$ brew uninstall tesseract
Uninstalling /usr/local/Cellar/tesseract/3.04.01_2... (77 files, 70.6M)
$ brew install --with-training-tools tesseract

$ unicharset_extractor vol.digit.exp0.box
Extracting unicharset from vol.digit.exp0.box
Wrote unicharset file ./unicharset.

font_propertiesファイル

$ echo "digit 0 0 0 0 0" > font_properties

学習する

$ mftraining -F font_properties -U unicharset vol.digit.exp0.tr
Warning: No shape table file present: shapetable
Reading vol.digit.exp0.tr ...
Flat shape table summary: Number of shapes = 7 max unichars = 1 number with multiple unichars = 0
Warning: no protos/configs for Joined in CreateIntTemplates()
Warning: no protos/configs for |Broken|0|1 in CreateIntTemplates()
Done!

$ cntraining vol.digit.exp0.tr
Reading vol.digit.exp0.tr ...
Clustering ...

Writing normproto ...

ここまでで必要なファイルの生成に成功したらしい。いくつかのファイルをリネームする。

$ mv inttemp vol.inttemp
$ mv pffmtable vol.pffmtable 
$ mv shapetable vol.shapetable 
$ mv normproto vol.normproto
$ mv unicharset  vol.unicharset

$ combine_tessdata vol.
Combining tessdata files
TessdataManager combined tesseract data files.
Offset for type  0 (vol.config                ) is -1
Offset for type  1 (vol.unicharset            ) is 140
Offset for type  2 (vol.unicharambigs         ) is -1
Offset for type  3 (vol.inttemp               ) is 641
Offset for type  4 (vol.pffmtable             ) is 128020
Offset for type  5 (vol.normproto             ) is 128115
Offset for type  6 (vol.punc-dawg             ) is -1
Offset for type  7 (vol.word-dawg             ) is -1
Offset for type  8 (vol.number-dawg           ) is -1
Offset for type  9 (vol.freq-dawg             ) is -1
Offset for type 10 (vol.fixed-length-dawgs    ) is -1
Offset for type 11 (vol.cube-unicharset       ) is -1
Offset for type 12 (vol.cube-word-dawg        ) is -1
Offset for type 13 (vol.shapetable            ) is 129137
Offset for type 14 (vol.bigram-dawg           ) is -1
Offset for type 15 (vol.unambig-dawg          ) is -1
Offset for type 16 (vol.params-model          ) is -1
Output vol.traineddata created successfully.

以上で漸くvol.traineddataが生成された。

/usr/local/Cellar/tesseract/3.04.01_2/share/tessdataにvol.traineddataを移動し,OCRを試す。

結果

ひとまず学習に使った画像でテストしてみる f:id:reverent_f:20170110183150p:plain

$ tesseract vol.digit.exp0.png -l vol output
Tesseract Open Source OCR Engine v3.04.01 with Leptonica
Info in fopenReadFromMemory: work-around: writing to a temp file
$ cat output.txt 
57712

微妙な結果になった

まとめ

とりあえずTesseractの学習の手順は理解したので、学習を進めていく。

学習を進めても精度が伸びないようであれば、v4.0の利用か自前でOCRを実装することになるだろう。

備忘録

弱小院生のメモ