安裝 Docker
$ unzip tesseract-ocr-compilation-master.zip
$ cd tesseract-ocr-compilation-master
為了方便除錯,建立共通的目錄
$ vi scripts/3-run-new-container.sh
docker run -d -v debug_path:/home/tmp -p 4022:22 --name t4cmp tesseractshadow/tesseract4cmp
docker ps
執行步驟3 建立 t4cmp container 之後
在 Docker 中安裝 vi, 需在
$ docker exec -it t4cmp bash
# apt-get update
# apt-get install vim
使用 ssh login
# echo 'root:your_passwd' | chpasswd
# sed -i 's/#PermitRootLogin/PermitRootLogin/' /etc/ssh/sshd_config
# exit
$ docker stop t4cmp
$ docker start t4cmp
$ ssh root@localhost -p 4022
1. $ ./scripts/1-pull-container.sh 下載 docker image tesseractshadow/tesseract4cmp
2. $ ./scripts/2-remove-container.sh 移除 t4cmp container
3. $ ./scripts/3-run-new-container.sh 建立 t4cmp container
4. $ ./cripts/4-update-src.sh 更新 Leptionica 和 Tesseract Source
5. $ ./scripts/5-compile-src.sh 編譯 Leptionica 和 Tesseract
6. $ ./scripts/6-test-ocr.sh 測試,目錄於 ocr-files
7. $ ./scripts/7-build-pkg.sh 建立安裝檔於 pkg 目錄
Tessseract 文件 中 Training for Tesseract 4 有敘述如何 訓練自己的字
$ docker exec -it t4cmp bash
# cd /home/workspace
# apt update
# apt install ttf-mscorefonts-installer
會停在下列
[More]
Progress: [ 54%] [###############################...........................]
按 Enter 直到出現如下
Do you accept the EULA license terms? [yes/no]
輸入 yes
# apt install fonts-dejavu
# fc-cache -vf
# cd
# mkdir ~/tesstutorial
# cd ~/tesstutorial
# mkdir langdata
# cd langdata
# wget https://raw.githubusercontent.com/tesseract-ocr/langdata_lstm/master/radical-stroke.txt
# wget https://raw.githubusercontent.com/tesseract-ocr/langdata_lstm/master/common.punc
# wget https://raw.githubusercontent.com/tesseract-ocr/langdata_lstm/master/font_properties
# wget https://raw.githubusercontent.com/tesseract-ocr/langdata_lstm/master/Latin.unicharset
# wget https://raw.githubusercontent.com/tesseract-ocr/langdata_lstm/master/Latin.xheights
# mkdir eng
# cd eng
# wget https://raw.githubusercontent.com/tesseract-ocr/langdata/master/eng/eng.training_text
# wget https://raw.githubusercontent.com/tesseract-ocr/langdata/master/eng/eng.punc
# wget https://raw.githubusercontent.com/tesseract-ocr/langdata/master/eng/eng.numbers
# wget https://raw.githubusercontent.com/tesseract-ocr/langdata/master/eng/eng.wordlist
# cd ~/tesstutorial
# git clone --depth 1 https://github.com/tesseract-ocr/tesseract.git
# cd tesseract/tessdata
# wget https://github.com/tesseract-ocr/tessdata/raw/master/eng.traineddata
# mkdir best
# cd best
# wget https://github.com/tesseract-ocr/tessdata_best/raw/master/eng.traineddata
# wget https://github.com/tesseract-ocr/tessdata_best/raw/master/chi_tra.traineddata
# cd ~/tesstutorial/tesseract/
# src/training/tesstrain.sh --fonts_dir /usr/share/fonts --lang eng --linedata_only \
--noextract_font_properties --langdata_dir ../langdata \
--tessdata_dir ./tessdata --output_dir ~/tesstutorial/engtrain
出現下列錯誤
ERROR: /tmp/eng-2021-03-26.coV/eng.Tex_Gyre_Bonum_Bold.exp0.box does not exist or is not readable
# vi src/training/language-specific.sh
刪除 Tex Gyre* 字型
重新執行一遍,出現下列兩行,表示成功
Created starter traineddata for LSTM training of language 'eng'
Run 'lstmtraining' command to continue LSTM training for language 'eng'
建立另外一種字型,做為測試用
#src/training/tesstrain.sh --fonts_dir /usr/share/fonts --lang eng --linedata_only \
--noextract_font_properties --langdata_dir ../langdata \
--tessdata_dir ./tessdata \
--fontlist "Impact Condensed" --output_dir ~/tesstutorial/engeval
從頭開始訓練
# mkdir -p ~/tesstutorial/engoutput
# /usr/local/bin/lstmtraining --debug_interval 0 \
--traineddata ~/tesstutorial/engtrain/eng/eng.traineddata \
--net_spec '[1,36,0,1 Ct3,3,16 Mp3,3 Lfys48 Lfx96 Lrx96 Lfx256 O1c111]' \
--model_output ~/tesstutorial/engoutput/base --learning_rate 20e-4 \
--train_listfile ~/tesstutorial/engtrain/eng.training_files.txt \
--eval_listfile ~/tesstutorial/engeval/eng.training_files.txt \
--max_iterations 5000 &>~/tesstutorial/engoutput/basetrain.log
因為在 docker 內,--debug_interval 為 0,要看詳細的訓練過程 --debug_interval 為 100
# make training
# make training-install
# make ScrollView.jar
O1c111 表示有111個字,{eng}.unicharset 檔內有 111行
# /usr/local/bin/lstmeval --model ~/tesstutorial/engoutput/base_checkpoint \
--traineddata ~/tesstutorial/engtrain/eng/eng.traineddata \
--eval_listfile ~/tesstutorial/engeval/eng.training_files.txt
自己從頭錯訓練,誤率太高
At iteration 0, stage 0, Eval Char error rate=107.795608, Word error rate=97.578246
# /usr/local/bin/lstmeval --model ~/tesstutorial/tesseract/tessdata/best/eng.traineddata \
--eval_listfile ~/tesstutorial/engeval/eng.training_files.txt
用別人訓練的,好太多
At iteration 0, stage 0, Eval Char error rate=3.095477, Word error rate=9.465216
從最佳的 model 導出可以訓練的 lstm model
# mkdir -p ~/tesstutorial/impact_from_full
# /usr/local/bin/combine_tessdata -e ~/tesstutorial/tesseract/tessdata/best/eng.traineddata \
~/tesstutorial/impact_from_full/eng.lstm
微調訓練
# /usr/local/bin/lstmtraining --model_output ~/tesstutorial/impact_from_full/impact \
--continue_from ~/tesstutorial/impact_from_full/eng.lstm \
--traineddata ~/tesstutorial/tesseract/tessdata/best/eng.traineddata \
--train_listfile ~/tesstutorial/engeval/eng.training_files.txt \
--max_iterations 400
#/usr/local/bin/lstmeval --model ~/tesstutorial/impact_from_full/impact_checkpoint \
--traineddata ~/tesstutorial/tesseract/tessdata/best/eng.traineddata \
--eval_listfile ~/tesstutorial/engeval/eng.training_files.txt
測試的結果很好
At iteration 0, stage 0, Eval Char error rate=0.000000, Word error rate=0.000000
導出可用的 model
# /usr/local/bin/lstmtraining --stop_training \
--continue_from ~/tesstutorial/impact_from_full/impact_0.271000_43_400.checkpoint \
--traineddata ~/tesstutorial/tesseract/tessdata/best/eng.traineddata \
--model_output ~/tesstutorial/impact_from_full/eng.traineddata
# cp ~/tesstutorial/impact_from_full/eng.traineddata /usr/local/share/tessdata/eng_a.traineddata
# tesseract phototest.tif phototest -l eng_a -psm 1 --oem 1
準備自己的 tif 和 box, 兩者除附檔名外,檔名一樣且以 lang. 開頭 .exp0. 結尾
如 eng.AAA.exp0.tif, eng.AAA.exp0.box, 置於 /home/tmp
修改 tesstrain.sh, 去掉 phase_I_generate_image 8
# src/training/tesstrain_a.sh --fonts_dir /usr/share/fonts --lang eng --linedata_only \
--my_boxtiff_dir /home/tmp \
--noextract_font_properties --langdata_dir ../langdata \
--tessdata_dir ./tessdata --output_dir ~/tesstutorial/platetrain
# /usr/local/bin/lstmtraining --model_output ~/tesstutorial/impact_from_full/impact \
--continue_from ~/tesstutorial/impact_from_full/eng.lstm \
--traineddata ~/tesstutorial/tesseract/tessdata/best/eng.traineddata \
--train_listfile ~/tesstutorial/platetrain/eng.training_files.txt \
--max_iterations 400