網頁

2021年3月26日 星期五

Tesseract 在 Docker 上訓練

Tessseract 文件 中 Compiling and Installation 有敘述如何使用 Docker

安裝 Docker
 $ unzip tesseract-ocr-compilation-master.zip
 $ cd tesseract-ocr-compilation-master

為了方便除錯,建立共通的目錄
 $ vi scripts/3-run-new-container.sh
docker run -d -v debug_path:/home/tmp -p 4022:22 --name t4cmp tesseractshadow/tesseract4cmp
docker ps

執行步驟3 建立 t4cmp container 之後
在 Docker 中安裝 vi, 需在
$ docker exec -it t4cmp bash
# apt-get update
# apt-get install vim

使用 ssh login
# echo 'root:your_passwd' | chpasswd
# sed -i 's/#PermitRootLogin/PermitRootLogin/' /etc/ssh/sshd_config
# exit
$ docker stop t4cmp
$ docker start t4cmp
$ ssh root@localhost -p 4022

1. $ ./scripts/1-pull-container.sh 下載 docker image tesseractshadow/tesseract4cmp
2. $ ./scripts/2-remove-container.sh 移除 t4cmp container
3. $ ./scripts/3-run-new-container.sh 建立 t4cmp container
4. $ ./cripts/4-update-src.sh 更新 Leptionica 和 Tesseract Source
5. $ ./scripts/5-compile-src.sh 編譯 Leptionica 和 Tesseract 
6. $ ./scripts/6-test-ocr.sh 測試,目錄於 ocr-files
7. $ ./scripts/7-build-pkg.sh 建立安裝檔於 pkg 目錄

Tessseract 文件 中 Training for Tesseract 4 有敘述如何 訓練自己的字
$ docker exec -it t4cmp bash
# cd /home/workspace
# apt update
# apt install ttf-mscorefonts-installer
會停在下列
[More] 
Progress: [ 54%] [###############################...........................] 
按 Enter 直到出現如下
Do you accept the EULA license terms? [yes/no] 
輸入 yes
# apt install fonts-dejavu
# fc-cache -vf
# cd
# mkdir ~/tesstutorial
# cd ~/tesstutorial
# mkdir langdata
# cd langdata
# wget https://raw.githubusercontent.com/tesseract-ocr/langdata_lstm/master/radical-stroke.txt
# wget https://raw.githubusercontent.com/tesseract-ocr/langdata_lstm/master/common.punc
# wget https://raw.githubusercontent.com/tesseract-ocr/langdata_lstm/master/font_properties
# wget https://raw.githubusercontent.com/tesseract-ocr/langdata_lstm/master/Latin.unicharset
# wget https://raw.githubusercontent.com/tesseract-ocr/langdata_lstm/master/Latin.xheights
# mkdir eng
# cd eng
# wget https://raw.githubusercontent.com/tesseract-ocr/langdata/master/eng/eng.training_text
# wget https://raw.githubusercontent.com/tesseract-ocr/langdata/master/eng/eng.punc
# wget https://raw.githubusercontent.com/tesseract-ocr/langdata/master/eng/eng.numbers
# wget https://raw.githubusercontent.com/tesseract-ocr/langdata/master/eng/eng.wordlist
# cd ~/tesstutorial
# git clone --depth 1 https://github.com/tesseract-ocr/tesseract.git
# cd tesseract/tessdata
# wget https://github.com/tesseract-ocr/tessdata/raw/master/eng.traineddata
# mkdir best
# cd best
# wget https://github.com/tesseract-ocr/tessdata_best/raw/master/eng.traineddata
# wget https://github.com/tesseract-ocr/tessdata_best/raw/master/chi_tra.traineddata

# cd ~/tesstutorial/tesseract/
# src/training/tesstrain.sh --fonts_dir /usr/share/fonts --lang eng --linedata_only \
  --noextract_font_properties --langdata_dir ../langdata \
  --tessdata_dir ./tessdata --output_dir ~/tesstutorial/engtrain
出現下列錯誤
ERROR: /tmp/eng-2021-03-26.coV/eng.Tex_Gyre_Bonum_Bold.exp0.box does not exist or is not readable
# vi src/training/language-specific.sh 
刪除 Tex Gyre* 字型
重新執行一遍,出現下列兩行,表示成功
Created starter traineddata for LSTM training of language 'eng'
Run 'lstmtraining' command to continue LSTM training for language 'eng'

建立另外一種字型,做為測試用
#src/training/tesstrain.sh --fonts_dir /usr/share/fonts --lang eng --linedata_only \
  --noextract_font_properties --langdata_dir ../langdata \
  --tessdata_dir ./tessdata \
  --fontlist "Impact Condensed" --output_dir ~/tesstutorial/engeval

從頭開始訓練
# mkdir -p ~/tesstutorial/engoutput
# /usr/local/bin/lstmtraining --debug_interval 0 \
  --traineddata ~/tesstutorial/engtrain/eng/eng.traineddata \
  --net_spec '[1,36,0,1 Ct3,3,16 Mp3,3 Lfys48 Lfx96 Lrx96 Lfx256 O1c111]' \
  --model_output ~/tesstutorial/engoutput/base --learning_rate 20e-4 \
  --train_listfile ~/tesstutorial/engtrain/eng.training_files.txt \
  --eval_listfile ~/tesstutorial/engeval/eng.training_files.txt \
  --max_iterations 5000 &>~/tesstutorial/engoutput/basetrain.log

因為在 docker 內,--debug_interval 為 0,要看詳細的訓練過程 --debug_interval 為 100
# make training
# make training-install
# make ScrollView.jar

O1c111 表示有111個字,{eng}.unicharset 檔內有 111行

# /usr/local/bin/lstmeval --model ~/tesstutorial/engoutput/base_checkpoint \
  --traineddata ~/tesstutorial/engtrain/eng/eng.traineddata \
  --eval_listfile ~/tesstutorial/engeval/eng.training_files.txt
自己從頭錯訓練,誤率太高
At iteration 0, stage 0, Eval Char error rate=107.795608, Word error rate=97.578246

# /usr/local/bin/lstmeval --model ~/tesstutorial/tesseract/tessdata/best/eng.traineddata \
  --eval_listfile ~/tesstutorial/engeval/eng.training_files.txt
用別人訓練的,好太多
At iteration 0, stage 0, Eval Char error rate=3.095477, Word error rate=9.465216

從最佳的 model 導出可以訓練的 lstm model
# mkdir -p ~/tesstutorial/impact_from_full
# /usr/local/bin/combine_tessdata -e ~/tesstutorial/tesseract/tessdata/best/eng.traineddata \
  ~/tesstutorial/impact_from_full/eng.lstm

微調訓練
# /usr/local/bin/lstmtraining --model_output ~/tesstutorial/impact_from_full/impact \
  --continue_from ~/tesstutorial/impact_from_full/eng.lstm \
  --traineddata ~/tesstutorial/tesseract/tessdata/best/eng.traineddata \
  --train_listfile ~/tesstutorial/engeval/eng.training_files.txt \
  --max_iterations 400
#/usr/local/bin/lstmeval --model ~/tesstutorial/impact_from_full/impact_checkpoint \
  --traineddata ~/tesstutorial/tesseract/tessdata/best/eng.traineddata \
  --eval_listfile ~/tesstutorial/engeval/eng.training_files.txt
測試的結果很好
At iteration 0, stage 0, Eval Char error rate=0.000000, Word error rate=0.000000

導出可用的 model
# /usr/local/bin/lstmtraining --stop_training \
  --continue_from ~/tesstutorial/impact_from_full/impact_0.271000_43_400.checkpoint \
  --traineddata ~/tesstutorial/tesseract/tessdata/best/eng.traineddata \
  --model_output ~/tesstutorial/impact_from_full/eng.traineddata
# cp ~/tesstutorial/impact_from_full/eng.traineddata /usr/local/share/tessdata/eng_a.traineddata
# tesseract phototest.tif phototest -l eng_a -psm 1 --oem 1

準備自己的 tif 和 box, 兩者除附檔名外,檔名一樣且以 lang. 開頭 .exp0. 結尾
如 eng.AAA.exp0.tif, eng.AAA.exp0.box, 置於 /home/tmp
修改 tesstrain.sh, 去掉 phase_I_generate_image 8
# src/training/tesstrain_a.sh --fonts_dir /usr/share/fonts --lang eng --linedata_only \
  --my_boxtiff_dir /home/tmp \
  --noextract_font_properties --langdata_dir ../langdata \
  --tessdata_dir ./tessdata --output_dir ~/tesstutorial/platetrain

# /usr/local/bin/lstmtraining --model_output ~/tesstutorial/impact_from_full/impact \
  --continue_from ~/tesstutorial/impact_from_full/eng.lstm \
  --traineddata ~/tesstutorial/tesseract/tessdata/best/eng.traineddata \
  --train_listfile ~/tesstutorial/platetrain/eng.training_files.txt \
  --max_iterations 400

2021年3月11日 星期四

pyinstaller

參考 使用說明

(tensorflow) D:\PlateOcr> pyinstaller --add-binary d:\your_path\opencv_ffmpeg340_64.dll;opencv_ffmpeg340_64.dll --hidden-import opencv-python --hidden-import cv2 --hidden-import another --paths D:\your_path_to_opencv_lib:D:\your_path_to_cv2.xxx.pyd -F PlateOcrEval.py

最終測試成功命令
(tensorflow) D:\PlateOcr> pyinstaller --hidden-import cv2 --paths D:\your_path_to_opencv_lib:D:\your_path_to_cv2.xxx.pyd -F PlateOcrEval.py