pkgsrc-Changes-HG archive

[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index][Old Index]

[pkgsrc/trunk]: pkgsrc/graphics/tesseract tesseract: updated to 4.0.0



details:   https://anonhg.NetBSD.org/pkgsrc/rev/c69a62c63054
branches:  trunk
changeset: 386916:c69a62c63054
user:      adam <adam%pkgsrc.org@localhost>
date:      Sat Nov 03 09:13:07 2018 +0000

description:
tesseract: updated to 4.0.0

V4.0.0:
New OCR engine
- Added a new OCR engine that uses neural network system based on LSTMs, with major accuracy gains.
- This includes new training tools for the LSTM OCR engine. A new model can be trained from scratch or by fine tuning an existing model.
- Added trained data that includes LSTM models to 123 languages.
- Added optional accelerated code paths for the LSTM recognizer:
  * Using OpenMP
  * Using SIMD: AVX2 / AVX / SSE4.1
- Added a new parameter lstm_choice_mode that allows to include alternative symbol choices in the hOCR output.
- The new LSTM engine still does not support all features from the old legacy engine (see missing features).

Other OCR engines
- The pattern matching OCR engine that was the primary OCR engine in previous versions is still available in this version.
- Removed the 'Cube' OCR engine from the codebase. It was used for Hindi and for Arabic. The New LSTM engine performs much better, thus the Cube engine was no longer needed.

Updated build system
- Tesseract now uses semantic versioning.
- Tesseract now requires Leptonica 1.74.0 or a higher version.
- For building Tesseract from source code, a compiler with good C++ 11 support is required. See here for a list of officially supported compilers.
- Added unit tests to the main repo. The unit tests require Git submodules and the code for training.
- Added an option to compile Tesseract without the code of the legacy OCR engine.
- Update minimum required autoconf version to 2.63.
- Training tools dependencies - Update minimum required versions: ICU 52.1, Pango 1.22.0.
- Reorganized Tesseract's source tree. Most sources are now below the src directory.

Bug fixes and enhancements
- Fixed many issues that triggered compiler warnings.
- Fixed many issues reported by Coverity Scan or LGTM.
- Fixes to trainingdata rendering.
- Fixed damage to binary images when processing PDFs.
- Don't trigger a deliberate segmentation fault for fatal errors in release code.
- Fixed some issues in OpenCL code. OpenCL now works for the legacy Tesseract OCR engine, but does not improve the performance. It is not implemented for the LSTM OCR engine.
- Improved multi-page TIFF handling.
- Improvements to PDF rendering.
- Added version information and improved help texts to the training tools.
- Added faster version of log2().
- Documented in tesseract man page the option to use an input text file which contains lists of images.
- Made 'osd' the default traineddata when psm 0 is requested (currently this feature is only implemented in the command line interface, but not in the API).
- Removed tessedit_pageseg_mode 1 from hocr, pdf, and tsv config files. The user should explicitly use --psm 1 if that is desired.
- The list of available languages and scripts is now sorted alphabetically.
- Parameter unlv_tilde_crunching changed to false, because of default values cause issues in cases of unlv output in Tesseract 4.
- Removed obsolete code.

diffstat:

 graphics/tesseract/Makefile                            |    8 +-
 graphics/tesseract/PLIST                               |  105 +++++-----------
 graphics/tesseract/distinfo                            |   21 +-
 graphics/tesseract/patches/patch-tessdata_Makefile.am  |   10 +-
 graphics/tesseract/patches/patch-viewer_scrollview.cpp |   14 --
 5 files changed, 52 insertions(+), 106 deletions(-)

diffs (truncated from 359 to 300 lines):

diff -r a153fce22b9c -r c69a62c63054 graphics/tesseract/Makefile
--- a/graphics/tesseract/Makefile       Sat Nov 03 08:07:31 2018 +0000
+++ b/graphics/tesseract/Makefile       Sat Nov 03 09:13:07 2018 +0000
@@ -1,7 +1,6 @@
-# $NetBSD: Makefile,v 1.39 2018/07/20 03:34:16 ryoon Exp $
+# $NetBSD: Makefile,v 1.40 2018/11/03 09:13:07 adam Exp $
 
-DISTNAME=      tesseract-3.05.02
-PKGREVISION=   1
+DISTNAME=      tesseract-4.0.0
 CATEGORIES=    graphics
 MASTER_SITES=  ${MASTER_SITE_GITHUB:=tesseract-ocr/}
 DISTFILES=     ${DEFAULT_DISTFILES}
@@ -11,7 +10,7 @@
 COMMENT=       Open Source OCR Engine
 LICENSE=       apache-2.0
 
-LANGVER=       3.04.00
+LANGVER=       4.0.0
 DISTFILES+=    tessdata-${LANGVER}${EXTRACT_SUFX}
 SITES.tessdata-${LANGVER}.tar.gz=      -${MASTER_SITES:Q}tessdata/archive/${LANGVER}.tar.gz
 
@@ -22,7 +21,6 @@
 CONFIGURE_ENV+=                LIBLEPT_HEADERSDIR=${BUILDLINK_PREFIX.leptonica}/include
 
 INSTALL_TARGET=                install training-install
-INSTALLATION_DIRS=     libexec share/doc/tesseract share/tesseract
 
 post-extract:
        ${MV} ${WRKDIR}/tessdata-${LANGVER}/* ${WRKSRC}/tessdata
diff -r a153fce22b9c -r c69a62c63054 graphics/tesseract/PLIST
--- a/graphics/tesseract/PLIST  Sat Nov 03 08:07:31 2018 +0000
+++ b/graphics/tesseract/PLIST  Sat Nov 03 09:13:07 2018 +0000
@@ -1,65 +1,47 @@
-@comment $NetBSD: PLIST,v 1.9 2017/02/21 17:51:18 fhajny Exp $
+@comment $NetBSD: PLIST,v 1.10 2018/11/03 09:13:07 adam Exp $
 bin/ambiguous_words
 bin/classifier_tester
 bin/cntraining
+bin/combine_lang_model
 bin/combine_tessdata
 bin/dawg2wordlist
+bin/language-specific.sh
+bin/lstmeval
+bin/lstmtraining
+bin/merge_unicharsets
 bin/mftraining
 bin/set_unicharset_properties
 bin/shapeclustering
 bin/tesseract
+bin/tesstrain.sh
+bin/tesstrain_utils.sh
 bin/text2image
 bin/unicharset_extractor
 bin/wordlist2dawg
 include/tesseract/apitypes.h
 include/tesseract/baseapi.h
-include/tesseract/basedir.h
 include/tesseract/capi.h
-include/tesseract/errcode.h
-include/tesseract/fileerr.h
 include/tesseract/genericvector.h
 include/tesseract/helpers.h
 include/tesseract/host.h
 include/tesseract/ltrresultiterator.h
-include/tesseract/memry.h
-include/tesseract/ndminx.h
 include/tesseract/ocrclass.h
 include/tesseract/osdetect.h
 include/tesseract/pageiterator.h
-include/tesseract/params.h
 include/tesseract/platform.h
 include/tesseract/publictypes.h
 include/tesseract/renderer.h
 include/tesseract/resultiterator.h
 include/tesseract/serialis.h
 include/tesseract/strngs.h
+include/tesseract/tess_version.h
 include/tesseract/tesscallback.h
 include/tesseract/thresholder.h
 include/tesseract/unichar.h
-include/tesseract/unicharmap.h
-include/tesseract/unicharset.h
 lib/libtesseract.la
 lib/pkgconfig/tesseract.pc
-man/man1/ambiguous_words.1
-man/man1/cntraining.1
-man/man1/combine_tessdata.1
-man/man1/dawg2wordlist.1
-man/man1/mftraining.1
-man/man1/shapeclustering.1
-man/man1/tesseract.1
-man/man1/unicharset_extractor.1
-man/man1/wordlist2dawg.1
-man/man5/unicharambigs.5
-man/man5/unicharset.5
 share/tessdata/afr.traineddata
 share/tessdata/amh.traineddata
-share/tessdata/ara.cube.bigrams
-share/tessdata/ara.cube.fold
-share/tessdata/ara.cube.lm
-share/tessdata/ara.cube.nn
-share/tessdata/ara.cube.params
-share/tessdata/ara.cube.size
-share/tessdata/ara.cube.word-freq
 share/tessdata/ara.traineddata
 share/tessdata/asm.traineddata
 share/tessdata/aze.traineddata
@@ -68,12 +50,15 @@
 share/tessdata/ben.traineddata
 share/tessdata/bod.traineddata
 share/tessdata/bos.traineddata
+share/tessdata/bre.traineddata
 share/tessdata/bul.traineddata
 share/tessdata/cat.traineddata
 share/tessdata/ceb.traineddata
 share/tessdata/ces.traineddata
 share/tessdata/chi_sim.traineddata
+share/tessdata/chi_sim_vert.traineddata
 share/tessdata/chi_tra.traineddata
+share/tessdata/chi_tra_vert.traineddata
 share/tessdata/chr.traineddata
 share/tessdata/configs/ambigs.train
 share/tessdata/configs/api_config
@@ -86,6 +71,8 @@
 share/tessdata/configs/kannada
 share/tessdata/configs/linebox
 share/tessdata/configs/logfile
+share/tessdata/configs/lstm.train
+share/tessdata/configs/lstmdebug
 share/tessdata/configs/makebox
 share/tessdata/configs/pdf
 share/tessdata/configs/quiet
@@ -94,21 +81,15 @@
 share/tessdata/configs/tsv
 share/tessdata/configs/txt
 share/tessdata/configs/unlv
+share/tessdata/cos.traineddata
 share/tessdata/cym.traineddata
 share/tessdata/dan.traineddata
 share/tessdata/dan_frak.traineddata
 share/tessdata/deu.traineddata
 share/tessdata/deu_frak.traineddata
+share/tessdata/div.traineddata
 share/tessdata/dzo.traineddata
 share/tessdata/ell.traineddata
-share/tessdata/eng.cube.bigrams
-share/tessdata/eng.cube.fold
-share/tessdata/eng.cube.lm
-share/tessdata/eng.cube.nn
-share/tessdata/eng.cube.params
-share/tessdata/eng.cube.size
-share/tessdata/eng.cube.word-freq
-share/tessdata/eng.tesseract_cube.nn
 share/tessdata/eng.traineddata
 share/tessdata/eng.user-patterns
 share/tessdata/eng.user-words
@@ -117,50 +98,33 @@
 share/tessdata/equ.traineddata
 share/tessdata/est.traineddata
 share/tessdata/eus.traineddata
+share/tessdata/fao.traineddata
 share/tessdata/fas.traineddata
+share/tessdata/fil.traineddata
 share/tessdata/fin.traineddata
-share/tessdata/fra.cube.bigrams
-share/tessdata/fra.cube.fold
-share/tessdata/fra.cube.lm
-share/tessdata/fra.cube.nn
-share/tessdata/fra.cube.params
-share/tessdata/fra.cube.size
-share/tessdata/fra.cube.word-freq
-share/tessdata/fra.tesseract_cube.nn
 share/tessdata/fra.traineddata
 share/tessdata/frk.traineddata
 share/tessdata/frm.traineddata
+share/tessdata/fry.traineddata
+share/tessdata/gla.traineddata
 share/tessdata/gle.traineddata
 share/tessdata/glg.traineddata
 share/tessdata/grc.traineddata
 share/tessdata/guj.traineddata
 share/tessdata/hat.traineddata
 share/tessdata/heb.traineddata
-share/tessdata/hin.cube.bigrams
-share/tessdata/hin.cube.fold
-share/tessdata/hin.cube.lm
-share/tessdata/hin.cube.nn
-share/tessdata/hin.cube.params
-share/tessdata/hin.cube.word-freq
-share/tessdata/hin.tesseract_cube.nn
 share/tessdata/hin.traineddata
 share/tessdata/hrv.traineddata
 share/tessdata/hun.traineddata
+share/tessdata/hye.traineddata
 share/tessdata/iku.traineddata
 share/tessdata/ind.traineddata
 share/tessdata/isl.traineddata
-share/tessdata/ita.cube.bigrams
-share/tessdata/ita.cube.fold
-share/tessdata/ita.cube.lm
-share/tessdata/ita.cube.nn
-share/tessdata/ita.cube.params
-share/tessdata/ita.cube.size
-share/tessdata/ita.cube.word-freq
-share/tessdata/ita.tesseract_cube.nn
 share/tessdata/ita.traineddata
 share/tessdata/ita_old.traineddata
 share/tessdata/jav.traineddata
 share/tessdata/jpn.traineddata
+share/tessdata/jpn_vert.traineddata
 share/tessdata/kan.traineddata
 share/tessdata/kat.traineddata
 share/tessdata/kat_old.traineddata
@@ -168,20 +132,26 @@
 share/tessdata/khm.traineddata
 share/tessdata/kir.traineddata
 share/tessdata/kor.traineddata
+share/tessdata/kor_vert.traineddata
 share/tessdata/kur.traineddata
+share/tessdata/kur_ara.traineddata
 share/tessdata/lao.traineddata
 share/tessdata/lat.traineddata
 share/tessdata/lav.traineddata
 share/tessdata/lit.traineddata
+share/tessdata/ltz.traineddata
 share/tessdata/mal.traineddata
 share/tessdata/mar.traineddata
 share/tessdata/mkd.traineddata
 share/tessdata/mlt.traineddata
+share/tessdata/mon.traineddata
+share/tessdata/mri.traineddata
 share/tessdata/msa.traineddata
 share/tessdata/mya.traineddata
 share/tessdata/nep.traineddata
 share/tessdata/nld.traineddata
 share/tessdata/nor.traineddata
+share/tessdata/oci.traineddata
 share/tessdata/ori.traineddata
 share/tessdata/osd.traineddata
 share/tessdata/pan.traineddata
@@ -189,35 +159,26 @@
 share/tessdata/pol.traineddata
 share/tessdata/por.traineddata
 share/tessdata/pus.traineddata
+share/tessdata/que.traineddata
 share/tessdata/ron.traineddata
-share/tessdata/rus.cube.fold
-share/tessdata/rus.cube.lm
-share/tessdata/rus.cube.nn
-share/tessdata/rus.cube.params
-share/tessdata/rus.cube.size
-share/tessdata/rus.cube.word-freq
 share/tessdata/rus.traineddata
 share/tessdata/san.traineddata
 share/tessdata/sin.traineddata
 share/tessdata/slk.traineddata
 share/tessdata/slk_frak.traineddata
 share/tessdata/slv.traineddata
-share/tessdata/spa.cube.bigrams
-share/tessdata/spa.cube.fold
-share/tessdata/spa.cube.lm
-share/tessdata/spa.cube.nn
-share/tessdata/spa.cube.params
-share/tessdata/spa.cube.size
-share/tessdata/spa.cube.word-freq
+share/tessdata/snd.traineddata
 share/tessdata/spa.traineddata
 share/tessdata/spa_old.traineddata
 share/tessdata/sqi.traineddata
 share/tessdata/srp.traineddata
 share/tessdata/srp_latn.traineddata
+share/tessdata/sun.traineddata
 share/tessdata/swa.traineddata
 share/tessdata/swe.traineddata
 share/tessdata/syr.traineddata
 share/tessdata/tam.traineddata
+share/tessdata/tat.traineddata
 share/tessdata/tel.traineddata
 share/tessdata/tessconfigs/batch
 share/tessdata/tessconfigs/batch.nochop
@@ -229,6 +190,7 @@
 share/tessdata/tgl.traineddata
 share/tessdata/tha.traineddata
 share/tessdata/tir.traineddata
+share/tessdata/ton.traineddata
 share/tessdata/tur.traineddata
 share/tessdata/uig.traineddata
 share/tessdata/ukr.traineddata
@@ -237,3 +199,4 @@
 share/tessdata/uzb_cyrl.traineddata
 share/tessdata/vie.traineddata
 share/tessdata/yid.traineddata
+share/tessdata/yor.traineddata
diff -r a153fce22b9c -r c69a62c63054 graphics/tesseract/distinfo
--- a/graphics/tesseract/distinfo       Sat Nov 03 08:07:31 2018 +0000
+++ b/graphics/tesseract/distinfo       Sat Nov 03 09:13:07 2018 +0000
@@ -1,12 +1,11 @@
-$NetBSD: distinfo,v 1.18 2018/06/22 09:50:16 adam Exp $
+$NetBSD: distinfo,v 1.19 2018/11/03 09:13:07 adam Exp $



Home | Main Index | Thread Index | Old Index