pkgsrc-Changes archive

[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index][Old Index]

CVS commit: pkgsrc/graphics/tesseract



Module Name:    pkgsrc
Committed By:   adam
Date:           Sat Nov  3 09:13:07 UTC 2018

Modified Files:
        pkgsrc/graphics/tesseract: Makefile PLIST distinfo
        pkgsrc/graphics/tesseract/patches: patch-tessdata_Makefile.am
Removed Files:
        pkgsrc/graphics/tesseract/patches: patch-viewer_scrollview.cpp

Log Message:
tesseract: updated to 4.0.0

V4.0.0:
New OCR engine
- Added a new OCR engine that uses neural network system based on LSTMs, with major accuracy gains.
- This includes new training tools for the LSTM OCR engine. A new model can be trained from scratch or by fine tuning an existing model.
- Added trained data that includes LSTM models to 123 languages.
- Added optional accelerated code paths for the LSTM recognizer:
  * Using OpenMP
  * Using SIMD: AVX2 / AVX / SSE4.1
- Added a new parameter lstm_choice_mode that allows to include alternative symbol choices in the hOCR output.
- The new LSTM engine still does not support all features from the old legacy engine (see missing features).

Other OCR engines
- The pattern matching OCR engine that was the primary OCR engine in previous versions is still available in this version.
- Removed the 'Cube' OCR engine from the codebase. It was used for Hindi and for Arabic. The New LSTM engine performs much better, thus the Cube engine was no longer needed.

Updated build system
- Tesseract now uses semantic versioning.
- Tesseract now requires Leptonica 1.74.0 or a higher version.
- For building Tesseract from source code, a compiler with good C++ 11 support is required. See here for a list of officially supported compilers.
- Added unit tests to the main repo. The unit tests require Git submodules and the code for training.
- Added an option to compile Tesseract without the code of the legacy OCR engine.
- Update minimum required autoconf version to 2.63.
- Training tools dependencies - Update minimum required versions: ICU 52.1, Pango 1.22.0.
- Reorganized Tesseract's source tree. Most sources are now below the src directory.

Bug fixes and enhancements
- Fixed many issues that triggered compiler warnings.
- Fixed many issues reported by Coverity Scan or LGTM.
- Fixes to trainingdata rendering.
- Fixed damage to binary images when processing PDFs.
- Don't trigger a deliberate segmentation fault for fatal errors in release code.
- Fixed some issues in OpenCL code. OpenCL now works for the legacy Tesseract OCR engine, but does not improve the performance. It is not implemented for the LSTM OCR engine.
- Improved multi-page TIFF handling.
- Improvements to PDF rendering.
- Added version information and improved help texts to the training tools.
- Added faster version of log2().
- Documented in tesseract man page the option to use an input text file which contains lists of images.
- Made 'osd' the default traineddata when psm 0 is requested (currently this feature is only implemented in the command line interface, but not in the API).
- Removed tessedit_pageseg_mode 1 from hocr, pdf, and tsv config files. The user should explicitly use --psm 1 if that is desired.
- The list of available languages and scripts is now sorted alphabetically.
- Parameter unlv_tilde_crunching changed to false, because of default values cause issues in cases of unlv output in Tesseract 4.
- Removed obsolete code.


To generate a diff of this commit:
cvs rdiff -u -r1.39 -r1.40 pkgsrc/graphics/tesseract/Makefile
cvs rdiff -u -r1.9 -r1.10 pkgsrc/graphics/tesseract/PLIST
cvs rdiff -u -r1.18 -r1.19 pkgsrc/graphics/tesseract/distinfo
cvs rdiff -u -r1.1 -r1.2 \
    pkgsrc/graphics/tesseract/patches/patch-tessdata_Makefile.am
cvs rdiff -u -r1.2 -r0 \
    pkgsrc/graphics/tesseract/patches/patch-viewer_scrollview.cpp

Please note that diffs are not public domain; they are subject to the
copyright notices on the relevant files.

Modified files:

Index: pkgsrc/graphics/tesseract/Makefile
diff -u pkgsrc/graphics/tesseract/Makefile:1.39 pkgsrc/graphics/tesseract/Makefile:1.40
--- pkgsrc/graphics/tesseract/Makefile:1.39     Fri Jul 20 03:34:16 2018
+++ pkgsrc/graphics/tesseract/Makefile  Sat Nov  3 09:13:07 2018
@@ -1,7 +1,6 @@
-# $NetBSD: Makefile,v 1.39 2018/07/20 03:34:16 ryoon Exp $
+# $NetBSD: Makefile,v 1.40 2018/11/03 09:13:07 adam Exp $
 
-DISTNAME=      tesseract-3.05.02
-PKGREVISION=   1
+DISTNAME=      tesseract-4.0.0
 CATEGORIES=    graphics
 MASTER_SITES=  ${MASTER_SITE_GITHUB:=tesseract-ocr/}
 DISTFILES=     ${DEFAULT_DISTFILES}
@@ -11,7 +10,7 @@ HOMEPAGE=     https://github.com/tesseract-o
 COMMENT=       Open Source OCR Engine
 LICENSE=       apache-2.0
 
-LANGVER=       3.04.00
+LANGVER=       4.0.0
 DISTFILES+=    tessdata-${LANGVER}${EXTRACT_SUFX}
 SITES.tessdata-${LANGVER}.tar.gz=      -${MASTER_SITES:Q}tessdata/archive/${LANGVER}.tar.gz
 
@@ -22,7 +21,6 @@ GNU_CONFIGURE=                yes
 CONFIGURE_ENV+=                LIBLEPT_HEADERSDIR=${BUILDLINK_PREFIX.leptonica}/include
 
 INSTALL_TARGET=                install training-install
-INSTALLATION_DIRS=     libexec share/doc/tesseract share/tesseract
 
 post-extract:
        ${MV} ${WRKDIR}/tessdata-${LANGVER}/* ${WRKSRC}/tessdata

Index: pkgsrc/graphics/tesseract/PLIST
diff -u pkgsrc/graphics/tesseract/PLIST:1.9 pkgsrc/graphics/tesseract/PLIST:1.10
--- pkgsrc/graphics/tesseract/PLIST:1.9 Tue Feb 21 17:51:18 2017
+++ pkgsrc/graphics/tesseract/PLIST     Sat Nov  3 09:13:07 2018
@@ -1,65 +1,47 @@
-@comment $NetBSD: PLIST,v 1.9 2017/02/21 17:51:18 fhajny Exp $
+@comment $NetBSD: PLIST,v 1.10 2018/11/03 09:13:07 adam Exp $
 bin/ambiguous_words
 bin/classifier_tester
 bin/cntraining
+bin/combine_lang_model
 bin/combine_tessdata
 bin/dawg2wordlist
+bin/language-specific.sh
+bin/lstmeval
+bin/lstmtraining
+bin/merge_unicharsets
 bin/mftraining
 bin/set_unicharset_properties
 bin/shapeclustering
 bin/tesseract
+bin/tesstrain.sh
+bin/tesstrain_utils.sh
 bin/text2image
 bin/unicharset_extractor
 bin/wordlist2dawg
 include/tesseract/apitypes.h
 include/tesseract/baseapi.h
-include/tesseract/basedir.h
 include/tesseract/capi.h
-include/tesseract/errcode.h
-include/tesseract/fileerr.h
 include/tesseract/genericvector.h
 include/tesseract/helpers.h
 include/tesseract/host.h
 include/tesseract/ltrresultiterator.h
-include/tesseract/memry.h
-include/tesseract/ndminx.h
 include/tesseract/ocrclass.h
 include/tesseract/osdetect.h
 include/tesseract/pageiterator.h
-include/tesseract/params.h
 include/tesseract/platform.h
 include/tesseract/publictypes.h
 include/tesseract/renderer.h
 include/tesseract/resultiterator.h
 include/tesseract/serialis.h
 include/tesseract/strngs.h
+include/tesseract/tess_version.h
 include/tesseract/tesscallback.h
 include/tesseract/thresholder.h
 include/tesseract/unichar.h
-include/tesseract/unicharmap.h
-include/tesseract/unicharset.h
 lib/libtesseract.la
 lib/pkgconfig/tesseract.pc
-man/man1/ambiguous_words.1
-man/man1/cntraining.1
-man/man1/combine_tessdata.1
-man/man1/dawg2wordlist.1
-man/man1/mftraining.1
-man/man1/shapeclustering.1
-man/man1/tesseract.1
-man/man1/unicharset_extractor.1
-man/man1/wordlist2dawg.1
-man/man5/unicharambigs.5
-man/man5/unicharset.5
 share/tessdata/afr.traineddata
 share/tessdata/amh.traineddata
-share/tessdata/ara.cube.bigrams
-share/tessdata/ara.cube.fold
-share/tessdata/ara.cube.lm
-share/tessdata/ara.cube.nn
-share/tessdata/ara.cube.params
-share/tessdata/ara.cube.size
-share/tessdata/ara.cube.word-freq
 share/tessdata/ara.traineddata
 share/tessdata/asm.traineddata
 share/tessdata/aze.traineddata
@@ -68,12 +50,15 @@ share/tessdata/bel.traineddata
 share/tessdata/ben.traineddata
 share/tessdata/bod.traineddata
 share/tessdata/bos.traineddata
+share/tessdata/bre.traineddata
 share/tessdata/bul.traineddata
 share/tessdata/cat.traineddata
 share/tessdata/ceb.traineddata
 share/tessdata/ces.traineddata
 share/tessdata/chi_sim.traineddata
+share/tessdata/chi_sim_vert.traineddata
 share/tessdata/chi_tra.traineddata
+share/tessdata/chi_tra_vert.traineddata
 share/tessdata/chr.traineddata
 share/tessdata/configs/ambigs.train
 share/tessdata/configs/api_config
@@ -86,6 +71,8 @@ share/tessdata/configs/inter
 share/tessdata/configs/kannada
 share/tessdata/configs/linebox
 share/tessdata/configs/logfile
+share/tessdata/configs/lstm.train
+share/tessdata/configs/lstmdebug
 share/tessdata/configs/makebox
 share/tessdata/configs/pdf
 share/tessdata/configs/quiet
@@ -94,21 +81,15 @@ share/tessdata/configs/strokewidth
 share/tessdata/configs/tsv
 share/tessdata/configs/txt
 share/tessdata/configs/unlv
+share/tessdata/cos.traineddata
 share/tessdata/cym.traineddata
 share/tessdata/dan.traineddata
 share/tessdata/dan_frak.traineddata
 share/tessdata/deu.traineddata
 share/tessdata/deu_frak.traineddata
+share/tessdata/div.traineddata
 share/tessdata/dzo.traineddata
 share/tessdata/ell.traineddata
-share/tessdata/eng.cube.bigrams
-share/tessdata/eng.cube.fold
-share/tessdata/eng.cube.lm
-share/tessdata/eng.cube.nn
-share/tessdata/eng.cube.params
-share/tessdata/eng.cube.size
-share/tessdata/eng.cube.word-freq
-share/tessdata/eng.tesseract_cube.nn
 share/tessdata/eng.traineddata
 share/tessdata/eng.user-patterns
 share/tessdata/eng.user-words
@@ -117,50 +98,33 @@ share/tessdata/epo.traineddata
 share/tessdata/equ.traineddata
 share/tessdata/est.traineddata
 share/tessdata/eus.traineddata
+share/tessdata/fao.traineddata
 share/tessdata/fas.traineddata
+share/tessdata/fil.traineddata
 share/tessdata/fin.traineddata
-share/tessdata/fra.cube.bigrams
-share/tessdata/fra.cube.fold
-share/tessdata/fra.cube.lm
-share/tessdata/fra.cube.nn
-share/tessdata/fra.cube.params
-share/tessdata/fra.cube.size
-share/tessdata/fra.cube.word-freq
-share/tessdata/fra.tesseract_cube.nn
 share/tessdata/fra.traineddata
 share/tessdata/frk.traineddata
 share/tessdata/frm.traineddata
+share/tessdata/fry.traineddata
+share/tessdata/gla.traineddata
 share/tessdata/gle.traineddata
 share/tessdata/glg.traineddata
 share/tessdata/grc.traineddata
 share/tessdata/guj.traineddata
 share/tessdata/hat.traineddata
 share/tessdata/heb.traineddata
-share/tessdata/hin.cube.bigrams
-share/tessdata/hin.cube.fold
-share/tessdata/hin.cube.lm
-share/tessdata/hin.cube.nn
-share/tessdata/hin.cube.params
-share/tessdata/hin.cube.word-freq
-share/tessdata/hin.tesseract_cube.nn
 share/tessdata/hin.traineddata
 share/tessdata/hrv.traineddata
 share/tessdata/hun.traineddata
+share/tessdata/hye.traineddata
 share/tessdata/iku.traineddata
 share/tessdata/ind.traineddata
 share/tessdata/isl.traineddata
-share/tessdata/ita.cube.bigrams
-share/tessdata/ita.cube.fold
-share/tessdata/ita.cube.lm
-share/tessdata/ita.cube.nn
-share/tessdata/ita.cube.params
-share/tessdata/ita.cube.size
-share/tessdata/ita.cube.word-freq
-share/tessdata/ita.tesseract_cube.nn
 share/tessdata/ita.traineddata
 share/tessdata/ita_old.traineddata
 share/tessdata/jav.traineddata
 share/tessdata/jpn.traineddata
+share/tessdata/jpn_vert.traineddata
 share/tessdata/kan.traineddata
 share/tessdata/kat.traineddata
 share/tessdata/kat_old.traineddata
@@ -168,20 +132,26 @@ share/tessdata/kaz.traineddata
 share/tessdata/khm.traineddata
 share/tessdata/kir.traineddata
 share/tessdata/kor.traineddata
+share/tessdata/kor_vert.traineddata
 share/tessdata/kur.traineddata
+share/tessdata/kur_ara.traineddata
 share/tessdata/lao.traineddata
 share/tessdata/lat.traineddata
 share/tessdata/lav.traineddata
 share/tessdata/lit.traineddata
+share/tessdata/ltz.traineddata
 share/tessdata/mal.traineddata
 share/tessdata/mar.traineddata
 share/tessdata/mkd.traineddata
 share/tessdata/mlt.traineddata
+share/tessdata/mon.traineddata
+share/tessdata/mri.traineddata
 share/tessdata/msa.traineddata
 share/tessdata/mya.traineddata
 share/tessdata/nep.traineddata
 share/tessdata/nld.traineddata
 share/tessdata/nor.traineddata
+share/tessdata/oci.traineddata
 share/tessdata/ori.traineddata
 share/tessdata/osd.traineddata
 share/tessdata/pan.traineddata
@@ -189,35 +159,26 @@ share/tessdata/pdf.ttf
 share/tessdata/pol.traineddata
 share/tessdata/por.traineddata
 share/tessdata/pus.traineddata
+share/tessdata/que.traineddata
 share/tessdata/ron.traineddata
-share/tessdata/rus.cube.fold
-share/tessdata/rus.cube.lm
-share/tessdata/rus.cube.nn
-share/tessdata/rus.cube.params
-share/tessdata/rus.cube.size
-share/tessdata/rus.cube.word-freq
 share/tessdata/rus.traineddata
 share/tessdata/san.traineddata
 share/tessdata/sin.traineddata
 share/tessdata/slk.traineddata
 share/tessdata/slk_frak.traineddata
 share/tessdata/slv.traineddata
-share/tessdata/spa.cube.bigrams
-share/tessdata/spa.cube.fold
-share/tessdata/spa.cube.lm
-share/tessdata/spa.cube.nn
-share/tessdata/spa.cube.params
-share/tessdata/spa.cube.size
-share/tessdata/spa.cube.word-freq
+share/tessdata/snd.traineddata
 share/tessdata/spa.traineddata
 share/tessdata/spa_old.traineddata
 share/tessdata/sqi.traineddata
 share/tessdata/srp.traineddata
 share/tessdata/srp_latn.traineddata
+share/tessdata/sun.traineddata
 share/tessdata/swa.traineddata
 share/tessdata/swe.traineddata
 share/tessdata/syr.traineddata
 share/tessdata/tam.traineddata
+share/tessdata/tat.traineddata
 share/tessdata/tel.traineddata
 share/tessdata/tessconfigs/batch
 share/tessdata/tessconfigs/batch.nochop
@@ -229,6 +190,7 @@ share/tessdata/tgk.traineddata
 share/tessdata/tgl.traineddata
 share/tessdata/tha.traineddata
 share/tessdata/tir.traineddata
+share/tessdata/ton.traineddata
 share/tessdata/tur.traineddata
 share/tessdata/uig.traineddata
 share/tessdata/ukr.traineddata
@@ -237,3 +199,4 @@ share/tessdata/uzb.traineddata
 share/tessdata/uzb_cyrl.traineddata
 share/tessdata/vie.traineddata
 share/tessdata/yid.traineddata
+share/tessdata/yor.traineddata

Index: pkgsrc/graphics/tesseract/distinfo
diff -u pkgsrc/graphics/tesseract/distinfo:1.18 pkgsrc/graphics/tesseract/distinfo:1.19
--- pkgsrc/graphics/tesseract/distinfo:1.18     Fri Jun 22 09:50:16 2018
+++ pkgsrc/graphics/tesseract/distinfo  Sat Nov  3 09:13:07 2018
@@ -1,12 +1,11 @@
-$NetBSD: distinfo,v 1.18 2018/06/22 09:50:16 adam Exp $
+$NetBSD: distinfo,v 1.19 2018/11/03 09:13:07 adam Exp $
 
-SHA1 (tessdata-3.04.00.tar.gz) = 6ea24cccf0e823da98589ccc75d51f0950618236
-RMD160 (tessdata-3.04.00.tar.gz) = 0a3c3b3c127b6031e2e037d78e3a6f159fb9e869
-SHA512 (tessdata-3.04.00.tar.gz) = 4fbb66137c729e16c7a9e35b09916a45c1bb5ec5a7002a22647e0b10975362cb44c6d6c0c997baf25866f78749ec2d4a86317ec3fb664bd963243e230516d162
-Size (tessdata-3.04.00.tar.gz) = 499088801 bytes
-SHA1 (tesseract-3.05.02.tar.gz) = 6d57403988a5c4eef80381c7a209d80d0391c833
-RMD160 (tesseract-3.05.02.tar.gz) = c31e95c288d9ecfa893bebe467dc5cc5f0edce5a
-SHA512 (tesseract-3.05.02.tar.gz) = 4cb23a6981dd5ec9eefea7b9674847ae88a411a7308ee6d946a920c76eefcf5fe7a90f6cb3ff00493a0e69b5c327d052fa8514d7f3ed506bccbe4b0163065793
-Size (tesseract-3.05.02.tar.gz) = 3571750 bytes
-SHA1 (patch-tessdata_Makefile.am) = 013c9b4bbf64a0948a362d334e6b86a240aa944f
-SHA1 (patch-viewer_scrollview.cpp) = 6df7672def32455c0a82283893320e69290980ca
+SHA1 (tessdata-4.0.0.tar.gz) = 94557a6ecdf8ff8bec131598759e7d3b0bca1911
+RMD160 (tessdata-4.0.0.tar.gz) = 2e826e866b56ff8b9cb2c6613f04d8c4a4ff98d7
+SHA512 (tessdata-4.0.0.tar.gz) = cd71bb99d44eefb53b359ba64b472c509fff773b2737a8d51e10d5d52d9a3a7ff870d470b1c72a7c78be3263b5ecfbb58a6eab13cf7128d8599681676cdcef6b
+Size (tessdata-4.0.0.tar.gz) = 669258747 bytes
+SHA1 (tesseract-4.0.0.tar.gz) = 243a4919d44bc64d1e7e4cac660c716c845a8d03
+RMD160 (tesseract-4.0.0.tar.gz) = 0e95d343639ab98c6d3fbc528053b627b6e12282
+SHA512 (tesseract-4.0.0.tar.gz) = 69e57d4ba1fc43d212fd0fff69a2b5d48a3b37cfee7054fdc083cbb7e04d92317609a32e457229661d70ce8d9b16c9d25e81bfc3861db660dd2c8f292202d447
+Size (tesseract-4.0.0.tar.gz) = 1961372 bytes
+SHA1 (patch-tessdata_Makefile.am) = 496926e629d3803165306c22a9c03ff71f5b774f

Index: pkgsrc/graphics/tesseract/patches/patch-tessdata_Makefile.am
diff -u pkgsrc/graphics/tesseract/patches/patch-tessdata_Makefile.am:1.1 pkgsrc/graphics/tesseract/patches/patch-tessdata_Makefile.am:1.2
--- pkgsrc/graphics/tesseract/patches/patch-tessdata_Makefile.am:1.1    Tue Feb 21 17:51:18 2017
+++ pkgsrc/graphics/tesseract/patches/patch-tessdata_Makefile.am        Sat Nov  3 09:13:07 2018
@@ -1,12 +1,12 @@
-$NetBSD: patch-tessdata_Makefile.am,v 1.1 2017/02/21 17:51:18 fhajny Exp $
+$NetBSD: patch-tessdata_Makefile.am,v 1.2 2018/11/03 09:13:07 adam Exp $
 
 Revert a trunk commit that broke install-lang for tesseract<4.
 
---- tessdata/Makefile.am.orig  2017-02-16 17:59:48.000000000 +0000
+--- tessdata/Makefile.am.orig  2018-10-29 08:53:12.000000000 +0000
 +++ tessdata/Makefile.am
-@@ -44,6 +44,27 @@ langdata = bul.traineddata mlt.trainedda
-       ita.cube.nn fra.cube.size eng.cube.bigrams ara.cube.lm \
-       rus.cube.nn spa.cube.nn hin.cube.bigrams
+@@ -29,6 +29,27 @@ langdata = bul.traineddata mlt.trainedda
+       chi_tra.traineddata ita.traineddata spa_old.traineddata \
+       deu-frak.traineddata aze.traineddata
  
 +.PHONY: install-langs
 +install-langs:



Home | Main Index | Thread Index | Old Index