Source-Changes-HG archive

[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index][Old Index]

[pkgsrc/trunk]: pkgsrc/textproc/split-thai Update to 0.8



details:   https://anonhg.NetBSD.org/pkgsrc/rev/3bd72dac6835
branches:  trunk
changeset: 437630:3bd72dac6835
user:      scole <scole%pkgsrc.org@localhost>
date:      Fri Aug 28 16:02:42 2020 +0000

description:
Update to 0.8
- add 'tgrep' perl script for grepping thai words

diffstat:

 textproc/split-thai/DESCR            |    3 +-
 textproc/split-thai/Makefile         |   10 +-
 textproc/split-thai/PLIST            |    3 +-
 textproc/split-thai/files/README.txt |   27 +++-
 textproc/split-thai/files/tgrep      |  208 +++++++++++++++++++++++++++++++++++
 5 files changed, 237 insertions(+), 14 deletions(-)

diffs (truncated from 352 to 300 lines):

diff -r 19e31de83ed4 -r 3bd72dac6835 textproc/split-thai/DESCR
--- a/textproc/split-thai/DESCR Fri Aug 28 14:55:18 2020 +0000
+++ b/textproc/split-thai/DESCR Fri Aug 28 16:02:42 2020 +0000
@@ -3,4 +3,5 @@
 swath, and a c++ icu-project program.  All use dictionary-based word
 splitting.
 
-Also included is merged dictionary file of thai words.
+Also included is a merged dictionary file of Thai words and a perl
+script to grep Thai UTF-8 words.
diff -r 19e31de83ed4 -r 3bd72dac6835 textproc/split-thai/Makefile
--- a/textproc/split-thai/Makefile      Fri Aug 28 14:55:18 2020 +0000
+++ b/textproc/split-thai/Makefile      Fri Aug 28 16:02:42 2020 +0000
@@ -1,6 +1,6 @@
-# $NetBSD: Makefile,v 1.7 2020/08/20 14:20:27 scole Exp $
+# $NetBSD: Makefile,v 1.8 2020/08/28 16:02:42 scole Exp $
 
-PKGNAME=       split-thai-0.7
+PKGNAME=       split-thai-0.8
 CATEGORIES=    textproc
 MAINTAINER=    pkgsrc-users%NetBSD.org@localhost
 COMMENT=       Utilities to split UTF-8 Thai text into words
@@ -15,10 +15,12 @@
 USE_LANGUAGES= c++11   # darwin needed 11?
 
 USE_TOOLS=     pkg-config mkdir cp sh:run env awk cat sort uniq grep wc echo
+USE_TOOLS+=    perl:run
 BUILD_DEPENDS+=        libdatrie-[0-9]*:../../devel/libdatrie
 DEPENDS+=      emacs-[0-9]*:../../editors/emacs
 DEPENDS+=      swath-[0-9]*:../../textproc/swath
 
+REPLACE_PERL=  tgrep
 REPLACE_SH=    st-swath
 
 UTF8_ENV=      env LC_ALL=C.UTF-8
@@ -47,7 +49,7 @@
 pre-extract:
        mkdir -p ${WRKSRC}
        cd files && cp README.txt st-emacs st-icu.cc st-swath \
-               thai-utility.el thaidict.abm ${WRKSRC}
+               tgrep thai-utility.el thaidict.abm ${WRKSRC}
 
 post-extract:
        cd ${WRKSRC} && ${UTF8_ENV} emacs --batch \
@@ -80,7 +82,7 @@
 
 do-install:
        ${INSTALL_SCRIPT} ${WRKSRC}/st-emacs ${WRKSRC}/st-swath \
-               ${DESTDIR}${PREFIX}/bin
+               ${WRKSRC}/tgrep ${DESTDIR}${PREFIX}/bin
        ${INSTALL_PROGRAM} ${WRKSRC}/st-icu ${DESTDIR}${PREFIX}/bin
 .for i in ${ST_SHARE_FILES}
        ${INSTALL_DATA} ${WRKSRC}/${i} ${DESTDIR}${PREFIX}/share/split-thai
diff -r 19e31de83ed4 -r 3bd72dac6835 textproc/split-thai/PLIST
--- a/textproc/split-thai/PLIST Fri Aug 28 14:55:18 2020 +0000
+++ b/textproc/split-thai/PLIST Fri Aug 28 16:02:42 2020 +0000
@@ -1,7 +1,8 @@
-@comment $NetBSD: PLIST,v 1.2 2020/08/14 17:31:34 scole Exp $
+@comment $NetBSD: PLIST,v 1.3 2020/08/28 16:02:42 scole Exp $
 bin/st-emacs
 bin/st-icu
 bin/st-swath
+bin/tgrep
 share/split-thai/README.txt
 share/split-thai/thai-dict.el
 share/split-thai/thai-dict.elc
diff -r 19e31de83ed4 -r 3bd72dac6835 textproc/split-thai/files/README.txt
--- a/textproc/split-thai/files/README.txt      Fri Aug 28 14:55:18 2020 +0000
+++ b/textproc/split-thai/files/README.txt      Fri Aug 28 16:02:42 2020 +0000
@@ -2,14 +2,16 @@
      st-emacs
      st-icu
      st-swath
+     tgrep
 
 SYNOPSIS
      st-emacs|st-icu|st-swath [filename|text1 text2 ...|'blank']
+     tgrep [options] FILE ...
 
 DESCRIPTION
      This package is a collection of utilities to separate Thai words
      by spaces (word tokenization).  They can separate stdin, files,
-     or text as arguments.  It includes 3 separate utilities:
+     or text as arguments.  It includes these utilities:
 
      st-emacs:  emacs-script using emacs lisp thai-word library
                 https://www.gnu.org/software/emacs/
@@ -18,30 +20,38 @@
      st-swath:  sh script wrapper to simplfy args to the swath program
                 https://linux.thai.net/projects/swath
 
+     tgrep:     grep-like utility using perl, see "tgrep -h"
+
 EXAMPLES
-      split one or more text strings
+      split one or more text strings:
       # st-swath แมวและหมา
       # st-swath "แมวหมา" พ่อและแม่
       
-      read stdin
+      read stdin:
       # echo "แมวและหมา" | st-swath
 
-      read from a file
+      read from a file:
       # st-swath < thaifile.txt
       # st-swath somefile.txt
 
-      They can also read directly from stdin
+      They can also read directly from stdin:
       # st-icu
         แมวหมา   (typed in)
         แมว หมา  (output line by line)
 
+      grep for thai words:
+      # grep แมว thaifile.txt
+
 ENVIRONMENT
      You will most likely need to set the environment variables LC_ALL
      or LC_CTYPE for proper unicode handling, e.g., en_US.UTF-8 or
      C.UTF-8.  These tools are only setup to handle UTF-8 encodings.
 
+     A terminal capable of entering and displaying UTF-8, and some
+     actual UTF-8 fonts installed on the system will also be needed.
+     
 EXIT STATUS
-     0 for success, non zero otherwise
+     0 for success, non zero otherwise.  For tgrep, see "tgrep -h"
 
 NOTES
      Note that it is not possible to split Thai words 100% accurately
@@ -66,5 +76,6 @@
 
 BUGS
      st-icu should also use the combined dictionary words.
-     thai text mixed with other languages may not be handled well.
-     this file should be converted to a proper manpage.
+     thai text mixed with other languages may not be handled well when
+     splitting.
+     this file should be converted to proper manpages.
diff -r 19e31de83ed4 -r 3bd72dac6835 textproc/split-thai/files/tgrep
--- /dev/null   Thu Jan 01 00:00:00 1970 +0000
+++ b/textproc/split-thai/files/tgrep   Fri Aug 28 16:02:42 2020 +0000
@@ -0,0 +1,208 @@
+#!/bin/perl
+#
+# perl grep equivalent-wrapper supporting utf-8 and thai in particular
+#
+use warnings;
+use strict;
+use Encode;
+use Getopt::Std;
+
+use utf8;
+use open qw/:std :utf8/;
+
+our ( $opt_h, $opt_i, $opt_l, $opt_n, $opt_q, $opt_v );
+
+getopts('hilnqv');
+
+if ( $opt_h ) {
+    usage();
+    exit 0;
+} elsif ( ! defined $ARGV[0] ) {
+    # no pattern given
+    usage();
+    exit 1;
+}
+
+my $pattern = decode('UTF-8', $ARGV[0]) if defined $ARGV[0];
+unless ( length( $pattern ) ) {
+    usage();
+    exit 1;
+}
+
+my $opt_filesonly = ( defined $opt_l ? 1 : 0 );
+my $opt_ignorecase = ( defined $opt_i ? 1 : 0 );
+my $opt_linenum = ( defined $opt_n ? 1 : 0 );
+my $opt_quiet = ( defined $opt_q ? 1 : 0 );
+my $opt_invert = ( defined $opt_v ? 1 : 0 );
+
+# rest of args should be filenames
+my @files = @ARGV;
+shift @files;
+@files = map { decode('UTF-8', $_ ) } @files;
+
+#
+# usage
+#
+sub usage {
+    print <<'EOF';
+
+NAME
+    tgrep - print lines matching a pattern, supports utf-8 characters
+    and some thai character classes using perl regexp matching.
+
+SYNOPSIS
+    tgrep [options] PATTERN [FILE] [FILE2]
+
+DESCRIPTION
+    tgrep (thai grep) is similar to grep, in that it searches files or
+    stdin for lines matching a pattern.  It uses perl to support utf-8
+    characters, and therefore the patterns are perl regexp patterns.
+    It supports a few simple homegrown character classes:
+
+    [:thai:]          match any thai unicode value
+    [:thaiconsonant:] match thai consonant including ฤ ฦ
+    [:thaidigit:]     match thai number ๐๑๒๓๔๕๖๗๘๙ 
+    [:thaitonemark:]  match thai tonemark ่้๊๋
+    [:thaivowel:]     match thai vowel symbols ะัา ำิีึืุูเแโใไๅ็
+                      does not include consonants that function as vowels
+    [:thaifullvowel:] same as [:thaivowel:] plus อรวยฤฦๅ used to form
+                      vowel diacritics and dipthongs
+    [:thaimisc:]      match misc thai symbols ฯๆฺ฿์ํ๎๏๚๛
+
+OPTIONS
+    -h  print help or usage
+
+    -i  ignore case
+
+    -l  suppress normal output, only print filenames that match
+
+    -n  prefix each line of output with the line number of the file
+
+    -q  quiet mode, don't print out matches
+
+    -v  invert match or print lines not matching pattern
+
+ENVIRONMENT
+     You may need to set LC_CTYPE, LC_ALL, or other LC_* to a utf-8
+     setting for this to program to work, e.g. for csh-type shells:
+          setenv LC_CTYPE en_US.UTF-8
+         
+EXIT STATUS
+    Similar to grep, returns 0 when matching line found, 1 otherwise.
+    If an error occurs, exit with 2 unless -q (quiet) option and a
+    match is found
+
+EXAMPLES
+    search for 'ก' in a utf-8 text file
+    $ tgrep ก file.txt
+
+    use perl regexp to match any line thai with utf-8 characters
+    $ tgrep '\p{InThai}' somefile.txt
+
+    use perl regexp unicode values to match thai numbers
+    $ tgrep '^[\x{0e50}-\x{0e59}]+$' other.file
+
+    match lines with a thai number
+    $ tgrep '[:thaidigit:]' afile.txt
+
+NOTES
+    grep(1) also can be used to match thai characters with unicode
+    escapes, for example
+       egrep "["$'\u0e01'-$'\u0e5b'"]" file.txt
+    would match thai unicode chars in a bash-type shell.
+
+SEE ALSO
+    grep(1), perl(1), perlre(1), locale(1), ugrep(1)
+
+BUGS
+    Only utf-8 encodings are supported.
+    The character classes used by this program ([:thai*:]) are not
+    standard or supported by other programs.
+    Quoting perl regular expression can sometimes be difficult from
+    within the shell.
+
+EOF
+}
+
+# handle convenience character classes
+if ( index($pattern, "[:thai:]") != -1 ) {
+    $pattern =~ s!\[\:thai\:\]!\\p\{InThai\}!g;
+}
+if ( index($pattern, "[:thaiconsonant:]") != -1 ) {
+    # chars between ก & ฮ inclusive
+    $pattern =~ s!\[\:thaiconsonant\:\]!\[\x{0e01}-\x{0e2e}\]!g;
+}
+if ( index($pattern, "[:thaidigit:]") != -1 ) {
+    $pattern =~ s!\[\:thaidigit\:\]![๐๑๒๓๔๕๖๗๘๙]!g;
+}
+if ( index($pattern, "[:thaitonemark:]") != -1 ) {
+    $pattern =~ s!\[\:thaitonemark\:\]![่้๊๋]!g;
+}
+if ( index($pattern, "[:thaivowel:]") != -1 ) {
+    $pattern =~ s!\[\:thaivowel\:\]![ะัา ำิีึืุูเแโใไๅ็]!g;
+}
+if ( index($pattern, "[:thaivowelfull:]") != -1 ) {
+    $pattern =~ s!\[\:thaivowelfull\:\]![ะัา ำิีึืุูเแโใไๅ็อรวยฤฦๅ]!g;
+}
+if ( index($pattern, "[:thaimisc:]") != -1 ) {
+    $pattern =~ s!\[\:thaimisc\:\]![ฯๆ฿์ํ๎๏ฺ๚๛]!g;
+}
+
+my $qpattern = ( $opt_ignorecase ? qr/$pattern/iou : qr/$pattern/ou );
+#print "pattern \"$pattern\"\n";
+#print "qpattern \"$qpattern\"\n";
+
+# if no file args or just "-", assume stdin
+push @files, "/dev/stdin" if ! @files;


Home | Main Index | Thread Index | Old Index