pkgsrc-Changes-HG archive

[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index][Old Index]

[pkgsrc/trunk]: pkgsrc/textproc/miller Updated miller to 5.0.0.



details:   https://anonhg.NetBSD.org/pkgsrc/rev/01ad47251b59
branches:  trunk
changeset: 359262:01ad47251b59
user:      wiz <wiz%pkgsrc.org@localhost>
date:      Sun Mar 05 12:37:30 2017 +0000

description:
Updated miller to 5.0.0.

Autodetected line-endings, in-place mode, user-defined functions, and more

This major release significantly expands the expressiveness of the DSL for mlr put and mlr filter. (The upcoming 5.1.0 release will add the ability to aggregate across all columns for non-DSL verbs 
such as mlr stats1 and mlr stats2. As well, a Windows port is underway.)

Please also see the Miller main docs.

Simple but impactful features:

    Line endings (CRLF vs. LF, Windows-style vs. Unix-style) are now autodetected. For example, files (including CSV) with LF input will lead to LF output unless you specify otherwise.
    There is now an in-place mode using mlr -I.

Major DSL features:

    You can now define your own functions and subroutines: e.g. func f(x, y) { return x**2 + y**2 }.
    New local variables are completely analogous to out-of-stream variables: sum retains its value for the duration of the expression it's defined in; @sum retains its value across all records in the 
record stream.
    Local variables, function parameters, and function return types may be defined untyped or typed as in x = 1 or int x = 1, respectively. There are also expression-inline type-assertions available. 
Type-checking is up to you: omit it if you want flexibility with heterogeneous data; use it if you want to help catch misspellings in your DSL code or unexpected irregularities in your input data.
    There are now four kinds of maps. Out-of-stream variables have always been scalars, maps, or multi-level maps: @a=1, @b[1]=2, @c[1][2]=3. The same is now true for local variables, which are new 
to 5.0.0. Stream records have always been single-level maps; $* is a map. And as of 5.0.0 there are now map literals, e.g. {"a":1, "b":2}, which can be defined using JSON-like syntax (with either 
string or integer keys) and which can be nested arbitrarily deeply.
    You can loop over maps -- $*, out-of-stream variables, local variables, map-literals, and map-valued function return values -- using for (k, v in ...) or the new for (k in ...) (discussed next). 
All flavors of map may also be used in emit and dump statements.
    User-defined functions and subroutines may take map-valued arguments, and may return map values.
    Some built-in functions now accept map-valued input: typeof, length, depth, leafcount, haskey. There are built-in functions producing map-valued output: mapsum and mapdiff. There are now 
string-to-map and map-to-string functions: splitnv, splitkv, splitnvx, splitkvx, joink, joinv, and joinkv.

Minor DSL features:

    For iterating over maps (namely, local variables, out-of-stream variables, stream records, map literals, or return values from map-valued functions) there is now a key-only for-loop syntax: e.g. 
for (k in $*) { ... }. This is in addition to the already-existing for (k, v in ...) syntax.
    There are now triple-statement for-loops (familiar from many other languages), e.g. for (int i = 0; i < 10; i += 1) { ... }.
    mlr put and mlr filter now accept multiple -f for script files, freely intermixable with -e for expressions. The suggested use case is putting user-defined functions in script files and 
one-liners calling them using -e. Example: myfuncs.mlr defines the function f(...), then mlr put -f myfuncs.mlr -e '$o = f($i)' myfile.dat. More information is here.
    mlr filter is now almost identical to mlr put: it can have multiple statements, it can use begin and/or end blocks, it can define and invoke functions. Its final expression must evaluate to 
boolean which is used as the filter criterion. More details are here.
    The min and max functions are now variadic: $o = max($a, $b, $c).
    There is now a substr function.
    While ENV has long provided read-access to environment variables on the right-hand side of assignments (as a getenv), it now can be at the left-hand side of assignments (as a putenv). This is 
useful for subsidiary processes created by tee, emit, dump, or print when writing to a pipe.
    Handling for the # in comments is now handled in the lexer, so you can now (correctly) include # in strings.
    Separators are now available as read-only variables in the DSL: IPS, IFS, IRS, OPS, OFS, ORS. These are particularly useful with the split and join functions: e.g. with mlr --ifs tab ..., the IFS 
variable within a DSL expression will evaluate to a string containing a tab character.
    Syntax errors in DSL expressions now have a little more context.
    DSL parsing and execution are a bit more transparent. There have long been -v and -t options to mlr put and mlr filter, which print the expression's abstract syntax tree and do a low-level parser 
trace, respectively. There are now additionally -a which traces stack-variable allocation and -T which traces statements line by line as they execute. While -v, -t, and -a are most useful for 
development of Miller, the -T option gives you more visibility into what your Miller scripts are doing. See also here.

Verbs:

    most-frequent and least-frequent as requested in #110.
    seqgen makes it easy to generate data from within Miller: please also see here for a usage example.
    unsparsify makes it easy to rectangularize data where not all records have the same fields.
    cat -n now takes a group-by (-g) option, making it easy to number records within categories.
    count-distinct,
    uniq,
    most-frequent,
    least-frequent,
    top, and
    histogram
    now take a -o option for specifying their output field names, as requested in #122.
    Median is now a synonym for p50 in stats1.
    You can now start a then chain with an initial then, which is nice in backslashy/multiline-continuation contexts.
    This was requested in #130.

I/O options:

    The print statement may now be used with no arguments, which prints a newline, and a no-argument printn prints nothing but creates a zero-length file in redirected-output context.
    Pretty-print format now has a --pprint --barred option (for output only, not input). For an example, please see here.
    There are now keystroke-savers of the form --c2p which abbreviate --icsvlite --opprint, and so on.
    Miller's map literals are JSON-looking but allow integer keys which JSON doesn't. The
    --jknquoteint and --jvquoteall flags for mlr (when using JSON output) and mlr put (for dump) provide control over double-quoting behavior.

Documents new since the previous release:

    Miller in 10 minutes is a long-overdue addition: while Miller's detailed documentation is evident, there has been a lack of more succinct examples.
    The cookbook has likewise been expanded, and has been split out
    into three parts: part 1, part
    2, part 3.
    A bit more background on C performance compared to other languages I experimented with, early on in the development of Miller, is here.

On-line help:

    Help for DSL built-in functions, DSL keywords, and verbs is accessible using mlr -f, mlr -k, and mlr -l respectively; name-only lists are available with mlr -F, mlr -K, and mlr -L.

Bugfixes:

    A corner-case bug causing a segmentation violation on two sub/gsub statements within a single put, the first one matching its pattern and the second one not matching its pattern, has been fixed.

Backward incompatibilities: This is Miller 5.0.0, not 4.6.0, due to the following (all relatively minor):

    The v variables bound in for-loops such as for (k, v in some_multi_level_map) { ... } can now be map-valued if the v specifies a non-terminal in the map.
    There are new keywords such as var, int, float, num, str, bool, map, IPS, IFS, IRS, OPS, OFS, ORS which can no longer be used as variable names. See mlr -k for the complete list.
    Unset of the last key in an map-valued variable's map level no longer removes the level: e.g. with @v[1][2]=3 and unset @v[1][2] the @v variable would be empty. As of 5.0.0, @v has key 1 with an 
empty-map value.
    There is no longer type-inference on literals: "3"+4 no longer gives 7. (That was never a good idea.)
    The typeof function used to say things like MT_STRING; now it says things like string.

diffstat:

 textproc/miller/Makefile |   4 ++--
 textproc/miller/distinfo |  10 +++++-----
 2 files changed, 7 insertions(+), 7 deletions(-)

diffs (27 lines):

diff -r 59876e9b2ff7 -r 01ad47251b59 textproc/miller/Makefile
--- a/textproc/miller/Makefile  Sun Mar 05 12:33:45 2017 +0000
+++ b/textproc/miller/Makefile  Sun Mar 05 12:37:30 2017 +0000
@@ -1,6 +1,6 @@
-# $NetBSD: Makefile,v 1.8 2016/09/01 16:25:51 wiz Exp $
+# $NetBSD: Makefile,v 1.9 2017/03/05 12:37:30 wiz Exp $
 
-DISTNAME=      mlr-4.5.0
+DISTNAME=      mlr-5.0.0
 PKGNAME=       ${DISTNAME:S/mlr/miller/}
 CATEGORIES=    devel
 MASTER_SITES=  ${MASTER_SITE_GITHUB:=johnkerl/}
diff -r 59876e9b2ff7 -r 01ad47251b59 textproc/miller/distinfo
--- a/textproc/miller/distinfo  Sun Mar 05 12:33:45 2017 +0000
+++ b/textproc/miller/distinfo  Sun Mar 05 12:37:30 2017 +0000
@@ -1,6 +1,6 @@
-$NetBSD: distinfo,v 1.9 2016/09/01 16:25:51 wiz Exp $
+$NetBSD: distinfo,v 1.10 2017/03/05 12:37:30 wiz Exp $
 
-SHA1 (mlr-4.5.0.tar.gz) = 8d1cb1c1b32790b92c404e893b2b66659238d0b6
-RMD160 (mlr-4.5.0.tar.gz) = c9f05de18c9f9ecb8004ef332ad995efbd5c6793
-SHA512 (mlr-4.5.0.tar.gz) = 31b1c44b03b36d9ed98986ab6d01afdf5d74e36917d40235bb00ed0294ab83c254081f81e7ed2ef74616549ea54cbd08cb513e91dbf24d22913dba4db43fce55
-Size (mlr-4.5.0.tar.gz) = 1010180 bytes
+SHA1 (mlr-5.0.0.tar.gz) = 8cf41235ac550d6a8ab82ad55f12f807a8eb30d0
+RMD160 (mlr-5.0.0.tar.gz) = bf83f414d892fdee33a1aa76d0e175a73a16c6f4
+SHA512 (mlr-5.0.0.tar.gz) = 3c0cae5447b2135cb9097ca80a726e3372391e50a974b0bbe90261a020d62d3b99f58405c480b411a88eee08e20e7ef30feb34eb9eb86a8ce3c9aee833660d8b
+Size (mlr-5.0.0.tar.gz) = 1143163 bytes



Home | Main Index | Thread Index | Old Index