[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index][Old Index]
[pkgsrc/trunk]: pkgsrc/textproc/miller miller: update to 5.6.2.
user: fcambus <fcambus%pkgsrc.org@localhost>
date: Fri Mar 06 08:18:31 2020 +0000
miller: update to 5.6.2.
#271 fixes a corner-case bug with more than 100 CSV/TSV files with
headers of varying lengths.
The new http://johnkerl.org/miller/doc/whyc-details.html is an
elaboration on http://johnkerl.org/miller/doc/whyc.html which answers
a question posed by @BurntSushi on Reddit a couple years ago which
I did not address in detail at the time.
The only change is that http://johnkerl.org/miller/doc is now
more mobile-friendly. All build artifacts are the same as at
The new system DSL function allows you to run arbitrary shell commands
and store them in field values. Some example usages are documented
here. This is in response to issues #246 and #209.
There is now support for ASV and USV file formats. This is in response
to issue #245.
The new format-values verb allows you to apply numerical formatting
across all record values. This is in response to issue #252.
The new DKVP I/O in Python sample code now works for Python 2 as
well as Python 3.
There is a new cookbook entry on doing multiple joins. This is in
response to issue #235.
The toupper, tolower, and capitalize DSL functions
are now UTF-8 aware, thanks to @sheredom's marvelous
https://github.com/sheredom/utf8.h. The internationalization page
has also been expanded. This is in response to issue #254.
#250 fixes a bug using in-place mode in conjunction with verbs
(such as rename or sort) which take field-name lists as arguments.
#253 fixes a bug in the label when one or more names are common
between old and new.
#251 fixes a corner-case bug when (a) input is CSV; (b) the last
field ends with a comma and no newline; (c) input is from standard
input and/or --no-mmap is supplied.
The new positional-indexing feature resolves #236 from @aborruso. You
can now get the name of the 3rd field of each record via $[], and
its value by $[[]]. These are both usable on either the left-hand
or right-hand side of assignment statements, so you can more easily
do things like renaming fields progrmatically within the DSL.
There is a new capitalize DSL function, complementing the
already-existing toupper. This stems from #236.
There is a new skip-trivial-records verb, resolving #197. Similarly,
there is a new remove-empty-columns verb, resolving #206. Both are
useful for data-cleaning use-cases.
Another pair is #181 and #256. While Miller uses mmap internally
(and invisibily) to get approximately a 20% performance boost over
not using it, this can cause out-of-memory issues with reading either
large files, or too many small ones. Now, Miller automatically avoids
mmap in these cases. You can still use --mmap or --no-mmap if you
want manual control of this.
There is a new --ivar option for the nest verb which complements
the already-existing --evar. This is from #260 thanks to @jgreely.
There is a new keystroke-saving urandrange DSL function:
urandrange(low, high) is the same as low + (high - low) *
urand(). This arose from #243.
There is a new -v option for the cat verb which writes a low-level
record-structure dump to standard error.
There is a new -N option for mlr which is a keystroke-saver for
The new FAQ entry
The new FAQ entry
#244 fixes a documentation issue while highlighting the need for #241.
There was a SEGV using nest within then-chains, fixed in response
Quotes and backslashes weren't being escaped in JSON output with
--jvquoteall; reported on #222.
The new clean-whitespace verb resolves #190 from @aborruso. Along with
the new functions strip, lstrip, rstrip, collapse_whitespace, and
clean_whitespace, there is now both coarse-grained and fine-grained
control over whitespace within field names and/or values. See the
linked-to documentation for examples.
The new altkv verb resolves #184 which was originally opened via an
email request. This supports mapping value-lists such as a,b,c,d to
alternating key-value pairs such as a=b,c=d.
The new fill-down verb resolves #189 by @aborruso. See the linked-to
documentation for examples.
The uniq verb now has a uniq -a which resolves #168 from @sjackman.
The new regextract and regextract_or_else functions resolve #183
The new ssub function arises from #171 by @dohse, as a simplified way
to avoid escaping characters which are special to regular-expression
There are new localtime functions in response to #170 by
@sitaramc. However note that as discussed on #170 these do
not undo one another in all circumstances. This is a non-issue
for timezones which do not do DST. Otherwise, please use with
disclaimers: localdate, localtime2sec, sec2localdate, sec2localtime,
strftime_local, and strptime_local.
Windows build-artifacts are now available in Appveyor at
and will be attached to this and future releases. This resolves #167,
#148, and #109.
Travis builds at https://travis-ci.org/johnkerl/miller/builds now
run on OSX as well as Linux.
An Ubuntu 17 build issue was fixed by @singalen on #164.
put/filter documentation was confusing as reported by @NikosAlexandris
The new FAQ entry
resolves #193 by @aborruso.
The new cookbook entry
arises from #168 from @sjackman.
The unsparsify documentation had some words missing as reported by
@tst2005 on #194.
There was a typo in the cookpage page
as fixed by @tst2005 in #192.
There was a memory leak for TSV-format files only as reported by
@treynr on #181.
Dollar sign in regular expressions were not being escaped properly
as reported by @dohse on #171.
Comment strings in data files: mlr --skip-comments allows
you to filter out input lines starting with #, for all file
formats. Likewise, mlr --skip-comments-with X lets you specify
the comment-string X. Comments are only supported at start of data
line. mlr --pass-comments and mlr --pass-comments-with X allow you
to forward comments to program output as they are read.
The count-similar verb lets you compute cluster sizes by cluster
While Miller DSL arithmetic gracefully overflows from 64-integer
to double-precision float (see also here), there are now the
integer-preserving arithmetic operators .+ .- .* ./ .// for those
times when you want integer overflow.
There is a new bitcount function: for example, echo x=0xf0000206 |
mlr put '$y=bitcount($x)' produces x=0xf0000206,y=7.
Issue 158: mlr -T is an alias for --nidx --fs tab, and mlr -t is an
alias for mlr --tsvlite.
The mathematical constants ? and e have been renamed from PI and
E to M_PI and M_E, respectively. (It's annoying to get a syntax
error when you try to define a variable named E in the DSL, when
A through D work just fine.) This is a backward incompatibility,
but not enough of us to justify calling this release Miller 6.0.0.
As noted here, while Miller has its own DSL there will always be
things better expressible in a general-purpose language. The new page
Sharing data with other languages shows how to seamlessly share data
back and forth between Miller, Ruby, and Python. SQL-input examples
and SQL-output examples contain detailed information the interplay
between Miller and SQL.
Issue 150 raised a question about suppressing numeric conversion. This
resulted in a new FAQ entry How do I suppress numeric conversion?,
as well as the longer-term follow-on issue 151 which will make
numeric conversion happen on a just-in-time basis.
To my surprise, csvlite format options weren?t listed in mlr --help
or the manpage. This has been fixed.
Documentation for auxiliary commands has been expanded, including
within the manpage.
Issue 159 fixes regex-match of literal dot.
Issue 160 fixes out-of-memory cases for huge files. This is an old
bug, as old as Miller, and is due to inadequate testing of huge-file
cases. The problem is simple: Miller prefers memory-mapped I/O
(using mmap) over stdio since mmap is fractionally faster. Yet as
any processing (even mlr cat) steps through an input file, more and
more pages are faulted in -- and, unfortunately, previous pages are
not paged out once memory pressure increases. (This despite gallant
attempts with madvise.) Once all processing is done, the memory is
released; there is no leak per se. But the Miller process can crash
before the entire file is read. The solution is equally simple: to
prefer stdio over mmap for files over 4GB in size. (This 4GB threshold
is tunable via the --mmap-below flag as described in the manpage.)
Issue 161 fixes a CSV-parse error (with error message "unwrapped
double quote at line 0") when a CSV file starts with the UTF-8
byte-order-mark ("BOM") sequence 0xef 0xbb 0xbf and the header line
has double-quoted fields. (Release 5.2.0 introduced handling for
UTF-8 BOMs, but missed the case of double-quoted header line.)
Issue 162 fixes a corner case doing multi-emit of aggregate variables
when the first variable name is a typo.
The Miller JSON parser used to error with Unable to parse JSON data:
Line 1 column 0: Unexpected 0x00 when seeking value on empty input,
or input with trailing whitespace; this has been fixed.
textproc/miller/Makefile | 4 ++--
textproc/miller/distinfo | 10 +++++-----
2 files changed, 7 insertions(+), 7 deletions(-)
diffs (27 lines):
diff -r efb80305ae11 -r 4b45d5579461 textproc/miller/Makefile
--- a/textproc/miller/Makefile Thu Mar 05 20:49:22 2020 +0000
+++ b/textproc/miller/Makefile Fri Mar 06 08:18:31 2020 +0000
@@ -1,6 +1,6 @@
-# $NetBSD: Makefile,v 1.15 2019/03/28 23:52:09 leot Exp $
+# $NetBSD: Makefile,v 1.16 2020/03/06 08:18:31 fcambus Exp $
diff -r efb80305ae11 -r 4b45d5579461 textproc/miller/distinfo
--- a/textproc/miller/distinfo Thu Mar 05 20:49:22 2020 +0000
+++ b/textproc/miller/distinfo Fri Mar 06 08:18:31 2020 +0000
@@ -1,6 +1,6 @@
-$NetBSD: distinfo,v 1.14 2017/08/14 21:22:55 wiz Exp $
+$NetBSD: distinfo,v 1.15 2020/03/06 08:18:31 fcambus Exp $
-SHA1 (mlr-5.2.2.tar.gz) = 1b130238401ae30096d984961af0e1f88d583a1a
-RMD160 (mlr-5.2.2.tar.gz) = 8147e4ff0a7125ece80246b35e0b54c1c8c50833
-SHA512 (mlr-5.2.2.tar.gz) = 1f6843fb08e3e3c59912e673636fc7d52246ab9a49a0df25c4b11a17ed7576e0c27e10c06f164a9df8e4b30d8f1715088161187b8126fecc84ef50774dcf7b93
-Size (mlr-5.2.2.tar.gz) = 1191162 bytes
+SHA1 (mlr-5.6.2.tar.gz) = 4a3fb995a65a9960bb2e53bd565081d491aba8b1
+RMD160 (mlr-5.6.2.tar.gz) = 51e6d16ca6d012e47d8cad29d643c7da943a0535
+SHA512 (mlr-5.6.2.tar.gz) = d5c984c1db045219c79564251193ec4887582987cde980df6705e10e246d230d92fd9197e2c207545133f96e7cd292fc1eb494e8c57384d6ba0a90a83c4f1dd9
+Size (mlr-5.6.2.tar.gz) = 1280257 bytes
Main Index |
Thread Index |