tech-userlevel archive

[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index][Old Index]

Proposed addition of strcodecs(3) library - review requested



This is a request for peer review for the addition of a new library,
strcodecs(3), to the repository.  It adds a consistent framework for a
lot of the string and data transformations we already have, and a lot
more new ones.

The latest version can be found at:

        http://www.netbsd.org/~agc/strcodecs-20100917.tgz

If anyone has used python, this is similar to their codecs. It's
in library format, and can be used by other scripting languages,
and from the command line via strcodecs(1).

I've appended the manual pages to this email.

I've also included a minimal version of an atf test in here - it's my
first one, please be gentle - and the full tests are in the
strcodecs(1) Makefile.

Some examples of their use:

# randomise 653 bytes, and get the base64 representation of them
% strcodecs randomise 653 | strcodecs btoa 
Z8Zpc1H/Suwpzbqr8vvjRnzCVPgb6OeNdlouYzOfyZpmMg23MVijWiVdBRdY6V7Uq7LNxpu0
VBEOgnRBIT3ch3DpPqFB4fxnPgF+l+rca5aPOFwq7LA7+zKvPFTsGNtcAhr+Q/v6qjr7KdHm
BTx8lHXYvmGJ+Vy7qJkPlbHr8bMF7/cA6aE65coLy9BIR2S9HyMeqBx7ZMUUc1rFXkt5Yztw
ZCQRngncqtSs8hsQrzszzeNQSEcVXLtvIhm6m331C+EaHH8j+Cn4pBsTtcpO6JgyOOB5TT00
vF9Od/rLbAWshiErqhpVor5wtXM7BFzTNpSzr+Lw5J5PMhVJ/YJOqQhw1LKKKVRImgq81Q4Y
qESsW/OOTNctmwlC5QbEM6/No4R/La3UdkfeMhzsSsQw9iAjhWz7sgcE9OwLuSC6hsM+BfHs
2Wczt5lQo+MU09k0916g8hCo9gWUAb60vER4+klp5iPQGtppan5MflEls0iEUzqU+zGZkDJX
RO6bvOnlJc8I9eniXlNgqtKy0IX6VNg16NRmgmSY2aiHdWVwWoo/YoApRN58pYlOV1nTUa2s
hpWA7BfkhfGMDGbxfMB8uyL85GbaYQtjr2K8g7RpLzr/rycWk6wHH7htETQtje9PidS2YzXB
x+Qkg2fY7ZYS7EU5AtjlCviddwnRpZbB9B+VqoLKbEmukM0WaLqseqbytKjKmbLCNyrLCM9h
ycOAXm4DKNpM12oZ7dLTmUx5iwAiVprUGNH+5NnNRaORxgH/ySrZFQFDL+4VAodhfBNinmn8
coHNcWWmPqtJz3FLzjp1p0926n5k/4HrYf3+w5tnvw3pjH5OMr35fIxqx1ukPAL0su1yFuzz
AU3wABA=

# metaphone calculation
% echo crooks | strcodecs metaphone
KRKS
% echo crux | strcodecs metaphone
KRKS
%

# on the fly gzip compression and then base85 encoding
% strcodecs gzip Makefile | strcodecs base85encode | strcodecs base85decode | 
strcodecs -o m gunzip 
% diff Makefile m
%

Many more transformations are available.

If I can answer any questions, please let me know.

Thanks,
Alistair
.\" $NetBSD: libnetpgp.3,v 1.14 2010/06/18 00:20:28 agc Exp $
.\"
.\" Copyright (c) 2010 Alistair Crooks <agc%NetBSD.org@localhost>
.\" All rights reserved.
.\"
.\" Redistribution and use in source and binary forms, with or without
.\" modification, are permitted provided that the following conditions
.\" are met:
.\" 1. Redistributions of source code must retain the above copyright
.\"    notice, this list of conditions and the following disclaimer.
.\" 2. Redistributions in binary form must reproduce the above copyright
.\"    notice, this list of conditions and the following disclaimer in the
.\"    documentation and/or other materials provided with the distribution.
.\"
.\" THIS SOFTWARE IS PROVIDED BY THE AUTHOR ``AS IS'' AND ANY EXPRESS OR
.\" IMPLIED WARRANTIES, INCLUDING, BUT NOT LIMITED TO, THE IMPLIED WARRANTIES
.\" OF MERCHANTABILITY AND FITNESS FOR A PARTICULAR PURPOSE ARE DISCLAIMED.
.\" IN NO EVENT SHALL THE AUTHOR BE LIABLE FOR ANY DIRECT, INDIRECT,
.\" INCIDENTAL, SPECIAL, EXEMPLARY, OR CONSEQUENTIAL DAMAGES (INCLUDING, BUT
.\" NOT LIMITED TO, PROCUREMENT OF SUBSTITUTE GOODS OR SERVICES; LOSS OF USE,
.\" DATA, OR PROFITS; OR BUSINESS INTERRUPTION) HOWEVER CAUSED AND ON ANY
.\" THEORY OF LIABILITY, WHETHER IN CONTRACT, STRICT LIABILITY, OR TORT
.\" (INCLUDING NEGLIGENCE OR OTHERWISE) ARISING IN ANY WAY OUT OF THE USE OF
.\" THIS SOFTWARE, EVEN IF ADVISED OF THE POSSIBILITY OF SUCH DAMAGE.
.\"
.Dd September 11, 2010
.Dt STRCODECS 3
.Os
.Sh NAME
.Nm strcodecs
.Nd string coding and decoding functions for transforming data
.Sh LIBRARY
.Lb libstrcodecs
.Sh SYNOPSIS
.In strcodecs.h
.Pp
.Ft int
.Fo strcodec
.Fa "strcodecs_t *codecs" "const char *in" "const size_t insize"
.Fa "const char *operation" "void *out" "size_t outsize"
.Fc
.Ft int
.Fo astrcodec
.Fa "strcodecs_t *codecs" "const char *in" "const size_t insize"
.Fa "const char *operation" "void *out" "size_t *outsize"
.Fc
.Ft int
.Fo ipstrcodec
.Fa "strcodecs_t *codecs" "void *input" "int size"
.Fa "const char *operation"
.Fc
.Ft int
.Fo strcodec_size
.Fa "strcodecs_t *codecs" "const char *operation" "const unsigned insize"
.Fc
.Ft int
.Fo strcodec_input_needed
.Fa "strcodecs_t *codecs" "const char *operation"
.Fc
.Ft int
.Fo strcodecs_begin
.Fa "strcodecs_t *codecs" "const char *subset" "..."
.Fc
.Ft int
.Fo strcodecs_lockdown
.Fa "strcodecs_t *codecs"
.Fc
.Ft int
.Fo strcodecs_add
.Fa "strcodecs_t *codecs" "const char *operation"
.Fa "int (*)(const char *, const size_t, const char *, void *, size_t)"
.Fa "const char *multiplier" "const int input_needed"
.Fc
.Sh DESCRIPTION
.Nm
is a library interface which implements various transformations from
input data to output data.
Text is transformed by the
.Nm
library, converting the input to the output format.
New transformations can be added to the table.
The table can also be locked to prevent further transformations being added.
A lot of these transformations are available at the system level already.
However,
.Nm
provides a single, consistent interface to the transformations,
in a way that is easy to provide as an interface for scripting
languages and from the shell.
.Pp
The basic way of using the
.Nm
library is to call the
.Fn strcodec
function to transform the text.
Two alternate functions are provided,
.Fn astrcodec
which will dynamically allocate the space for the output array
using
.Xr calloc 3 .
In-place transformations can be made using the
.Fn ipstrcodec
function.
An
.Dq in-place
transformation means that the transformation will be done
using temporary storage which is allocated, and then the
transformed text will be copied over the original input,
thereby making the operation appear to have transformed the
text in situ.
.Pp
The transformation table holding information on all the possible
transformations can be initialised using
the
.Fn strcodecs_begin
function.
The function can be used to limit the transformations which get
loaded into the transformation table.
At the present time, the following subsets of transformations are
defined:
.Bl -tag -width network
.It all
will load all the following subsets of transformations
.It charset
will load all the transformations relating to character sets,
including base64 and base85, EBCDIC,
RAD50, etc.
.It digest
will load all the transformations relating to message digests,
including md5, sha1, etc
.It fill
will load all the transformations relating to region fill,
including zero and randomise
.It format
will load all the transformations relating to formatting of output,
such as hexadecimal dumping, rotation, etc
.It edit
will load all the transformations relating to editing of output,
such as sed and edit functionality
.It hash
will load all the transformations relating to 32bit hashing.
.It network
will load all the transformations relating to network name
resolution
.El
.Pp
It is not necessary to call this function prior to using any of
the functionality in the
.Nm
library -- if the table has not been initialised by the time of the first
call, then it will be called automatically.
.Pp
The internal transformation information carries information on the worst-case
size of the output array.
This size can be calculated using the
.Fn strcodec_size
function, passing into the function the size of the input buffer.
The
.Fn strcodec_input_needed
function will return an indication whether an input buffer is needed.
Please note that an input buffer is needed for the
.Fn ipstrcodec
.Dq in-place
transformation call.
The
.Fn strcodec_valid_op
function is used to verify that the current operation is a known
transformation.
.Pp
There are a number of pre-defined transformations provided:
.Bl -tag -width rad50decode
.It asa
.Op format
perform Fortran control character transformations in the
form of the POSIX
.Xr asa 1
command.
.It ascii2ebcdic
.Op charset
convert the input from ASCII character encodings to
EBCDIC character encodings.
.It base64decode
.Op charset
perform atob, or base64, decoding.
Each sequence of 4 bytes is transformed back into a 3 byte
sequence.
.It base64encode
.Op charset
perform atob, or base64, encoding.
Each sequence of 3 bytes is transformed into a 4 byte
sequence from the pre-defined 64-byte set.
.It base85decode
.Op charset
perform base85 decoding.
Each sequence of 5 bytes is transformed back into a 4 byte
sequence.
.It base85encode
.Op charset
perform base85 encoding.
Each sequence of 4 bytes is transformed into a 5 byte
sequence from the pre-defined 85-byte set.
.It bin2hex
.Op charset
encodes the input string as 4-character C-string style
hexadecimal constants.
.It bswap16
.Op format
perform a bytewise swap of the 16-bit entity
.It bswap32
.Op format
perform a bytewise swap of the 32-bit entity
.It bswap64
.Op format
perform a bytewise swap of the 64-bit entity
.It dos2unix
.Op format
DOS style line-endings are transformed into Unix style
line-endings.
.It ebcdic2ascii
.Op charset
convert the input from EBCDIC character encodings to
ASCII character encodings.
.It edit
.Op edit
edit the input text with the
.Dq EDITOR
or
.Dq VISUAL
editor, as defined in the environment.
.It from-uri
.Op charset
convert from a percent-encoded URI to ASCII text.
.It full-uuencode
.Op charset
convert the given text into uuencoded text (see also
the uuencode and uudecode transforms), adding a file header
and trailer.
.It gethostinfo
.Op network
attempt to resolve the hostname,
given the IP address (either IPv4 or IPv6) as input.
.It getipaddress
.Op network
attempt to reverse resolve the IP address (both IPv4 and IPv6)
given the hostname as input.
.It gunzip
.Op compress
decompress the input buffer using
.Xr zlib 3
.It gzip
.Op compress
compress the input buffer using
.Xr zlib 3
.It hex2bin
.Op charset
decodes the input string from 4-character C-string style
hexadecimal constants to binary output.
.It hexdump
.Op format
converts the input text to an ASCII-clean hexadecimal dump format,
including a printable representation of the input text.
.It md5
.Op digest
calculate the MD5 digest using
.Xr MD5_Data 3
.It metaphone
.Op charset
calculate the metaphone phonetic value for the
input.
.It rad50decode
.Op charset
converts the input text from DEC RADIX-50 format to
the original text. Due to the limited range of the
RADIX-50 
character set, some of the original text may have been lost.
.It rad50encode
.Op charset
converts the input text to DEC RADIX-50 format from
the original text. Due to the limited range of the
RADIX-50 
character set, some of the original text may have been lost.
.It randomise
.Op fill
fill the output with random values.
.It rmd160
.Op digest
calculate the RMD160 digest using
.Xr RMD160_Data 3
.It rot
.Op format
transform the input text with a circular rotation.
The most famous of these is the Caesar
.Xr rot13 6
transformation, but this transformation allows any length
of rotation to be used.
.It secs2str
.Op format
transforms the input value (as the ASCII-encoded decimal
value of seconds since the start of the epoch) to a colon-separated
value representing the date.
.It sed
.Op edit
performs a
.Xr sed 1
transformation on a regular expression. Please note that full,
extended regular expressions, as defined in
.Xr re_format 7
are used to match.
.It size
.Op digest
returns the size of the input as a decimal string
.It sha1
.Op digest
calculate the SHA1 digest using
.Xr SHA1Data 3
.It sha256
.Op digest
calculate the SHA256 digest using
.Xr SHA256_Data 3
.It sha512
.Op digest
calculate the SHA512 digest using
.Xr SHA512_Data 3
.It soundex
.Op charset
calculate the soundex phonetic value for the
input.
.It str2secs
.Op format
transforms the input value (as the
colon-separated
value representing the date) to an
ASCII-encoded decimal
value representing seconds since the start of the epoch.
.It strunvis
.Op charset
uses the
.Xr unstrvis 3
transformation on the input data.
.It strvis
.Op charset
uses the
.Xr strvis 3
transformation on the input data.
.It strvisc
.Op charset
uses the
.Xr strvisc 3
transformation on the input data.
.It substring
.Op edit
extract a substring of the input string, and place it in the output string.
.It to-uri
.Op charset
convert from a percent-encoded URI to ASCII text.
.It to-lower
.Op charset
change any uppercase letters in the input string to lowercase.
.It to-unicode
.Op charset
translate to unicode-16 from UTF-8
.It to-upper
.Op charset
change any lowercase letters in the input string to uppercase.
.It to-utf8
.Op charset
translate from unicode-16 to UTF-8
.It unix2dos
.Op charset
the Unix-style line-endings are converted to DOS style line-endings.
.It uudecode
.Op charset
transform the input text from
.Xr uudecode 1
text to the original text.
.It uuencode
.Op charset
encode the input text as
.Xr uuencode 1
text.
.It zero
.Op fill
produce an area containing NUL bytes in the output.
.El
.Pp
A number of hash functions have also been implemented, namely:
.Bl -tag -width bernsteinhash
.It dumbhash
.Op hash
implements a simple hashing scheme based on the addition of the value
of each character in the string.
.It dumbmulhash
.Op hash
implements a simple hashing scheme based on the addition of the value
of each character in the string multiplied by its position in the string.
.It lennart
.Op hash
implements a simple and fast generic string hasher based on Peter K. Pearson's
article in CACM 33-6, pp. 677.
.It crchash
.Op hash
implements a hash used in CRC calculations
.It perlhash
.Op hash
implements the addition-based hash algorithm used internally in the perl
interpreter.
.It perlxorhash
.Op hash
implements the XOR-based hash algorithm used internally in the perl
interpreter.
.It pythonhash
.Op hash
implements the hash algorithm used internally in the python
interpreter.
.It mousehash
.Op hash
implements an XOR-based hash algorithm from der Mouse.
.It bernstein
.Op hash
implements a multiplicative-based hash algorithm from Daniel Bernstein.
.It honeyman
.Op hash
implements an XOR-based hash algorithm from Peter Honeyman.
.It pjwhash
.Op hash
implements the so called `hashpjw' function by P.J. Weinberger
from Aho/Sethi/Ullman, COMPILERS: Principles, Techniques and Tools,
1986, 1987 Bell Telephone Laboratories, Inc.
.It bobhash
.Op hash
implements another, more complex hash algorithm.
.It torekhash
.Op hash
implements a hash algorithm due to Chris Torek, and using Duff's device.
.It byacchash
.Op hash
implements the hash function found in Berkeley
.Xr byacc 1
program
.It tclhash
.Op hash
implements the hash algorithm used internally in the tcl interpreter.
.It gawkhash
.Op hash
implements the hash algorithm used internally in the gawk interpreter,
also using Duff's device.
.It gcc3_hash
.Op hash
implements one of the hash algorithms found in gcc3
.It gcc3_hash2
.Op hash
implements another of the hash algorithms found in gcc3
.It nemhash
.Op hash
implements another hash function
.El
.Sh SEE ALSO
.Xr asa 1 ,
.Xr sed 1 ,
.Xr uudecode 1 ,
.Xr uuencode 1 ,
.Xr calloc 3 ,
.Xr MD5Data 3 ,
.Xr RMD160Data 3 ,
.Xr SHA1Data 3 ,
.Xr SHA256_Data 3 ,
.Xr SHA512_Data 3 ,
.Xr strvis 3 ,
.Xr strvisc 3 ,
.Xr unstrvis 3 ,
.Xr zlib 3 ,
.Xr rot13 6 ,
.Xr re_format 7
.Sh HISTORY
The
.Nm
library first appeared in
.Nx 6.0 .
.Sh AUTHORS
.An Alistair Crooks Aq agc%NetBSD.org@localhost
.\" $NetBSD: libnetpgp.3,v 1.14 2010/06/18 00:20:28 agc Exp $
.\"
.\" Copyright (c) 2010 Alistair Crooks <agc%NetBSD.org@localhost>
.\" All rights reserved.
.\"
.\" Redistribution and use in source and binary forms, with or without
.\" modification, are permitted provided that the following conditions
.\" are met:
.\" 1. Redistributions of source code must retain the above copyright
.\"    notice, this list of conditions and the following disclaimer.
.\" 2. Redistributions in binary form must reproduce the above copyright
.\"    notice, this list of conditions and the following disclaimer in the
.\"    documentation and/or other materials provided with the distribution.
.\"
.\" THIS SOFTWARE IS PROVIDED BY THE AUTHOR ``AS IS'' AND ANY EXPRESS OR
.\" IMPLIED WARRANTIES, INCLUDING, BUT NOT LIMITED TO, THE IMPLIED WARRANTIES
.\" OF MERCHANTABILITY AND FITNESS FOR A PARTICULAR PURPOSE ARE DISCLAIMED.
.\" IN NO EVENT SHALL THE AUTHOR BE LIABLE FOR ANY DIRECT, INDIRECT,
.\" INCIDENTAL, SPECIAL, EXEMPLARY, OR CONSEQUENTIAL DAMAGES (INCLUDING, BUT
.\" NOT LIMITED TO, PROCUREMENT OF SUBSTITUTE GOODS OR SERVICES; LOSS OF USE,
.\" DATA, OR PROFITS; OR BUSINESS INTERRUPTION) HOWEVER CAUSED AND ON ANY
.\" THEORY OF LIABILITY, WHETHER IN CONTRACT, STRICT LIABILITY, OR TORT
.\" (INCLUDING NEGLIGENCE OR OTHERWISE) ARISING IN ANY WAY OUT OF THE USE OF
.\" THIS SOFTWARE, EVEN IF ADVISED OF THE POSSIBILITY OF SUCH DAMAGE.
.\"
.Dd September 11, 2010
.Dt STRCODECS 1
.Os
.Sh NAME
.Nm strcodecs
.Nd coding and decoding utility to transform data strings
.Sh SYNOPSIS
.Nm
.Op Fl o output-file
.Ar transformation
.Op file...
.Sh DESCRIPTION
The
.Nm
command is used to transform input text to output text,
using the
.Xr strcodecs 3
library of transformations.
A large array of transformations and encodings are possible.
The
.Nm
name comes from the encoding and decoding of text to
provide the transformation of input to output.
If
no file is given for input, the input will be taken
from standard input.
If no
.Fl o
option is used to specify an output file,
the transformed text will be printed on the standard output.
.Sh SEE ALSO
.Xr strcodecs 3
.Sh HISTORY
The
.Nm
command first appeared in
.Nx 6.0 .
.Sh AUTHORS
.An Alistair Crooks Aq agc%NetBSD.org@localhost .


Home | Main Index | Thread Index | Old Index