tech-userlevel archive

[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index][Old Index]

Introducing ARFE



A few years ago, while I was debugging and tuning up NetBSD networking
code, I was interested in the rate of change of many statistics (netstat
-s, ifconfig -va).  I was producing lots of "before and after" stats

	ifconfig -va > before ; sleep 10 ; ifconfig -va > after

and comparing or subtracting them in my head.  A strong urge to
automate the subtraction without writing a one-off script (or scripts)
led me to write a universal statistics subtractor.  In this way,
DT---(d)ifferentiate (t)ext---was born.

DT reads two inputs and finds a longest common subsequence (LCS) of the
inputs where numbers are "wild": one number, consisting of an optional
sign followed by one or more decimal digits, can match any other.  Then
DT emits the LCS, printing the difference of all of the numbers in the
common sequence.  I will give an example.  DT input 1:

wm0: flags=8843<UP,BROADCAST,RUNNING,SIMPLEX,MULTICAST> mtu 1500
	capabilities=2bf80<TSO4,IP4CSUM_Rx,IP4CSUM_Tx,TCP4CSUM_Rx,TCP4CSUM_Tx,UDP4CSUM_Rx,UDP4CSUM_Tx,TCP6CSUM_Tx,UDP6CSUM_Tx>
	enabled=2bf80<TSO4,IP4CSUM_Rx,IP4CSUM_Tx,TCP4CSUM_Rx,TCP4CSUM_Tx,UDP4CSUM_Rx,UDP4CSUM_Tx,TCP6CSUM_Tx,UDP6CSUM_Tx>
	address: 00:0a:0b:cd:01:ef
	media: Ethernet autoselect (1000baseT full-duplex)
	status: active
	input: 9348780 packets, 2659054914 bytes, 2853146 multicasts
	output: 5844547 packets, 1166873148 bytes, 2667 multicasts
	inet 10.0.1.17 netmask 0xffffff00 broadcast 10.0.1.255
	inet6 fe80::20a:bff:fecd:1ef%wm0 prefixlen 64 scopeid 0x1

Input 2:

wm0: flags=8843<UP,BROADCAST,RUNNING,SIMPLEX,MULTICAST> mtu 1500
	capabilities=2bf80<TSO4,IP4CSUM_Rx,IP4CSUM_Tx,TCP4CSUM_Rx,TCP4CSUM_Tx,UDP4CSUM_Rx,UDP4CSUM_Tx,TCP6CSUM_Tx,UDP6CSUM_Tx>
	enabled=2bf80<TSO4,IP4CSUM_Rx,IP4CSUM_Tx,TCP4CSUM_Rx,TCP4CSUM_Tx,UDP4CSUM_Rx,UDP4CSUM_Tx,TCP6CSUM_Tx,UDP6CSUM_Tx>
	address: 00:0a:0b:cd:01:ef
	media: Ethernet autoselect (1000baseT full-duplex)
	status: active
	input: 9348892 packets, 2659066294 bytes, 2853157 multicasts
	output: 5844607 packets, 1166884588 bytes, 2667 multicasts
	inet 10.0.1.17 netmask 0xffffff00 broadcast 10.0.1.255
	inet6 fe80::20a:bff:fecd:1ef%wm0 prefixlen 64 scopeid 0x1

The DT output reveals the change in input/output statistics between
inputs 1 & 2:

wm0: flags=   0<UP,BROADCAST,RUNNING,SIMPLEX,MULTICAST> mtu    0
	capabilities=0bf 0<TSO0,IP0CSUM_Rx,IP0CSUM_Tx,TCP0CSUM_Rx,TCP0CSUM_Tx,UDP0CSUM_Rx,UDP0CSUM_Tx,TCP0CSUM_Tx,UDP0CSUM_Tx>
	enabled=0bf 0<TSO0,IP0CSUM_Rx,IP0CSUM_Tx,TCP0CSUM_Rx,TCP0CSUM_Tx,UDP0CSUM_Rx,UDP0CSUM_Tx,TCP0CSUM_Tx,UDP0CSUM_Tx>
	address:  0:0a:0b:cd: 0:ef
	media: Ethernet autoselect (   0baseT full-duplex)
	status: active
	input:     112 packets,      11380 bytes,      11 multicasts
	output:      60 packets,      11440 bytes,    0 multicasts
	inet  0.0.0. 0 netmask 0xffffff 0 broadcast  0.0.0.  0
	inet0 fe 0:: 0a:bff:fecd:0ef%wm0 prefixlen  0 scopeid 0x0

In the output, there are distracting artifacts, such as the mtu of 0.
I can clean that up with a second application of DT.  Using the output
above as input 2, I use this as input 1:

	input:     0 packets,      0 bytes,      0 multicasts
	output:      0 packets,      0 bytes,    0 multicasts

DT emits:

	input:     112 packets,      11380 bytes,      11 multicasts
	output:      60 packets,      11440 bytes,    0 multicasts

Note that input 1 does not describe the expected input or the shape of
the output.  Rather, it exemplifies both.  Quick, concrete expressions
of text transformations using exemplars instead of match expressions
seem to have a lot of promise.  It appears that they can save a
programmer the time they would spend writing regexes and other code,
and make automation more accessible to a lay person.  So I have started
exploring ways that an entire suite of text-processing tools can derive
from DT.  Here are some modifications and extensions to DT that I am
already contemplating:

1) Make it detect more data types than decimal numbers: hexadecimal and
   floating-point numbers, IPv4 and IPv6 addresses, opaque strings.

2) Provide more functions than subtraction, f(x, y) = y - x.
   IT---(i)integrate (t)ext---is just DT with a change of function,
   f(x, y) = x + y.  Bitwise-AND and bitwise-OR are sometimes suitable
   functions, especially for hexadecimal numbers.  For IP addresses,
   generalization may be most appropriate: f(x, y) should take x and y
   to the subnet (prefix & mask length) that contains both.  Sometimes a
   "copy" function is best, f(x, y) = x or f(x, y) = y.

3) Treat the first input as an exemplar for a solitary record; treat the
   second input as if it consists of one or more records.  Using the
   exemplar, find record boundaries.  Process each individual record, or
   perform a function like sort or join on the record set.

4) Use separate input and output exemplars.  When an example datum
   in the input exemplar also appears in the output exemplar, let
   that bind the input to an output position.  In this way, this
   input/output-exemplar pair expresses 3x3 matrix transpose:

	input    	output
                 
	11 12 13 	11 21 31
	21 22 23 	12 22 32
	31 32 33 	13 23 33

5) Respect parentheses, braces, and other marks that open, close, and
   nest---for example, XML tags.

ARFE (pronounced "arf!") is what I call the toolsuite that expands on
DT.  ARFE stands for (A)d Hoc (R)ecord and (F)ield (E)xtraction.

If you would like to know more about ARFE, or to help out with the
research and programming, let me know.

Dave

-- 
David Young
dyoung%pobox.com@localhost    Urbana, IL    (217) 721-9981


Home | Main Index | Thread Index | Old Index