tech-userlevel archive
[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index][Old Index]
Introducing ARFE
A few years ago, while I was debugging and tuning up NetBSD networking
code, I was interested in the rate of change of many statistics (netstat
-s, ifconfig -va). I was producing lots of "before and after" stats
ifconfig -va > before ; sleep 10 ; ifconfig -va > after
and comparing or subtracting them in my head. A strong urge to
automate the subtraction without writing a one-off script (or scripts)
led me to write a universal statistics subtractor. In this way,
DT---(d)ifferentiate (t)ext---was born.
DT reads two inputs and finds a longest common subsequence (LCS) of the
inputs where numbers are "wild": one number, consisting of an optional
sign followed by one or more decimal digits, can match any other. Then
DT emits the LCS, printing the difference of all of the numbers in the
common sequence. I will give an example. DT input 1:
wm0: flags=8843<UP,BROADCAST,RUNNING,SIMPLEX,MULTICAST> mtu 1500
capabilities=2bf80<TSO4,IP4CSUM_Rx,IP4CSUM_Tx,TCP4CSUM_Rx,TCP4CSUM_Tx,UDP4CSUM_Rx,UDP4CSUM_Tx,TCP6CSUM_Tx,UDP6CSUM_Tx>
enabled=2bf80<TSO4,IP4CSUM_Rx,IP4CSUM_Tx,TCP4CSUM_Rx,TCP4CSUM_Tx,UDP4CSUM_Rx,UDP4CSUM_Tx,TCP6CSUM_Tx,UDP6CSUM_Tx>
address: 00:0a:0b:cd:01:ef
media: Ethernet autoselect (1000baseT full-duplex)
status: active
input: 9348780 packets, 2659054914 bytes, 2853146 multicasts
output: 5844547 packets, 1166873148 bytes, 2667 multicasts
inet 10.0.1.17 netmask 0xffffff00 broadcast 10.0.1.255
inet6 fe80::20a:bff:fecd:1ef%wm0 prefixlen 64 scopeid 0x1
Input 2:
wm0: flags=8843<UP,BROADCAST,RUNNING,SIMPLEX,MULTICAST> mtu 1500
capabilities=2bf80<TSO4,IP4CSUM_Rx,IP4CSUM_Tx,TCP4CSUM_Rx,TCP4CSUM_Tx,UDP4CSUM_Rx,UDP4CSUM_Tx,TCP6CSUM_Tx,UDP6CSUM_Tx>
enabled=2bf80<TSO4,IP4CSUM_Rx,IP4CSUM_Tx,TCP4CSUM_Rx,TCP4CSUM_Tx,UDP4CSUM_Rx,UDP4CSUM_Tx,TCP6CSUM_Tx,UDP6CSUM_Tx>
address: 00:0a:0b:cd:01:ef
media: Ethernet autoselect (1000baseT full-duplex)
status: active
input: 9348892 packets, 2659066294 bytes, 2853157 multicasts
output: 5844607 packets, 1166884588 bytes, 2667 multicasts
inet 10.0.1.17 netmask 0xffffff00 broadcast 10.0.1.255
inet6 fe80::20a:bff:fecd:1ef%wm0 prefixlen 64 scopeid 0x1
The DT output reveals the change in input/output statistics between
inputs 1 & 2:
wm0: flags= 0<UP,BROADCAST,RUNNING,SIMPLEX,MULTICAST> mtu 0
capabilities=0bf 0<TSO0,IP0CSUM_Rx,IP0CSUM_Tx,TCP0CSUM_Rx,TCP0CSUM_Tx,UDP0CSUM_Rx,UDP0CSUM_Tx,TCP0CSUM_Tx,UDP0CSUM_Tx>
enabled=0bf 0<TSO0,IP0CSUM_Rx,IP0CSUM_Tx,TCP0CSUM_Rx,TCP0CSUM_Tx,UDP0CSUM_Rx,UDP0CSUM_Tx,TCP0CSUM_Tx,UDP0CSUM_Tx>
address: 0:0a:0b:cd: 0:ef
media: Ethernet autoselect ( 0baseT full-duplex)
status: active
input: 112 packets, 11380 bytes, 11 multicasts
output: 60 packets, 11440 bytes, 0 multicasts
inet 0.0.0. 0 netmask 0xffffff 0 broadcast 0.0.0. 0
inet0 fe 0:: 0a:bff:fecd:0ef%wm0 prefixlen 0 scopeid 0x0
In the output, there are distracting artifacts, such as the mtu of 0.
I can clean that up with a second application of DT. Using the output
above as input 2, I use this as input 1:
input: 0 packets, 0 bytes, 0 multicasts
output: 0 packets, 0 bytes, 0 multicasts
DT emits:
input: 112 packets, 11380 bytes, 11 multicasts
output: 60 packets, 11440 bytes, 0 multicasts
Note that input 1 does not describe the expected input or the shape of
the output. Rather, it exemplifies both. Quick, concrete expressions
of text transformations using exemplars instead of match expressions
seem to have a lot of promise. It appears that they can save a
programmer the time they would spend writing regexes and other code,
and make automation more accessible to a lay person. So I have started
exploring ways that an entire suite of text-processing tools can derive
from DT. Here are some modifications and extensions to DT that I am
already contemplating:
1) Make it detect more data types than decimal numbers: hexadecimal and
floating-point numbers, IPv4 and IPv6 addresses, opaque strings.
2) Provide more functions than subtraction, f(x, y) = y - x.
IT---(i)integrate (t)ext---is just DT with a change of function,
f(x, y) = x + y. Bitwise-AND and bitwise-OR are sometimes suitable
functions, especially for hexadecimal numbers. For IP addresses,
generalization may be most appropriate: f(x, y) should take x and y
to the subnet (prefix & mask length) that contains both. Sometimes a
"copy" function is best, f(x, y) = x or f(x, y) = y.
3) Treat the first input as an exemplar for a solitary record; treat the
second input as if it consists of one or more records. Using the
exemplar, find record boundaries. Process each individual record, or
perform a function like sort or join on the record set.
4) Use separate input and output exemplars. When an example datum
in the input exemplar also appears in the output exemplar, let
that bind the input to an output position. In this way, this
input/output-exemplar pair expresses 3x3 matrix transpose:
input output
11 12 13 11 21 31
21 22 23 12 22 32
31 32 33 13 23 33
5) Respect parentheses, braces, and other marks that open, close, and
nest---for example, XML tags.
ARFE (pronounced "arf!") is what I call the toolsuite that expands on
DT. ARFE stands for (A)d Hoc (R)ecord and (F)ield (E)xtraction.
If you would like to know more about ARFE, or to help out with the
research and programming, let me know.
Dave
--
David Young
dyoung%pobox.com@localhost Urbana, IL (217) 721-9981
Home |
Main Index |
Thread Index |
Old Index