Subject: Re: can anybody explain this?
To: None <current-users@NetBSD.ORG>
From: Mike Long <mike.long@analog.com>
List: current-users
Date: 02/15/1996 11:11:51
>Date: Wed, 14 Feb 1996 15:28:13 -0500
>From: Ken Hornstein <kenh@cmf.nrl.navy.mil>

[John Kohl wrote:]
>>My backup script does some rsh's/rdumps and some local dumps.  Once in
>>about every 5 nights,  I get a timeout on one of the rshs in the middle
>>of the script.  Which one fails seems to change each time.

>I have seen random rsh failures on other systems - maybe it's a general
>BSD bug w.r.t OOB data handling?  (Does rsh use OOB data?  I know rlogin
>does).

That may or may not be the problem, but problems with OOB handling may
exist.  I found the message below on comp.bugs.4bsd; I'm reposting it
for the sake of those who don't read news.  I don't know if *BSD has
the same problem with OOB messages that the author experienced.

------- Start of forwarded message -------
From: jik@annex-1-slip-jik.cam.ov.com (Jonathan Kamens)
Newsgroups: comp.protocols.kerberos,comp.bugs.4bsd
Subject: krlogin/krlogind usage of OOB data is broken
Date: 10 Feb 1996 13:32:09 -0500
Organization: jik's Linux box
NNTP-Posting-Host: jik.datasrv.co.il

(Comp.bugs.4bsd folks: This posting relates to a problem in how the Kerberos
rlogin and rlogind programs use out-of-band data.  However, I believe that
the problem I'm describing here is shared by the stock BSD rlogin and rlogind
programs, which is why I'm cross-posting to comp.bugs.4bsd.)

I understand that Sam Hartman has done a considerable amount of work on rlogin
and rlogind, trying to get their handling of out-of-band data to work
properly.  I've done similar work independently of Sam, and I suspect that my
changes are quite different from the ones he's implemented; nevertheless, I've
come to the conclusion that the way the protocl uses OOB data is broken in at
least on way that simply cannot be fixed without changing the protocol.

The basic problem I'm encountering is this: What happens if krlogind sends an
OOB message to krlogin, and then it sends a *second* OOB message before
krlogin has processed the first one?  This *can* and *does* happen.  For
example, when I krlogin from my Linux box at home to an AIX box at work over a
SLIP link, the AIX box sends three different OOB messages as part of the
initial initialization of the connection, and network congestion can easily
cause all of them to get to my Linux box in consecutive packets, too quickly
for it to deal with each of them before the next one arrives.

Unfortunately, the way OOB data is implemented in the Linux kernel (and I
believe in many other UNIX kernels as well) is that only one OOB message is
allowed at a time.  If a second message is received while the first one is
still pending, the first one becomes part of the normal data stream, and the
OOB mark is moved to the second one.  This does appear to be legal, according
to the BSD documentation about OOB data.  Consider what occurs if this happens
with krlogin/krlogind -- if krlogind sends multiple OOB messages
consecutively, then krlogin will process one of them, but the rest will simply
be part of the data stream, thus causing one or more garbage characters to
appear on the user's screen.  If the connection is being encrypted, the
results are much worse -- the OOB messages that enter the normal data stream
corrupt it, which usually causes krlogin to complain and close the connection.

I came up with three hacks to reduce the likelihood of this problem, but
they're all real hacks, and even all of them together don't work 100% of the
time.  First of all, I modified the protocol() function in krlogind so that
any single run through the protocol() loop only causes a single OOB byte to be
sent, with all the commands that need to be sent OR'd together in it.  This
appears to be OK since (a) krlogin treats the OOB byte as a mask, and checks
it to see which bits are set, and (b) the various commands sent as OOB bytes
are bit-wise exlusive of each other.  I confess that of the three hacks I came
up with, this is the one I'm least sure about, so if anyone can confirm or
deny that this is a reasonable thing to do, I'd love to hear it.  For the
Linux -> AIX case I mentioned above, this reduces the number of OOB bytes sent
by the AIX box from three to two.

Second, I modified krlogind so that it never sends two OOB messages less than
five seconds apart.  In *most* cases, this gives the client time to process
the first OOB message before the second one is sent.  But of course, it
introduces delays when initiating some connections.

Sometimes network congestion or whatever makes the five-second pause by
krlogind meaningless, and besides, sometimes krlogin will have to talk to a
krlogind which hasn't been modified in this way.  So I put a third hack hack
in des_read() in krlogin.  When des_read() reads the length of the next
encrypted data block off the net, and that length is absurd, it checks to see
if the first byte of the length contains a valid OOB message.  If it does, it
processes it as an OOB message, shifts the three remaining bytes of the length
up one, and then reads a new byte to replace the one that was treated as OOB. 
In the case of the Linux -> AIX connection I mentioned above, it ends up doing
this twice, since the Linux box gets three OOB messages in quick succession
and only ends up dealing with one of them as OOB data.  I figured that this
doesn't really pose a thread to the encrypted data stream, since if there's
really a problem with it the problem will turn up later anyway.

However, that second hack in krlogin will only work when an encrypted session
is being used.  Non-encrypted sessions will still end up with some OOB
messages not getting processed and ending up as garbage in the data stream. 
Furthermore, even with these hacks, I've still seen instances where des_read()
gets unexpected values when it tries to read the length off the net, or where
the encrypted data is not available for some reason when it tries to read it.

As far as I can tell, the only way to make this work reliably is to require
hand-shaking -- when krlogind sends OOB data to krlogin, krlogin needs to send
OOB data back to krlogind to tell it when it has processed the data, and
krlogind needs to wait for that ACK before sending any more OOB data.  This
is, I believe, how telnet/telnetd handle their OOB data.

Unfortunately, this would require changing the krlogin/krlogind protocol (and
I realize that "protocol" is a strong word) in a way that would make the new
krlogin incompatible with the old krlogind and vice versa.  The closest thing
that I can come up with to modifying the protocol in a backward-compatible way
is to have krlogind set a bit in the first OOB byte it sends, to tell krlogin,
"I know how to deal with OOB ACK messages, so you should ACK every OOB message
you receive."  Unfortunately, I can't figure out a protocol-compatible way for
krlogin to tell krlogind that it knows how to deal with this bit, so after
sending this bit to krlogin, krlogind has no way of knowing whether it should
wait for the ACK from krlogin.

I would appreciate any input that people might have into this problem.  Am I
right that there's a problem?  Has it always been there?  Is there any way to
solve it, short of either (a) modifying the protocol in a way that isn't
backward-compatible, or (b) ditching krlogin/krlogind altogether and using
ktelnet/ktelnetd instead (yes, I'd love to do that, but first of all, some of
our customers demand krlogin/krlogind, and second, I've heard rumors that the
security negotiation in ktelnet/ktelnetd is vulnerable).

Thanks.
------- End of forwarded message -------
-- 
Mike Long <mike.long@analog.com>           http://www.shore.net/~mikel
VLSI Design Engineer         finger mikel@shore.net for PGP public key
Analog Devices, CPD Division          CCBF225E7D3F7ECB2C8F7ABB15D9BE7B
Norwood, MA 02062 USA       (eq (opinion 'ADI) (opinion 'mike)) -> nil