Re: New class of receive error

To: Roy Marples <roy%marples.name@localhost>
Subject: Re: New class of receive error
From: Robert Elz <kre%munnari.OZ.AU@localhost>
Date: Mon, 14 May 2018 04:51:07 +0700
    Date:        Sun, 13 May 2018 20:01:02 +0100
    From:        Roy Marples <roy%marples.name@localhost>
    Message-ID:  <7838c482-dba1-24e6-e310-7423ace0dbef%marples.name@localhost>

  | Sorry, the exact text is this:
  | /* should notify about lost packet */
  |
  | It appeared in quite a few places.

4 of them I see,

In none of them does it attempt to indicate who the author of that
thought should be notified or how - it might have meant a
printf() (or similar, perhaps to the log) or just about anything.

  | And how is this a bad thing?

I didn't say it was a bad thing - just that it ought to be optional
(which everyone seems to be telling you) so that applications that
don't care aren't bothered by errors that they cannot do anything
about, nor even really understand.

  | The outcome of this is that we now know syslogd receive buffers can 
  | overflow. This isn't noted anywhere in the man page and neither is there 
  | a configuration option to increase it anywhere I can see.

I think that was altered recently, now there are sysctls (if I read
src-updates mail correctly)

  | But most worringly for me, is that we as a group don't care about this 
  | and are in favour of just concealing the error from the user.

It isn't concealing the error, it is that errors like this are inevitable,
any user writing any application like this should know that datagrams
are a lossy form of communication, and data can be lost, with no
notification.   The application needs to be designed to cope with that.

Occasionally telling the application that a packet was lost, and not
telling them other times, is of no real help to anyone.

  | Again, why is syslog special here? It's no more special than route(4). 

It isn't, it was just an example, as it was one that had been mentioned,
and it is less special than route(4) as the routing socket is quite special.

  | Lossing stuff is bad, concealing this fact is even worse.

I agree, but losing stuff is inevitable.  That's the design (it is one of the
reasons I said that I am not sure that designing the "routing socket" as
a socket interface was really a good idea ... but it is too late to change
that now).

I don't want to conceal the error, I just don't want to notify the wrong
party - if the buffers aren't big enough for the system, it is the sender
(if possible - they might be able to resend) and the system
manager who needs to be told, not syslogd or rtadvd, or ...
Then more space can be allocated, so the problem is less likely, as
a first step.   And if this is really causing problems, whichever application
is involved can be fixed to make sure it can recover.

  | While I agree solving at the sender would be ideal, that can't be done 
  | without adding non standard interfaces as all the returns are void.

The return for send()?   Or are you talking about the internal kernel
sends to the routing socket in particular?   For the latter, I agree, and
always have, that getting that info into dhcpcd (and making the same
info available to other routing socket readers that can be updated to
deal with it) is not a bad idea - all that's needed is for them to do
(something like)

	int on = 1;

	setsockopt(rtskt, 0 /*or whatever*/, SO_NOTIFYLOSS, &on, sizeof on);

and for the "errors" to be treated the old way unless that is done.

Is that really so bad?

  | > Do what?
  |
  | Empower the receiver into resolving it.

How is it supposed to do that?   It has no idea what, or how much,
was lost.   Nor in general from where it came.   What is it really
supposed to do?   What can it possibly do, except complain?

What dhcpcd needs is special, as it knows that it can simply go fetch
all the info from the sources again (as expensive as that might be) and
so whatever was lost becomes irrelevant.

But that's a very special case, other sockets (including unix domain
datagram sockets) cannot usually do that - how is syslogd supposed
to deal with a lost message other than by logging it had a lost message?
And that won't necessarily be in the correct log file - as it doesn't know
where the lost message was to be put.

That is, it was most likely a debug message from a mailer (because it was
lost we have no idea what severity or facility it was) that no-one cares
about really anyway, but which flow through at high rates on busy mailers.
So, best just to ignore it.

Unless it happened to be a rare important message about something
critical failing, that we really need to see.   Very rare.  Very unlikely
to be lost, so just forget about that...

There's nothing syslogd can do.

I have no idea what

  | Take this error from my NetBSD powered router:
  | cnmac2: reception error, packet dropped (error code = 13)

that one refers to, but what ido you expect cnmac2 (whatever that is)
to do in that case?   What message was lost?   Where did it originate?
Was it important or just noise?

  | What am I expected to do about that?

Nothing.   That's the point.   If the application cannot recover, it is badly
designed.

  | So are suggesting we remove that error as well?

if there is nothing we can do about it, then yes, sure.  If the info
is useless, then, it is useless...

Of course, the kernel can count how many times this happens, and make that
info available along with all the other stats it maintains, like packets 
received with checksum errors (would you suggest we should notify the
receiver about tose as well???)

  | Where do we draw the line here, or is syslog somehow special?

no, syslog is not special, and the line is that applications that want
to know that a packet was lost should be able to ask to be told that,
and the vast majority for which that info is useless should be allowed
to have lost packets in the local kernel be ignored in the same way
that lost packets anywhere else are.   Again, lost packets are to be
expected - they are part of the network design (TCP actually forces
it - that's how it knows it is filling the pipe and sending fast enough.)

  |
  | No, hiding the issue would be back to the old behaviour.

The old behaviour worked (at least for non routing socket applications).
The change just makes noise to no useful effect (and could cause
applications that are not expecting non fatal receive errors to simply
log an error and exit - the only normal receive errors relate to bad
fds (socket closed, etc) bad buffers (not big enough, or bad addr)
and similar (unless it is a non-blocking socket and we get EAGAIN, or
we are processing signals and get EINTR).   For apps that are
doing neither of those, there were *no* non-fatal receive errors before.
Now there is one.   That is a huge change.

  | My attempt at resizing the default buffers just makes it less likely to 
  | happen.

On that we agree, and once again, I have no problem with the default
size increases - and note default - apps have always been able to
	setsockopt(s, x, SO_RCVBUF, &bufsize, sizeof bufsize)
if they need a large receive buffer (or only need a small one.)

And for the system manager, a count of the times this happens - which I
think is already there (I think I said netstat -m last message, I meant -s
of course), where I already see ...

rip6:
        0 messages dropped due to full socket buffers

(and the same for udp, even ddp).   I am not sure where in there
routing socket issues are included - perhaps nowhere, in which case
fixing that would be a far more useful project than arguing about this.

kre
References:
- Re: New class of receive error
  - From: Roy Marples
- Re: New class of receive error
  - From: Roy Marples
- New class of receive error
  - From: Michael van Elst
- Re: New class of receive error
  - From: Robert Elz
Prev by Date: Re: New class of receive error
Next by Date: Re: New class of receive error
Previous by Thread: Re: New class of receive error
Indexes:
Home | Main Index | Thread Index | Old Index