Subject: Re: curernt kernel and scsi troubles...
To: Riccardo Mottola <rollei@tiscalinet.it>
From: Tim Kelly <hockey@dialectronics.com>
List: port-macppc
Date: 03/24/2005 17:23:33
At 10:54 PM +0100 3/24/05, Riccardo Mottola wrote:
>has there been any potential improvment in X and X exit to console/ in
>mc0 interrupt stuff?

As far as mc0 interrupt stuff, I've got interrupts on transmission down to
only when Tx errors occur. That reduces the number of interrupts on a 1.5M
file from 1000 to 0 if there are no errors (each ethernet frame being 1514
bytes, give or take). Receiving still interrupts upon the end of an
ethernet frame, as before.

The main problem remaining is that I can slam so much into the chip that Rx
requests get ignored and curio ends up shutting down the Tx channel. It has
taken me a long time to figure out how to overcome this, but last week I
did finally put the pieces together. In the dbdma channel bits, there are 8
status bits that currently the dbdma implementation completely ignores. It
turns out that these status bits are wired to the POLL register on the mace
chip, and a 0x80 in the status means the ethernet frame sent was valid, and
a 0x20 means that there is a receive requested by mace. When I see this I
need to halt Tx. If the receive request is ignored for two frames, the chip
issues a babble error. The documentation on babble indicates that this
error occurs when more than 1518 bytes are sent without an end-of-frame,
but that's only partially correct. A more loose interpretation of "babble"
is that the Tx channel is talking too much and needs to be quiet. mace is
half-duplex, by the way.

So I am in the process of rewriting the dbdma engine in if_mc.c for about
the (twenty-) eighth time. Last time I looked I had over 600 compiles on
this. This time I think I'll get it right, though. What I have to do is use
the dbdma commands to wait the Tx channel when it sees the 0x20 status.
That part is easy. The problem is that the dbdma documentation is not
exactly correct, and I'm not sure what the problem is. I have been unable
to get the channel to unwait in hardware. Supposedly curio will monitor the
condition and automatically resume when it is cleared, but I appear to be
missing something. It is possible I need to take additional steps in the Rx
interrupt handler to ensure the status bits are cleared.

As an example of the documentation not being quite correct is that I've
never been able to overwrite a STOP with a NOP or anything other than an
OUT_LAST and get the channel to work if the channel was active at the time
of the overwrite, even when I do eieio. If the channel was stopped at the
time, no problem. If it was active and running, it never happens. I suspect
that the curio chip looks ahead a bit.

The other aspect that I've been wrestling with is what to do with Tx
errors. I've talked with Bill Studenmund about this, and I will probably
resume that discussion soon. The recommended practice in case of an
ethernet frame error is to discard the frame and let the higher protocol
handle resending when it is requested. However, what I have found is that
due to how many buffers I can line up (currently I play with 8 but I've
done 32), a Tx error may occur early but the ip stack has continued to send
packets since there's room in the DMA to do so and so the chip will
continue to send frames but the other side has requested a resend of the
earlier mucked frame. After a while the other side stops requesting the
frame, the Tx side stops receiving acks, and the whole stream breaks down.

The workaround I will probably explore is one I did earlier, which is
utilizing scatter-gather DBDMA to its fullest by simply pointing a latter
DBDMA command to the frame that got mucked,as it is still in memory. There
will still be some out-of-order issues, but I've implemented an error
recovery branch that gets the frame buffer and then branches back into
order so that as far as I can tell no more than 2 frames will transmit
before the error is detected and corrected, and then the remaining frames
get sent in order.

I'm pretty sure that the ignoring the status bits in DBDMA is causing
problems in scsi, snapper, and serial, but that's just a hunch.

I've been quite pressed for time lately so I don't always get to work on
this as much as I'd like, but I'll post as soon as I have something for
wider testing.

tim