port-mac68k: Re: Quadra AV SCSI DMA Code

Subject: Re: Quadra AV SCSI DMA Code
To: Aaron Brown <aaron@results-computing.net>
From: Michael R.Zucca <mrz5149@acm.org>
List: port-mac68k
Date: 06/08/2003 20:32:01
On Sunday, June 8, 2003, at 08:33  PM, Aaron Brown wrote:

> Thanks for the code. I'll start looking at it asap. Do you have any 
> good sources of information about the Quadra SCSI/DMA controller?

Nothing formal, just what I've gleaned from staring at the ROM 
disassemblies and pieced together from the AV tech notes and such.

The SCSI is just your typical NCR SCSI controller that is used in every 
other Quadra. NetBSD has a very capable SCSI driver for that chip. The 
DMA engine, however, is a pure Apple creation.

I've got most of the DMA engine stuff laid out in PSC code updates that 
I made. The only thing I didn't really do was find out what interrupt 
the SCSI DMA channel was on but I'm pretty sure it has a DMA interrupt. 
It's probably a low priority interrupt since I believe it has the 
lowest interrupt priority listed in the technote. I'll bet it's 
interrupt 0 on the PSC's level 4 interrupts or it's somewhere in the 
PSC's level 3 interrupts.

The DMA engine is pretty simple. Dave Huang first described it when he 
did DMA ethernet. I took what he had and looked at what the ROM code 
did and got a better handle on what's going on.

Basically, there are DMA channels set aside for each device that can do 
DMA. Some DMA channels have different width/alignment requirements. For 
instance, the SCSI channel appears to do DMA's 2 bytes at a time (as 
shown in the technote) and requires a 16 byte alignment (as seen by 
experience). While other channels like the Serial and Floppy channels 
can do transactions 1 byte at a time and have some other alignment 
requirement. I've looked at the floppy ROM code and it appears that 
there is no alignment restriction on that channel. I suspect that the 
minimum size is device dependent, while the alignment restriction 
probably has something to do with the DMA engine and how it was 
programmed for a particular channel, or it has something to do with a 
cache line size (is 16 bytes a 040 cache line?). Though, the DMA engine 
is _NOT_ coherent, so why cache line size would matter I can't imagine. 
Since there is no snooping between the DMA engine and the CPU, I used 
the bus_dma infrastructure NetBSD provides. If you do any further work, 
I strongly recommend continuing to use it to avoid weird caching 
issues. Besides, it's really a really well thought out and cool 
interface :-) It also has all the necessary infrastructure to 
find/combine contiguous physical regions given a virtual address and 
length.

In any case each DMA channel has what I call two DMA "streams" (or 
register sets in NetBSD/mac68k parlance). There is a channel 
control/status register and three registers for each stream: transfer 
address, transfer length, command/status. I think the DMA engine is 
supposed to be set up so that you can have one DMA stream running and 
another DMA pending though I haven't used that feature. Everything is 
strictly one stream at a time right now with one transfer. In the 
future it might be nice to have one transfer "in flight" while queuing 
another transfer to go. This would be nice in a scenario where we have 
the DMA interrupt doing chaining. If the SCSI setup routine programmed 
the first two segments of a transfer, when the DMA engine interrupts 
looking for another segment, the other stream could be doing a transfer 
while the interrupt is being processed by the CPU! This might yield a 
really good latency/throughput win.

Check out my routines to see how to control the DMA engine in general. 
It's a little magical right now. Perhaps in the future I'll describe it 
better.

I think the plan of attack for optimizations is:
1. Find the DMA interrupt and change the code so that the SCSI code 
just passes the DMA routines the bus_dma information about the 
transfer. Then, when the the DMA interrupt fires, the code can just 
take the next segment from the bus_dma information and slam it into the 
DMA engine. This should improve interrupt latency significantly for 
multisegment transfers since reading a value from a bus_dma structure 
and slamming into the dma engine is much less work than fooling around 
in the SCSI state-machine to setup the next transfer.
2. Optimize the code so that we do overlapping transfers like I 
described above (i.e. one stream running while another loads) for dma's 
that have two or more segments.
3. Do un-16-byte-aligned transfers under 4k by copying the data to/from 
a pre-allocated and aligned transfer buffer. This will help solve the 
sync negotiation problem by boiling it down to figuring out how to do 
odd-sized transfers that appear to be completely DMA.
4. Figure out a way to do odd-sized reads/writes that will satisfy the 
sync negotiation code. This might be accomplished using a transfer pad 
(a feature of the SCSI chip) or by doing a PIO read/write of the last 
byte in the DMA interrupt. I tried doing the read/write of the last 
byte in the SCSI interrupt, but I think that it's too late by the time 
the SCSI interrupt fires. I think if you took the SCSI interrupt you've 
failed the sync negotiation already. I also wonder if a PIO read/write 
to the SCSI FIFO would also blow the sync negotiation. In any case, 
this is the most obnoxious problem to solve so I suggest you save it 
for last. :-)

If you have any questions, just ask.

-- 
----------------------------------------------
  Michael Zucca - mrz5149@acm.org
----------------------------------------------
  "I'm too old to use Emacs." -- Rod MacDonald
----------------------------------------------