Port-vax archive

[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index][Old Index]

Re: I/O bus reset to fix CMD MSCP controllers (and probably others)



On Fri, Mar 28, 2025 at 05:27:48PM +0100, Johnny Billquist wrote:
> Here is the actual patch:
> 
> *** usr/src/sys/conf/boot/raboot.s.old  Mon Aug 17 21:41:34 2009
> --- usr/src/sys/conf/boot/raboot.s      Mon Aug 17 22:44:12 2009
> ***************
> *** 1,5 ****
> --- 1,9 ----
>   /*
>    *    SCCS id @(#)raboot.s    2.0 (2.11BSD)   4/13/91
> +  *
> +  * Code corrected as per the other primitive mscp drivers
> +  * to handles other mscp controllers than DECs.
> +  * /bqt - 20090817
>    */
>   #include "localopts.h"
> 
> ***************
> *** 59,65 ****
> 
>   MSCPSIZE =    64.     / One MSCP command packet is 64bytes long (need 2)
> 
> ! RASEMAP       =       140000  / RA controller owner semaphore
> 
>   RAERR =               100000  / error bit
>   RASTEP1 =     04000   / step1 has started
> --- 63,69 ----
> 
>   MSCPSIZE =    64.     / One MSCP command packet is 64bytes long (need 2)
> 
> ! RASEMAP       =       100000  / RA controller owner semaphore
> 
>   RAERR =               100000  / error bit
>   RASTEP1 =     04000   / step1 has started
> ***************
> *** 153,170 ****
>         mov     $RASEMAP,*$ra+RARSPH    / set mscp semaphores
>         mov     $RASEMAP,*$ra+RACMDH
>         mov     *_bootcsr,r0            / tap controllers shoulder
> !       mov     $ra+RACMDI,r0
>   1:
>         tst     (r0)
> !       beq     1b                      / Wait till command read
> !       clr     (r0)+                   / Tell controller we saw it, ok.
>   2:
>         tst     (r0)
> !       beq     2b                      / Wait till response written
>         clr     (r0)                    / Tell controller we got it
>         rts     pc
> 
> ! icons:        RAERR
>         ra+RARING
>         0
>         RAGO
> --- 157,176 ----
>         mov     $RASEMAP,*$ra+RARSPH    / set mscp semaphores
>         mov     $RASEMAP,*$ra+RACMDH
>         mov     *_bootcsr,r0            / tap controllers shoulder
> !       mov     $ra+RACMDH,r0
>   1:
>         tst     (r0)
> !       bmi     1b                      / Wait till command read
> !       mov     $ra+RARSPH,r0
>   2:
>         tst     (r0)
> !       bmi     2b                      / Wait till response written
> !       mov     $ra+RACMDI,r0
> !       clr     (r0)+                   / Tell controller we saw it, ok.
>         clr     (r0)                    / Tell controller we got it
>         rts     pc
> 
> ! icons:        RAERR + 033
>         ra+RARING
>         0
>         RAGO

So just out of curiosity, I took a look at the whole 2.11BSD rauboot.s
as I wanted to know what it is doing and what wisdom may be gleaned from
this patch. Not much, it seems, as it apparently fixes a different
problem.

But the initialization bits look similar:

RAERR =         100000  / error bit 
RASTEP1 =       04000   / step1 has started
RAGO =          01      / start operation, after init
...
RARING =        8.      / Ring base
...
/
/ RA initialize controller
/
        mov     $RASTEP1,r0
        mov     raip,r1
        clr     (r1)+                   / go through controller init seq.
        mov     $icons,r2
1:
        bit     r0,(r1)
        beq     1b
        mov     (r2)+,(r1)
        asl     r0
        bpl     1b
        ...

icons:  RAERR + 033
        ra+RARING
        0
        RAGO

So it writes 0 into IP just once, and loops until the step 1 bit is set
in SA. Once there, it writes the values beginning at icons, each
corresponding to an initialization value for SA for each step, and waits
for each step bit by shifting RASTEP1.

Step 1: RAERR + 033 (100033)
        Bit 15 needs to be 1, and RAERR does that, but it has nothing to
        do with an error here. 033 corresponds to interrupt vector 154,
        which is the default vector for the first MSCP controller. But
        IE is 0, so it shouldn't matter. Ring length is 0 for both
        commands and responses, corresponding to 2**0 == 1 entry each.

Step 2: ra + RARING
        ra is the base of the communications area, but the controller
        actually expects to be given the base of the response and
        command descriptor rings, which are at +8 in the comm area.
        That's the low 16 bit of the full Unibus or Qbus address.

Step 3: 0
        That's the high bits of the full Unibus or Qbus address of the
        comm area.

Step 4: RAGO
	Set DMA burst = 0 (1 longword), request no "last fail" message,
	and kick the controller into action.

So, 9 instructions of code plus 4 words of data to get the thing going.
Nice.


Anyway, I've re-read most of the UDA50 programming manual this
morning and I'd like to share a few things from Section 9.2:
(https://bitsavers.org/pdf/dec/disc/uda50/AA-L621A-TK_UnibusPortDescription_1982.pdf)

  In the event of an initialization error, the port driver must retry
  the sequence at least once. It is suggested, however, that a second
  failure be considered as meaning that the port/controller is "down".

That's where the requirement for (at least) one retry comes from. We do
that only in udamatch(), assuming it won't ever be needed in udaattach().
I don't think that's necessarily a bad assumption, given that udamatch()
must have succeeded talking to the controller for us to ever reach
udaattach().

  The host begins the initialization sequence either by issuing a bus
  INIT or by writing any value to the IP register. The port must
  guarantee that the host will read zeroes in SA on the next bus cycle.
  Initialization then sequences through Steps 1-4 as described on the
  following pages.

So we're kinda expected to read SA=0 once before we get to Step 1.

  From the host's viewpoint, Step n is deemed to have begun when reading
  SA shows the transition Sn 0-->1. Of course, Step n ends when Step
  n+1 begins as just defined. This transition from Step n to Step n+1
  may be accompanied by an interrupt, depending on whether interrupts
  are enabled.

Obviously the transition to Step 1 cannot cause an interrupt, but then
we're not using interrupts anyway despite enabling them.

  Steps 1-3 each are required to complete within 10 seconds. If any of
  these steps fails to complete within that period, this is to be
  treated as a host-detected fatal error.

This is where the 10s timeout in mscp_waitstep() comes from.

  During initialization, the host must wait 100 microseconds after any
  interrupt before reading the SA register to see if there was an error.
  This is because the port may use the SA register to deliver the vector
  address to the processor interrupt sequence. If it does, then time
  will be required by the port to set SA to the value to be read by the
  host initialization code.

We're probably good on that as mscp_waitstep() waits 10ms. Except for
the first read of SA, which is done with no delay. That's probably worth
fixing, just in case.

  This pattern should appear within 100 microseconds after the
  hard-initialize.

This is about the Step 1 bit in SA appearing following a write to IP.
We're currently waiting the whole 10s if it doesn't appear, which
shouldn't do any harm but seems unnecessary. Also, this is where the CMD
controller is failing to react.

  Upon receipt of the above data the port/controller begins running its
  integrity check diagnostics. When finished, the port conditionally
  interrupts the host as described above. If enabled, the interrupt
  will take place whether the diagnositics succeeded or failed.

  Step 1 must complete within 10 seconds after the host writes to the SA
  register. The completion will result in an interrupt if IE was set to
  one in Step 1.

This is what we expect to have happened towards the end of udamatch()
before we return 1, or as the comment says: "should have interrupted by
now". Since we waited for SA to indicate transition to Step 2, we can be
sure that the interrupt has happened by now.


So, I'm not sure this helps much with our problems with uda(4) on CMD
controllers, but I found it interesting nonetheless. The system with the
CMD controller will be offline and unreachable until Friday, so I won't
be able to conduct any more experiments until then.


Hans


-- 
%SYSTEM-F-ANARCHISM, The operating system has been overthrown


Home | Main Index | Thread Index | Old Index