Subject: Re: latest snapshot kernel locks up on PB520
To: Ken Nakata <kenn@synap.ne.jp>
From: Colin Wood <cwood@ichips.intel.com>
List: port-mac68k
Date: 08/21/1998 18:00:45
Ken Nakata wrote:
> Hiya,
> 
> I just did a clean install on a loaner PowerBook 520, but the latest
> snapshot kernel GENERIC 79 locks up after the "root on sd0a dumps on
> sd0b" line.  Command+Power works, so I took the following stack trace:
> 
> db> t
> _Debugger(12d5b2,3804,1dde44,35c0,0) + 6
> _nmihand(0,0,0,2714,0) + 26
> _lev7intr(?)
> _mi_switch(2714,2200,0,c,0) + 12
> _tsleep(46d000,11,110c3a,0) + 178
> _scsipi_execute_xs(46d000) + 9e
> _scsi_scsipi_cmd(43ee00,1ddeee,6,0,0) + a0
> _scsipi_prevent(43ee00,1,180) + 4e
> _sdopen(401,0,6000,0) + b0
> _sdsize(401,2704,1db184,1d2200,60000000) + 56
> _cpu_dumpconf(0,0) + 4c
> _main() + 328
> _main() + 328
> db> _
> 
> Continuing and later breaking into debugger shows the exact same stack
> trace.
> 
> Anyone have any idea what's going on?

If previous problem reports are correct, here's what is happening:  we're
still in single user mode, so there is only 1 process running.  We want to
configure dumps (I assume that's what cpu_dumpconf is doing), so we open
the SCSI disk (via sdopen() in dev/scsipi/sd.c).  This results in a call
to scsipi_prevent() (in dev/scsipi/scsipi_base.c) which appears to exist
solely to keep you from removing removable media whilst opening the drive.
Two flags are passed to scsipi_prevent():
SCSI_IGNORE_ILLEGAL_REQUEST and SCSI_IGNORE_MEDIA_CHANGE.  These flags are
passed to scsi_scsipi_cmd() (in scsi_base.c) when that function is called
(via the scsipi_command() macro) in scsipi_prevent().  Next,
scsi_scsipi_cmd() creates a new scsipi_xfer structure and places the flags
into it.  It then passes this structure to the scsipi_execute_xs()
function (in dev/scsipi/scsipi_base.c) to do the transfer.
scsipi_execute_xs() makes a call to the scsipi_command_direct() macro to
actually execute the command.  This macro is actually a call to the host
adapter's scsi_cmd routine (in this case, ncr5380_scsi_cmd()).  At this
point, ncr5380_scsi_cmd() checks to see if it is polling or not.  Since
the polling flag isn't set, it queues the command and returns.  Back in
scsipi_execute_xs(), we note the return value of SUCCESSFULLY_QUEUED in
the following code block:

==========================================================================

        case SUCCESSFULLY_QUEUED:
                if ((xs->flags & (SCSI_NOSLEEP | SCSI_POLL)) == SCSI_NOSLEEP)
                        return (EJUSTRETURN);
#ifdef DIAGNOSTIC
                if (xs->flags & SCSI_NOSLEEP)
                        panic("scsipi_execute_xs: NOSLEEP and POLL");
#endif
                s = splbio();
                while ((xs->flags & ITSDONE) == 0)
                        tsleep(xs, PRIBIO + 1, "scsipi_cmd", 0);
                splx(s);

==========================================================================

Basically, we're busy-waiting for the command to complete at this point.
I believe the problem is that splbio() blocks interrupts from the SCSI
controller, so it will never complete.  The call to tsleep() should switch
to another process, but there is no other process.  We should be polling
at this point instead of relying on interrupt-driven I/O.  

So, why aren't we polling?  It depends. Is this using the sbc driver or
the ncrscsi driver?

In mac68k/dev/ncr5380.c, from ncr5380_scsi_cmd():

==========================================================================

/*
 * Carry out a request from the high level driver.
 */
static int
ncr5380_scsi_cmd(struct scsipi_xfer *xs)
{
        int     sps;
        SC_REQ  *reqp, *link, *tmp;
        int     flags = xs->flags;

==========================================================================


Note that the flags are only grabbed from the scsipi_xfer struct.
Likewise, in the MI ncr5380 driver (dev/ic/ncr5380sbc.c), you have the
following:

==========================================================================

/*
 * Enter a new SCSI command into the "issue" queue, and
 * if there is work to do, start it going.
 *
 * WARNING:  This can be called recursively!
 * (see comment in ncr5380_done)
 */
int
ncr5380_scsi_cmd(xs)
        struct scsipi_xfer *xs;
{
        struct  ncr5380_softc *sc;
        struct sci_req  *sr;
        int s, rv, i, flags;

        sc = xs->sc_link->adapter_softc;
        flags = xs->flags;

        if (sc->sc_flags & NCR5380_FORCE_POLLING)
                flags |= SCSI_POLL;


==========================================================================


The above seems to include an additional set of flags, taking into account
the ability to force polling.  However, I can find nowhere in the code
that actually sets the SCSI_POLL flag during autoconfiguration, so this is
what seems to be the problem.

Of course, I'm not quite sure what the solution should be....perhaps we
need to find some way to set polling during autoconf?  This seems to be a
problem with the MI part of the esp driver as well....why aren't they
affected?

Later.

-- 
Colin Wood                                 cwood@ichips.intel.com
Component Design Engineer - PMD                 Intel Corporation
-----------------------------------------------------------------
I speak only on my own behalf, not for my employer.