macbsd-development: panic: biodone already (Re: MacLinux vs. MacBSD)

Subject: panic: biodone already (Re: MacLinux vs. MacBSD)
To: None <macbsd-general@NetBSD.ORG>
From: David Condon <david@trouble.wariat.org>
List: macbsd-development
Date: 01/23/1995 14:53:29
Dameon D. Welch writes:

>It'd sure be nice to get rid of the "panic: biodone already" things...
>;-)

That reminds me, I've been meaning to report that I hacked my kernel to get
rid of this. I recognise that what I did is basically treating the symptom
instead of the disease, sort of like putting a penny in the fuse box, and
for all I know could have hazardous side effects. But I have been torture
testing it for the past couple weeks and doing things like running X and
having a couple of different compilations going on at once along with an
ftp session and playing a game, with disc I/O and swapping going on more
or less constantly for long periods, and so far haven't noticed any problems.

I patched the scsi_done, sdstrategy and sdstart routines so that whenever
biodone() occured, I just put in:

if (!(bp->b_flags & B_DONE))
                biodone(bp);

The panics didn't stop until this was done to scsi_done(), but I just left the
other ones that way as well.

If somebody would explain to me step by step how to generate patch files using
diff, I would be happy to send the patches.

Presumably, there is some rational reason for the panic to happen -- it's
telling you that something is wrong. This just prevents it from ever getting
into the stage where that can happen. There are numerous bits in the scsi files
that are #ifdeffed out as if the routines didn't work properly the first
time, so it was just left for some other time to fix it. It's all too much
for my poor brain to cope with, but someone who actually understood what all
these functions do could probably fix it in a relatively short time.

I also wondered if this could possibly have anything to do with the
"non-working" ncr96scsi driver -- because that is used in the SE/30 (that's

what I have). It says "ncr96scsi not configured" during bootup.

Back in December, Paul Goyette posted a message that I found in the macbsd-
development archives, appended below, which seems to have some solid information
on the possible causes of the problem. My experience supports the idea that it
is related to a BUSY SCSI bus condition. I have a total of 4 hard drives
(only two of which have unix partitions) plus a tape drive. I found that
with the Beta-1 kernel, the biodone already panic occurred only
occasionally as long as
only the internal drive, with the root netbsd partition, was connected. It would
still happen, but rarely enough that I was able to go through building a
new kernel. However, the panic was guaranteed to happen in short order
whenever _any_ additional device was connected to the SCSI bus, _whether or
not another netbsd partition was mounted_.

>From finchm  Tue Dec  6 20:17:28 1994
ESMTP id UAA13772 for <macbsd-development@netbsd.org>; Tue, 6 Dec 1994
20:17:11 -0500
 with SMTP (MailShare 1.0b7); Tue, 6 Dec 1994 17:50:11 -0800
Mime-Version: 1.0
Content-Type: text/plain; charset="us-ascii"
Date: Tue, 6 Dec 1994 17:50:11 -0800
To: current-users@netbsd.org, macbsd-development@netbsd.org
From: goyettep@ccnet.com (Paul Goyette)
Subject: More on "biodone already" panic
Message-ID: <1425386285-37789@pgoyette.ccnet.com>

Just in case this helps, I've used the debugger to look around at things
when the biodone already panic occurs.  WWhat I've found is that somewhere
along the way an xs (scsi transfer) structure has been deallocated twice to
the free list.  next_free_ns points to an entry whose link points to
itself!  And the sc_link structure's opennings field (essentially, an
"outstanding I/O credits" counter) contains a value of 3, even though it
was initialized at 2.

Further, the problem only seems to occur _after_ a transfer has completed
with an error of XS_BUSY (xs->error = 8).  And xs->retries field has been
decremented all the way to -1 (ie, no more retries allowed).

I put some extra debugging code into scsi_base.c and verified that the xs
has indeed been doubly-deallocated, and that the previous transfer
completed with XS_BUSY.  It seems that just doing a "printf ("XS_BUSY!\n")
" in routine sc_err1 causes enuf delay to avoid the problem most of the
time, but not always.

I'm running with a Toshiba 1.2BG internal hard drive with separate root,
swap and usr partitions, and an external Syquest 270MB cartridge drive for
the source tree, both of which claim to be SCSI-2 capable.  The XS_BUSY
error only occurs on the Toshiba.  So, now I'm rebuilding with the Syquest
as my Root&Usr drive, and I'll use the Toshiba for the sources.  This way,
at least I'll be able to keep things up long enuf to do something useful,
but I may not be able to build any new kernels for further testing.  :(

Hope this helps someone who's more familiar with scsi innards to solve the
problem.  This (and the adb scrolling bug) are the only things I have to
grouse/gripe/bitch about!

------------------------------------------------------------------
| Paul Goyette            | PGP Public key available on request  |
| Paul@pgoyette.ccnet.com | Fingerprint: 9D 3C 90 0E DA 46 10 59 |
| goyettep@ccnet.com      |              15 F2 87 D6 AA BD 90 D5 |
------------------------------------------------------------------

**** end Paul Goyette's message ****

My debugger trace with Beta-1 kernel (copied down by hand and then typed --
not guaranteed against any typos). It was the same as this every time.

panic: biodone already
Stopped at           _Debugger+0x6:    unlk  a6
db> trace
_Debugger(15dcc,259b8,ffffc18,612d60,fffffc1c) + 6
_panic(259b8,c08e80,fffffc40,6ca6e,612d60) + 34
_biodone(612d60,c08e80,c0afa0,1,3) + 18
_scsi_done(c08e80,c08e80,612d60,10,c08e80) + 96
_ae_ring_to_mbuf(c08e80) + 53e
_scsi_done(c08e80,c08e80,612d60,10,c08e80) + 7a
_ae_ring_to_mbuf(c08e80) + 53e
_scsi_done(c08e80,c08e80,612d60,10,c08e80) + 7a
_ae_ring_to_mbuf(c08e80) + 53e
_scsi_done(c08e80,c08e80,a,401,c08eda) + 7a
_ae_ring_to_mbuf(c08e80) + 53e
_scsi_scsi_cmd(c0afa0,fffffd6a,a,e5c000,2000,4,2710,612d60,401) + 90
_sdstart(2,c051ae,612d60,53e000,29e) + 10c
_sdstrategy(612d60,fffffddc,53dbc,fffffe14,53e000) + a0
_spec_strategy(fffffe14,53e000,612d60,c45480,0) + 24
_ufs_strategy(fffffe14) + d2
_cluster_read(c45480,0,101cb23,29d,2000) + 394
_ffs_read(fffffed0,0,8000,13588,0) + 1c8
_vn_read(c51b80,ffffff28,c40200) + 82
_read(c09400,ffffff84,ffffff7c) + a2

--
      "The Net interprets censorship as damage and routes around it."
                                          -- John Gilmore
david@trouble.wariat.org