port-macppc: Re: Boot failures with MESH SCSI bus

Subject: Re: Boot failures with MESH SCSI bus
To: M L Riechers <mlr@rse.com>
From: Monroe Williams <monroe@pobox.com>
List: port-macppc
Date: 10/18/2000 03:25:32
on 10/16/00 3:53 PM, M L Riechers at mlr@rse.com wrote:

>> Thanks for providing the stripped kernels.  Neither work with my 7500 (even
>> with real-base set to F00000)....
>> ....I still get the
>> same problems as my last post about it.
>> 
>> Any other suggestions?
>> 
>> -- MW
> 
> You're welcome.
> 
> Ummmmm...
> 
> You don't mention any hard drives.  What devices do you have on the
> two SCSI busses -- powered up or not?  What are their actual SCSI
> addresses?  Did they come from apple?  Are they new, or did they come
> from other systems? Do they work on other systems, or do they work
> with the Mac OS on your 7500?

I believe I'm having a very similar problem to the ones Michael Wolfson and
Jake Luck have reported.  On my 7500, I get more or less the following with
anything later than the 20000205-current snapshot:

...
mesh0 at obio0 offset 0x18000 irq 13: 50 MHz, SCSI ID 7
scsibus1 at mesh0: 8 targets, 8 luns per target
...
scsibus0: waiting 2 seconds for devices to settle...
scsibus1: waiting 2 seconds for devices to settle...
probe(mesh0:0:0) Sense Error Code 0x0
mesh: timeout state=3
mesh: resetting dma

and it stops there.  (I don't see the timeout message repeat.)

I just now tried both the kernels you posted links to on Oct. 12:

netbsd.EASTERN_ZB doesn't display the "probe" message, and repeats the
"timeout/resetting" message twice.

netbsd.EASTERN-1.5ALPHA2  gives the following:

...
scsibus0: waiting 2 seconds for devices to settle...
scsibus1: waiting 2 seconds for devices to settle...
probe(mesh0:0:0) Sense Error Code 0x0
sd0 at scsibus1 target 0 lun 0: <, , > SCSI0 0/direct fixed
sd0: 1691 MB, 8387 cyl, 10 head, 41 sec, 512 bytes/sect x 3464353 sectors
sd1 at scsibus1 target 1 lun 0: <, , > SCSI0 0/direct fixed
sd1: 1691 MB, 6810 cyl, 2 head, 254 sec, 512 bytes/sect x 3464353 sectors
probe(mesh0:3:0) Sense Error Code 0x0
sd2 at scsibus1 target 3 lun 0: <, , > SCSI0 0/direct fixed
mesh: timeout state=3
mesh: resetting dma
mesh: timeout state=3
mesh: resetting dma
mesh: timeout state=3
mesh: resetting dma
mesh: timeout state=3
mesh: resetting dma
mesh: timeout state=3
mesh: resetting dma
sd2: drive offline
boot device: sd1
root on sd1a dumps on sd1b
mesh: timeout state=3
mesh: resetting dma
mesh: timeout state=3
mesh: resetting dma
mesh: timeout state=3
mesh: resetting dma

at which point I stopped it because I couldn't stand to watch.  The drive
geometries given are wildly inaccurate.  (See below for what's really on the
chain.)

With the 20000205-current snapshot, everything works fine and runs for days
without any SCSI errors.  The relevant parts of dmesg (when booted with my
one working kernel) are as follows:

...
esp0 at obio0 offset 0x10000 irq 12: NCR53C94, 25MHz, SCSI ID 7
scsibus0 at esp0: 8 targets, 8 luns per target
...
mesh0 at obio0 offset 0x18000 irq 13: 50MHz, SCSI ID 7
scsibus1 at mesh0: 8 targets, 8 luns per target
...
sd0 at scsibus1 targ 0 lun 0: <IBM, DDRS-39130, S97B> SCSI2 0/direct fixed
sd0: 8715 MB, 8387 cyl, 10 head, 212 sec, 512 bytes/sect x 17850000 sectors
sd1 at scsibus1 targ 1 lun 0: <QUANTUM, FIREBALL_TM1280S, 300N> SCSI2
0/direct fixed
sd1: 1222 MB, 6810 cyl, 2 head, 183 sec, 512 bytes/sect x 2503872 sectors
cd0 at scsibus1 targ 3 lun 0: <MATSHITA, CD-ROM CR-8005, 1.0m> SCSI2 5/cdrom
removable
boot device: sd1
root on sd1a dumps on sd1b
...

> Did you modify or replace the 50 pin internal SCSI cable, or are you
> using the original 7500 cable?  What are you using for an external
> cable?

I'm using the original cable.  I have no external SCSI cable plugged in.

> And what processor did you replace the original 601 processor with?

Newer Tech 300 MHz G3 card.

I've had more than my share of SCSI problems in the past, but I have _never_
seen a SCSI cabling or termination issue that was absolutely rock-solid with
one kernel version and _repeatably and completely non-functional_ with
another.  Intermittent problems that got worse, yes, but not like this.

Something specific _must_ have changed in the code sometime after the
20000205-current snapshot to cause this problem, but I'm damned if I can
figure out what it was.

Sorry if I sound frustrated, but I've posted about this problem a number of
times since MARCH (that's 7 MONTHS now) and haven't so much as heard one of
the core developers acknowledge the problem.

I made some attempts to troubleshoot it myself back then, but the complexity
of the SCSI code and my complete ignorace of its workings meant that I
didn't get very far.  What little I did find out is here:

http://mail-index.netbsd.org/port-macppc/2000/03/14/0004.html
http://mail-index.netbsd.org/port-macppc/2000/03/15/0002.html

If I understood what I was seeing at the time (which is by no means a
certainty), it appeared that the data buffer containing the probe responses
was coming back all 0's.  This made me wonder if perhaps the problem was not
in SCSI at all, but I was never able to get much further than speculation.

Michael Wolfson's recent reports of all three scsi busses in his 7500
failing in spectacular ways would also seem to point towards this being a
7500 problem as opposed to a MESH problem.  Hmm...

> Sorry to sound so intrusive, but I don't know that the macppc netbsd
> port has had experience with lightly loaded multiple SCSI busses,
> where SCSI address 0 might not be filled.

Please, be as intrusive as you like.  I'll give you any configuration
information you want.  I'll try test kernels.  I'll apply source patches.
I'll give you a shell account on my personal machine if it will help the
problem get fixed. 

Why do I care?

I'm a part-time sysadmin and full-time programmer for a software development
company of about 40 people.  I've got 1.4.2 deployed on a 9600 acting as our
primary mail and dns server (as well as internal web, cvs, and other
miscellaneous tasks), and I'd just love to upgrade it to a kernel that can
figure out its root device so that it can reboot unattended after a power
failure.  (It appears that -current learned this trick sometime between
1.4.2 and the 20000205 snapshot.)  The first step in doing this is to get a
release kernel that can boot my 7500 crashbox.

BTW, the mail server had an uptime of 122 days before we shut it down
yesterday to move it to our newly constructed server room.  You gotta love
that...

-- monroe
------------------------------------------------------------------------
Monroe Williams                                         monroe@pobox.com