Subject: Re: Boot failures with MESH SCSI bus
To: Monroe Williams <monroe@pobox.com>
From: M L Riechers <mlr@rse.com>
List: port-macppc
Date: 10/19/2000 18:42:05
> BTW, the mail server had an uptime of 122 days before we shut it down
> yesterday to move it to our newly constructed server room.  You gotta love
> that...
> 

Yes, I do.

> ....I'll give you a shell account on my personal machine if it will help the
> problem get fixed. 

Yes, that would probably be expeditious.  Mail me your machine name,
account name (mlr would be good, if you don't mind,) and pass word, by
private e-mail to mlr@rse.com, and I promise I'll change the pass word
ASAP.

> Why do I care?
> 
> ... and I'd just love to upgrade it to a kernel that can
> figure out its root device so that it can reboot unattended after a power
> failure....

I feel your pain(tm).

For my part, if we could just get mozilla running on macppc, I think
we could confidently say that anything (tm) would run on macppc, and
bless the macppc as a true production machine (X11/server/frame buffer
problems notwithstanding).  However, there seems to be at least one
unresolved (elf/shared lib/gcc) issue in the way, and it's extremely
frustrating.

However, that isn't your problem at the moment, and I think your
problem is important.  We've got to get more people on all flavors of
macppc, and show stoppers like this don't help. I admit that when I
read the reported mesh failure problem back last spring, I was
skeptical, thinking that this has to be some concomitant hardware
change. But too many people have reported getting the same problem by
merely upgrading. So why the darn blazes does the exact same kernel
work fine for me, but not for you?

Discounting some magic in OF, the difference _has_ to lie in hardware
somewhere, and there is somewhere a NetBSD kernel change that can't
handle that difference.  So what's different?

The things that leap out are:

    you have a Newer Tech 300 MHz G3 card, I have a 604 (Revision 303).

    you have a IBM, DDRS-39130, S97B drive at SCSI target 0,
      I have a WDIGTL, WDE9100, drive at target 0, and there are _no_
      terminators on this drive.  Both claim to be SCSI-2.

    you have a QUANTUM, FIREBALL_TM1280S drive at SCSI target 1,
      claiming to be SCSI-2, which I assume was original with your
      7500;  I have no target 1.

    you have a MATSHITA, CD-ROM CR-8005 at SCSI target 3, which I
      assume was original with your 7500; I have a SONY, CD-ROM
      CDU-8005 at SCSI target 3, which was original; Both claim to be
      SCSI-2.

    you're using the original SCSI cable;  I've replaced mine, it's
      much longer (and it's to spec, I might add).

    I'm presuming that the physical order of devices on your SCSI bus
      is: mesh, CD-ROM, IBM DDRS-39130, Fireball;  ours is: mesh,
      CD-ROM, (open connector or two), WDE9100, terminator.

    I'm presuming that (you're hoping that?) your Quantum Fireball, in
      Apples' finest tradition, provides some sort of termination; my
      terminator is a separate, active terminator all by its lonesome
      on the end of the cable, and requires TERMPR from the mesh end
      of the bus.  (Our fireball is long gone.  OK, actually, that's a
      lie.  Our fireball _is_ on the bus, in one of those mid-cable
      SCSI sockets I'm claiming is empty, addressed as target 0, but,
      it's powered down. It's also mac os 8.6.)

(You see how suspicious I am of proper termination?  anyway.)

Now, my guess is that, whatever changed in the kernel to bring you to
grief, it's not in the MI SCSI code;  too many people use that.  I
suspect we'll find it in arch/macppc.

So, last night I did some big scale diff's between src.2000.01.15 (15
Jan 2000), src.2000.02.12 (12 Feb 2000), and src.2000.08.27 (27 August
2000).  (Why those? Because that's all the past sources I haven't
deleted).  I was looking for something along the order of "look, you
stupid OF, (or mesh) turn the silly TRMPR and terminators on" that
changed.  That I didn't find.  I did find a slew of changes between
Jan and Feb, and not so many between Feb and Aug.

I also came to suspect the processor difference (yours, Newer Tech 300
MHz G3 card, mine, 604), on account of the only stand-out, solid,
quick-scan Feb-Aug change I saw was a change in the way SCSI time-outs
are effected.

So, Michael Wolfson, Jake, (anyone else?) what cpu do you have on your
7500's?  In particular, are they G3's or G4's, or 604e's (particularly
the faster kind) rather than my stuck-in-the-mud 604?

Is there a possibility of any of you (back) replacing your current
processor with some form of 604 (possibly suffix -e) and see if that
works with my 20000827 (netbsd.EASTERN-1.5ALPHA2) kernel?  I'd spring
for a G3, but that's not in the cards right now.

By popular acclaim, the 20000205 kernel works. That means that the
20000115 kernel I have sources for almost surely works, the 20000212
kernel probably works, and the 20000827 kernel doesn't.  If the
20000212 kernel works for you, then we can ignore the diff's from 
20000115 to 20000212.  Monroe,  can you test this one for me, and
confirm that it works? You'll find it at:

ftp://ftp.rse.com/pub/NetBSD/arch/macppc/snapshot/20000212/binary/kernel/netbsd.EASTERN-1.4S

The earlier one, if you care to test it, (and I see no point right now
to do that), is at:

ftp://ftp.rse.com/pub/NetBSD/arch/macppc/snapshot/20000115/binary/kernel/netbsd.EASTERN-1.4P

Meanwhile, I'll see if I can find some G3 documentation.

-Mike