Subject: Hard drive problems - possibly tied to presence of PCI NIC?
To: None <port-sgimips@netbsd.org>
From: Tillman Hodgson <tillman@seekingfire.com>
List: port-sgimips
Date: 03/26/2004 11:46:51
Howdy,

A friend and I (Hi Brad!) have have identical O2s with identical hard
drives (I bought a lot of 3 IBM 76H5817 4.51GB SCA SCSI drives off
ebay.ca). We're both running into reproducible hard drive problems and
Brad came up with an unlikely-seeming possibility through
guess-and-testing. Since Brad isn't subscribed to this list I've cc'ed
him.

This is a long-ish email because I've included some detailed information
on the trouble-shooting steps we've taken. The short version is that we
believe that the PCI NIC is causing problems with the SCSI bus.


=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-
First, some details:

* We both installed off of the Jan 05/04 install image
* We're both running IBM 76H5817 4.51GB SCA SCSI drives in the drive
  bay closest to the system board
* We're both running 3Com 905B-TX PCI NICs
* We're both encountering the same two variety of hard drives problems

* For reference, I'm running the GENERIC kernel (named "random") that was
  recently posted to this mailing list. Brad was running the install
  kernel from Jan 5/04. The relevent portion of my dmesg looks like this
  (Brad's should be the same):

  pci_addr_fixup: 000:03:0 0x10b7 0x9055 new address 0x00003000 (size 0x80)
  pci_addr_fixup: 000:03:0 0x10b7 0x9055 new address 0x80300000 (size 0x80)
  mace: established interrupt 7 (level 0)
  pci0 at macepci0 bus 0
  pci0: i/o space, memory space enabled, rd/line, rd/mult, wr/inv ok
  ahc0 at pci0 dev 1 function 0: Adaptec aic7880 Ultra SCSI adapter
  ahc0: interrupting at crime interrupt 8
  ahc0: Using left over BIOS settings
  ahc0: aic7880: Wide Channel A, SCSI Id=0, 16/253 SCBs
  scsibus0 at ahc0: 16 targets, 8 luns per target
  ahc1 at pci0 dev 2 function 0: Adaptec aic7880 Ultra SCSI adapter
  ahc1: interrupting at crime interrupt 9
  ahc1: Using left over BIOS settings
  ahc1: aic7880: Wide Channel A, SCSI Id=0, 16/253 SCBs
  scsibus1 at ahc1: 16 targets, 8 luns per target
  ex0 at pci0 dev 3 function 0: 3Com 3c905B-TX 10/100 Ethernet (rev. 0x0)
  ex0: interrupting at crime interrupt 10
  ex0: MAC address 00:10:4b:69:2a:86
  exphy0 at ex0 phy 24: 3Com internal media interface
  exphy0: 10baseT, 10baseT-FDX, 100baseTX, 100baseTX-FDX, auto
  biomask 07 netmask 07 ttymask 07 clockmask 87


=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-
Next, the two hard drive problems:

* The hard drive isn't even detected by ARCS. 'hinv' doesn't show it
  and it doesn't spin up. here's Brad's comments from IRC last night:

  22:05:54 <@Smitty> > hinv
  22:05:54 <@Smitty>                    System: IP32
  22:05:54 <@Smitty>                 Processor: 200 Mhz R5000, with FPU
  22:05:54 <@Smitty>      Primary I-cache size: 32 Kbytes
  22:05:54 <@Smitty>      Primary D-cache size: 32 Kbytes
  22:05:56 <@Smitty>      Secondary cache size: 1024 Kbytes
  22:05:59 <@Smitty>               Memory size: 320 Mbytes
  22:06:01 <@Smitty>                  Graphics: CRM, Rev C
  22:06:04 <@Smitty>                     Audio: A3 version 1
  22:06:07 <@Smitty> Still no drive showing.

  I only encountered this problem twice, back when I was trying to get
  things installed. It seemed to mysteriously clear itself up both
  times. In Brad's case, once it started it never cleared up. He tried
  re-seating it several times.

* The hard drive drive gives errors constantly. My dmesg contains a lot
  of lines like this:

  sd0(ahc0:0:1:0):  Check Condition on CDB: 0x28 00 00 61 1a 50 00 00 02 00
      SENSE KEY:  Media Error
     INFO FIELD:  6363728
       ASC/ASCQ:  Uncorrected Read Error - Recommend Reassignment
           SKSV:  Actual Retry Count: 47

  Naturally, this leads to fsck problems. fsck can't fix them -- even
  when I get a clean filesystem, if I re-run fsck it'll instantly have
  a bunch of new problems.


=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-
A timeline of events while troubleshooting:

Last night Brad decided that he was going to attempt to install the Jan
5/04 image on his O2 based on my success at doing so (despite my hard
drive problems, the oeprating system more-or-less works, which is
great).

After newfs hung during install (back when ARCS /did/ see the drive) the
drive shopped showing up in 'hinv' in ARCS (as shown above). As I was
also getting drive errors, our thought at the time was that I had bought
a bad batch of drives. That stumped Brad's install for a while but then
he came up with the idea of swapping drives with his sparcstation (which
had a couple of 2G SCA drives).

Before swapping drives, though, Brad tried moving his existing drive to
the other drive bay on the O2. That didn't help, the drive was still not
detected.

First he put one of the two 2G drives from teh sparcstation into the O2
and it wasn't detected.  Moved it to the second slot. No go. Took the
termination off (just torule this out) and still nothing.

Then he put the 4.5G drive from the O2 (the one that wasn't being
detected) into the sparcstation. It was detected there just fine and he
installed NetBSD 1.6.2 onto it.

During the install on the sparcstation he then installed the other 2GB
drive from the sparcstation into the O2 with the same "no detection"
problem from the machine.

Back on the sparstation the 4.5G drive from the O2 is stilling running
great and has had no problems detected by fsck (though we've tried
several times) and reports no SCSI errors in dmesg.

For comparison, here's what a fsck looks like on my O2 with one of the
4.5G IBM drives:

 [root@lapislazuli ~]# fsck -f /
 ** /dev/rsd0a (NO WRITE)
 ** Last Mounted on /
 ** Root file system
 ** Phase 1 - Check Blocks and Sizes
 ** Phase 2 - Check Pathnames
 CANNOT READ: BLK 6360830
 CONTINUE? [yn] y
 THE FOLLOWING DISK SECTORS COULD NOT BE READ: 6360830,
 BAD INODE NUMBER FOR '.'  I=770145  OWNER=root MODE=40755
 SIZE=512 MTIME=Feb  9 02:39 2004
 DIR=?
 FIX? no
 ? IS AN EXTRANEOUS HARD LINK TO DIRECTORY
 /usr/X11R6/share/locale/tg/LC_MESSAGES
 REMOVE? no
 CANNOT READ: BLK 6360830
 CONTINUE? [yn] y
 THE FOLLOWING DISK SECTORS COULD NOT BE READ: 6360830,
 BAD INODE NUMBER FOR '..'  I=770145  OWNER=root MODE=40755
 SIZE=512 MTIME=Feb  9 02:39 2004
 DIR=/usr/pkgsrc.old/www/firefox/work/mozilla/extensions/xmlterm/build
 FIX? no
 BAD INODE NUMBER FOR '..'  I=777216  OWNER=root MODE=40755
 SIZE=512 MTIME=Mar 20 11:08 2004
 CANNOT READ: BLK 6360830
 CONTINUE? [yn] y
 THE FOLLOWING DISK SECTORS COULD NOT BE READ: 6360830,
 DIR=?
 FIX? no
 ** Phase 3 - Check Connectivity
 ** Phase 4 - Check Reference Counts
 LINK COUNT DIR I=709698  OWNER=root MODE=40755
 SIZE=1536 MTIME=Mar 20 11:08 2004  COUNT 95 SHOULD BE 96
 ADJUST? no
 LINK COUNT DIR I=770105  OWNER=root MODE=40755
 SIZE=512 MTIME=Mar 22 07:34 2004  COUNT 12 SHOULD BE 11
 ADJUST? no
 LINK COUNT DIR I=777216  OWNER=root MODE=40755
 SIZE=512 MTIME=Mar 20 11:08 2004  COUNT 2 SHOULD BE 3
 ADJUST? no
 ** Phase 5 - Check Cyl groups
 SUMMARY INFORMATION BAD
 SALVAGE? no
 BLK(S) MISSING IN BIT MAPS
 SALVAGE? no
 230467 files, 1664244 used, 2256009 free (11401 frags, 280576 blocks,
 0.3% fragmentation)

Note that this done from a mounted file system. When I do it from single
user user mode it attempts to repair the information but a fresh fsck
performed right afterwards finds more problems, ad nauseum.

So we now had a "bad" drive that was miraculously "fixed" when running
in the sparcstation ... odd.

 23:07:49 <@Smitty> Now I am wondering about the card in these SGI's
 23:08:08 <@Smitty> Wonder if there is a conflict with the added nic.
 23:17:52 <@Smitty> Taking the 3com out just for giggles... :)

I was skeptical, but it seemed easy enough to test.

 23:32:16 <@Smitty> Well, it worked.
 23:32:25 <@Smitty> Installed the FS just fine.
 23:32:32 <@Smitty> Course now it can't see the network...

So it looks like /somehow/ the PCI NIC is interfering with the SCSI bus,
which seems really surprising to me (though I don't know SGI
architecture very well).

The next step in trouble-shooting is for Brad to put the 3Com NIC back
in (to see if it was justing a seating issue) and for one of us to try a
different brand of NIC (to see if it's on a 3c905 issue).


=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-
Help please :-)

Is there a possibility that the PCI slot is causing problems with the
SCSI bus? If so, would it be related to this specific type of NIC or a
more general problem? What kinds of NICs are other folks running
successfully?

And, to avoid the whole issue, what's the status of getting the onboard
NIC working? We're both willing to volunteer time and our O2s as test
boxes to the task :-)

Thanks muchly for your time and assistance. SGI workstations are
beautiful things, and we're having a blast with ours.

- Tillman
  (and Brad in spirit)


-- 
I managed to out-cool even the disgustingly cool people normally found
at the cafe I went to, without trying. I'm assuming it was the IETF
draft I was reading, because nothing else really accounts for it.
    - A.S.R. quote (Kirrily 'Skud' Robert)