Re: RAIDframe trouble lately?

To: current-users%netbsd.org@localhost
Subject: Re: RAIDframe trouble lately?
From: Tom Ivar Helbekkmo <tih%hamartun.priv.no@localhost>
Date: Thu, 20 Mar 2008 14:52:34 +0100

Sarton O'Brien <bsd-xen%roguewrt.org@localhost> writes:

> You don't mention if the heavy disk activity is after you are
> successfully logged into the system and whether you are able to check
> dmesg or any logs.

It could happen any time, and if I don't do anything that incurs heavy
disk use, the system can stay up for a long time.  The surest way to
kill it seems to be to read or write large amounts of data on RAID 5 (I
can cause the hang to happen at will by copying large files onto the
RAID 5 set (especially if I run two copies in parallell), or by letting
Bacula run a full backup of it over the net.

> It also might be worth mentioning what types of filesystems you are
> using. Also any distributed scheme like nfs, nis ...  might be useful
> to know about.

Filesystem        Size       Used      Avail %Cap Mounted on
/dev/raid0a       247M       132M       103M  56% /
/dev/raid0e       2.0G       513M       1.4G  26% /var
/dev/raid2e       4.9G       2.2G       2.5G  46% /usr
kernfs            1.0K       1.0K         0B 100% /kern
procfs            4.0K       4.0K         0B 100% /proc
ptyfs             1.0K       1.0K         0B 100% /dev/pts
portal:185        1.0K       1.0K         0B 100% /p
mfs:182           248M       2.0K       236M   0% /tmp
/dev/raid3e       8.3G       988M       7.0G  12% /var/pgsql/data
/dev/raid4e        17G       6.9G       9.0G  43% /usr/local
/dev/raid4f        17G       9.0G       6.9G  56% /u
/dev/sd0e          67G        54G       9.8G  84% /store

All file systems are ffs.  Swap is on raid1.  raid0 through raid3 are
mirror pairs, while raid4 is a RAID 5 set of five disks:

raid0: RAID Level 1
raid0: Components: /dev/sd1a /dev/sd2a
raid0: Total Sectors: 4718592 (2304 MB)
raid1: RAID Level 1
raid1: Components: /dev/sd1e /dev/sd2e
raid1: Total Sectors: 2578176 (1258 MB)
raid2: RAID Level 1
raid2: Components: /dev/sd1f /dev/sd2f
raid2: Total Sectors: 10485760 (5120 MB)
raid3: RAID Level 1
raid3: Components: /dev/sd3a /dev/sd4a
raid3: Total Sectors: 17782656 (8682 MB)
raid4: RAID Level 5
raid4: Components: /dev/sd5a /dev/sd6a /dev/sd7a /dev/sd8a /dev/sd9a
raid4: Total Sectors: 71130624 (34731 MB)

The underlying disks are spread over several SCSI controllers:

ahc0 at pci0 dev 6 function 0: Adaptec aic7890/91 Ultra2 SCSI adapter
ahc0: interrupting at ioapic0 pin 19 (irq 5)
ahc0: aic7890/91: Ultra2 Wide Channel A, SCSI Id=7, 32/253 SCBs
scsibus0 at ahc0: 16 targets, 8 luns per target
ahc1 at pci0 dev 9 function 0: Adaptec 2940 Ultra SCSI adapter
ahc1: interrupting at ioapic0 pin 19 (irq 5)
ahc1: aic7880: Ultra Wide Channel A, SCSI Id=7, 16/253 SCBs
scsibus1 at ahc1: 16 targets, 8 luns per target
ahc2 at pci0 dev 10 function 0: Adaptec 2940 Ultra SCSI adapter
ahc2: interrupting at ioapic0 pin 18 (irq 12)
ahc2: aic7880: Ultra Wide Channel A, SCSI Id=7, 16/253 SCBs
scsibus2 at ahc2: 16 targets, 8 luns per target
ahc3 at pci0 dev 11 function 0: Adaptec 2940 Ultra SCSI adapter
ahc3: interrupting at ioapic0 pin 17 (irq 10)
ahc3: aic7880: Ultra Wide Channel A, SCSI Id=7, 16/253 SCBs
scsibus3 at ahc3: 16 targets, 8 luns per target
[...]
sd0 at scsibus0 target 0 lun 0: <SEAGATE, ST373405LW, 0001> disk fixed
sd0: 70007 MB, 29550 cyl, 8 head, 606 sec, 512 bytes/sect x 143374741 sectors
sd0: sync (25.00ns offset 63), 16-bit (80.000MB/s) transfers, tagged queueing
sd1 at scsibus1 target 0 lun 0: <IBM, DNES-309170W, SAH0> disk fixed
sd1: 8748 MB, 11474 cyl, 5 head, 312 sec, 512 bytes/sect x 17916240 sectors
sd1: sync (100.00ns offset 8), 16-bit (20.000MB/s) transfers, tagged queueing
sd2 at scsibus1 target 1 lun 0: <IBM, DNES-309170W, SAH0> disk fixed
sd2: 8748 MB, 11474 cyl, 5 head, 312 sec, 512 bytes/sect x 17916240 sectors
sd2: sync (100.00ns offset 8), 16-bit (20.000MB/s) transfers, tagged queueing
sd3 at scsibus1 target 2 lun 0: <IBM, DMVS09V, 0100> disk fixed
sd3: 8748 MB, 11727 cyl, 5 head, 305 sec, 512 bytes/sect x 17916240 sectors
sd3: sync (100.00ns offset 8), 16-bit (20.000MB/s) transfers, tagged queueing
sd4 at scsibus1 target 3 lun 0: <IBM, DMVS09V, 0250> disk fixed
sd4: 8748 MB, 11727 cyl, 5 head, 305 sec, 512 bytes/sect x 17916240 sectors
sd4: sync (100.00ns offset 8), 16-bit (20.000MB/s) transfers, tagged queueing
sd5 at scsibus2 target 0 lun 0: <SEAGATE, ST39102LW, 0005> disk fixed
sd5: 8683 MB, 6962 cyl, 12 head, 212 sec, 512 bytes/sect x 17783240 sectors
sd5: sync (100.00ns offset 8), 16-bit (20.000MB/s) transfers, tagged queueing
sd6 at scsibus2 target 1 lun 0: <SEAGATE, ST39173W, 5958> disk fixed
sd6: 8683 MB, 7501 cyl, 10 head, 237 sec, 512 bytes/sect x 17783240 sectors
sd6: sync (100.00ns offset 8), 16-bit (20.000MB/s) transfers, tagged queueing
sd7 at scsibus2 target 2 lun 0: <SEAGATE, ST39102LW, 0005> disk fixed
sd7: 8683 MB, 6962 cyl, 12 head, 212 sec, 512 bytes/sect x 17783240 sectors
sd7: sync (100.00ns offset 8), 16-bit (20.000MB/s) transfers, tagged queueing
sd8 at scsibus2 target 3 lun 0: <SEAGATE, ST39173W, 4290> disk fixed
sd8: 8683 MB, 7501 cyl, 10 head, 237 sec, 512 bytes/sect x 17783240 sectors
sd8: sync (100.00ns offset 8), 16-bit (20.000MB/s) transfers, tagged queueing
sd9 at scsibus2 target 4 lun 0: <SEAGATE, ST39173LW, 6246> disk fixed
sd9: 8683 MB, 7501 cyl, 10 head, 237 sec, 512 bytes/sect x 17783240 sectors
sd9: sync (100.00ns offset 8), 16-bit (20.000MB/s) transfers, tagged queueing
st0 at scsibus3 target 5 lun 0: <Quantum, DLT4000, D991> tape removable
st0: density code 25, variable blocks, write-enabled
st0: sync (100.00ns offset 15), 8-bit (10.000MB/s) transfers
st1 at scsibus3 target 6 lun 0: <SONY, SDX-300C, 0400> tape removable
st1: density code 48, 512-byte blocks, write-enabled
st1: sync (100.00ns offset 8), 16-bit (20.000MB/s) transfers

> I'd suggest finding a way of triggering the problem and attach to
> any relevent log to see if you can catch something before the system
> becomes unusable ... assuming any possible info is already being
> synced/logged.

Yup - except nothing is logged, anywhere.  It just freezes.  :(

-tih
-- 
Self documenting code isn't. User application constraints don't. --Ed Prochak

References:
- RAIDframe trouble lately?
  - From: Tom Ivar Helbekkmo
- Re: RAIDframe trouble lately?
  - From: Sarton O'Brien

Prev by Date: The NetBSD project celebrates its fifteenth anniversary (fwd)
Next by Date: Re: RAIDframe trouble lately?
Previous by Thread: Re: RAIDframe trouble lately?
Next by Thread: Re: RAIDframe trouble lately?
Indexes:

Home | Main Index | Thread Index | Old Index