Subject: Re: panic while building a raid-1 set one component at a time
To: None <current-users@NetBSD.org>
From: Jeff Rizzo <riz@boogers.sf.ca.us>
List: current-users
Date: 10/05/2003 16:29:05
After noodling on this for a while, it occurs to me that the machine I'm
building this RAID set on before deploying it has only 16MB of memory...
Is that little enough to cause this particular issue?  I know that
raidframe is somewhat memory intensive... If so, is there anything I can
do kernelwise to strip down the rest of the memory needs so I can get this
set built?  It's not going to live here permanently, but I'd
sure like to get it built before moving it to its final destination...

Thanks,
+j

On Sun, Oct 05, 2003 at 12:12:07PM -0700, Jeff Rizzo wrote:
> I've done this before, but not for about a year, so I'm not sure
> if I'm doing something wrong here, or what.  I'm working with a GENERIC
> kernel circa September 28 on i386 (from the releng.netbsd.org snapshot
> that day)
> 
> I've got two identical disks, and constructed half a raid-1 on one (I
> needed the other to bootstrap from sysinst) as it says to do in the
> raidctl man page; it seems to be working fine in degraded mode.
> 
> The two disks are wd1 and wd2;  wd2 is the working component of the raid
> set; I'm trying to add wd1.  I copied the disklabel from wd2 onto wd1,
> did a 'raidctl -a /dev/wd1a raid0', and then when I try to do the
> 'raidctl -F component0 raid0', it panics:
> 
> # raidctl -a /dev/wd1a raid0
> Warning: truncating spare disk /dev/wd1a to 488396928 blocks
> # Oct  5 10:26:49  /netbsd: Warning: truncating spare disk /dev/wd1a to 488396928 blocks
> raidctl -F component0 raid0
> RECON: initiating reconstruction on row 0 col 0 -> spare at row 0 col 2
> raid0: Quiescence reached..
> panic: malloc: out of space in kmem_map
> Stopped in pid 399.1 (raid_recon) at    netbsd:cpu_Debugger+0x4:        leave
> db> 
> 
> Now, I'm wondering about the "Warning: truncating spare disk" message;
> I can't see anything different about the labels of wd1 and wd2, and I
> didn't get that message when I built wd2.
> 
> One interesting point:  I can't seem to change the info on wd2c in the
> disklabel;  it always returns to
> 
>  c:        15         0     unused      0     0        # (Cyl.      0 -      0*)
> 
> No matter how I edit it with "disklabel", though the edits always seem to
> take.
> 
> Anyway, here's the entire sequence.  I hope there's some clue in here
> somewhere...
> 
> # disklabel wd1
> # /dev/rwd1d:
> type: ESDI
> disk: WDC WD2500JB-32F
> label: fictitious
> flags:
> bytes/sector: 512
> sectors/track: 63
> tracks/cylinder: 16
> sectors/cylinder: 1008
> cylinders: 484521
> total sectors: 488397168
> rpm: 3600
> interleave: 1
> trackskew: 0
> cylinderskew: 0
> headswitch: 0           # microseconds
> track-to-track seek: 0  # microseconds
> drivedata: 0 
> 
> 4 partitions:
> #        size    offset     fstype [fsize bsize cpg/sgs]
>  a: 488397105        63       RAID                     # (Cyl.      0*- 484520)
>  c: 488397105        63     unused      0     0        # (Cyl.      0*- 484520)
>  d: 488397168         0     unused      0     0        # (Cyl.      0 - 484520)
> # disklabel wd2
> # /dev/rwd2d:
> type: ESDI
> disk: WDC WD2500JB-32F
> label: fictitious
> flags:
> bytes/sector: 512
> sectors/track: 63
> tracks/cylinder: 16
> sectors/cylinder: 1008
> cylinders: 484521
> total sectors: 488397168
> rpm: 3600
> interleave: 1
> trackskew: 0
> cylinderskew: 0
> headswitch: 0           # microseconds
> track-to-track seek: 0  # microseconds
> drivedata: 0 
> 
> 4 partitions:
> #        size    offset     fstype [fsize bsize cpg/sgs]
>  a: 488397105        63       RAID                     # (Cyl.      0*- 484520)
>  c:        15         0     unused      0     0        # (Cyl.      0 -      0*)
>  d: 488397168         0     unused      0     0        # (Cyl.      0 - 484520)
> # raidctl -a /dev/wd1a raid0
> Warning: truncating spare disk /dev/wd1a to 488396928 blocks
> # Oct  5 11:07:15  /netbsd: Warning: truncating spare disk /dev/wd1a to 488396928 blocks
> raidctl -s raid0
> Components:
>           component0: failed
>            /dev/wd2a: optimal
> Spares:
>            /dev/wd1a: spare
> component0 status is: failed.  Skipping label.
> Component label for /dev/wd2a:
>    Row: 0, Column: 1, Num Rows: 1, Num Columns: 2
>    Version: 2, Serial Number: 20031005, Mod Counter: 101
>    Clean: No, Status: 0
>    sectPerSU: 128, SUsPerPU: 1, SUsPerRU: 1
>    Queue size: 100, blocksize: 512, numBlocks: 488396928
>    RAID Level: 1
>    Autoconfig: Yes
>    Root partition: Yes
>    Last configured as: raid0
> /dev/wd1a status is: spare.  Skipping label.
> Parity status: DIRTY
> Reconstruction is 100% complete.
> Parity Re-write is 100% complete.
> Copyback is 100% complete.
> # raidctl -F component0 raid0
> RECON: initiating reconstruction on row 0 col 0 -> spare at row 0 col 2
> raid0: Quiescence reached..
> panic: malloc: out of space in kmem_map
> Stopped in pid 398.1 (raid_recon) at    netbsd:cpu_Debugger+0x4:        leave
> db> bt
> cpu_Debugger(0,e8f000,c087c000,0,e8f000) at netbsd:cpu_Debugger+0x4
> panic(c0695840,0,e8f000,0,3a38b1) at netbsd:panic+0x11d
> malloc(e8e2c4,c06cad40,0,0,3a38b1) at netbsd:malloc+0x167
> rf_MakeReconMap(c08d5000,80,0,1d1c5880,0) at netbsd:rf_MakeReconMap+0xc2
> rf_MakeReconControl(c0974900,0,0,0,2) at netbsd:rf_MakeReconControl+0x171
> rf_ContinueReconstructFailedDisk(c0974900,0,2,0,c20ac4e0) at netbsd:rf_ContinueR
> econstructFailedDisk+0xc1
> rf_ReconstructFailedDiskBasic(c08d5000,0,0,c08d5000,c088fe60) at netbsd:rf_Recon
> structFailedDiskBasic+0xb9
> rf_ReconstructFailedDisk(c08d5000,0,0,1,c0100d22) at netbsd:rf_ReconstructFailed
> Disk+0x60
> rf_FailDisk(c08d5000,0,0,1,c42bd1b8) at netbsd:rf_FailDisk+0xc7
> rf_ReconThread(c0924ec0,7e0000,7e9000,0,c010030c) at netbsd:rf_ReconThread+0x43
> db> 
> db> ps
>  PID           PPID     PGRP        UID S   FLAGS LWPS          COMMAND    WAIT
> >398              0        0          0 2 0x20200    1       raid_recon
>  351            332      351          0 2  0x4002    1          raidctl
>  349              1        1          0 2  0x4000    1            getty nanosle
>  333              1        1          0 2  0x4000    1            getty nanosle
>  343              1        1          0 2  0x4000    1            getty nanosle
>  332              1      332          0 2  0x4003    1              csh   pause
>  337              1      337          0 2       0    1             cron nanosle
>  330              1      330          0 2       0    1            inetd  kqread
>  281              1      281          0 2       0    1             sshd  select
>  171              1      171          0 2       0    1          rpcbind  select
>  150              1      150          0 2       0    1          syslogd
>  120              1      120          0 2       0    1         dhclient  select
>  8                0        0          0 2 0x20200    1         aiodoned aiodone
>  7                0        0          0 2 0x20200    1          ioflush  syncer
>  6                0        0          0 2 0x20200    1           reaper  reaper
>  5                0        0          0 2 0x20200    1       pagedaemon pgdaemo
>  4                0        0          0 2 0x20200    1       lfs_writer lfswrit
>  3                0        0          0 2 0x20200    1          raidio0 raidiow
>  2                0        0          0 2 0x20200    1            raid0 rfwcond
>  1                0        1          0 2  0x4000    1             init    wait
>  0               -1        0          0 2 0x20200    1          swapper
> db> 
> 
> Thanks in advance for any clues anyone can provide...
> 
> +j

-- 
Jeff Rizzo                                         http://boogers.sf.ca.us/~riz