Subject: Re: Testing of sgivol etc.
To: Havard Eidnes <he@netbsd.org>
From: Scott G. Akmentins-Taylor <staylor@mrynet.com>
List: port-sgimips
Date: 11/10/2001 17:15:27
Hi, Håvard

I discovered this as well.  What I've discovered is that using
"sgivol -i" is a Very Bad thing.  With the disklabel mods make
to the kernel and libutil, it is completely unnecessary.

I just do the following for initialisation:

	# disklabel -i sd1
	  (create the disklabel.  it will write both the the BSD and
	   the SGI disklabels)
	# ./sgivol -w boot boot sd1

After this, the drive is completely writable and bootable after
populated.

I have ALSO discovered that it is near impossible to rectify the 
drive's problem after ONLY the SGI disklabel has been written --
i.e. no BSD disklabel exists.  I had to hack together a kernel
that would not try to use the SGI disklabel if no BSD label was
found (that or using a pre-disklabel-mod kernel to use dd(1)
to overwrite the disk blocks with /dev/zero).

Let me know if this isn't clear or I can help you get around
the problem.

Please let me know if you have further insight or better means
for dealing with this.

-scott

> Hi,
> 
> I just took some time to install a new disk on my SGI Indigo2,
> together with some more memory.  I upgraded to today's kernel, and
> decided to test the new "sgivol" utility, and the two first steps
> worked as advertized:
> 
>    viola# ./sgivol sd1
>    No SGI volumn header found, magic=6c6c6c6c
>    viola# ./sgivol -i sd1
>    disklabel shows 35843670 sectors
>    checksum: 00000000
>    root part: 0
>    swap part: 1
>    bootfile:
> 
>    Volume header files:
> 
>    SGI partitions:
>     0:a blocks 35840535 first      3135 type  7 (EFS)
>     8:i blocks     3135 first         0 type  0 (Volume Header)
>    10:k blocks 35843670 first         0 type  6 (Volume)
> 
>    Do you want to update volume (y/n)? y
>    viola#
> 
> However, this one didn't:
> 
>    viola# ./sgivol sd1
> 
> On the console appeared:
> 
>    sd1: no disk label
>    sd1: no disk label
>    Stopped in pid 1245 (sgivol) at 0x8811bd4c:     mfhi    v0
>    db> 
> 
> At this point the machine is still running diskless, of course.  Some
> digging resulted in identification of where it crashed; here's the DDB
> trace with the subroutines identified by running gdb on the image
> afterwards:
> 
> db> trace
> 8811bc68+e4 (200,5072df,a12,8ba75fd8) ra 880211a4 sz 32
>   sdstrategy
> 	0x8811bd4c <sdstrategy+228>:    mfhi    $v0
> 88020f40+264 (8811bc68,5072df,a12,100000) ra 8811c408 sz 80
>   physio
> 8811c3d8+30 (8811bc68,d2d27e78,a12,100000) ra 88069ed4 sz 32
>   sdread
> 88069dac+128 (8811bc68,d2d27e78,a12,100000) ra 880bcb20 sz 96
>   spec_read
> 880bcaa4+7c (8811bc68,d2d27e78,a12,100000) ra 88060910 sz 24
>   nfsspec_read
> 88060804+10c (8811bc68,d2d27e78,d2d27e78,100000) ra 88036f44 sz 64
>   vn_read
> 88036e80+c4 (8811bc68,d2d27e78,d2d27e78,100003e0) ra 88036e64 sz 96
>   dofileread
> 88036dc0+a4 (8811bc68,d2d27e78,d2d27e78,100003e0) ra 880f9b74 sz 56
>   sys_read
> 880f9964+210 (8811bc68,d2d27e78,d2d27e78,100003e0) ra 8800305c sz 80
>   syscall_plain
> mips3_SystemCall+b0 (8811bc68,d2d27e78,d2d27e78,100003e0) ra 3010c080 sz 0
> PC 0x3010c080: not in kernel space
> 0+3010c080 (8811bc68,d2d27e78,d2d27e78,100003e0) ra 0 sz 0
> User-level: pid 1245
> db> 
> 
> The offending line of code appears to be
> 
>         if (lp->d_secsize == DEV_BSIZE) {
>                 sector_aligned = (bp->b_bcount & (DEV_BSIZE - 1)) == 0;
>         } else {
> >>>             sector_aligned = (bp->b_bcount % lp->d_secsize) == 0;  <<<
>         }
> 
> I *think* lp->d_secsize is either initialized to 0 or read from the
> disk.
> 
> The section of code for the marked line above appears to be:
> 
> 0x8811bd34 <sdstrategy+204>:    lw      $a0,48($s0)
> 0x8811bd38 <sdstrategy+208>:    nop
> 0x8811bd3c <sdstrategy+212>:    divu    $zero,$a0,$v1
> 0x8811bd40 <sdstrategy+216>:    bnez    $v1,0x8811bd4c <sdstrategy+228>
> 0x8811bd44 <sdstrategy+220>:    nop
> 0x8811bd48 <sdstrategy+224>:    break   0x7
> 0x8811bd4c <sdstrategy+228>:    mfhi    $v0
> 
> and sure enough, v1 is zero:
> 
> db> show reg
> ...
> v1                   0
> a0               0x200
> ...
> 
> 
> I decided that the problem was the missing or uninitialized disk
> label, and after some failed attempts I managed to wedge one in place.
> This could not be done through an operation which would try to read
> the missing disklabel, as that would hit the above problem as well, so
> I ended up modifying a proto-file from one of my other systems and
> doing
> 
> # disklabel -R -r sd1 new-label.sd1
> 
> whereafter the label became sufficiently initialized that I could
> proceed with tuning the contents of the disk label.
> 
> The root cause for the problem may be insufficient provision of
> default values for the in-core disklabel when the label on the disk is
> missing.
> 
> 
> Regards,
> 
> - Håvard