Subject: Testing of sgivol etc.
To: None <port-sgimips@netbsd.org>
From: Havard Eidnes <he@netbsd.org>
List: port-sgimips
Date: 11/10/2001 23:58:24
Hi,

I just took some time to install a new disk on my SGI Indigo2,
together with some more memory.  I upgraded to today's kernel, and
decided to test the new "sgivol" utility, and the two first steps
worked as advertized:

   viola# ./sgivol sd1
   No SGI volumn header found, magic=3D6c6c6c6c
   viola# ./sgivol -i sd1
   disklabel shows 35843670 sectors
   checksum: 00000000
   root part: 0
   swap part: 1
   bootfile:

   Volume header files:

   SGI partitions:
    0:a blocks 35840535 first      3135 type  7 (EFS)
    8:i blocks     3135 first         0 type  0 (Volume Header)
   10:k blocks 35843670 first         0 type  6 (Volume)

   Do you want to update volume (y/n)? y
   viola#

However, this one didn't:

   viola# ./sgivol sd1

On the console appeared:

   sd1: no disk label
   sd1: no disk label
   Stopped in pid 1245 (sgivol) at 0x8811bd4c:     mfhi    v0
   db> =


At this point the machine is still running diskless, of course.  Some
digging resulted in identification of where it crashed; here's the DDB
trace with the subroutines identified by running gdb on the image
afterwards:

db> trace
8811bc68+e4 (200,5072df,a12,8ba75fd8) ra 880211a4 sz 32
  sdstrategy
	0x8811bd4c <sdstrategy+228>:    mfhi    $v0
88020f40+264 (8811bc68,5072df,a12,100000) ra 8811c408 sz 80
  physio
8811c3d8+30 (8811bc68,d2d27e78,a12,100000) ra 88069ed4 sz 32
  sdread
88069dac+128 (8811bc68,d2d27e78,a12,100000) ra 880bcb20 sz 96
  spec_read
880bcaa4+7c (8811bc68,d2d27e78,a12,100000) ra 88060910 sz 24
  nfsspec_read
88060804+10c (8811bc68,d2d27e78,d2d27e78,100000) ra 88036f44 sz 64
  vn_read
88036e80+c4 (8811bc68,d2d27e78,d2d27e78,100003e0) ra 88036e64 sz 96
  dofileread
88036dc0+a4 (8811bc68,d2d27e78,d2d27e78,100003e0) ra 880f9b74 sz 56
  sys_read
880f9964+210 (8811bc68,d2d27e78,d2d27e78,100003e0) ra 8800305c sz 80
  syscall_plain
mips3_SystemCall+b0 (8811bc68,d2d27e78,d2d27e78,100003e0) ra 3010c080 s=
z 0
PC 0x3010c080: not in kernel space
0+3010c080 (8811bc68,d2d27e78,d2d27e78,100003e0) ra 0 sz 0
User-level: pid 1245
db> =


The offending line of code appears to be

        if (lp->d_secsize =3D=3D DEV_BSIZE) {
                sector_aligned =3D (bp->b_bcount & (DEV_BSIZE - 1)) =3D=
=3D 0;
        } else {
>>>             sector_aligned =3D (bp->b_bcount % lp->d_secsize) =3D=3D=
 0;  <<<
        }

I *think* lp->d_secsize is either initialized to 0 or read from the
disk.

The section of code for the marked line above appears to be:

0x8811bd34 <sdstrategy+204>:    lw      $a0,48($s0)
0x8811bd38 <sdstrategy+208>:    nop
0x8811bd3c <sdstrategy+212>:    divu    $zero,$a0,$v1
0x8811bd40 <sdstrategy+216>:    bnez    $v1,0x8811bd4c <sdstrategy+228>=

0x8811bd44 <sdstrategy+220>:    nop
0x8811bd48 <sdstrategy+224>:    break   0x7
0x8811bd4c <sdstrategy+228>:    mfhi    $v0

and sure enough, v1 is zero:

db> show reg
...
v1                   0
a0               0x200
...


I decided that the problem was the missing or uninitialized disk
label, and after some failed attempts I managed to wedge one in place.
This could not be done through an operation which would try to read
the missing disklabel, as that would hit the above problem as well, so
I ended up modifying a proto-file from one of my other systems and
doing

# disklabel -R -r sd1 new-label.sd1

whereafter the label became sufficiently initialized that I could
proceed with tuning the contents of the disk label.

The root cause for the problem may be insufficient provision of
default values for the in-core disklabel when the label on the disk is
missing.


Regards,

- H=E5vard