Subject: Bootability eludes me once again
To: None <port-i386@netbsd.org>
From: Anne Bennett <anne@alcor.concordia.ca>
List: port-i386
Date: 04/30/2002 17:18:14
[Apologies for the long post; I've been at this for hours so I have a
 lot of test and reading results to report.]

I swear I'm not stupid.  Really.  What's crazy is that back in November,
I went through something like this, Wolfgang Solfrank explained to me
how the i386 boot sequence works, and I was quite sure I understood.
Certainly, I was able to fix the system last time.

Now I have another system that won't boot properly, and I'm *sure* I
did everything right!  Well, obviously not???

My understanding
----------------

(Boot process for the case where the NetBSD partition is *not* the entire
disk, and does *not* line up on sector zero of the disk:)

  - The boot ROM of the hardware reads sector zero of the disk, and
    jumps to it.  This sector should contain "master boot code",
    placed there by "fdisk", as well as the "DOS partition table".
    This code can be the regular boot code or the bootselect code.
    I'll call is the "part zero" boot code.

  - The above part zero bootcode reads the DOS partition table, finds the
    active partition (in this case, the NetBSD partition) (or, if
    bootselect, it may allow the user to select an alternative partition
    using one of the function keys), then reads the first sector of that
    and jumps to it.  This first sector of the NetBSD part of the disk
    should contain part 1 of the NetBSD boot code (the "countdown");
    the next sector contains the NetBSD disk label (including the NetBSD
    partition table).

  - Finally, the "countdown" code reads the NetBSD disk label, based
    on which it finds the rest (part 2) of the NetBSD boot code, which
    in turn starts the kernel.

Parts 1 and 2 of the NetBSD boot code are written by "installboot".
Part 1 by definition goes into the first sector of the NetBSD part of the
disk, or so we hope: "installboot" seems to place it in the first sector
of the partition specified on the command line, so we must make sure
that the partition we specify to "installboot" starts at the beginning
of the NetBSD part of the disk as per the MBR partition table.  Part 2
of the boot code is placed in the filesystem on the given partition.

The list of part 2 block locations appears to be hard-coded into part
1, which suggests that part 1 does *not* read the filesystem to find
the "/boot" file, but rather knows where the "rest of itself" is.  On
the other hand, I can use a part 2 on sd0a from a part1 obtained from
fd0a, which suggests that part 1 *can* read the filesystem (it can
read it enough to do "ls", anyway!).  Confusion still reigns. :-(


Symptom
-------

When coming up from a "reset", this Pentium III with IDE and SCSI
controllers reports its devices like so:

|  Auto-Detecting Pri Master.. IDE Hard Disk
|  Auto-Detecting Sec Master.. ATAPI CDROM
|  
|  Pri Master: AR1.0400 MAXTOR 6L040J2
|              Ultra DMA Mode-4, S.M.A.R.T. Capable but Disabled
|  Sec Master: 1.09 Compaq CRD-8320B
|  
|  Adaptec AHA-2940 Ultra/Ultra W BIOS v1.23
|  (c) 1996 Adaptec, Inc. All Rights Reserved
|  
|  <<< Press <Ctrl><A> for SCSISelect (TM) Utility! >>>
|  
|    SCSI ID:LUN NUMBER #:# 0:0 - SGI      SEAGATE ST51080N - Drive D: (81h)
|    SCSI ID:LUN NUMBER #:# 1:0 - QUANTUM  FIREBALL ST3.2S - Drive 82h
|    SCSI ID:LUN NUMBER #:# 2:0 - QUANTUM  FIREBALL ST4.3S - Drive 83h
|    SCSI ID:LUN NUMBER #:# 4:0 - TANDBERG TDC 3800
|    SCSI ID:LUN NUMBER #:# 6:0 - HP       C1790A
|  
|  BIOS Installed Successfully!

It is set to try to boot from floppy first, then SCSI, then IDE; the
O/S is on the first SCSI disk (sd0).  If there is a bootable floppy
in the drive, the system reports:

|  Searching for Boot Record from Floppy... OK

... and starts the "countdown".  At that point, I can interrupt, and
make it boot from sd0a, which is actually "hd1a" (more below).

In the absence of a bootable floppy diskette, with the regular mbr
installed on the first SCSI disk, we get:

|  Searching for Boot Record from Floppy... Not Found
|  Searching for Boot Record from SCSI... OK
|  No operating system


What I did
----------

(a) I used "fdisk -iua" to put an MBR on sd0.  "fdisk sd0" reports:

|  NetBSD disklabel disk geometry:
|  cylinders: 4826 heads: 4 sectors/track: 107 (428 sectors/cylinder)
|  
|  BIOS disk geometry:
|  cylinders: 1023 heads: 255 sectors/track: 63 (16065 sectors/cylinder)
|  
|  Partition table:
|  0: <UNUSED>
|  1: <UNUSED>
|  2: <UNUSED>
|  3: sysid 169 (NetBSD)
|      start 64, size 64 (0 MB), flag 0x80
|          beg: cylinder    0, head   1, sector  2
|          end: cylinder    0, head   2, sector  2


(b) I used "disklabel" to partition sd0.  "disklabel sd0" reports:

|  # /dev/rsd0d:
|  type: SCSI
|  disk: ST51080N
|  [...]
|  8 partitions:
|  #        size   offset     fstype   [fsize bsize cpg/sgs]
|    a:   102720       64     4.2BSD     1024  8192    16   # (Cyl.    0*- 240*)
|    b:   221211   102784       swap                        # (Cyl.  240*- 756*)
|    c:  2070171       64     unused        0     0         # (Cyl.    0*- 4836*)
|    d:  2070235        0     unused        0     0         # (Cyl.    0 - 4836*)
|    e:   513600   323995     4.2BSD     1024  8192    16   # (Cyl.  756*- 1956*)
|    f:   616320   837595       RAID                        # (Cyl. 1956*- 3396*)
|    g:   205440  1453915     4.2BSD     1024  8192    16   # (Cyl. 3396*- 3876*)
|    h:   410880  1659355     unused        0     0         # (Cyl. 3876*- 4836*)


(c) I used "/usr/mdec/installboot -v /usr/mdec/biosboot.sym /dev/rsd0a"; it
    reported success, and the mod time on the "boot" file on that partition
    changed.


An earlier mistake
------------------

Above, I report the last three things I did.  Before that, I was working
while I probably should have been sleeping, and I made a few mistakes.
The easy-to-fix mistakes involved forgetting a step, and having to boot
from floppy to do that step.  One destructive mistake I made was to
call fdisk on "sd0a" instead of "sd0", which trashed my root partition,
which I had to restore from tape backup.

The reason I mention this is that part of that mistake seems to not have
been overwritten since then; "fdisk sd0a" reports:

|  NetBSD disklabel disk geometry:
|  cylinders: 4826 heads: 4 sectors/track: 107 (428 sectors/cylinder)
|  
|  BIOS disk geometry:
|  cylinders: 1023 heads: 255 sectors/track: 63 (16065 sectors/cylinder)
|  
|  Partition table:
|  0: <UNUSED>
|  1: <UNUSED>
|  2: <UNUSED>
|  3: sysid 169 (NetBSD)
|      start 0, size 16 (0 MB), flag 0x80
|          beg: cylinder    0, head   0, sector  1
|          end: cylinder    0, head   0, sector 16

Size 16, offset zero is how I know that's my earlier attempt; I
eventually settled on size 64, offset 64, as is reported by
"fdisk sd0" or "fdisk sd0d".  What I don't understand is why this
bogus and wrongly-placed MBR data was not overwritten by the
subsequent disklabel and/or installboot commands.

(By the way, it would be good if fdisk would warn and confirm before
writing the MBR anywhere other than sector zero!)


More details and tests
----------------------

From the "countdown code" loaded from floppy, I can determine, using
"ls", which "hd number" corresponds to each disk, because I have placed
a file on partition "a" of each disk that reveals the disk's "NetBSD
name".  I recommend this technique to anyone having trouble figuring
out how the bootcode has numbered their disks!

|  ls hd0a:    z.This_is_wd0a  (has "boot" but nothing else)
|  ls hd1a:    z.This_is_sd0a  (has "boot" and is root partition of O/S)
|  ls hd2a:    z.This_is_sd1a  (backup O/S, not tested)
|  ls hd3a:    z.This_is_sd2a  (empty fs)

So my O/S is on "hd1a" (as numbered by the floppy's mbr).

I ran a few tests with the bootselect mbr...

I just now tried "fdisk -B sd0", marking "DOS partition 3" as "NetBSD".
For the "default boot option", I tried "10: The first active partition",
and then I tried "3: NetBSD".  In either case, during the boot sequence,
I see:

|  Searching for Boot Record from SCSI... OK
|  F4: NetBSD
|  3

This suggests that the boot selector (written by fdisk on sector zero of
the first SCSI disk, sd0) *is* being executed by the boot ROM, but for
some unknown reason, it is jumping to code that makes no sense.  Based on
reading the assembler source code to mbr_bootsel (with gratitude for the
comments!!!), it looks as though "F4: NetBSD" is the result of walking
through the selector (name) table (i.e. the MBR partition table) and
printing the entries, while the "3" corresponds to ERR_NOOS ("Magic no.
check failed for part.") -- i.e., the same error as was reported by the
floppy's MBR boot code.

Hitting <F6> before the bootselector timeout results in a sort of
successful boot from my backup O/S (problems are my fault; my backup
procedure is still under construction), which really confirms to me
that fdisk has written usable "part zero boot code", and that the
problem is with "part 1" written (or not?) by "installboot".

Actually, it turns out that the sd0 MBR numbers the disks differently
from the floppy MBR, so with the sd0 MBR (regular or bootselect, I
suspect, but confirmed using "ls" from the countdown code loaded from
my backup O/S disk after pressing F6 from sd0 bootselect -- are you
confused yet?), we have:

  hd0 = sd0
  hd1 = sd1
  hd2 = sd2
  hd3 = wd0

where the floppy's MBR had wd0 first.  But that's probably not important.

Pressing <F5> ("boot from hd0 = sd0") at the sd0 bootselect MBR just
repeats "F4: NetBSD", which I guess is not unexpected.  The question
is: why is the magic number bad on my sd0 "active" partition?  Why is
there still my mistaken "mbr" there, instead of it having been
overwritten by "part 1" boot code?


Here is the output from
"/usr/mdec/installboot -v /usr/mdec/biosboot.sym /dev/rsd0a":

|  /usr/mdec/biosboot.sym: entry point 0x805d000
|  proto bootblock size 48640
|  room for 10 filesystem blocks at 0x578
|  renamed //boot -> //boot.bak
|  Will load 80 blocks.
|  dblk: 54160, num: 16
|  dblk: 54176, num: 16
|  dblk: 54192, num: 16
|  dblk: 54208, num: 16
|  dblk: 54224, num: 16
|  BSD partition starts at sector 64
|  deleting //boot.bak

My BSD partition does indeed start at sector 64, according to fdisk.

Files /dev/rsd0a and /boot (/ is on sd0a) had their mod times changed.

Certainly "installboot" *thinks* it had installed parts 1 and 2 boot
code correctly, and the bootability of my O/S (once we get a "part 1"
from another location) confirms that "part 2" of the bootcode *is*
installed correctly.

So I have narrowed it down to: installboot has not installed "part 1"
of the boot code (the countdown code) correctly on sd0a.

My only thought at this point is that somehow, "installboot" has put the
"part 1" blocks at the wrong place (presumably somewhere that has not,
so far, trashed anything else on that disk), but I can't imagine how
or why, unless it has to do somehow with the BIOS vs NetBSD geometry,
in which case I'm even further out of my depth than I appear to be.

I hope for a blinding flash of insight from one of the usual sources,
if they would be so kind.  I really hoped I'd find the problem myself
while trying to describe it to you, but no such luck this time!


Anne.
-- 
Ms. Anne Bennett, Senior Analyst, IITS, Concordia University, Montreal H3G 1M8
anne@alcor.concordia.ca                                        +1 514 848-7606