Subject: Re: Bootability eludes me once again
To: <>
From: David Laight <david@l8s.co.uk>
List: port-i386
Date: 05/01/2002 12:08:27
> Now I have another system that won't boot properly, and I'm *sure* I
> did everything right!  Well, obviously not???
> 
> My understanding
> ----------------
> 
> (Boot process for the case where the NetBSD partition is *not* the entire
> disk, and does *not* line up on sector zero of the disk:)
> 
>   - The boot ROM of the hardware reads sector zero of the disk, and
>     jumps to it.  This sector should contain "master boot code",
>     placed there by "fdisk", as well as the "DOS partition table".
>     This code can be the regular boot code or the bootselect code.
>     I'll call is the "part zero" boot code.
> 
>   - The above part zero bootcode reads the DOS partition table, finds the
>     active partition (in this case, the NetBSD partition) (or, if
>     bootselect, it may allow the user to select an alternative partition
>     using one of the function keys), then reads the first sector of that
>     and jumps to it.  This first sector of the NetBSD part of the disk
>     should contain part 1 of the NetBSD boot code (the "countdown");
>     the next sector contains the NetBSD disk label (including the NetBSD
>     partition table).

You've missed out 2 stages here!

The code in the first sector of the partition (pbr) re-reads sector
zero to find the netbsd partition (it isn't passed it's own sector
number), it then reads the first 15 sectors of the partition
and jumps into the beginning of the third sector read (the 2nd is the
netbsd disklabel).

This code enters protected mode and starts running C code.  It uses
a list of sector numbers and sizes to load in the rest of the
bootloader from filestore.  This is the code that does the 'coutdown'...

(Note that the pbr sector 0, sectors 2+ and the bootloader are all
linked into one object file and installed with installboot.)
> 
>   - Finally, the "countdown" code reads the NetBSD disk label, based
>     on which it finds the rest (part 2) of the NetBSD boot code, which
>     in turn starts the kernel.

> 
> Parts 1 and 2 of the NetBSD boot code are written by "installboot".
> Part 1 by definition goes into the first sector of the NetBSD part of the
> disk, or so we hope: "installboot" seems to place it in the first sector
> of the partition specified on the command line, so we must make sure
> that the partition we specify to "installboot" starts at the beginning
> of the NetBSD part of the disk as per the MBR partition table.  Part 2
> of the boot code is placed in the filesystem on the given partition.
> 
> The list of part 2 block locations appears to be hard-coded into part
> 1, which suggests that part 1 does *not* read the filesystem to find
> the "/boot" file, but rather knows where the "rest of itself" is.  On
> the other hand, I can use a part 2 on sd0a from a part1 obtained from
> fd0a, which suggests that part 1 *can* read the filesystem (it can
> read it enough to do "ls", anyway!).  Confusion still reigns. :-(

This is because it is 'part 2' that is doing the countdown.
This larger code has (some) support for reading filesystems.

> 
> 
> Symptom
> -------
> 
> |  Searching for Boot Record from Floppy... Not Found
> |  Searching for Boot Record from SCSI... OK
> |  No operating system

A bit of hacking
(find . -name '*.[cSh]' | xargs grep "No operating system")
will show that this message is from sbin/fdisk/mbr/mbr.S and
indicated that the sector read (expecting a pbr) didn't end
with the magic 0xaa55...

> 
> 
> What I did
> ----------
> 
> (a) I used "fdisk -iua" to put an MBR on sd0.  "fdisk sd0" reports:
> 
> |  NetBSD disklabel disk geometry:
> |  cylinders: 4826 heads: 4 sectors/track: 107 (428 sectors/cylinder)
> |  
> |  BIOS disk geometry:
> |  cylinders: 1023 heads: 255 sectors/track: 63 (16065 sectors/cylinder)
> |  
> |  Partition table:
> |  0: <UNUSED>
> |  1: <UNUSED>
> |  2: <UNUSED>
> |  3: sysid 169 (NetBSD)
> |      start 64, size 64 (0 MB), flag 0x80
> |          beg: cylinder    0, head   1, sector  2
> |          end: cylinder    0, head   2, sector  2

Unusual - usually start at sector 63, but should not be fatal!
 
> (b) I used "disklabel" to partition sd0.  "disklabel sd0" reports:

Not getting as far as anything that looks at the disklabel....

> (c) I used "/usr/mdec/installboot -v /usr/mdec/biosboot.sym /dev/rsd0a"; it
>     reported success, and the mod time on the "boot" file on that partition
>     changed.

Might be worth checking what is in sector 64... eg:
	dd if=/dev/rwd0d skip=64 count=1 2>/dev/null | hexdump -C 


> One destructive mistake I made was to
> call fdisk on "sd0a" instead of "sd0", which trashed my root partition,
> which I had to restore from tape backup.

It should be THAT drastic, fsck ought to find one of the alternate
superblocks....

> 
> The reason I mention this is that part of that mistake seems to not have
> been overwritten since then; "fdisk sd0a" reports:
> 
> |  NetBSD disklabel disk geometry:
> |  cylinders: 4826 heads: 4 sectors/track: 107 (428 sectors/cylinder)
> |  
> |  BIOS disk geometry:
> |  cylinders: 1023 heads: 255 sectors/track: 63 (16065 sectors/cylinder)
> |  
> |  Partition table:
> |  0: <UNUSED>
> |  1: <UNUSED>
> |  2: <UNUSED>
> |  3: sysid 169 (NetBSD)
> |      start 0, size 16 (0 MB), flag 0x80
> |          beg: cylinder    0, head   0, sector  1
> |          end: cylinder    0, head   0, sector 16
> 
> Size 16, offset zero is how I know that's my earlier attempt; I
> eventually settled on size 64, offset 64, as is reported by
> "fdisk sd0" or "fdisk sd0d".  What I don't understand is why this
> bogus and wrongly-placed MBR data was not overwritten by the
> subsequent disklabel and/or installboot commands.

I'd dump the first sector of every slice and aprtition to see where
that is hiding!

> So my O/S is on "hd1a" (as numbered by the floppy's mbr).

No - the order is determined by the system BIOS.  All the boot
code (until you get into the kernel itself) reads disks using
the BIOS calls and BIOS disk numbers (start at 0x80 for hard
disks).

> I ran a few tests with the bootselect mbr...
> 
> |  Searching for Boot Record from SCSI... OK
> |  F4: NetBSD
> |  3
> 
> This suggests that the boot selector (written by fdisk on sector zero of
> the first SCSI disk, sd0) *is* being executed by the boot ROM, but for
> some unknown reason, it is jumping to code that makes no sense.  Based on
> reading the assembler source code to mbr_bootsel (with gratitude for the
> comments!!!), it looks as though "F4: NetBSD" is the result of walking
> through the selector (name) table (i.e. the MBR partition table) and
> printing the entries, while the "3" corresponds to ERR_NOOS ("Magic no.
> check failed for part.") -- i.e., the same error as was reported by the
> floppy's MBR boot code.

Certainly looks that way....

> 
> Hitting <F6> before the bootselector timeout results in a sort of
> successful boot from my backup O/S (problems are my fault; my backup
> procedure is still under construction), which really confirms to me
> that fdisk has written usable "part zero boot code", and that the
> problem is with "part 1" written (or not?) by "installboot".
> 
> Actually, it turns out that the sd0 MBR numbers the disks differently
> from the floppy MBR, so with the sd0 MBR (regular or bootselect, I
> suspect, but confirmed using "ls" from the countdown code loaded from
> my backup O/S disk after pressing F6 from sd0 bootselect -- are you
> confused yet?), we have:
> 
>   hd0 = sd0
>   hd1 = sd1
>   hd2 = sd2
>   hd3 = wd0
> 
> where the floppy's MBR had wd0 first.  But that's probably not important.

Interesting that the BIOS gives the disks different numbers when
booting from floppy and hard disk :-)

> Certainly "installboot" *thinks* it had installed parts 1 and 2 boot
> code correctly, and the bootability of my O/S (once we get a "part 1"
> from another location) confirms that "part 2" of the bootcode *is*
> installed correctly.
> 
> So I have narrowed it down to: installboot has not installed "part 1"
> of the boot code (the countdown code) correctly on sd0a.

Check whether the code is actually there - dd/hexdump and some
guesswork, you should see the strings "Read err" and "No NetBSD part"
lurking near the end of the sector...

> My only thought at this point is that somehow, "installboot" has put the
> "part 1" blocks at the wrong place (presumably somewhere that has not,
> so far, trashed anything else on that disk), but I can't imagine how
> or why, unless it has to do somehow with the BIOS vs NetBSD geometry,
> in which case I'm even further out of my depth than I appear to be.

There is another possibility:
The SCSI BIOS may have a different view of the disk geometry
(one you aren't being told about), so is reading the wrong sector.
You might be able to force the mbr_bootsel code to do an LBA
read (instead of a chs one).
Set the 'flag' byte (offset 0x195) of the MBR to 0x3.
(fdisk will probably do this - and never clear it - if you
put a selected partition above the chs limit)

I don't think there is space in the mbr_bootsel code to
dump the incorrect partition code.  Not without temporarily
removing some of the features.

	David

-- 
David Laight: david@l8s.co.uk