NetBSD-Users archive

[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index][Old Index]

Re: NetBSD 9.1 upgrade and file system crash - reboot fails



Hi Martin,

On 2020-10-30 14:51:28 +0000 Martin Husemann <martin%duskware.de@localhost> wrote:

On Fri, Oct 30, 2020 at 03:41:55PM +0100, Riccardo Mottola wrote:
A lot of errors.... and the system is not bootable anymore! I get:
NetBSD MBR boot....
Non-System disk or disk error

This is very early MBR boot sector failure, it should not be related
to the fsck issue - but maybe your disk starts to act up?

could be... the boot part should not be affected by a kernel/filesystem error, right? (except something very bad like out-of-partition access or such).
The disk should be pretty new, but read below.

I would start checking fdisk output for the disk - is it still as
expected? Does it show a NetBSD partition with expected size?


Disk: /dev/wd0
NetBSD disklabel disk geometry:
cylinders: 155061, heads: 16, sectors/track: 63 (1008 sectors/cylinder)
total sectors: 156301488, bytes/sector: 512

BIOS disk geometry:
cylinders: 1022, heads: 240, sectors/track: 63 (15120 sectors/cylinder)
total sectors: 156301488

Partitions aligned to 15120 sector boundaries, offset 63

Partition table:
0: NetBSD (sysid 169)
    start 64, size 156301424 (76319 MB, Cyls 0/1/2-10337/95/63), Active
1: <UNUSED>
2: <UNUSED>
3: <UNUSED>
Bootselector disabled.
First active partition: 0
Drive serial number: 0 (0x00000000)


disklabel:
4 partitions:
#        size    offset     fstype [fsize bsize cpg/sgs]
a: 151173728 64 4.2BSD 0 0 0 # (Cyl. 0*- 149973) b: 5127696 151173792 swap # (Cyl. 149974 - 155060) c: 156301424 64 unused 0 0 # (Cyl. 0*- 155060) d: 156301488 0 unused 0 0 # (Cyl. 0 - 155060)


offset ad size of c matches with the partition table. Is that fine enough?

Then compare the disklabel output, does it match?

If that is ok, install bootloader again.

I Installed anyway and got the machine booting again.. then did all the chekcs. All important data is backed up, the only inconvenience is the typical setup-reinstall, etc.

Also use atactl to check the smart status of the disk.

How reliable is that data?

I checked SMART status, it looks a little worrying:
SMART supported, SMART enabled
id value thresh crit collect reliability description raw 1 58 34 yes online positive Raw read error rate 27218486
  3  96    0     yes online  positive    Spin-up time                0
4 95 20 no online positive Start/stop count 6082
  5 100   36     yes online  positive    Reallocated sector count    13
7 81 30 yes online positive Seek error rate 125626383 9 95 0 no online positive Power-on hours count 4752
 10 100   34     yes online  positive    Spin retry count            0
12 98 20 no online positive Device power cycle count 2790 192 99 0 no online positive Power-off retract count 2791 193 18 0 no online positive Load cycle count 165436 194 37 0 no online positive Temperature 37 Lifetime min/max 0/11 195 58 0 no online positive Hardware ECC Recovered 27218486
197 100    0     no  online  positive    Current pending sector      0
198 100    0     no  offline positive    Offline uncorrectable       0
199 200    0     no  online  positive    Ultra DMA CRC error count   0
200 100    0     no  offline positive    Write error rate            0
202 100    0     no  online  positive    Data address mark errors    0

13 reallocated sectors, if one of them is on the MBR, who knows? But also the number of cycles and power-on is high, but reasonable. The read & Seek look incredibily high. So I thought of writing this to a file, checking the next day and then today again, just do see what increases.

The day after:
SMART supported, SMART enabled
id value thresh crit collect reliability description raw 1 59 34 yes online positive Raw read error rate 232650323
  3  96    0     yes online  positive    Spin-up time                0
4 95 20 no online positive Start/stop count 6088
  5 100   36     yes online  positive    Reallocated sector count    13
7 81 30 yes online positive Seek error rate 126691967 9 95 0 no online positive Power-on hours count 4762
 10 100   34     yes online  positive    Spin retry count            0
12 98 20 no online positive Device power cycle count 2793 192 99 0 no online positive Power-off retract count 2794 193 17 0 no online positive Load cycle count 166041 194 29 0 no online positive Temperature 29 Lifetime min/max 0/11 195 59 0 no online positive Hardware ECC Recovered 232650323
197 100    0     no  online  positive    Current pending sector      0
198 100    0     no  offline positive    Offline uncorrectable       0
199 200    0     no  online  positive    Ultra DMA CRC error count   0
200 100    0     no  offline positive    Write error rate            0
202 100    0     no  online  positive    Data address mark errors    0

Some stuff makes sense.. like +10 more hours, a couple of start/stop conts more. Bug e.g. the number of hardware error recorvered is 10 times higher? The same for the raw read error wow...

Then this is the data for the third day (each time I did a power-off reboot, so it is not continuous operation, I shut down the laptop at night)

SMART supported, SMART enabled
id value thresh crit collect reliability description raw 1 60 34 yes online positive Raw read error rate 73875073
  3  96    0     yes online  positive    Spin-up time                0
4 95 20 no online positive Start/stop count 6088
  5 100   36     yes online  positive    Reallocated sector count    13
7 81 30 yes online positive Seek error rate 127050561 9 95 0 no online positive Power-on hours count 4771
 10 100   34     yes online  positive    Spin retry count            0
12 98 20 no online positive Device power cycle count 2793 192 99 0 no online positive Power-off retract count 2794 193 17 0 no online positive Load cycle count 166675 194 28 0 no online positive Temperature 28 Lifetime min/max 0/11 195 60 0 no online positive Hardware ECC Recovered 73875073
197 100    0     no  online  positive    Current pending sector      0
198 100    0     no  offline positive    Offline uncorrectable       0
199 200    0     no  online  positive    Ultra DMA CRC error count   0
200 100    0     no  offline positive    Write error rate            0
202 100    0     no  online  positive    Data address mark errors    0

The number of read errors skyrocketed!

The number of reallocated sector remains the same and this is the only... reassuring thing.

Riccardo



Home | Main Index | Thread Index | Old Index