netbsd-bugs: kern/26411: wd.c adding too many blocks to the bad block list

Subject: kern/26411: wd.c adding too many blocks to the bad block list
To: None <gnats-bugs@gnats.NetBSD.org>
From: None <raeburn@raeburn.org>
List: netbsd-bugs
Date: 07/23/2004 03:45:18
>Number:         26411
>Category:       kern
>Synopsis:       wd.c adding too many blocks to the bad block list
>Confidential:   no
>Severity:       serious
>Priority:       high
>Responsible:    kern-bug-people
>State:          open
>Class:          sw-bug
>Submitter-Id:   net
>Arrival-Date:   Fri Jul 23 06:10:00 UTC 2004
>Closed-Date:
>Last-Modified:
>Originator:     Ken Raeburn
>Release:        2.0 branch as of July 5 or thereabouts
>Organization:
MIT
>Environment:
NetBSD thud 2.0_BETA NetBSD 2.0_BETA (THUD) #0: Mon Jul  5 12:50:20 EDT 2004  root@thud:/usr/obj/sys/arch/i386/compile/THUD i386

>Description:
I'm trying to recover what data I can from a dying disk, using dd, reading one block at a time.
Every so often, the kernel reports a bad block, and if it fails to recover the data after retrying a few times, dd reports an error on the block.  However, dd also fails on a number of blocks following the bad one (most of my tests were on the console so all I know is it spewed for a while, and five i/o error reports were on the screen at the end), with no additional reports from the kernel.

Looking with dkctl, I find:

thud# dkctl /dev/rwd2d badsector list
/dev/rwd2d: blocks 688204 - 688715 failed at Fri Jul 23 00:54:18 2004
/dev/rwd2d: blocks 684426 - 684937 failed at Fri Jul 23 00:53:19 2004

The max bad block number is 511 more than the first in both cases, which seems kind of suspicious.

In dev/ata/wd.c, line 780, I find:

			dbs->dbs_min = bp->b_rawblkno;
			dbs->dbs_max = dbs->dbs_min + bp->b_bcount - 1;

But isn't b_bcount a byte count?


On a distantly related note, there's an odd pattern in the logged messages.

Jul 22 03:02:35 thud /netbsd: wd2d: error reading fsbn 232679563 (wd2 bn 232679563; cn 230832 tn 14 sn 25), retrying
Jul 22 03:02:35 thud /netbsd: wd2: (obsolete (address mark not found))
Jul 22 03:02:36 thud /netbsd: wd2d: error reading fsbn 232679563 (wd2 bn 232679563; cn 230832 tn 14 sn 25)wd2: (uncorrectable data error)
Jul 22 03:02:36 thud /netbsd: 

If "retrying" is logged, the device name and error are logged on the next line.  Without "retrying", they're reported on the same line, and a blank line is emitted afterwards.  Lines 756-766 seem to do this.  Is it intentional, or should the newline emitted at 766 be instead emitted earlier when it's decided that the "retrying" message isn't going to be displayed?
>How-To-Repeat:
Try to copy the contents off a slowly failing disk.  Watch more dd messages spew forth than the kernel messages can account for.  Look at the bad sector list.
>Fix:
I don't know if dbs_max should be the same as dbs_min, or dbs_min + b_bcount/512 - 1, or dbs_min + b_bcount/some_size - 1, or what, but I suspect it's one of those.  I'm going to try just making them equal, I think.

I've started doing another copy, and now my bad sector list shows 684426, 688204, 688205, 688208, 917574, 917575, 917580, 81, 82 ... so the values are stored a lot less efficiently (can't really be helped if the times are to be tracked), but I believe I'm actually recovering more of the original data.
>Release-Note:
>Audit-Trail:
>Unformatted: