Subject: Re: kern/9857: wddone() omits block numbers from soft errors
To: Manuel Bouyer <bouyer@antioche.lip6.fr>
From: John Hawkinson <jhawk@MIT.EDU>
List: netbsd-bugs
Date: 04/11/2000 11:29:41
[ added netbsd-bugs, left some quoting in because of it ]

>> >>A soft error is a hard error which has been corrected by a retry.

>> Hmm. Thinking about this once more, I don't think it is clear enough.
>> Either the printf() should be made more clear, or the documentation
>> should be updated. I would sort of favor the first, but I suppose
>> an argument can be made for ata(4), especially since it wouldn't
>> require futzing with the code to save blkdone.
>> 
>> What do you think?
>
>I think that the 'soft error' printf should just go away. This is just
>redondant with the message saying 'retrying'. If the kernel says it's
>retrying and we don't hear anything after that it's implicit that the
>retry worked, isn't it ?

I don't think this sufficient by any means. Here's an example of
some stuff lying around in my message buffer ;-):

wd0e:  uncorrectable data error reading fsbn 7490056 of 7490056-7490057 (wd0 bn 12934201; cn 13686 tn 14 sn 49), retrying
wd0e:  uncorrectable data error reading fsbn 7490056 of 7490056-7490057 (wd0 bn 12934201; cn 13686 tn 14 sn 49), retrying
wd0e:  uncorrectable data error reading fsbn 7490056 of 7490056-7490057 (wd0 bn 12934201; cn 13686 tn 14 sn 49), retrying
wd0e:  uncorrectable data error reading fsbn 7490056 of 7490056-7490057 (wd0 bn 12934201; cn 13686 tn 14 sn 49), retrying
wd0e:  uncorrectable data error reading fsbn 7490056 of 7490056-7490057 (wd0 bn 12934201; cn 13686 tn 14 sn 49), retrying
wd0e:  uncorrectable data error reading fsbn 7490056 of 7490056-7490057 (wd0 bn 12934201; cn 13686 tn 14 sn 49)
wd0e:  uncorrectable data error reading fsbn 7509888 of 7509824-7509903 (wd0 bn 12954033; cn 13707 tn 14 sn 36), retrying
wd0e:  uncorrectable data error reading fsbn 7509888 of 7509824-7509903 (wd0 bn 12954033; cn 13707 tn 14 sn 36), retrying
wd0e:  uncorrectable data error reading fsbn 7509888 of 7509824-7509903 (wd0 bn 12954033; cn 13707 tn 14 sn 36), retrying
wd0e:  uncorrectable data error reading fsbn 7509888 of 7509824-7509903 (wd0 bn 12954033; cn 13707 tn 14 sn 36), retrying
wd0e:  uncorrectable data error reading fsbn 7509894 of 7509824-7509903 (wd0 bn 12954039; cn 13707 tn 14 sn 42), retrying
wd0e:  uncorrectable data error reading fsbn 7509894 of 7509824-7509903 (wd0 bn 12954039; cn 13707 tn 14 sn 42)
wd0e:  uncorrectable data error reading fsbn 7509888 of 7509872-7509919 (wd0 bn 12954033; cn 13707 tn 14 sn 36), retrying
wd0e:  uncorrectable data error reading fsbn 7509888 of 7509872-7509919 (wd0 bn 12954033; cn 13707 tn 14 sn 36), retrying
wd0e:  uncorrectable data error reading fsbn 7509888 of 7509872-7509919 (wd0 bn 12954033; cn 13707 tn 14 sn 36), retrying
wd0e:  uncorrectable data error reading fsbn 7509888 of 7509872-7509919 (wd0 bn 12954033; cn 13707 tn 14 sn 36), retrying
wd0: soft error (corrected)
wd0e:  uncorrectable data error reading fsbn 7510128 of 7510128-7510143 (wd0 bn 12954273; cn 13708 tn 3 sn 24), retrying
wd0e:  uncorrectable data error reading fsbn 7510128 of 7510128-7510143 (wd0 bn 12954273; cn 13708 tn 3 sn 24), retrying
wd0e:  (obsolete) reading fsbn 7510128 of 7510128-7510143 (wd0 bn 12954273; cn 13708 tn 3 sn 24), retrying
wd0e:  uncorrectable data error reading fsbn 7510128 of 7510128-7510143 (wd0 bn 12954273; cn 13708 tn 3 sn 24), retrying
wd0e:  uncorrectable data error reading fsbn 7510134 of 7510128-7510143 (wd0 bn 12954279; cn 13708 tn 3 sn 30), retrying
wd0e:  (obsolete) reading fsbn 7510134 of 7510128-7510143 (wd0 bn 12954279; cn 13708 tn 3 sn 30)
wd0e:  uncorrectable data error reading fsbn 7510368 of 7510368-7510415 (wd0 bn 12954513; cn 13708 tn 7 sn 12), retrying
wd0e:  uncorrectable data error reading fsbn 7510368 of 7510368-7510415 (wd0 bn 12954513; cn 13708 tn 7 sn 12), retrying
wd0e:  uncorrectable data error reading fsbn 7510368 of 7510368-7510415 (wd0 bn 12954513; cn 13708 tn 7 sn 12), retrying
wd0e:  uncorrectable data error reading fsbn 7510368 of 7510368-7510415 (wd0 bn 12954513; cn 13708 tn 7 sn 12), retrying
wd0e:  uncorrectable data error reading fsbn 7510374 of 7510368-7510415 (wd0 bn 12954519; cn 13708 tn 7 sn 18), retrying
wd0e:  uncorrectable data error reading fsbn 7510374 of 7510368-7510415 (wd0 bn 12954519; cn 13708 tn 7 sn 18)

Now, the absence of the "soft error" output there would make block
7509888 indistinguishable from 7490056 or 7510128.

Also, I think it would be nice if the wd driver kept stats on these
errors -- but this part of an overall stats desire that's probably
better addressed seperately.

--jhawk