Subject: Re: kern/9857: wddone() omits block numbers from soft errors
To: John Hawkinson <jhawk@MIT.EDU>
From: Manuel Bouyer <bouyer@antioche.lip6.fr>
List: netbsd-bugs
Date: 04/11/2000 17:48:28
On Tue, Apr 11, 2000 at 11:29:41AM -0400, John Hawkinson wrote:
> I don't think this sufficient by any means. Here's an example of
> some stuff lying around in my message buffer ;-):
> 
> wd0e:  uncorrectable data error reading fsbn 7490056 of 7490056-7490057 (wd0 bn 12934201; cn 13686 tn 14 sn 49), retrying
> wd0e:  uncorrectable data error reading fsbn 7490056 of 7490056-7490057 (wd0 bn 12934201; cn 13686 tn 14 sn 49), retrying
> wd0e:  uncorrectable data error reading fsbn 7490056 of 7490056-7490057 (wd0 bn 12934201; cn 13686 tn 14 sn 49), retrying
> wd0e:  uncorrectable data error reading fsbn 7490056 of 7490056-7490057 (wd0 bn 12934201; cn 13686 tn 14 sn 49), retrying
> wd0e:  uncorrectable data error reading fsbn 7490056 of 7490056-7490057 (wd0 bn 12934201; cn 13686 tn 14 sn 49), retrying
> wd0e:  uncorrectable data error reading fsbn 7490056 of 7490056-7490057 (wd0 bn 12934201; cn 13686 tn 14 sn 49)
> wd0e:  uncorrectable data error reading fsbn 7509888 of 7509824-7509903 (wd0 bn 12954033; cn 13707 tn 14 sn 36), retrying
> wd0e:  uncorrectable data error reading fsbn 7509888 of 7509824-7509903 (wd0 bn 12954033; cn 13707 tn 14 sn 36), retrying
> wd0e:  uncorrectable data error reading fsbn 7509888 of 7509824-7509903 (wd0 bn 12954033; cn 13707 tn 14 sn 36), retrying
> wd0e:  uncorrectable data error reading fsbn 7509888 of 7509824-7509903 (wd0 bn 12954033; cn 13707 tn 14 sn 36), retrying
> wd0e:  uncorrectable data error reading fsbn 7509894 of 7509824-7509903 (wd0 bn 12954039; cn 13707 tn 14 sn 42), retrying
> wd0e:  uncorrectable data error reading fsbn 7509894 of 7509824-7509903 (wd0 bn 12954039; cn 13707 tn 14 sn 42)
> wd0e:  uncorrectable data error reading fsbn 7509888 of 7509872-7509919 (wd0 bn 12954033; cn 13707 tn 14 sn 36), retrying
> wd0e:  uncorrectable data error reading fsbn 7509888 of 7509872-7509919 (wd0 bn 12954033; cn 13707 tn 14 sn 36), retrying
> wd0e:  uncorrectable data error reading fsbn 7509888 of 7509872-7509919 (wd0 bn 12954033; cn 13707 tn 14 sn 36), retrying
> wd0e:  uncorrectable data error reading fsbn 7509888 of 7509872-7509919 (wd0 bn 12954033; cn 13707 tn 14 sn 36), retrying
> wd0: soft error (corrected)
> wd0e:  uncorrectable data error reading fsbn 7510128 of 7510128-7510143 (wd0 bn 12954273; cn 13708 tn 3 sn 24), retrying
> wd0e:  uncorrectable data error reading fsbn 7510128 of 7510128-7510143 (wd0 bn 12954273; cn 13708 tn 3 sn 24), retrying
> wd0e:  (obsolete) reading fsbn 7510128 of 7510128-7510143 (wd0 bn 12954273; cn 13708 tn 3 sn 24), retrying
> wd0e:  uncorrectable data error reading fsbn 7510128 of 7510128-7510143 (wd0 bn 12954273; cn 13708 tn 3 sn 24), retrying
> wd0e:  uncorrectable data error reading fsbn 7510134 of 7510128-7510143 (wd0 bn 12954279; cn 13708 tn 3 sn 30), retrying
> wd0e:  (obsolete) reading fsbn 7510134 of 7510128-7510143 (wd0 bn 12954279; cn 13708 tn 3 sn 30)
> wd0e:  uncorrectable data error reading fsbn 7510368 of 7510368-7510415 (wd0 bn 12954513; cn 13708 tn 7 sn 12), retrying
> wd0e:  uncorrectable data error reading fsbn 7510368 of 7510368-7510415 (wd0 bn 12954513; cn 13708 tn 7 sn 12), retrying
> wd0e:  uncorrectable data error reading fsbn 7510368 of 7510368-7510415 (wd0 bn 12954513; cn 13708 tn 7 sn 12), retrying
> wd0e:  uncorrectable data error reading fsbn 7510368 of 7510368-7510415 (wd0 bn 12954513; cn 13708 tn 7 sn 12), retrying
> wd0e:  uncorrectable data error reading fsbn 7510374 of 7510368-7510415 (wd0 bn 12954519; cn 13708 tn 7 sn 18), retrying
> wd0e:  uncorrectable data error reading fsbn 7510374 of 7510368-7510415 (wd0 bn 12954519; cn 13708 tn 7 sn 18)
> 
> Now, the absence of the "soft error" output there would make block
> 7509888 indistinguishable from 7490056 or 7510128.

Ok, so the text should maybe be changed ('retry succeeded', or something
like this). But I don't think it's worth repeating the block number.

> 
> Also, I think it would be nice if the wd driver kept stats on these
> errors -- but this part of an overall stats desire that's probably
> better addressed seperately.

Yes. diconnect/reselect and tagged command queueing would make this even
more attractive.

--
Manuel Bouyer, LIP6, Universite Paris VI.           Manuel.Bouyer@lip6.fr
--