Subject: Re: faulty hard disk driver patch in pcmcia patches
To: John Kohl <jtk@kolvir.arlington.ma.us>
From: Chris G Demetriou <Chris_G_Demetriou@auchentoshan.pdl.cs.cmu.edu>
List: port-i386
Date: 11/19/1996 00:44:45
> >>>>> "enami" == enami tsugutomo <enami@ba2.so-net.or.jp> writes:
> 
> enami> John Kohl <jtk@kolvir.arlington.ma.us> writes:
> >> I've excised the bad wd.c patch from the patch kits on ftp.netbsd.org and
> >> sipb.mit.edu.
> 
> enami> Then there is no special patch for wd.c, right?  But my laptop (DEC
> enami> HiNote Ultra II) reports "hard error reading ..." on disk access after
> enami> resume if using plain wd.c.
> 
> I'm not totally certain of why this is (I don't know enough about the
> hardware involved), but I'm told that with this patch some drives will
> lose their geometry translation after a reset/unwedge.  After that, the
> driver will be using one set of cylinder/head/sector geometries and the
> disk will be using another, leading to file system data corruption.
> 
> If it works for you, then you're lucky/fine.  It does cause problems for
> other systems.
> 
> Charles Hannum has mentioned in conversation that he'd like to see a
> proper fix for the problem--any takers?

My patch for the problem, which i sent to Frank and Charles after i
returned from OSDI (actually, i think i sent it about a week and a
half ago), is included below.

I've not received any "good, bad, or ugly" comments about it that i
recall.  (if there were any, the folks who sent them should feel free
to remind me... 8-)

It's not particularly pretty, but it should get the job done.  Without
it, the ThinkPad 701C that i am borrowing displays
apparently-infinitely repeating "hard errors" after return from
suspend mode.


Note that the patch as given -- which simply invokes wdcunwedge() at a
useful time (after seeing a few errors) -- _should_ deal properly with
the geometry issue (at least, if my naive understanding of the issue
is correct).  This is because wdcunwedge() transitions the wdc state
machine into a state from which it will try to reload all of the
drives' geometries, etc.  At least, that's how I read it.  (I'm not a
big PC hardware goo-roo, i just needed to get something working in my
hotel room far from home, and it looked like the 'correct' solution
given the previous complaints i'd heard about not reloading the
drives' geometries.

Anyway, below is the mail i sent to Charles and Frank and a couple of
others...  If the patch works for you, great.  If it eats your hard
drive, well, i'm sorry, don't try and blame me for it, and that
probably isn't me laughing in the background...  8-)



chris
=============================================================================
To: mycroft@netbsd.org, fvdl@netbsd.org
cc: cgd@ux2.sp.cs.cmu.edu, perry@netbsd.org, thorpej@netbsd.org
Subject: the wd.c patch that i mentioned.
Date: Fri, 08 Nov 1996 19:44:35 -0500
Message-ID: <27599.847500275@ux2.sp.cs.cmu.edu>
From: Chris G Demetriou <Chris_G_Demetriou@ux2.sp.cs.cmu.edu>

So, as previously mentioned, while I was at OSDI (and previous) i was
seeing bad thing happening on my notebook's wd disk after resume from
suspend.  The errors that I was seeing were:

	(1) hard drive would get infinite errors on resume, when
	    trying to read blocks, until power cycle.  you couldn't
	    do _anything_ that wasn't already in core.

	(2) same for writes.

	(3) (2) could occasionally cause corruption, presumably
	    because important data hadn't been written out
	    completely before the suspend.

I didn't notice anything else, but that doesn't mean other things
weren't happening.


Below is the patch that fixed it for me.  After this patch was in
place, I didn't see any of the problems mentioned above.

Also included is a change which renames the "OPEN" state to "READY,"
because as far as i can tell from looking at the driver, "OPEN" has no
real meaning or relation to a device being opened, it confused me for
a bit, and could confuse others.


Anyway, the basic change that the patch makes is:

	If we see half of the number of allowed errors, rounded up,
	(i.e. (WDIORETRIES + 1) / 2) when trying a command, then
	try to unwedge the controller in case it got spooged, but
	keep the error count the same so we don't do try to reset
	it again for the same reason.

Rather simple; there could be a seperate counter or #define which does
something similar, but i figured that this was "good enough" at least
for a first cut to see what you folks thought of it and to fix my
immediate problem.

Note that this resets the controller, recalibrates the drives, and
sets up their geometry and multimode settings as appropriate, so the
concerns about trashing the drives' geometries should be satisfied.
I understand that resetting the controller may not be the best thing
to do, but it seems safe, and in the case where real errors are being
noticed it shouldn't hurt, should it?

The "wedgie" error message was actually observed, when it was enabled,
in the cases where the driver would go into "infinite error mode"
before the patch was applied.  I #if'd it out since it's a common
message, and hitting that case isn't necessarily an error (and
therefore shouldn't be indicated as one).


thoughts?  If this patch is inappropriate, i'd appreciate knowing why,
and being given enough information so that I might be able to fix it.
Without this patch, my laptop was unusable after suspends...


cgd
===================================================================
Index: wd.c
===================================================================
RCS file: /cvsroot/src/sys/dev/isa/wd.c,v
retrieving revision 1.154
diff -c -r1.154 wd.c
*** wd.c	1996/11/07 05:23:07	1.154
--- wd.c	1996/11/09 00:29:27
***************
*** 96,102 ****
  #define	GEOMETRY_WAIT	3		/* done uploading geometry */
  #define	MULTIMODE	4		/* set multiple mode */
  #define	MULTIMODE_WAIT	5		/* done setting multiple mode */
! #define	OPEN		6		/* done with open */
  	int sc_mode;			/* transfer mode */
  #define	WDM_PIOSINGLE	0		/* single-sector PIO */
  #define	WDM_PIOMULTI	1		/* multi-sector PIO */
--- 96,102 ----
  #define	GEOMETRY_WAIT	3		/* done uploading geometry */
  #define	MULTIMODE	4		/* set multiple mode */
  #define	MULTIMODE_WAIT	5		/* done setting multiple mode */
! #define	READY		6		/* ready for use */
  	int sc_mode;			/* transfer mode */
  #define	WDM_PIOSINGLE	0		/* single-sector PIO */
  #define	WDM_PIOMULTI	1		/* multi-sector PIO */
***************
*** 547,553 ****
  	bp = wd->sc_q.b_actf;
      
  	if (wdc->sc_errors >= WDIORETRIES) {
! 		wderror(wd, bp, "hard error");
  		bp->b_error = EIO;
  		bp->b_flags |= B_ERROR;
  		wdfinish(wd, bp);
--- 547,553 ----
  	bp = wd->sc_q.b_actf;
      
  	if (wdc->sc_errors >= WDIORETRIES) {
! 		wderror(wd, bp, "wdcstart hard error");
  		bp->b_error = EIO;
  		bp->b_flags |= B_ERROR;
  		wdfinish(wd, bp);
***************
*** 555,561 ****
  	}
  
  	/* Do control operations specially. */
! 	if (wd->sc_state < OPEN) {
  		/*
  		 * Actually, we want to be careful not to mess with the control
  		 * state if the device is currently busy, but we can assume
--- 555,561 ----
  	}
  
  	/* Do control operations specially. */
! 	if (wd->sc_state < READY) {
  		/*
  		 * Actually, we want to be careful not to mess with the control
  		 * state if the device is currently busy, but we can assume
***************
*** 783,789 ****
  	}
      
  	/* Is it not a transfer, but a control operation? */
! 	if (wd->sc_state < OPEN) {
  		if (wdcontrol(wd) == 0) {
  			/* The drive is busy.  Wait. */
  			return 1;
--- 783,789 ----
  	}
      
  	/* Is it not a transfer, but a control operation? */
! 	if (wd->sc_state < READY) {
  		if (wdcontrol(wd) == 0) {
  			/* The drive is busy.  Wait. */
  			return 1;
***************
*** 811,821 ****
  		if (bp->b_flags & B_FORMAT)
  			goto bad;
  #endif
- 	
- 		if (++wdc->sc_errors < WDIORETRIES)
- 			goto restart;
- 		wderror(wd, bp, "hard error");
  
  #ifdef B_FORMAT
  	bad:
  #endif
--- 811,828 ----
  		if (bp->b_flags & B_FORMAT)
  			goto bad;
  #endif
  
+ 		if (++wdc->sc_errors < WDIORETRIES) {
+ 			if (wdc->sc_errors == (WDIORETRIES + 1) / 2) {
+ #if 0
+ 				wderror(wd, NULL, "wedgie");
+ #endif
+ 				wdcunwedge(wdc);
+ 				return 1;
+ 			}
+ 			goto restart;
+ 		}
+ 		wderror(wd, bp, "wdcintr hard error");
  #ifdef B_FORMAT
  	bad:
  #endif
***************
*** 1129,1135 ****
  	case MULTIMODE:
  	multimode:
  		if (wd->sc_mode != WDM_PIOMULTI)
! 			goto open;
  		outb(wdc->sc_iobase+wd_seccnt, wd->sc_multiple);
  		if (wdcommandshort(wdc, wd->sc_drive, WDCC_SETMULTI) != 0) {
  			wderror(wd, NULL, "wdcontrol: setmulti failed (1)");
--- 1136,1142 ----
  	case MULTIMODE:
  	multimode:
  		if (wd->sc_mode != WDM_PIOMULTI)
! 			goto ready;
  		outb(wdc->sc_iobase+wd_seccnt, wd->sc_multiple);
  		if (wdcommandshort(wdc, wd->sc_drive, WDCC_SETMULTI) != 0) {
  			wderror(wd, NULL, "wdcontrol: setmulti failed (1)");
***************
*** 1144,1153 ****
  			goto bad;
  		}
  		/* fall through */
! 	case OPEN:
! 	open:
  		wdc->sc_errors = 0;
! 		wd->sc_state = OPEN;
  		/*
  		 * The rest of the initialization can be done by normal means.
  		 */
--- 1151,1160 ----
  			goto bad;
  		}
  		/* fall through */
! 	case READY:
! 	ready:
  		wdc->sc_errors = 0;
! 		wd->sc_state = READY;
  		/*
  		 * The rest of the initialization can be done by normal means.
  		 */
***************
*** 1500,1506 ****
  	part = WDPART(dev);
  
  	/* Make sure it was initialized. */
! 	if (wd->sc_state < OPEN)
  		return ENXIO;
  
  	wdc = (void *)wd->sc_dev.dv_parent;
--- 1507,1513 ----
  	part = WDPART(dev);
  
  	/* Make sure it was initialized. */
! 	if (wd->sc_state < READY)
  		return ENXIO;
  
  	wdc = (void *)wd->sc_dev.dv_parent;