Subject: Re: kern/32717: alpha 3.0 install kernel doesn't see scsi disks
To: None <kern-bug-people@netbsd.org, gnats-admin@netbsd.org,>
From: Ken Raeburn <raeburn@MIT.EDU>
List: netbsd-bugs
Date: 05/17/2006 03:35:01
The following reply was made to PR kern/32717; it has been noted by GNATS.

From: Ken Raeburn <raeburn@MIT.EDU>
To: gnats-bugs@netbsd.org
Cc: kern-bug-people@netbsd.org, netbsd-bugs@netbsd.org
Subject: Re: kern/32717: alpha 3.0 install kernel doesn't see scsi disks
Date: Tue, 16 May 2006 23:33:43 -0400

 So, I've been stumbling around trying to figure out how to debug this  
 one.
 
 I tried tweaking uvm_pagefree to use insert-at-head initially, then  
 after N calls, switch to inserting at the tail.  I found that for  
 some values of N the system would boot okay, and for others it  
 wouldn't.  I also tried switching back to head insertions after some  
 other number of calls was reached.  At this point, it appears that if  
 I use tail insertions for calls 1287212 through 160000, it works; for  
 130000:160000, it reports SCSI errors, as I originally reported; for  
 128716:160000, and at least up through 128500:160000, it recognizes  
 the disks but not the disk label.
 
 I also hacked uvm_pagefree to scribble the pattern 7d,5d over the  
 page (and clear PG_ZERO) before putting it on the free list.  I also  
 enabled DEBUG and DIAGNOSTIC, but they don't seem to have found  
 anything interesting.
 
 When I start the tail insertions at 1287212, the storage for the  
 first disk label is at kernel address 0xfffffc003fffc040 and gets  
 filled with a reasonable disk label.  When I start the tail  
 insertions at 1287216, the disk label is supposed to be at  
 0xfffffc0040004040 and is filled with the 7d,5d pattern I used.  The  
 low 32 bits of that address is right after the 1G mark.
 
 This machine has 1280M of memory, 256M in bank A and 1024M in bank  
 B.  I suppose it's possible that some of the memory chips are bad in  
 a way that doesn't show up writing 7d,5d from the kernel but causes  
 writes from the SCSI controller to fail completely, consistently, and  
 quietly; I haven't figured out how to run the console memory tester  
 yet.  But could there be a problem in telling the PCI SCSI controller  
 how to access some of the memory?
 
 I'm also looking at possible hardware issues.  Pulling the 1G memory  
 makes everything work fine, so it appears that the 256M is not bad,  
 or at least not completely broken; running with just the 1G in bank A  
 also works fine.  Swapping the banks  (A=1024, B=256) leaves it  
 breaking in the same way as it does now.  Reseating the SCSI  
 controller card also makes no difference.
 
 With just the 1G bank installed, I can boot and run the install CD.   
 I assume once it finishes, I'll either have to install a custom  
 kernel using tail insertion (which would worry me, since I don't know  
 what the actual problem is or whether it might bite me in some other  
 way), or keep running with an empty memory bank when I seem to have  
 two banks worth of working memory...  Any other suggestions for  
 things I can try?
 
 Ken