Subject: Re: kern/32717: alpha 3.0 install kernel doesn't see scsi disks
To: None <gnats-bugs@netbsd.org>
From: Ken Raeburn <raeburn@MIT.EDU>
List: netbsd-bugs
Date: 05/16/2006 23:33:43
So, I've been stumbling around trying to figure out how to debug this  
one.

I tried tweaking uvm_pagefree to use insert-at-head initially, then  
after N calls, switch to inserting at the tail.  I found that for  
some values of N the system would boot okay, and for others it  
wouldn't.  I also tried switching back to head insertions after some  
other number of calls was reached.  At this point, it appears that if  
I use tail insertions for calls 1287212 through 160000, it works; for  
130000:160000, it reports SCSI errors, as I originally reported; for  
128716:160000, and at least up through 128500:160000, it recognizes  
the disks but not the disk label.

I also hacked uvm_pagefree to scribble the pattern 7d,5d over the  
page (and clear PG_ZERO) before putting it on the free list.  I also  
enabled DEBUG and DIAGNOSTIC, but they don't seem to have found  
anything interesting.

When I start the tail insertions at 1287212, the storage for the  
first disk label is at kernel address 0xfffffc003fffc040 and gets  
filled with a reasonable disk label.  When I start the tail  
insertions at 1287216, the disk label is supposed to be at  
0xfffffc0040004040 and is filled with the 7d,5d pattern I used.  The  
low 32 bits of that address is right after the 1G mark.

This machine has 1280M of memory, 256M in bank A and 1024M in bank  
B.  I suppose it's possible that some of the memory chips are bad in  
a way that doesn't show up writing 7d,5d from the kernel but causes  
writes from the SCSI controller to fail completely, consistently, and  
quietly; I haven't figured out how to run the console memory tester  
yet.  But could there be a problem in telling the PCI SCSI controller  
how to access some of the memory?

I'm also looking at possible hardware issues.  Pulling the 1G memory  
makes everything work fine, so it appears that the 256M is not bad,  
or at least not completely broken; running with just the 1G in bank A  
also works fine.  Swapping the banks  (A=1024, B=256) leaves it  
breaking in the same way as it does now.  Reseating the SCSI  
controller card also makes no difference.

With just the 1G bank installed, I can boot and run the install CD.   
I assume once it finishes, I'll either have to install a custom  
kernel using tail insertion (which would worry me, since I don't know  
what the actual problem is or whether it might bite me in some other  
way), or keep running with an empty memory bank when I seem to have  
two banks worth of working memory...  Any other suggestions for  
things I can try?

Ken