Subject: re: crash dump failing on machine with 4GB
To: Chris Ross <cross+netbsd@distal.com>
From: matthew green <mrg@eterna.com.au>
List: port-sparc64
Date: 09/28/2007 10:04:09
   
   On Sep 26, 2007, at 14:22, Chris Ross wrote:
   > On Sep 26, 2007, at 13:51, matthew green wrote:
   >> can you get a stack trace with symbols?  or use gdb to
   >> find them out from these values?
   >
   >   Of course.  Here's a backtrace after the failed "reboot 0x104"  
   > used to cause the dump attempt.
   >
   > dumping to dev 7,1 offset 4310231
   > dump 4096 esiop0: unable to load cmd DMA map: -1i/o error
   > sd0(esiop0:0:0:0): polling command not done
   > panic: scsipi_execute_xs
   > cpu0: kdb breakpoint at 13f3e80
   > Stopped in pid 0.2 (system) at  netbsd:cpu_Debugger+0x4:        nop
   > db> bt
   > scsipi_execute_xs(5f89c00, e0016d96, a, 0, 0, 4) at  
   > netbsd:scsipi_execute_xs+0x3
   > 18
   > sd_flush(746fc00, 103, 0, 0, 0, 8000000000001034) at netbsd:sd_flush 
   > +0x84
   > sd_shutdown(746fc00, 5, 0, 0, e0016fb8, 0) at netbsd:sd_shutdown+0x18
   > doshutdownhooks(161eaa8, 5, 0, 10, 1857800, f) at  
   > netbsd:doshutdownhooks+0x30
   
      So, does anyone have any suggestions on where I should go from  
   here?  I looked into the "unable to load cmd DMA map" error, which is  
   returning an EIO from a call to bus_dmamap_load().  Should I try to  
   track down into that function (via the macro, etc) and figure out if  
   it's returning an EIO for some reason relating to the physical memory  
   address it's given?  Or, can someone look at the code in doshutdown()  
   to see if the physical memory mapping calls "look right"?  I was  
   looking at amd64, figuring that it would be more likely to have this  
   functionality working, and I notice that the pmap_* call(s) it uses  
   are different, but that may not be unusual...

you mean it's bus_dmamap_load() is different?  yeah, that is gonna
be expected..

hmm, i don't see how sparc64 bus_dmamap_load() could return EIO?
see machdep.c:_bus_dmamap_load().  oh, the message above says it
returns -1... which also seems not possible... 

is the above text exactly what it says?  i don't see where the
"i/o error" comes from?  there should be a newline after the -1.
(perhaps you changed this?)
   
      Thanks.  I know not everyone has a 4GB sparc64 to play with, so  
   I'm happy to work on this, but I will need to get this machine into  
   production in the not-too-distant future, so need to keep moving.
   
   ps,
      Is the last argument to sd_flush(), quoted in the backtrace above,  
   indicative of a problem?  Just looks "odd" compared to the rest of  
   the parameters.
   

it is just garbage on the stack.  looking at sd.c:

	static int sd_flush(struct sd_softc *, int);

so only the first two arguments are relevant.