Subject: re: crash dump failing on machine with 4GB
To: Chris Ross <cross+netbsd@distal.com>
From: matthew green <mrg@eterna.com.au>
List: port-sparc64
Date: 09/27/2007 03:51:54
   
   On Sep 26, 2007, at 12:07, Chris Ross wrote:
   >   Is this a known issue?  I have a sparc64 machine with 4GB of memory.
   
      Not unexpectedly, this appears to be an int overflow issue.   
   Making the following change:
   
   --- sys/arch/sparc64/sparc64/machdep.c  11 Sep 2007 16:00:06  
   -0000      1.202
   +++ sys/arch/sparc64/sparc64/machdep.c  26 Sep 2007 17:24:50 -0000
   @@ -759,7 +759,7 @@
            for (mp = &phys_installed[0], j = 0; j < phys_installed_size;
                            j++, mp = &phys_installed[j]) {
   -               unsigned i = 0, n;
   +               unsigned long i = 0, n;
                    paddr_t maddr = mp->start;
   #if 0
   @@ -781,8 +781,7 @@
                                    printf("%ld ", todo / (1024*1024));
                            pmap_kenter_pa(dumpspace, maddr, VM_PROT_READ);
                            pmap_update(pmap_kernel());
   -                       error = (*dump)(dumpdev, blkno,
   -                                       (void *)dumpspace, (int)n);
   +                       error = (*dump)(dumpdev, blkno, (void *) 
   dumpspace, n);
                            pmap_kremove(dumpspace, n);
                            pmap_update(pmap_kernel());
                            if (error)

i guess i was expecting something like this.  you may be the first
person to truly try crashdumps on 4GB machine :-)
   
      causes it to produce a new error.  n is capped at 8192 by other  
   code, so the latter segment above is probably not even an issue.  I  
   don't know enough about the lower-level device code to know what I'm  
   hitting, so I thought I'd ask.  This wasn't getting hit before  
   because n was 0, due to the overflow.

8192 is almost certainly due to that being the sparc64 page size.
   
      I'm seeing now:
   
   db> reboot 0x104
   Frame pointer is at 0xe0016651
   Call traceback:
   13ea690(1, d, 0, e00171e0, ffffffffffffffff, 0, e0016731) fp = e0016731
   10be120(104, 0, e00170a8, 1860800, 1860b88, 188c7a8, e00167f1) fp =  
   e00167f1
   10bd658(1, 0, 4, e0017170, e0017298, 188c7a8, e00168c1) fp = e00168c1
   10bdc88(180f2c8, 4, 0, 0, e0017388, 0, e0016a11) fp = e0016a11
   10c163c(13f3f08, 0, 2, 1898819, 0, 0, e0016b01) fp = e0016b01
   13f5264(0, 0, 0, 0, 4, 1000000, e0016bd1) fp = e0016bd1
   13f2dd8(101, e0017b60, 98b31e1fa, 957d95e00000000, 1d00000000,  
   18a4800, e0017131) fp = e0017131
   1008c1c(e0017b60, 101, 13f3f00, 1d0006, 400, 187a998, e00172b1) fp =  
   e00172b1
   13c234c(189b950, 187f3e0, ffffffff, 0, 1818c00, 1d, e0017491) fp =  
   e0017491
   13c29a8(61c4800, e0017e0c, a847c1a, 7477, ffff, 40, e0017551) fp =  
   e0017551
   100911c(0, 0, e0017ed0, 1877998, 13c2960, 1000000, e0017621) fp =  
   e0017621
   1288640(0, 0, 4, 6, 187a800, 1000000, ffbd561) fp = ffbd561
   
   dumping to dev 7,1 offset 4310231
   dump 4096 esiop0: unable to load cmd DMA map: -1i/o error
   sd0(esiop0:0:0:0): polling command not done
   panic: scsipi_execute_xs
   cpu0: kdb breakpoint at 13f3f00
   Stopped in pid 0.2 (system) at  netbsd:cpu_Debugger+0x4:        nop
   db>


can you get a stack trace with symbols?  or use gdb to
find them out from these values?


.mrg.