Subject: Re: kern/10016: ddb can get stuck in infinite uvm_fault()ing
To: None <gnats-bugs@gnats.netbsd.org>
From: John Hawkinson <jhawk@mit.edu>
List: netbsd-bugs
Date: 05/04/2000 15:06:59
OK, I feel like I'm missing something with respect to how this is supposed
to work.

uvm_fault() is faulting inside db_read_bytes(), because the
code segment is invalid:

_db_read_bytes+0x10:    movb    0(%ecx),%al
db> t
_db_read_bytes(0,4,c0548cf8,c04dab64,c0548d38) at _db_read_bytes+0x10
_db_get_value(0,4,0,5,0) at _db_get_value+0x18
_db_stop_at_pc(c04dab64,c0548d38) at _db_stop_at_pc+0x1c2
_db_trap(5,0,1,ffffffff,c0548e30) at _db_trap+0x39
_kdb_trap(5,0,c0548dac) at _kdb_trap+0xb4
_trap() at _trap+0x168
--- trap (number 5) ---
(null)(ff2) at 0
_pnpbios_getnode(1,c0548e30,c06b9600,d2) at _pnpbios_getnode+0x6a
_pnpbios_attach(c06bdfc0,c06afe80,c0548ed4,c06afe80,c0548ed4) at _pnpbios_attach
+0x223
_config_attach(c06bdfc0,c0458fb8,c0548ed4,c031d7b8,c06bdfc0) at _config_attach+0
x30a
_config_found_sm(c06bdfc0,c0548ed4,c031d7b8,0) at _config_found_sm+0x29
_mainbus_attach(0,c06bdfc0,0,c06bdfc0,0) at _mainbus_attach+0x41
_config_attach(0,c0458164,0,0,c04b9694) at _config_attach+0x30a
_config_rootfound(c02f86c9,0) at _config_rootfound+0x37
_cpu_configure(bfeff000,c0548fa8,c01a202b,c0546010,546000) at _cpu_configure+0x1
e
_configure(c0546010,546000,54d000,90,500007ff) at _configure+0x5a
_main(0,0,0,0,0) at _main+0x317
db> show reg
es            0x160010
ds                0x10
edi                0x5
esi                0x4
ebp         0xc0548cdc  _end+0x6bdbc
ebx         0xc0548cf9  _end+0x6bdd9
edx                0x2
ecx                0x1
eax                0x3
eip         0xc02fbb14  _db_read_bytes+0x10
cs                 0x8
eflags         0x10006
esp         0xc0548cd8  _end+0x6bdb8
ss          0xc04d0010  _playstats+0x6270


So one bug appears to be that ddb needs to do some checking before
blindly assuming it can deal here. 32/16 etc./ugh. Ignoring that
for the moment...

A page fault is incurred, handled by the i386 trap(), which calls
uvm_fault() which fails, and so we print the message, and loop to the
"we_re_toast:" label. Said label special-cases DDB and KGDB (for
breakpoint handling?) and in the ddb case, invokes kdb_trap() and then
returns, rather than panic()ing.

[      aside: I find  it ironic that if you've  built with DDB enabled
    you don't see the trap data that you'd otherwise get printf()d out
    before the panic("trap") [i.e.  arch/i386/i386/trap.c ll 301-307],
    but if you haven't built  with DEBUG you cannot enable the display
    of this  upon entry to trap()  via setting "trapdebug".  So if you
    build without DDB and without DEBUG,  you get it, but if you build
    with DDB and without debug, you don't, but if you build with DEBUG
    and set trapdebug, you do get it. It seems odd, since one presumes
    the normal  case for  people trying to  debug things is  they have
    kernels  build with DDB  (for ocassional  debugging) but  not with
    DEBUG set.
    
       I don't   really  understand   why  the  "if   (trapdebug)"  is
    conditionalized on  DEBUG -- it does  not seem like  one cmpl upon
    every  trap()  invokation is  particularly  expensive,  so I'd  be
    curious  if there's  objection to  removing the  #ifdef  DEBUG (ll
    262).
]

In any event, kdb_trap() (in db_interface.c) prints the trap type
information and then calls the MI db_trap(). db_trap() calls
db_stop_at_pc() to see if there's a ddb breakpoint in effect (or an
"until", etc.), and if not, calls db_restart_at_pc().

For the case of a pagefault, this causes an infinite loop since
the pagefault didn't happen on a breakpoint, and db_restart_at_pc() just
results in trying to re-execute the instruction which triggers the
pagefault. There's no opportunity for user interaction.

I'm unclear on whether I'm misunderstanding the sequence of events
of it they're just that broken :-(. 

The following workaround seems to be helpful:

Index: db_trap.c
===================================================================
RCS file: /cvsroot/syssrc/sys/ddb/db_trap.c,v
retrieving revision 1.14
diff -c -3 -2 -p -r1.14 db_trap.c
*** db_trap.c	1999/04/12 20:38:21	1.14
--- db_trap.c	2000/05/04 18:54:23
***************
*** 47,82 ****
--- 47,85 ----
  void
  db_trap(type, code)
  	int	type, code;
  {
  	boolean_t	bkpt;
  	boolean_t	watchpt;
  
  	bkpt = IS_BREAKPOINT_TRAP(type, code);
  	watchpt = IS_WATCHPOINT_TRAP(type, code);
  
  	if (db_stop_at_pc(DDB_REGS, &bkpt)) {
  	    if (db_inst_count) {
  		db_printf("After %d instructions (%d loads, %d stores),\n",
  			  db_inst_count, db_load_count, db_store_count);
  	    }
  	    if (curproc != NULL) {
  		if (bkpt)
  		    db_printf("Breakpoint in %s at\t", curproc->p_comm);
  		else if (watchpt)
  		    db_printf("Watchpoint in %s at\t", curproc->p_comm);
  		else
  		    db_printf("Stopped in %s at\t", curproc->p_comm);
  	    } else if (bkpt)
  		db_printf("Breakpoint at\t");
  	    else if (watchpt)
  		db_printf("Watchpoint at\t");
  	    else
  		db_printf("Stopped at\t");
  	    db_dot = PC_REGS(DDB_REGS);
  	    db_print_loc_and_inst(db_dot);
  
  	    db_command_loop();
+ 	} else if (type==T_PAGEFLT) {
+ 	  db_printf("pagefault stop\t");
+ 	  db_command_loop();
  	}
  
  	db_restart_at_pc(DDB_REGS, watchpt);
  }

it certainly prevents the endless looping of uvm_fault()s
and allowed me to get the above stack trace.

I guess it's probably not legit to use "T_PAGEFLT" in MI code.
I suppose I could define IS_PAGEFAULT_TRAP() in the MD db_machdep.h
and then #ifdef the db_trap() code on IS_PAGEFAULT_TRAP().

But I'm not convinced that the code is in-principal correct (i.e.
ignoring the prev. paragraph's issue). I suppose it's certainly better
than the status quo.

Feedback would be appreciated.

--jhawk