Thanks for the reply. Unfortunately the system is in another timezone so no easy way for me to get at the console. That would have been first prize. I was hoping for a way to do a one-shot query to get that kind of information.I have a problem where a process is getting stuck in a biowait and I was wondering if there is any way to find out where in the code (or in the kernel) this is happening. Once it is in this state it is unkillable so it doesn't process signal 11 to get a core dump. I am 90% sure it is the access to an external disk array which is causing it but I can't find a way to verify this. I have control of the source code so can compile it with symbols etc.It's happening in the kernel. Do you have access to the console? If so, when it gets stuck, you can enter DDB and get a process listing with both the wait message ("biowait"--labelled "WCHAN" in the userland ps output) and the "wait channel". The wait channel is a kernel virtual address of the object being waited on. In this case, a "struct buf", and you can enter "show buf <addr>" to see the contents of that buf. The process is in src/sys/vfs_bio.c:biowait(), but the question is why isn't it getting woken up--or if it's getting woken up, why aren't B_DONE or B_DELWRI set? It would be useful to know what version you're running, and what disk drivers you're using. A dmesg (or /var/run/dmesg.boot) would show both, although it won't tell us which device "owns" the buf that gets stuck. Another option is to get a kernel core dump (if that's possible in this state).
I'm running NetBSD 3 with an Adaptec SCSI controller attached to an easyRaid device (dnesg.boot attached). I've been thinking that maybe when the easyRaid is under heavy load it might be giving problems. I'm not sure how I can prove that. Interesting thing is that the last 2 times this problem has occurred it's been within a minute or so of the daily script running (circa 03h15). I'm not sure this is a coincidence...
Stuart
Attachment:
dmesg.boot.gz
Description: application/gzip