NetBSD-Users archive

[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index][Old Index]

Re: Substantial COMPAT_LINUX changes in netbsd-5?



Hi,

a slight update to this one: it appears that we are now in a situation
where we have these two backup processes shown here -- this is with the
suggested change (LK_CANRECURSE in nfs_root()):

mail-server% ps axl | egrep bpbkar
   0  5421     1 1332  85  0 18048  7260 uvn_fp2 D    ?     2:28.47 bpbkar -r 1
   0 17026     1  129 117  0 17860  4876 nfsrcv  D    ?     0:00.03 bpbkar -r 1
mail-server% 

The lock which makes subsequent attempts at running "df" to hang is
already held at this point -- most probably by the process which is now
stuck in uvn_fp2 and which is not making any progress.  Running
Linux-emulated "df" results in these hanging in "tstile" (a real "df"
probably would as well, did not test that, though):

mail-server% ps axlww | egrep "bpbkar|df"
   0  5421     1 1332  85  0 18048  7260 uvn_fp2 D    ?      2:28.47 bpbkar -r 
1209600 -ru root -dt 1428531 -to 0 -clnt mail-server.nordu.net -class NetBSD 
-sched Cumulative-Inc -st CINC -bpstart_to 300 -bpend_to 300 -read_to 1800 
-ckpt_time 900 -blks_per_buffer 2048 -use_otm -nfsok -b 
mail-server.nordu.net_1264048175 -kl 28 -use_ofb 
   0 17026     1  129 117  0 17860  4876 nfsrcv  D    ?      0:00.03 bpbkar -r 
1209600 -ru root -dt 1432799 -to 0 -clnt mail-server.nordu.net -class NetBSD 
-sched Cumulative-Inc -st CINC -bpstart_to 300 -bpend_to 300 -read_to 1800 
-ckpt_time 900 -blks_per_buffer 2048 -use_otm -nfsok -b 
mail-server.nordu.net_1264048175 -kl 28 -use_ofb 
1612  4586  3404    0  43  0   152    32 -       R+   ttyp2  0:00.00 egrep 
bpbkar|df 
1042  2519     1    0 127  0  1504   828 tstile  D    ttyp3- 0:00.00 
/emul/linux/bin/df 
1042  3006     1    0 127  0  1504   824 tstile  D    ttyp3- 0:00.01 
/emul/linux/bin/df 
mail-server% 

The previous message contained a stack trace for the process then stuck
in uvn_fp2, I repeat the relevant part here:

---
I also did a backtrace of the other bpbkar processes which in "ps axl"
output had these wait channels:

   0  5177     1  302 117  0 17860  4876 nfsrcv  D    ?      0:00.03 bpbkar -r 
   0  7179     1  915  85  0 18048  7260 uvn_fp2 D    ?      2:52.22 bpbkar -r 

db{0}> trace/t 0t7179
trace: pid 7179 lid 1 at 0xdb7ae3cc
sleepq_block(0,0,c0aaba51,c0b27c80,0,c150add8,62,c3ede230,de64667c,0) at 
netbsd:sleepq_block+0xeb
mtsleep(c3ede230,204,c0aaba51,0,de64667c,de64667c,10,6,0,0) at 
netbsd:mtsleep+0x12d
uvn_findpage(db7ae5ac,0,db7ae4ac,c05343fa,0,0,2,0,994000,db7ae5cc) at 
netbsd:uvn_findpage+0x92
uvn_findpages(de64667c,97c40000,3,db7ae5ec,db7ae5ac,0,994000,20,2,0) at 
netbsd:uvn_findpages+0x73
genfs_getpages(db7ae6b0,0,0,0,0,97cb0000,0,0,2,db7ae65c) at 
netbsd:genfs_getpages+0x743
nfs_getpages(db7ae6b0,4,97c42000,3,0,10000,97cc0000,c089d600,de64667c,97c40000) 
at netbsd:nfs_getpages+0xbb
VOP_GETPAGES(de64667c,97c40000,3,db7ae750,db7ae7c8,0,1,0,1802,0) at 
netbsd:VOP_GETPAGES+0x65
uvn_get(de64667c,97c40000,3,db7ae750,db7ae7c8,0,1,0,1802,e41be780) at 
netbsd:uvn_get+0x117
ubc_fault(db7ae8e0,d3a75000,db7ae8a0,1,0,1,42,c085d206,cee38540,ce3a4d00) at 
netbsd:ubc_fault+0x170
uvm_fault_internal(c0bc21c0,d3a75000,1,0,c4ec6482,c0000,0,c05a6cfa,6,6) at 
netbsd:uvm_fault_internal+0x3a9
trap() at netbsd:trap+0x797
--- trap (number 6) ---
copyout(e390a0e4,d3a75000,8249400,2000,e390a0e4,0,d3a75000,97c40000,3,d3a75000) 
at netbsd:copyout+0x33
uiomove(d3a75000,2000,db7aec8c,db7aeadc,0,101,deaddead,0,1829b58,0) at 
netbsd:uiomove+0x62
ubc_uiomove(de64667c,db7aec8c,10000,0,101,eee4221c,db7aeb2c,c085d206,de615800,de64671c)
 at netbsd:ubc_uiomove+0xeb
nfs_bioread(de64667c,db7aec8c,0,ce3a6f00,0,de64667c,db7aec2c,c053d6f4,db7aec14,de64667c)
 at netbsd:nfs_bioread+0x312
nfs_read(db7aec14,de64667c,c089d3c0,de64667c,1,20001,db7aec2c,c0534d58,c089ce80,de64667c)
 at netbsd:nfs_read+0x43
VOP_READ(de64667c,db7aec8c,0,ce3a6f00,d4728580,0,7aec6c,16,10000,8249400) at 
netbsd:VOP_READ+0x44
vn_read(e4408600,e4408600,db7aec8c,ce3a6f00,1,0,0,0,e41be780,db7aed48) at 
netbsd:vn_read+0x93
dofileread(9,e4408600,8249400,10000,e4408600,1,db7aed28,db7aed48,db7aed48,e41be780)
 at netbsd:dofileread+0x75
sys_read(e41be780,db7aed10,db7aed28,7aed20,96,10,c0b4a744,9,8249400,10000) at 
netbsd:sys_read+0x6f
linux_syscall(db7aed48,2b,2b,2b,2b,610,8259300,bfbeec08,9,10000) at 
netbsd:linux_syscall+0x9b
db{0}>

---

Now, why this process appears to be stuck in uvn_fp2 and does not make
any progress from that point I do not know.  My gut feeling is that
it's not unlikely that this process is holding a lock which makes those
other processes get stuck in "tstile" waits.

The part in uvn_findpage() which waits on uvn_fp2 appears to be this
section of code:

                /* page is there, see if we need to wait on it */
                if ((pg->flags & PG_BUSY) != 0) {
                        if (flags & UFP_NOWAIT) {
                                UVMHIST_LOG(ubchist, "nowait",0,0,0,0);
                                return 0;
                        }
                        pg->flags |= PG_WANTED;
                        UVMHIST_LOG(ubchist, "wait %p", pg,0,0,0);
                        UVM_UNLOCK_AND_WAIT(pg, &uobj->vmobjlock, 0,
                                            "uvn_fp2", 0);
                        mutex_enter(&uobj->vmobjlock);
                        continue;
                }

However, as stated, it seems that the process never wakes up from
waiting here.  Race condition on pg->flags PG_WANTED setting/testing?
Or is that supposed to be covered by &uobj->vmobjlock?

I see the comment ov uvn_findpages() says uobj must be locked, but
there's no diag-assert to verify that's actually the case.  How would
such a diag-assert look?

If you still want us to dig out which lock the "tstile"d processes is
hanging on, I think we still need some instructions to do that.


Regards,

- Håvard


Home | Main Index | Thread Index | Old Index