Subject: Re: parallel make locking up (on amd64)
To: None <current-users@netbsd.org>
From: Christos Zoulas <christos@astron.com>
List: current-users
Date: 07/12/2006 19:53:13
In article <20060712144944.GE16050@sb1001.name>,
Kurt Schreiner  <ks@ub.uni-mainz.de> wrote:
>Hi,
>
>"torturing" my shiny new Sun ultra40 I tried some "build.sh -j7" which run for
>a while but eventually the make processes lock up WAITing on vnlock.
>The lockup can (more or less) be reproduced by "reboot; login; build.sh -j7"...
>Filesystems are setup as follows:
>
>/dev/wd1g on /u type ffs (noatime, soft dependencies, local)
>mfs:698 on /tmp type mfs (synchronous, nosuid, nodev, noatime, local)
><above>:/u/NetBSD/lsrc on /u/NetBSD/src.060711 type union (nosuid,
>nodev, local, mounted by ks)
>
>parameters to build.sh are:
>
>./build.sh -N 1 -j 7 -x -U -m amd64 -O /u/NetBSD/arch/amd64/obj \
> -D /u/NetBSD/arch/amd64/dest -T /u/NetBSD/arch/amd64/TOOLS
>
>DDB (on serial console ;-) shows:
>
>db{0}> ps
> PID           PPID     PGRP        UID S   FLAGS LWPS          COMMAND    WAIT
> 8675             1     8675          0 2  0x4002    1            getty   ttyin
> 3366          3573     7364         77 2  0x4002    1             less   ttyin
> 3573          7364     7364         77 2  0x4002    1               sh    wait
> 7364          3355     7364         77 2  0x4002    1              man    wait
> 12189            1     7611         77 2  0x4002    1           nbmake  vnlock
> 7935             1     7217         77 2  0x4002    1           nbmake  vnlock
> 5841             1     3746         77 2  0x4002    1           nbmake  vnlock
> 4683             1     3379         77 2  0x4002    1           nbmake  vnlock
> 3095             1     1997         77 2  0x4002    1           nbmake  vnlock
> 7294             1     5283         77 2  0x4002    1           nbmake  vnlock
> 3355          2895     3355         77 2  0x4002    1             tcsh   pause
> 2895          3517     3517         77 2   0x100    1             sshd  select
> 3517           636     3517          0 2  0x4101    1             sshd   netio
> 1207           918     1207         77 2  0x4002    1             tcsh   ttyin
> 918            244      244         77 2   0x100    1             sshd  select
> 244            636      244          0 2  0x4101    1             sshd   netio
> 243              1      243          0 2  0x4002    1            getty   ttyin
> 242              1      242          0 2  0x4002    1            getty   ttyin
> 241              1      241          0 2  0x4002    1            getty   ttyin
> 235              1      235          0 2       0    1             cron nanosle
> 233              1      233          0 2       0    1            inetd  kqread
>
>
>db{0}> trace/t 0t5841
>trace: pid 5841  at 0xffff800057a326a0
>ltsleep() at netbsd:ltsleep+0x3df
>acquire() at netbsd:acquire+0x17d
>lockmgr() at netbsd:lockmgr+0x367
>VOP_LOCK() at netbsd:VOP_LOCK+0x25
>vn_lock() at netbsd:vn_lock+0x99
>cache_lookup() at netbsd:cache_lookup+0x2f9
>ufs_lookup() at netbsd:ufs_lookup+0xdc
>VOP_LOOKUP() at netbsd:VOP_LOOKUP+0x27
>union_lookup1() at netbsd:union_lookup1+0x42
>union_lookup() at netbsd:union_lookup+0xd9
>VOP_LOOKUP() at netbsd:VOP_LOOKUP+0x27
>lookup() at netbsd:lookup+0x296
>namei() at netbsd:namei+0x16a
>vn_open() at netbsd:vn_open+0x164
>sys_open() at netbsd:sys_open+0xdd
>syscall_plain() at netbsd:syscall_plain+0x122
>kernel: page fault trap, code=0
>Faulted in DDB; continuing...
>
>db{0}> trace/t 0t7294
>trace: pid 7294  at 0xffff8000581b9b60
>ltsleep() at netbsd:ltsleep+0x3df
>acquire() at netbsd:acquire+0x17d
>lockmgr() at netbsd:lockmgr+0x680
>VOP_LOCK() at netbsd:VOP_LOCK+0x25
>vn_lock() at netbsd:vn_lock+0x99
>union_lock() at netbsd:union_lock+0x7f
>VOP_LOCK() at netbsd:VOP_LOCK+0x25
>vn_lock() at netbsd:vn_lock+0x99
>vn_readdir() at netbsd:vn_readdir+0xcb
>sys___getdents30() at netbsd:sys___getdents30+0xaa
>syscall_plain() at netbsd:syscall_plain+0x122
>kernel: page fault trap, code=0
>Faulted in DDB; continuing...
>
>Is there anything I can do to help debugging this? Sendpr?

Yes, try without using sofdeps.

christos