Current-Users archive

[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index][Old Index]

Re: random lockups (now suspecting zfs)



Stephen Borrill <netbsd%precedence.co.uk@localhost> writes:

> On Sat, 4 Nov 2023, Simon Burge wrote:
>> Greg Troxel wrote:
>>
>>> So to me this feels like a locking botch in a rare path in zfs.
>>
>> This appears to be the case.  Chuck Silvers has some understanding of
>> the problem and I'm helping test, but at this stage there isn't a fix
>> available. :/
>
> It's interesting that you see the lockups during pkgsrc builds, i.e. a
> period where there is lots of file creation. We use zfs on backup
> systems that pull in data with rsync. During the initial runs (where
> every file is new) we usually get a couple of lockups, but during day
> to day operation (few changes) it is reliable. These are on physical
> and virtual machines running NetBSD 9 with the rule of thumb of 1GB
> RAM per TB of storage obeyed, but no patches besides setting MAXPHYS
> in the module to 32k for Xen.

I just had another problem, on the non-xen 32GB machine (which has 3.5T
in the pool, only half full).

The machine wasn't really doing much; X running with xfce, a few xterms,
ssh client, pidgin, and idling firefox with I think 24 tabs.

I found it mostly normal and was using an ssh session, and then switched
to the firefox virtual desktop which failed to redraw.  I tried to kill
firefox (because firefox hanging is not so strange :-() and found that a
few of the tabs appeared to be stuck in flt_noram and zio_buf.  I think
there might have been a different wchan earlier that was zfs but not
zio_buf.

I think it got in this state due to firefox leaking memory (in SIZE but
not RES?).

(So it might be a missing wakeup on flt_noram, but lock not released
seems plausible also.  Totally guessing here.)

(As I was composing this message (in tmux on another machine), the
firefox lockup deteriorated to more things hanging and then a total
lockup.  I was unable to ctrl-alt-f1 to get back to the text console.
It is responding to mdns queries and pings and sshd answers but I see
"local version string" and the "remote protocol" line does not appear.

I should try LOCKDEBUG on the package building box (where if it doesn't
work right that's much more ok!).



10853   129  9817      0  85  0  2762264 180704 uvnfp2   DEl  ttyp5    7:12.20 (firefox)
10853   994  9817      0   0  0        0      0 -        Z    ttyp5    0:00.00 (firefox)
10853  1867  9817      0  85  0  3423848 723944 uvnfp2   DEl  ttyp5  146:52.32 (firefox)
10853  7407  9817      0  85  0 20169184 355160 flt_nora DEl  ttyp5   63:48.76 (firefox)
10853  7630  9817      0   0  0        0      0 -        Z    ttyp5    0:00.00 (firefox)
10853  8451  9817      0  85  0  2712376 126588 uvnfp2   DEl  ttyp5    7:09.93 (firefox)
10853  8504  9817      0  85  0  2744608 143008 uvnfp2   DEl  ttyp5    9:56.45 (firefox)
10853  9817     1     21 117  0 12939188 948252 zio_buf_ DEl  ttyp5  303:41.53 (firefox)
10853 11066  9817      0  85  0  2821832 225664 &db->db_ DEl  ttyp5    1:34.01 (firefox)
10853 11769  9817      0  85  0  2849780 232172 uvnfp2   DEl  ttyp5    9:19.27 (firefox)
10853 12055  9817      0  85  0  2832852 144304 uvnfp2   DEl  ttyp5    8:49.22 (firefox)
10853 13075  9817      0  85  0  2782516 193652 plpg     DEl  ttyp5    9:00.21 (firefox)
10853 15399  9817      0  85  0  2822236 249496 uvnfp2   DEl  ttyp5   10:12.41 (firefox)
10853 15991  9817      0  85  0  2775316 187104 uvnfp2   DEl  ttyp5    7:13.63 (firefox)
10853 16033  9817      0   0  0        0      0 -        Z    ttyp5    0:00.00 (firefox)
10853 16877  9817      0  85  0  2731156 148896 uvnfp2   DEl  ttyp5    1:59.22 (firefox)
10853 17275  9817      0   0  0        0      0 -        Z    ttyp5    0:00.00 (firefox)
10853 19768  9817      0  85  0  2760188 152880 uvnfp2   DEl  ttyp5    7:11.17 (firefox)
10853 21618  9817      0   0  0        0      0 -        Z    ttyp5    0:00.00 (firefox)
10853 24342  9817      0  85  0  2737588 148452 uvnfp2   DEl  ttyp5   11:51.61 (firefox)
10853 24956  9817      0  85  0  2981764 336852 uvnfp2   DEl  ttyp5   20:20.13 (firefox)
10853 26368  9817      0  85  0  3164560 240992 uvnfp2   DEl  ttyp5   19:28.72 (firefox)
10853 26981  9817   1123  85  0  3659088 770432 flt_nora DEl  ttyp5   84:09.22 (firefox)
10853 27139  9817      0   0  0        0      0 -        Z    ttyp5    0:00.00 (firefox)
10853 29076  9817   2270  85  0  2975552 261064 flt_nora DEl  ttyp5   88:44.15 (firefox)

top says

Memory: 14G Act, 6989M Inact, 88M Wired, 549M Exec, 13G File, 228M Free
Swap: 40G Total, 348M Used, 40G Free / Pools: 11G Used

so it did get into paging

vmstat -s:
     4096 bytes per page
       16 page colors
  8079588 pages managed
    58419 pages free
  3733123 pages active
  1789074 pages inactive
        1 pages paging
    22427 pages wired
        1 reserve pagedaemon pages
       40 reserve kernel pages
   252503 boot kernel pages
  2817259 kernel pool pages
  2027769 anonymous pages
  3376469 cached file pages
   140490 cached executable pages
     2048 minimum free pages
     2730 target free pages
  2693196 maximum wired pages
        1 swap devices
 10485759 swap pages
    89099 swap pages in use
   534237 swap allocations
6412584353 total faults taken
6413831611 traps
 78980016 device interrupts
1337982749 CPU context switches
312160043 software interrupts
11693270083 system calls
   519672 pagein requests
    84753 pageout requests
        0 pages swapped in
  1355436 pages swapped out
 11042006 forks total
  7107987 forks blocked parent
  7107987 forks shared address space with parent
5114776798 pagealloc desired color avail
328102950 pagealloc desired color not avail
4442734240 pagealloc local cpu avail
1000145508 pagealloc local cpu not avail
    15518 faults with no memory
        0 faults with no anons
     1780 faults had to wait on pages
        0 faults found released page
 18115368 faults relock (18042145 ok)
301349249 anon page faults
   510693 anon retry faults
588841483 amap copy faults
238859694 neighbour anon page faults
6051925265 neighbour object page faults
1842970725 locked pager get faults
 17593932 unlocked pager get faults
193768531 anon faults
107569755 anon copy on write faults
1381042165 object faults
461851046 promote copy faults
4244253805 promote zero fill faults
  4295634 faults upgraded lock
    15025 faults couldn't upgrade lock
     5042 times daemon wokeup
     2631 revolutions of the clock hand
 29898211 pages freed by daemon
 45830406 pages scanned by daemon
   992050 anonymous pages scanned by daemon
 28906451 object pages scanned by daemon
   321731 pages reactivated
        0 pages found busy by daemon
  1270683 total pending pageouts
 49696575 pages deactivated
272335878 per-cpu stats synced
   686222 anon pages possibly dirty
  1341547 anon pages dirty
        0 anon pages clean
    70660 file pages possibly dirty
       56 file pages dirty
  3446243 file pages clean
20979779248 total name lookups
19150722941 good hits
1317343206 negative hits
 10803620 bad hits
        0 false hits
500909481 miss
        0 too long
   968276 pass2 hits
  1088078 2passes
 17136909 reverse hits
    10934 reverse miss
        0 access denied
          cache hits (91% pos + 6% neg) system 0% per-process
          deletions 0%, falsehits 0%, toolong 0%


Home | Main Index | Thread Index | Old Index