tech-kern archive
[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index][Old Index]
Unexpected out of memory kills when running parallel find instances over millions of files
Running 20 find(1) instances, where each has a "private" tree with
million of files runs into trouble with the kernel killing them (and
others):
[ 785.194378] UVM: pid 1998.1998 (find), uid 0 killed: out of swap
[ 785.194378] UVM: pid 2010.2010 (find), uid 0 killed: out of swap
[ 785.224675] UVM: pid 1771.1771 (top), uid 0 killed: out of swap
[ 785.285291] UVM: pid 1960.1960 (zsh), uid 0 killed: out of swap
[ 785.376172] UVM: pid 2013.2013 (find), uid 0 killed: out of swap
[ 785.416572] UVM: pid 1760.1760 (find), uid 0 killed: out of swap
[ 785.416572] UVM: pid 1683.1683 (tmux), uid 0 killed: out of swap
This should not be happening -- there is tons of reusable RAM as
virtually all of the vnodes getting here are immediately recyclable.
$elsewhere I got a report of a workload with hundreds of millions of
files which get walked in parallel -- a number high enough that it
does not fit in RAM on boxes which run it. Out of curiosity I figured
I'll check how others are doing on the front, but key is that this is
not a made up problem.
I'm running NetBSD 10, kernel built from this commit at top of the tree:
Author: andvar <andvar%NetBSD.org@localhost>
Date: Sat Oct 14 08:05:25 2023 +0000
fix various typos in comments and documentation, mainly in word "between".
Specs are 24 cores, 24G of RAM and ufs2 with noatime. swap is *not* configured.
Test generates 20 separate trees, each has 1000 directories with 1000
files (or 20 million files in total + some dirs).
Repro instructions are here:
https://people.freebsd.org/~mjg/.junk/fstree.tgz
Note that parallel creation of the these trees is dog slow, took over
40 minutes for me.
I had to pass extra flags to newfs to for the target fs to even fit
this inode count:
newfs -n 220000000 -O 2 /dev/wd1e
So the expected outcome is that this finishes (extra points for
reasonable time) instead of having userspace getting killed.
I don't know what kind of diagnostic info would be best here, but
given repro steps above I don't think I need to look for something.
Have fun. :)
--
Mateusz Guzik <mjguzik gmail.com>
Home |
Main Index |
Thread Index |
Old Index