[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index][Old Index]
kern/53124: FFS is slow
>Synopsis: FFS is slow because pmap_update doesn't scale
>Arrival-Date: Sat Mar 24 08:55:01 +0000 2018
>Release: NetBSD 8.99.12
System: NetBSD slowpoke 8.99.12 NetBSD 8.99.12 (SLOWPOKE) #27: Tue Mar 20 02:21:41 CET 2018 mlelstv@gossam:/home/netbsd-current/obj.amd64/home/netbsd-current/src/sys/arch/amd64/compile/SLOWPOKE amd64
Filesystem I/O is slowed down significantly on systems with many cores.
Setup: 32 Core (16 Core + HT) Ryzen, 64GB RAM, NVME disk.
You can read from the raw NVME disk at about 3GByte/s with 'dd'
as in 'dd if=/dev/rdk0 of=/dev/null bs=1024k'.
However, reading from a FFS filesytem (well aligned, etc..) on
that disk is limited to about 140MB/s.
Even reading the file a second time, when everything is cached
in memory isn't any faster. The problem is not related to the
physical disk system.
A check against a Haswell system with only 4 cores running netbsd-7
with a backported NVME driver yields 2.2GB/s raw, 1.2GB/s through
the filesystem and 1.7GB/s from cache.
A laptop with an old i5 (dual core + HT) reads a cached file at
Disabling HT on the Ryzen system doubles the speed (filesystem
or cache) to about 330MB/s.
I've started crash(8) to sample kernel stack traces from the dd
process and at least 90% of the time it is working in pmap_update().
trace: pid 2073 lid 1 at 0xffff8004953cfbc0
pmap_update() at pmap_update+0x26
ubc_alloc() at ubc_alloc+0x51e
ubc_uiomove() at ubc_uiomove+0x8e
ffs_read() at ffs_read+0xd3
VOP_READ() at VOP_READ+0x37
vn_read() at vn_read+0x94
dofileread() at dofileread+0x90
sys_read() at sys_read+0x5f
syscall() at syscall+0x1bc
--- syscall (number 3) ---
From systat(1) you can see that reading from the filesystem needs
about one TLB shootdown per 8kB read. Broadcasting this to all
cores is apparently something slow and obviously takes more
time the more cores you have.
The TLB shootdown process is optimized to skip CPUs that have
mapped a different address space. This can be easily verified
by running N-1 infinite loops while doing the I/O test.
The result is that reading from cache speeds up to 1.8GB/s.
But with N-1 idle Cores, they are all waiting in the idle loop
that has mapped the kernel address space.
Do Filesystem I/O on a machine with many cores.
Main Index |
Thread Index |