misc/52184: Recent qemu performs badly on NetBSD hosts under load

To: misc-bug-people%netbsd.org@localhost,gnats-admin%netbsd.org@localhost,netbsd-bugs%netbsd.org@localhost
Subject: misc/52184: Recent qemu performs badly on NetBSD hosts under load
From: gson%gson.org@localhost (Andreas Gustafsson)
Date: Fri, 21 Apr 2017 09:40:00 +0000 (UTC)

>Number:         52184
>Category:       misc
>Synopsis:       Recent qemu performs badly on NetBSD hosts under load
>Confidential:   no
>Severity:       serious
>Priority:       medium
>Responsible:    misc-bug-people
>State:          open
>Class:          sw-bug
>Submitter-Id:   net
>Arrival-Date:   Fri Apr 21 09:40:00 +0000 2017
>Originator:     Andreas Gustafsson
>Release:        NetBSD 7.1
>Organization:

>Environment:
System: NetBSD 7.1
Architecture: x86_64
Machine: amd64
>Description:

The NetBSD testbed babylon5.netbsd.org runs automated full-system
tests using qemu.  After the qemu on babylon5 was upgraded from
version 0.15.1 to 2.8.0 in early March, the tests have been taking
much longer to run than they did with 0.15.1, at least twice as long
on average.

The exact reason for the slowdown is still unknown, and could be a
qemu issue, a NetBSD issue, or a combination of the two; hence the
"misc" category of this PR.

I have run a number of tests to try to track down the circumstances in
which the problem occurs.  As far as I can tell, they are as follows:

  1. The version of qemu must include commit
     05e514b1d4d5bd4209e2c8bbc76ff05c85a235f3
     (found by bisection).

  2. The host system that qemu is executed on must be NetBSD
     rather than Linux.

  3. Some minor slowdown may occur under just the two above
     conditions, but for the slowdown to be dramatic, the host system
     must be under some CPU load.  On babylon5, this load typically
     consists of a parallel build of NetBSD (started with "nice 10")
     and other qemu processes.  In my own tests, I have simulated it
     by running multiple infinite loop processes in the background
     (again with "nice 10").

A shell script to automate reproducing the problem in available at

   http://www.gson.org/netbsd/bugs/qemu-slow/test.sh

This script builds two versions of qemu from git, from before and
after the above commit, downloads a NetBSD 6.0.1 image pre-configured
to run the ATF tests, starts infinite loop processes in the background
to generate CPU load, and runs the test image using each qemu version
in turn.  It needs about 2.5 GB of disk and several hours to run, and
due to the high CPU load it generates, I would not recommend running
it on a production system.

When run under NetBSD 7.1/amd64 on a 12-core HP DL360 G7 server, the
script reported the following execution times for qemu git revision
21a03d17f2edb1e63f7137d97ba355cc6f19d79f:

   real      3871.12
   user      2934.44
   sys         17.25

And for qemu git revision 05e514b1d4d5bd4209e2c8bbc76ff05c85a235f3:

   real     16813.29
   user     17774.38
   sys       5884.99

Note that the difference is especially dramatic for the system time,
which increased by a factor of more than 300.  When the script was run
on a Linux system, there was no significant difference in execution
time between the two qemu versions.

The commit message for qemu commit 05e514b1d4d5bd4209e2c8bbc76ff05c85a235f3
reads:

    AioContext: optimize clearing the EventNotifier
    
    It is pretty rare for aio_notify to actually set the EventNotifier.  It
    can happen with worker threads such as thread-pool.c's, but otherwise it
    should never be set thanks to the ctx->notify_me optimization.  The
    previous patch, unfortunately, added an unconditional call to
    event_notifier_test_and_clear; now add a userspace fast path that
    avoids the call.
    
    Note that it is not possible to do the same with event_notifier_set;
    it would break, as proved (again) by the included formal model.
    
    This patch survived over 3000 reboots on aarch64 KVM.

    Signed-off-by: Paolo Bonzini <pbonzini%redhat.com@localhost>
    Reviewed-by: Fam Zheng <famz%redhat.com@localhost>
    Tested-by: Richard W.M. Jones <rjones%redhat.com@localhost>
    Message-id: 1437487673-23740-7-git-send-email-pbonzini%redhat.com@localhost
    Signed-off-by: Stefan Hajnoczi <stefanha%redhat.com@localhost>

Reading the diff of the commit in case did not make it immediately
clear to me why this should cause a performance regression on NetBSD.
Any help in explaining this is appreciated.

>How-To-Repeat:

Run the test script.

>Fix:

Prev by Date: bin/52183: veriexecctl always reports an error even if operation suceeds
Next by Date: Re: misc/52184: Recent qemu performs badly on NetBSD hosts under load
Previous by Thread: bin/52183: veriexecctl always reports an error even if operation suceeds
Next by Thread: Re: misc/52184: Recent qemu performs badly on NetBSD hosts under load
Indexes:

Home | Main Index | Thread Index | Old Index