Re: kern/50094

To: kern-bug-people%netbsd.org@localhost,gnats-admin%netbsd.org@localhost,netbsd-bugs%netbsd.org@localhost,cmeerw%cmeerw.org@localhost
Subject: Re: kern/50094
From: Michael Pratt <mpratt%google.com@localhost>
Date: Thu, 17 Dec 2020 17:30:02 +0000 (UTC)

The following reply was made to PR kern/50094; it has been noted by GNATS.

From: Michael Pratt <mpratt%google.com@localhost>
To: gnats-bugs%netbsd.org@localhost
Cc: 
Subject: Re: kern/50094
Date: Thu, 17 Dec 2020 12:27:40 -0500

 This problem is affecting the Go language runtime. See
 https://github.com/golang/go/issues/42515 for complete details.

 In particular, see
 https://github.com/golang/go/issues/42515#issuecomment-747097912,
 where I came to the same conclusion as Christof that the problem is
 that the kqueue is unlocked with the knote removed, thus a racing
 kevent call may miss the event. I've also captured custom ktrace
 traces to demonstrate the problem.

 However, while I believe Christof's proposed fix would work for
 blocking kevent calls, it would still allow non-blocking kevent calls
 to miss the event and return without it in violation of the API. It is
 not clear to me how this should be fixed as I don't know exactly what
 constraints require unlocking in the first place. For reference,
 FreeBSD seems to handle this issue by noticing that the kqueue is "in
 flux" and waiting for it to settle:
 https://github.com/freebsd/freebsd/blob/master/sys/kern/kern_event.c#L1859-L1869.

 I have a C reproducer similar to Go's usage at
 https://gist.github.com/prattmic/8b5bc6c87437bd4496d5f546fb3226fc,
 which is copied at the end of this email for convenience.

 The C reproducer should be sufficient to see the issue, but here's
 some additional context on Go's usage for reference:

 Go uses a single kqueue to track networking file descriptors
 ("netpoll"). Typically there is one blocking kevent caller to capture
 events when the process is otherwise idle, plus multiple non-blocking
 callers from threads looking for work (so we don't need complete
 synchronization with the blocking thread).

 The blocking thread serves a second purpose to implement Go timers.
 The timeout on the kevent is set to the earliest expiring timer. When
 the kevent times out, we notice and handle the now-expired timer. If
 user code sets a timer expiring earlier than any other, then we need
 to break out of the blocking kevent to restart with a shorter timeout.
 We do this by writing to a special purpose pipe used just for
 generating events to break out of kevent. It is these "netpoll break"
 events that we are occasionally missing due to this bug.

 In Go 1.15 and prior, there was an additional background thread
 running periodically that could also notice overrun timers and handle
 them (at the expense of latency, since this only runs periodically).
 For Go 1.16, various improvements to improve timer latency have made
 that periodic check unnecessary, so it has been removed. Now this bug
 is loudly exposed as we can completely miss timers for an unbounded
 period, hanging programs and causing test timeouts.

 As a workaround we will add back the periodic check for NetBSD, but as
 mentioned above, it is at the expense of timer latency, so we'd really
 like to see this bug fixed.

 Regards,
 Michael Pratt

 ---

 // $ gcc -pthread kqueue_race.c
 // $ ./a.out
 // kevent = 0
 // Incorrect number of events after 477 calls!

 #include <pthread.h>
 #include <stdio.h>
 #include <stdlib.h>
 #include <sys/event.h>
 #include <sys/time.h>
 #include <unistd.h>

 int kq;

 void* read_kevent(void* arg) {
   int count = 0;
   while (1) {
     count++;

     // Non-blocking.
     struct timespec ts = {};
     struct kevent ev;
     int n = kevent(kq, NULL, 0, &ev, 1, &ts);
     if (n < 0) {
       perror("kevent");
       exit(1);
     } else if (n != 1) {
       printf("kevent = %d\n", n);
       printf("Incorrect number of events after %d calls!\n", count);
       exit(2);
     }
   }
 }

 int main(int argc, char** argv) {
   int p[2];
   int ret = pipe(p);
   if (ret < 0) {
     perror("pipe");
     return 1;
   }

   kq = kqueue();
   if (kq < 0) {
     perror("kqueue");
     return 1;
   }

   struct kevent ev = {};
   EV_SET(&ev, p[0], EVFILT_READ, EV_ADD, 0, 0, 0);
   ret = kevent(kq, &ev, 1, NULL, 0, NULL);
   if (ret < 0) {
     perror("kevent register");
     return 1;
   }

   // Write to pipe, kevent now ready for read end.
   char c = 0;
   ret = write(p[1], &c, 1);
   if (ret != 1) {
     perror("write");
     return 1;
   }

   pthread_t t1, t2;
   ret = pthread_create(&t1, NULL, read_kevent, NULL);
   if (ret < 0) {
     perror("pthread_create");
     return 1;
   }
   ret = pthread_create(&t2, NULL, read_kevent, NULL);
   if (ret < 0) {
     perror("pthread_create");
     return 1;
   }

   pthread_join(t1, NULL);
   pthread_join(t2, NULL);

   return 0;
 }

Prev by Date: Re: kern/55115 (siisata(4) broken after enabling MSI support)
Next by Date: Re: bin/55865 (pkg_add segfaults)
Previous by Thread: toolchain/55879: make(1): error out on .else with extraneous text conflicts mk/readline.builtin.mk
Next by Thread: Re: kern/50094
Indexes:

Home | Main Index | Thread Index | Old Index