kern/56535: Memory corruption in multi-threaded Go parent process following fork() on AMD CPUs

To: kern-bug-people%netbsd.org@localhost,gnats-admin%netbsd.org@localhost,netbsd-bugs%netbsd.org@localhost
Subject: kern/56535: Memory corruption in multi-threaded Go parent process following fork() on AMD CPUs
From: mpratt%google.com@localhost
Date: Fri, 3 Dec 2021 23:20:00 +0000 (UTC)

>Number:         56535
>Category:       kern
>Synopsis:       Memory corruption in multi-threaded Go parent process following fork() on AMD CPUs
>Confidential:   no
>Severity:       serious
>Priority:       medium
>Responsible:    kern-bug-people
>State:          open
>Class:          sw-bug
>Submitter-Id:   net
>Arrival-Date:   Fri Dec 03 23:20:00 +0000 2021
>Originator:     Michael Pratt
>Release:        9.0, 9.99.92
>Organization:
Google / Go Programming Language
>Environment:
NetBSD buildlet-netbsd-amd64-9-0-n2d-rne414d8e.c.symbolic-datum-552.internal 9.0 NetBSD 9.0 (GENERIC) #0: Fri Feb 14 00:06:28 UTC 2020  mkrepro%mkrepro.NetBSD.org@localhost:/usr/src/sys/arch/amd64/compile/GENERIC amd64

>Description:
The Go project CI infrastructure recently switched to using VMs with a mix of Intel and AMD-based instances (from Intel-only previously). With this change we saw a big uptick in flakiness.

Upon further investigation, we discovered that Go build/test is extremely flaky on AMD CPUs in particular. So bad in fact that we can almost never successfully complete tests on AMD CPUs (amd64 or 386).

This problem is being investigated in https://golang.org/issue/49209 and https://golang.org/issue/34988. (FWIW, OpenBSD/386 seems to have a similar problem). As far as I know, Intel CPUs are never affected.

In https://github.com/golang/go/issues/49209#issuecomment-984360815, Maya tested the following CPU models:

AMD 10h: OK (Turion II Neo N40L)
AMD 15h: OK
AMD 17h: NOT OK (Zen 1950X, Zen2 3600)
AMD 19h: NOT OK (Zen3 5950X)

We see a wide variety of different crashes, but the common theme is that they all appear to be due to memory corruption. In the cases where we tell the precise value, I believe we are always seeing reads of 0 that should not be 0, but there are plenty of crashes where we can't precisely tell the problematic value.

Investigation this week has narrowed down the problem to some relationship with fork(). The corruption always occurs (in the parent process) shortly after a fork() system call. It may occur on the thread that called fork(), or on another thread.

The simplest reproducer I have currently is at https://github.com/golang/go/issues/34988#issuecomment-985874750 (available as a repo at https://github.com/prattmic/go-bsd-corruption-issue49209).

The behavior of this reproducer is:

* One goroutine calling fork() and wait4() in a loop (child immediately exits).
* One goroutine making a dummy getpid() system call in a loop.
* N.B. since these are infinite loops, each goroutine will generally be running on a dedicated OS thread, but could theoretically move threads by the Go scheduler (adding runtime.LockOSThread() calls would force dedicated threads. The crashes still reproduce this way).
* The syscall.Syscall functions do not just make a raw system call, they also call runtime.entersyscall and runtime.exitsyscall (https://cs.opensource.google/go/go/+/master:src/runtime/proc.go;l=3840;drc=9b0de0854d5a5655890ef0b2b9052da2541182a3) which do some synchronization with the Go scheduler. Perhaps notably, this fiddles around with TLS variables (the "g" returned by getg(), which is ultimately in FS_BASE).
* If either goroutine uses syscall.RawSyscall, which literally just makes direct system calls, then the crashes disappear.
* If Go is running single-threaded (set GOMAXPROCS=1 and apply the patch in https://github.com/golang/go/issues/34988#issuecomment-985729313), then the crashes disappear. (The Go scheduler will time-share the two goroutines in this case).

I have unfortunately not yet managed to successfully create a reproducer in C, but want to get a bug started here because it seems to me that there must be some interaction beyond Go at work here: This reproducer makes a direct fork() system call and then memory in the parent process is corrupted. That should simply not be possible, barring some restriction on using fork() on NetBSD that I am unaware of. Even better, the child exits immediately and the _only_ memory it should touch is the instruction fetches of the executable page of the few instructions it contains.

I have a bunch of pet theories (missing/misbehaving TLB shootdowns when pages are zapped at fork() for later CoW? Bug in CoW that causes the parent to get a new zero page instead of a copy?), but I am not familiar with NetBSD's mm code so I've not had much luck looking into this beyond some basic dtrace'ing.

My latest analysis is in https://github.com/golang/go/issues/34988#issuecomment-985874750.
>How-To-Repeat:
You must use a machine with an AMD CPU (Zen or newer).

I have been working primarily with GCE N2D instance types (https://cloud.google.com/compute/docs/general-purpose-machines#n2d_machines), but this issue has been reproduced on AWS m5a instances, and bare-metal AMD machines.

This should reproduce with the latest Go release, Go 1.17.4. I haven't directly tested older versions, but our builders have experienced crashes back to the Go 1.10 toolchain (which is used for bootstrapping).

$ git clone https://github.com/prattmic/go-bsd-corruption-issue49209
$ GOOS=netbsd go build
$ ./fork-bug

If you do not see a crash within 60s, edit loop.go to replace the GETPID system call with the call to runtime.Gosched(). This should cause a crash much more quickly.

The most likely crash you will see is something like

entersyscall inconsistent 0xc00003a778 [0xc00003a000,0xc00003a800]                                                                           
fatal error: entersyscall 
... followed by a stack trace ...

But a variety of different fatal crashes are possible. This program never crashes when run on an Intel CPU.
>Fix:

Prev by Date: NetBSD Nightly Trouble Ticket Report
Next by Date: kern/56536: shutdown issue in usbnet
Previous by Thread: PR/56176 CVS commit: [netbsd-8] src/sys/dev/pci
Next by Thread: Re: kern/56535: Memory corruption in multi-threaded Go parent process following fork() on AMD CPUs
Indexes:

Home | Main Index | Thread Index | Old Index