NetBSD-Bugs archive

[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index][Old Index]

kern/53817: Random panics in vfs_mountroot()



>Number:         53817
>Category:       kern
>Synopsis:       Random panics in vfs_mountroot()
>Confidential:   no
>Severity:       critical
>Priority:       high
>Responsible:    kern-bug-people
>State:          open
>Class:          sw-bug
>Submitter-Id:   net
>Arrival-Date:   Sat Dec 29 21:35:00 +0000 2018
>Originator:     Andreas Gustafsson
>Release:        NetBSD-current
>Organization:

>Environment:
System: NetBSD
Architecture: aarch64
Machine: evbarm
>Description:

The automated evbarm-aarch64 test runs on lyta.netbsd.org have ended
with a panic in vfs_mountroot in 36 out of 98 runs:

  lyta /bracket/evbarm-aarch64/results/2018 $ zgrep vfs_mountroot */test.log.gz
  2018.10.26.22.10.15/test.log.gz:[   2.1539321] fp ffffffc000979e00 vfs_mountroot() at ffffffc00043b524 netbsd:vfs_mountroot+0x214
  2018.10.27.13.20.21/test.log.gz:[   2.3229166] fp ffffffc000979e00 vfs_mountroot() at ffffffc00043b4f4 netbsd:vfs_mountroot+0x214
  2018.10.28.00.44.37/test.log.gz:[   2.1098840] fp ffffffc000979e00 vfs_mountroot() at ffffffc00043b4f4 netbsd:vfs_mountroot+0x214
  [...]
  2018.12.24.08.40.33/test.log.gz:[   2.4557637] fp ffffffc000b8adf0 vfs_mountroot() at ffffffc0004581a4 netbsd:vfs_mountroot+0x214
  2018.12.24.22.47.19/test.log.gz:[   2.3375370] fp ffffffc000b8bdf0 vfs_mountroot() at ffffffc00045a0c4 netbsd:vfs_mountroot+0x214
  2018.12.25.11.56.14/test.log.gz:[   2.4641909] fp ffffffc000b8bdf0 vfs_mountroot() at ffffffc00045a0c4 netbsd:vfs_mountroot+0x214
  2018.12.27.04.54.03/test.log.gz:[   2.4717805] fp ffffffc000b8bdf0 vfs_mountroot() at ffffffc00045a054 netbsd:vfs_mountroot+0x214

Console output from the first recorded failure:

  [   2.0451998] mountroot: trying ffs...
  [   2.0561633] root file system type: ffs
  [   2.0561633] panic: Trap: Data Abort (EL1): Translation Fault L1 with read access for 0000000000000070: pc ffffffc00043b524: opcode f9403800: ldr x0, [x0,#112]

  [   2.0672672] cpu0: Begin traceback...
  [   2.0672672] trace fp ffffffc000979b50
  [   2.0773923] fp ffffffc000979b70 vpanic() at ffffffc0003efd00 netbsd:vpanic+0x190
  [   2.0773923] fp ffffffc000979bd0 panic() at ffffffc0003efdcc netbsd:panic+0x44
  [   2.0889589] fp ffffffc000979c60 data_abort_handler() at ffffffc00005fcf8 netbsd:data_abort_handler+0x480
  [   2.0889589] tf ffffffc000979cd0 el1_trap() at ffffffc00005d3f8 netbsd:el1_trap
  [   2.1014251] ---- trapframe 0xffffffc000979cd0 (304 bytes) ----
  [   2.1014251]	   pc=ffffffc00043b524,	  spsr=0000000060000005
  [   2.1014251]	  esr=0000000096000005,	   far=0000000000000070
  [   2.1114037]	   x0=0000000000000000,	    x1=ffff00005fab2bd0
  [   2.1114037]	   x2=0000000000000004,	    x3=00000000000000f0
  [   2.1114037]	   x4=ffffffc000820558,	    x5=0000000000000000
  [   2.1114037]	   x6=0000000000000004,	    x7=0000000000000001
  [   2.1225739]	   x8=0000000000000004,	    x9=ffffffc000979da0
  [   2.1225739]	  x10=ffff00005f9f9f58,	   x11=000000000000003f
  [   2.1225739]	  x12=fffffc00017e7e7c,	   x13=fffffc00017e7e80
  [   2.1225739]	  x14=ffffffffffffffe8,	   x15=ffff00005f9fa000
  [   2.1338631]	  x16=0000000000000021,	   x17=0000000000000083
  [   2.1338631]	  x18=0000000000001000,	   x19=0000000000000000
  [   2.1338631]	  x20=ffffffc000a1c000,	   x21=ffffffc000a8bce8
  [   2.1338631]	  x22=ffffffc000854a00,	   x23=ffffffc0006ef800
  [   2.1449961]	  x24=ffffffc000856000,	   x25=0000000000000000
  [   2.1449961]	  x26=ffffffc000a1c000,	   x27=ffffffc000971000
  [   2.1449961]	  x28=0000000000000000, fp=x29=ffffffc000979e00
  [   2.1449961] lr=x30=ffffffc00043b518,	    sp=ffffffc000979e00
  [   2.1449961] ------------------------------------------------
  [   2.1539321] fp ffffffc000979e00 vfs_mountroot() at ffffffc00043b524 netbsd:vfs_mountroot+0x214
  [   2.1539321] fp ffffffc000979e60 main() at ffffffc00050f3e4 netbsd:main+0x44c
  [   2.1539321] fp 0000000000000000 aarch64_start() at ffffffc00000183c netbsd:aarch64_start+0x103c
  [   2.1675298] cpu0: End traceback...
  Stopped in pid 0.1 (system) at	netbsd:cpu_Debugger+0x4:	ret

Disassembling around the faulting instruction (0x214 is decimal 532):

   0xffffffc000443824 <vfs_mountroot+508>:      bl  0xffffffc000451210 <vref>
   0xffffffc000443828 <vfs_mountroot+512>:      ldr     x0, [x21, #80]
   0xffffffc00044382c <vfs_mountroot+516>:      bl  0xffffffc00045d388 <VOP_UNLOCK>
   0xffffffc000443830 <vfs_mountroot+520>:      str     xzr, [x22, #8]
   0xffffffc000443834 <vfs_mountroot+524>:      ldr     x0, [x20, #2376]
   0xffffffc000443838 <vfs_mountroot+528>:      ldr     x1, [x21, #80]
   0xffffffc00044383c <vfs_mountroot+532>:      ldr     x0, [x0, #112]
   0xffffffc000443840 <vfs_mountroot+536>:      str     x1, [x0]
   0xffffffc000443844 <vfs_mountroot+540>:      ldr     x0, [x20, #2376]
   0xffffffc000443848 <vfs_mountroot+544>:      ldr     x0, [x0, #112]
   0xffffffc00044384c <vfs_mountroot+548>:      ldr     x0, [x0]
   0xffffffc000443850 <vfs_mountroot+552>:      bl  0xffffffc000451210 <vref>

This looks like the source line

                /*                                                                                                                                               
                 * Now that root is mounted, we can fixup initproc's CWD                                                                                         
                 * info.  All other processes are kthreads, which merely                                                                                         
                 * share proc0's CWD info.                                                                                                                       
                 */
                initproc->p_cwdi->cwdi_cdir = rootvnode;

and indeed the offset #112 in the faulting instruction matches the offset
of the p_cwdi field in struct proc.  Also, the faulting address is 0x70,
which is at a 112 bytes offset from a null pointer.

This looks like a race condition to me: main() calls fork1() to create
process 1 and then calls vfs_mountroot() which uses the global
variable initproc, but initproc is initialized in start_init() which
is executed by process 1, so it may or may not have been initialized
by the time it is used by vfs_mountroot().

I suspect the bug may have been introduced by:

  revision 1.496
  date: 2018-04-16 17:18:16 +0300;  author: kamil;  state: Exp;  lines: +5 -3;  commitid: YY8XhMArR8bEOFyA;
  Set initproc inside start_init()

  This allows us to stop using the rnewprocp argument in fork1(9).

  The rnewprocp argument will be removed soon from the API, as it can cause
  use-after-free scenarios.

  No functional change intended.

  Noted by <Mateusz Guzik>
  Reference: http://mail-index.netbsd.org/tech-kern/2017/09/08/msg022267.html

  Sponsored by <The NetBSD Foundation>

Although I have only seen this happen on evbarm-aarch64 under qemu,
the bug looks machine independent, so I'm filing it in category kern.

>How-To-Repeat:

>Fix:



Home | Main Index | Thread Index | Old Index