port-arm/55737: Apparent bug in evbarm64 DMA code causes filesystem corruption

To: port-arm-maintainer%netbsd.org@localhost,gnats-admin%netbsd.org@localhost,netbsd-bugs%netbsd.org@localhost
Subject: port-arm/55737: Apparent bug in evbarm64 DMA code causes filesystem corruption
From: Richard Todd <rmtodd%servalan.servalan.com@localhost>
Date: Tue, 20 Oct 2020 03:35:00 +0000 (UTC)

>Number:         55737
>Category:       port-arm
>Synopsis:       Apparent bug in evbarm64 DMA code causes filesystem corruption
>Confidential:   no
>Severity:       serious
>Priority:       high
>Responsible:    port-arm-maintainer
>State:          open
>Class:          sw-bug
>Submitter-Id:   net
>Arrival-Date:   Tue Oct 20 03:35:00 +0000 2020
>Originator:     Richard Todd
>Release:        NetBSD 9.99.72
>Organization:
	
>Environment:
System: NetBSD netbsdarm64test.servalan.com 9.99.72 NetBSD 9.99.72 (RMTGENERIC_VM_a15d1b7af98f) #3: Thu Oct 8 00:00:34 CDT 2020 sysbuild%materia.servalan.com@localhost:/amd/ichotolot/u2/netbsd-vmexport/sysbuild/Build/evbarm64/obj/amd/ichotolot/u2/netbsd-vmexport/sysbuild/src/sys/arch/evbarm/compile/RMTGENERIC_VM evbarm
Architecture: aarch64
Machine: evbarm
>Description:
  As per recent discussion on port-arm
http://mail-index.netbsd.org/port-arm/2020/10/13/msg006997.html
various people, including me, have been seeing filesystem corruption with
recent builds of NetBSD-current on evbarm64.  With a good deal of work, I
managed to get a reproducible test case and narrow the apparent culprit down
to this recent commit:

changeset:   936632:a15d1b7af98f
branch:      trunk
user:        skrll <skrll%NetBSD.org@localhost>
date:        Tue Sep 08 10:30:17 2020 +0000
files:       sys/arch/arm/arm32/bus_dma.c
extra:       branch=trunk
description:
A few bus_dmatag_subregion fixes

- return EOPNOTSUPP if min_addr isn't less than max_addr
- fix the subset check to ensure that all the ranges in the parent tag are
  within the {min,max}_addr range.  If so we can just continue to use the
  parent tag.
- when building the new ranges read the parent tag range rather than un-
  initialised memory.
- remove the max_addr != 0xffffffff check - the overflow should be handled
  by the unsigned arithmetic for arm32.
- add a KASSERT
- add comments

I've got a console log here ( https://pastebin.com/eHRG8jGy ) exhibiting an
example reproduction of the problem in my usual qemu testing VM.  The first
part has qemu booting with two virtual disks attached, one the "usual"
NetBSD-current disk image I've been using with that VM, and the second a
copy of said image which was going to get its / newfsed and recreated by
copying from the master.  (This was done to mitigate the possiblity that
these crashes were caused by preexisting corruption on the filesystem,
which was the previous theory on port-arm mailing list regarding the source
of these crashes).  The second part is qemu running off of the "copy" disk
image, and going thru a typical attempt to upgrade to an already-built
NetBSD-current release (built elsewhere, stashed on NFS on another machine)
-- first sysupgrade fetches everything, installs the new kernel, then we
boot the new kernel.  Next we try to use sysupgrade to install world, but
we hit a

 panic: ffs_blkfree: bad size: dev = 0x5c20, bno = 2432893633 bsize = 16384, size = 16384, fs = /

panic in the middle of extracting the tests tar file, and then fail
automatic fsck on reboot.


>How-To-Repeat:
      Install a kernel built from -current after the a15d1b7af98f commit,
      reboot, and attempt to install world with sysupgrade.  
>Fix:
      Revert to a kernel built from -current prior to that commit.
      Alternatively, I just tested a build of -current sources from
      Saturday (rev 1b3515410312, date: Sat Oct 17 23:23:06 2020 +0000) with
      reversing the suspect bus_dma.c patch and the later uvm_bio.c patch of
      October 5 mentioned recently on netbsd-current list (and which has
      since been reverted in -current source).  With those sources, we can do
      the install of kernel and world just fine. 
      

>Unformatted:

Follow-Ups:
- re: port-arm/55737: Apparent bug in evbarm64 DMA code causes filesystem corruption
  - From: matthew green

Prev by Date: NetBSD Nightly Trouble Ticket Report
Next by Date: re: port-arm/55737: Apparent bug in evbarm64 DMA code causes filesystem corruption
Previous by Thread: Re: kern/55736 (ipsec_natt and npf tests have regressed on the -9 branch)
Next by Thread: re: port-arm/55737: Apparent bug in evbarm64 DMA code causes filesystem corruption
Indexes:

Home | Main Index | Thread Index | Old Index