NetBSD-Bugs archive

[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index][Old Index]

toolchain/51094: repeatable file-system corruption: 7.99.2{7,8} i386/athlon gcc-5.3.0



>Number:         51094
>Category:       toolchain
>Synopsis:       repeatable file-system corruption: 7.99.2{7,8} i386/athlon gcc-5.3.0
>Confidential:   no
>Severity:       serious
>Priority:       medium
>Responsible:    toolchain-manager
>State:          open
>Class:          sw-bug
>Submitter-Id:   net
>Arrival-Date:   Tue Apr 26 02:25:00 +0000 2016
>Originator:     Jim Bernard
>Release:        NetBSD 7.99.28
>Organization:
me
>Environment:
System: NetBSD 7.99.28 (201604241117Z): Sun Apr 24 20:33:53 MDT 2016 i386
Architecture: i386
Machine: i386
>Description:

	I'm seeing corruption of newly created directories whenever
	lots of directories are created.  It first occurred during a
	recent upgrade of an i386 system from 7.99.26 (built from
	sources updated 201603050025Z) to 7.99.27 (built from sources
	updated 201604152325Z), while running the 7.99.27 kernel and
	copying the userland files.  I successfully upgraded userland
	while running an older system on another disk, after which
	the upgraded 7.99.27 system ran normally for about a week
	with light use before another large transfer of files
	corrupted and panicked it again.  Nightly fsck's of the
	mounted file systems showed no corruption during the
	intervening week.

	What I know so far:

	* It's triggered by copying a large directory tree to some
	  (ffs) file system.  Fairly early in the transfer there is
	  directory corruption at the destination and the kernel
	  panics, complaining of directory corruption.  Both pax and
	  rsync transfers have triggered it.

	* It occurs with perfect repeatability under recent 7.99.27
	  and 7.99.28 kernels (GENERIC or custom) that are optimized
	  for the AMD athlon thunderbird cpu
	  (makeoptions  CPUFLAGS="-march=athlon-tbird").

	* It doesn't occur with otherwise-identical unoptimized
	  kernels, and it's never happened with optimized kernels
	  from the 201603050025Z build (7.99.26) or earlier (going
	  back many years).

	* I've tested on two different destination disks attached to
	  two different pata channels on the system board.  The
	  results are the same for both.
	
	* Removing a large directory tree causes no trouble.  I think
	  minor updates to large trees (with cvs, git, or pax) also
	  cause no trouble, but I don't haven't tested that
	  explicitly.

	* memtest86 finds no memory problems.

	I'm pointing the finger at gcc 5.3.0, hence my classification
	of this pr under "toolchain", though I suppose the tbird
	optimization could be revealing a bug elsewhere.

	Since the eventual panic is a result of prior directory
	corruption, I imagine there's nothing very useful in a crash
	dump, but just in case, here's bt and partial ps info from
	one triggered by a pax copy:

# crash -M netbsd.12.core -N netbsd.12 
Crash version 7.99.27, image version 7.99.28.
WARNING: versions differ, you may not be able to examine this image.
System panicked: /: bad dir ino 46339 at offset 60: NUL in name [d] i=1, namlen=5

Backtrace from time of crash is available.
crash> bt
_KERNEL_OPT_NARCNET(0,104,c0113eab,8,c0feee21,0,104,c100806c,db16bbdc,db16bbc0) at 0
_end(104,0,c100806c,db16bbdc,3c,db9d903c,c3f2bed4,db16bbd0,c0939e88,c100806c) at db16bbdc
vpanic(c100806c,db16bbdc,db16bc9c,c089aa8e,c100806c,c309f100,b503,0,3c,c12dafc0) at vpanic+0x134
panic(c100806c,c309f100,b503,0,3c,c12dafc0,db16bd0c,1ff,0,0) at panic+0x18
ufs_lookup(db16bcbc,c09aa61b,c3f260a4,c3f26000,db16bcd4,c09a0799,c0ef6600,c3f26000,db16bd0c,db16be7c) at ufs_lookup+0x59e
VOP_LOOKUP(c3f26000,db16bd0c,db16be7c,db16bd1c,db16bcfc,c09aa61b,c3f260a4,db16be54,db16bdf4,c30c5800) at VOP_LOOKUP+0x33
lookup_once(db16bd90,20002,4,ffffffff,db16bd40,c01528da,0,db16bd68,c08f03a0,c30c59d4) at lookup_once+0x194
namei_tryemulroot(0,1,0,c30a5000,c354d618,db16bf68,db16be54,db16be7c,20,0) at namei_tryemulroot+0x473
namei(db16be54,c30a5000,400,0,db16beb4,db16bf68,db16be98,c098fbf0,80757c4,db16be50) at namei+0x27
fd_nameiat.isra.0(80757c4,db16be50,0,f2ebb8,c3f2ebb8,1,c354d618,0,c354d618,c30a5000) at fd_nameiat.isra.0+0x58
do_sys_statat(c30c5800,ffffff9c,80757c4,0,db16beb4,c30cba8c,5,c30cb880,c30cb880,c37c8808) at do_sys_statat+0x70
sys___lstat50(c30c5800,db16bf68,db16bf60,c306d9f0,4199a000,db16bf60,db16bf68,1b9,0,0) at sys___lstat50+0x3d
syscall() at syscall+0x142
--- syscall (number 441) ---
bbaf5127:
crash> ps
PID    LID S CPU     FLAGS       STRUCT LWP *               NAME WAIT
854  >   1 7   0         0           c30c5800                pax
636      1 3   0        80           c30c52c0               bash wait
691      1 3   0        80           c30c5020                csh pause
774      1 3   0        80           c30c5d40              getty ttyraw
...

	dmesg is below.  The disks I've demonstrated this on are wd0
	and wd1, attached to the on-board pata channels (one device
	per channel).  wd0 has some reallocated sectors, but their
	number has not increased since the first time corruption
	occurred.  wd1 has no reallocations.  There is a third disk
	attached to a sata channel on an add-on pci controller, but I
	haven't been willing to risk damage to the data on that disk.
	The cpu is clocked at its rated speed, 1.4 GHz.  No recent
	hardware changes.

NetBSD 7.99.28 (201604241117Z) #0: Sun Apr 24 20:33:53 MDT 2016
total memory = 1279 MB
avail memory = 1250 MB
timecounter: Timecounters tick every 10.000 msec
timecounter: Timecounter "i8254" frequency 1193182 Hz quality 100
System Manufacturer System Name (System Version)
mainbus0 (root)
cpu0 at mainbus0
cpu0: AMD Athlon(tm) Processor, id 0x644
pci0 at mainbus0 bus 0: configuration mode 1pci0: This pci host supports neither MSI nor MSI-X.

pci0: i/o space, memory space enabled, rd/line, rd/mult, wr/inv ok
pchb0 at pci0 dev 0 function 0: vendor 1022 product 700e (rev. 0x13)
agp0 at pchb0: aperture at 0xf8000000, size 0x4000000
ppb0 at pci0 dev 1 function 0: vendor 1022 product 700f (rev. 0x00)
pci1 at ppb0 bus 1pci1: This pci host supports neither MSI nor MSI-X.

pci1: i/o space, memory space enabled
vga0 at pci1 dev 5 function 0: vendor 1002 product 5159 (rev. 0x00)
wsdisplay0 at vga0 kbdmux 1: console (80x25, vt100 emulation)
wsmux1: connecting to wsdisplay0
drm at vga0 not configured
pcib0 at pci0 dev 4 function 0: vendor 1106 product 0686 (rev. 0x40)
viaide0 at pci0 dev 4 function 1
viaide0: VIA Technologies VT82C686A (Apollo KX133) ATA100 controller
viaide0: bus-master DMA support present
viaide0: primary channel configured to compatibility mode
viaide0: primary channel interrupting at irq 14
atabus0 at viaide0 channel 0
viaide0: secondary channel configured to compatibility mode
viaide0: secondary channel interrupting at irq 15
atabus1 at viaide0 channel 1
uhci0 at pci0 dev 4 function 2: vendor 1106 product 3038 (rev. 0x16)
uhci0: interrupting at irq 5
usb0 at uhci0: USB revision 1.0
uhci1 at pci0 dev 4 function 3: vendor 1106 product 3038 (rev. 0x16)
uhci1: interrupting at irq 5
usb1 at uhci1: USB revision 1.0
viaenv0 at pci0 dev 4 function 4: VIA Technologies VT82C686A Hardware Monitor
viaenv0: Hardware Monitor disabled
timecounter: Timecounter "viaenv0" frequency 3579545 Hz quality 1000
viaenv0: 24-bit timer
cmpci0 at pci0 dev 5 function 0: vendor 13f6 product 0111 (rev. 0x10)
cmpci0: interrupting at irq 10
audio0 at cmpci0: full duplex, playback, capture, mmap, independent
opl at cmpci0 not configured
mpu at cmpci0 not configured
wm0 at pci0 dev 9 function 0: Intel i82541PI 1000BASE-T Ethernet (rev. 0x05)
wm0: interrupting at irq 5
wm0: 32-bit 33MHz PCI bus
wm0: 64 words (6 address bits) MicroWire EEPROM
wm0: Ethernet address 90:e2:ba:9c:9e:7c
igphy0 at wm0 phy 1: Intel IGP01E1000 Gigabit PHY, rev. 0
igphy0: 10baseT, 10baseT-FDX, 100baseTX, 100baseTX-FDX, 1000baseT, 1000baseT-FDX, auto
pdcsata0 at pci0 dev 11 function 0: Promise PDC20775 SATA300 controller (rev. 0x02)
pdcsata0: interrupting at irq 10
pdcsata0: bus-master DMA support present
atabus2 at pdcsata0 channel 0
atabus3 at pdcsata0 channel 1
atabus4 at pdcsata0 channel 2
ohci0 at pci0 dev 13 function 0: vendor 1033 product 0035 (rev. 0x43)
ohci0: interrupting at irq 5
ohci0: OHCI version 1.0
usb2 at ohci0: USB revision 1.0
ohci1 at pci0 dev 13 function 1: vendor 1033 product 0035 (rev. 0x43)
ohci1: interrupting at irq 11
ohci1: OHCI version 1.0
usb3 at ohci1: USB revision 1.0
ehci0 at pci0 dev 13 function 2: vendor 1033 product 00e0 (rev. 0x04)
ehci0: interrupting at irq 10
ehci0: EHCI version 1.0
ehci0: companion controllers, 3 ports each: ohci0 ohci1
usb4 at ehci0: USB revision 2.0
isa0 at pcib0
lpt0 at isa0 port 0x378-0x37b irq 7
com0 at isa0 port 0x3f8-0x3ff irq 4: ns16550a, working fifo
com1 at isa0 port 0x2f8-0x2ff irq 3: ns16550a, working fifo
pckbc0 at isa0 port 0x60-0x64
pckbd0 at pckbc0 (kbd slot)
pckbc0: using irq 1 for kbd slot
wskbd0 at pckbd0: console keyboard, using wsdisplay0
pms0 at pckbc0 (aux slot)
pckbc0: using irq 12 for aux slot
wsmouse0 at pms0 mux 0
attimer0 at isa0 port 0x40-0x43
pcppi0 at isa0 port 0x61
midi0 at pcppi0: PC speaker
sysbeep0 at pcppi0
fdc0 at isa0 port 0x3f0-0x3f7 irq 6 drq 2
attimer0: attached to pcppi0
timecounter: Timecounter "clockinterrupt" frequency 100 Hz quality 0
fd0 at fdc0 drive 0: 1.44MB, 80 cyl, 2 head, 18 sec
IPsec: Initialized Security Association Processing.
uhub0 at usb0: vendor 1106 UHCI root hub, class 9/0, rev 1.00/1.00, addr 1
uhub0: 2 ports with 2 removable, self powered
uhub1 at usb1: vendor 1106 UHCI root hub, class 9/0, rev 1.00/1.00, addr 1
uhub1: 2 ports with 2 removable, self powered
uhub2 at usb2: vendor 1033 OHCI root hub, class 9/0, rev 1.00/1.00, addr 1
uhub2: 3 ports with 3 removable, self powered
uhub3 at usb3: vendor 1033 OHCI root hub, class 9/0, rev 1.00/1.00, addr 1
uhub3: 2 ports with 2 removable, self powered
uhub4 at usb4: vendor 1033 EHCI root hub, class 9/0, rev 2.00/1.00, addr 1
uhub4: 5 ports with 5 removable, self powered
pdcsata0 port 0: device present, speed: 3.0Gb/s
wd0 at atabus0 drive 0
wd0: <WDC WD1600JB-00GVA0>
wd0: drive supports 16-sector PIO transfers, LBA48 addressing
wd0: 149 GB, 310101 cyl, 16 head, 63 sec, 512 bytes/sect x 312581808 sectors
wd0: 32-bit data port
wd0: drive supports PIO mode 4, DMA mode 2, Ultra-DMA mode 5 (Ultra/100)
wd0(viaide0:0:0): using PIO mode 4, Ultra-DMA mode 5 (Ultra/100) (using DMA)
wd1 at atabus1 drive 0
wd1: <HDS722516VLAT80>
wd1: drive supports 16-sector PIO transfers, LBA48 addressing
wd1: 153 GB, 319120 cyl, 16 head, 63 sec, 512 bytes/sect x 321672960 sectors
wd1: 32-bit data port
wd1: drive supports PIO mode 4, DMA mode 2, Ultra-DMA mode 5 (Ultra/100)
wd1(viaide0:1:0): using PIO mode 4, Ultra-DMA mode 5 (Ultra/100) (using DMA)
wd2 at atabus2 drive 0
pdcsata0:0:0: lost interrupt
	type: ata tc_bcount: 512 tc_skip: 0
wd2: <ST2000DM001-1CH164>
wd2: drive supports 16-sector PIO transfers, LBA48 addressing
wd2: 1863 GB, 3876021 cyl, 16 head, 63 sec, 512 bytes/sect x 3907029168 sectors
wd2: drive supports PIO mode 4, DMA mode 2, Ultra-DMA mode 6 (Ultra/133)
wd2(pdcsata0:0:0): using PIO mode 4, Ultra-DMA mode 6 (Ultra/133) (using DMA)
atapibus0 at atabus4: 2 targets
cd0 at atapibus0 drive 0: <SONY CD-RW CRX1611, , TYS3> cdrom removable
cd0: drive supports PIO mode 4, DMA mode 2
cd0(pdcsata0:2:0): using PIO mode 4, DMA mode 2 (using DMA)
boot device: wd0
root on wd0a dumps on wd0b
root file system type: ffs
kern.module.path=/stand/i386/7.99.28/modules
pdcsata0:2:0: lost interrupt
	type: atapi tc_bcount: 32 tc_skip: 0
cd0: transfer error, downgrading to PIO mode 4
cd0(pdcsata0:2:0): using PIO mode 4
wsdisplay0: screen 1 added (80x25, vt100 emulation)
wsdisplay0: screen 2 added (80x25, vt100 emulation)
wsdisplay0: screen 3 added (80x25, vt100 emulation)
wsdisplay0: screen 4 added (80x25, vt100 emulation)
wsdisplay0: screen 5 added (80x25, vt100 emulation)
wsdisplay0: screen 6 added (80x25, vt100 emulation)
wsdisplay0: screen 7 added (80x25, vt100 emulation)


>How-To-Repeat:
	Build an i386 kernel (GENERIC will do) optimized for an AMD athlon
	thunderbird cpu:

	makeoptions    CPUFLAGS="-march=athlon-tbird"

	Boot on such a machine.

	  mkdir /junk
	  pax -rw -pe <some_big_directory> /junk

>Fix:
	So far, I only have the workaround of not optimizing for the athlon
	thunderbird.  Suggestions are most welcome.



Home | Main Index | Thread Index | Old Index