kern/41024: wapbl causes file system corruption

To: kern-bug-people%netbsd.org@localhost,gnats-admin%netbsd.org@localhost,netbsd-bugs%netbsd.org@localhost
Subject: kern/41024: wapbl causes file system corruption
From: apb%cequrux.com@localhost
Date: Mon, 16 Mar 2009 07:10:00 +0000 (UTC)

>Number:         41024
>Category:       kern
>Synopsis:       wapbl causes file system corruption
>Confidential:   no
>Severity:       critical
>Priority:       high
>Responsible:    kern-bug-people
>State:          open
>Class:          sw-bug
>Submitter-Id:   net
>Arrival-Date:   Mon Mar 16 07:10:00 +0000 2009
>Originator:     Alan Barrett
>Release:        NetBSD 5.99.8
>Organization:
Not much
>Environment:
System: NetBSD 5.99.8 i386
>Description:
I have an external USB disk that I use for backups.  Very frequently,
while attempting to make a backup, the system panics, usually with a
message like this:

    /mnt: bad dir ino 16170501 at offset 0: mangled entry
    panic: bad dir

The file system is ffs+wapbl on cgd.  The kernel includes the recent
change to make cgd pass the DIOCCACHESYNC ioctl through to the
underlying disk (see PR 41016).

Backups are made using rsync, but I can replicate the panics simply
using find(1) to read the file system.

The disk and its parents are attached as follows:

    pci0 at mainbus0 bus 0: configuration mode 1
    pci0: i/o space, memory space enabled, rd/line, rd/mult, wr/inv ok
    ehci0 at pci0 dev 29 function 7: vendor 0x8086 product 0x27cc (rev. 0x01)
    ehci0: interrupting at ioapic0 pin 20
    ehci0: EHCI version 1.0
    ehci0: companion controllers, 2 ports each: uhci0 uhci1 uhci2 uhci3
    usb4 at ehci0: USB revision 2.0
    uhub2 at usb4: vendor 0x8086 EHCI root hub, class 9/0, rev 2.00/1.00, addr 1
    uhub2: 8 ports with 8 removable, self powered
    umass0 at uhub2 port 6 configuration 1 interface 0
    umass0: Western Digital External HDD, rev 2.00/2.06, addr 5
    umass0: using SCSI over Bulk-Only
    scsibus0 at umass0: 2 targets, 1 lun per target
    sd0 at scsibus0 target 0 lun 0: <WD, 3200JB External, 0107> disk fixed
    sd0: fabricating a geometry
    sd0: 298 GB, 305245 cyl, 64 head, 32 sec, 512 bytes/sect x 625142448 sectors
    sd0: fabricating a geometry

Most of the disk space is allocated to the sd0e partition, which is
configured as a cgd device (cgd2 in the following description).

The whole of the cgd2 device is allocated to the cgd2a file system,
which is formatted as ffs+wapbl.

>How-To-Repeat:
Ensure that the file system is clean:

    $ fsck -f -y -P /dev/rcgd2a 
        [fsck fixes several problems from a previous crash]
    $ fsck -f -y -P /dev/rcgd2a
        [no problems]

Verify that mounting without wapbl does not cause problems:

    $ mount -o nolog /dev/cgd2a /mnt
    $ find /mnt -type d -print | tail
        [no problems]
    $ umount /mnt
    $ fsck -f -y -P /dev/rcgd2a
        [no problems]

Verify that mounting with wapbl + noatime does not cause problems:

    $ mount -o log,noatime /dev/cgd2a /mnt
    $ find /mnt -type d -print | tail
        [no problems]
    $ umount /mnt
    $ fsck -f -y -P /dev/rcgd2a
        [no problems]

Verify that mounting with wapbl causes a crash:

    $ mount -o log /dev/cgd2a /mnt
    $ find /mnt -type d -print | tail

    /mnt: bad dir ino 16170501 at offset 0: mangled entry
    panic: bad dir
    [...]
    stoped in pid 14915.1 (find) [...]

Reboot and examine the crash dump:

$ crash -M netbsd.203.core -N netbsd.203
crash> bt
[...]
panic(c0ad6b07,ce09d0f8,f6be05,0,0,c0a7f762,200,d771967c,0,dbcaf000) at 
0xc06f25fa
ufs_dirbad(dadc3e84,0,c0a7f762,0,cca07aa8,0,0,0,0,c4335218) at 0xc076dc1a
ufs_lookup(cca07ad8,1,cca07adc,c07e372d,d771967c,c0a51140,d771967c,cca07c14,cca07c28,d771967c)
 at 0xc076e60b
VOP_LOOKUP(d771967c,cca07c14,cca07c28,c07f4890,cca07b1c,1000,1,0,20,0) at 
0xc07f56ec
lookup(cca07c00,20002,400,cca07c1c,1,c3a88000,cca07c1c,c0771adf,c0bab0a0,c0b3f5c0)
 at 0xc07d850c
namei(cca07c00,e00,cca07bfc,c07e372d,d771967c,ce09d000,cca07c1c,bb916090,0,0) 
at 0xc07d8c0d
do_sys_stat(bb916090,0,cca07c68,cca07c98,0,0,0,ceb57760,cca07c90,1) at 
0xc07dff47
sys___lstat50(cf027000,cca07d00,cca07d28,bb916090,bb9160ac,bfbfeb98,bbb2a66d,1,bb268c80,804ee6c)
 at 0xc07dffac
syscall(cca07d48,bb9200b3,bb9000ab,bb90001f,bfbf001f,bb9160ac,bb916040,bfbfebf8,bbbc21dc,bb916040)
 at 0xc0711e7d
crash> quit

Examime the file system:

$ fsdb -f /dev/rcgd2a
fsdb (inum: 2)> inode 16170501
current inode: directory
I=16170501 MODE=40755 SIZE=512
        MTIME=Mar 30 23:23:50 2005 [0 nsec]
        CTIME=Feb 27 15:21:12 2009 [658332227 nsec]
        ATIME=Mar 15 21:09:23 2009 [866735678 nsec]
OWNER=apb GRP=apb LINKCNT=2 FLAGS=0x0 BLKCNT=0x4 GEN=0x72b4af9f
fsdb (inum: 16170501)> ls
fsdb (inum: 16170501)> blks
I=16170501 4 blocks
Direct blocks:
0: 65573431
fsdb (inum: 16170501)> quit
*** FILE SYSTEM MARKED DIRTY
*** BE SURE TO RUN FSCK TO CLEAN UP ANY DAMAGE
*** IF IT WAS MOUNTED, RE-MOUNT WITH -u -o reload

$ dd if=/dev/rcgd2a bs=512 skip=65573431 count=1 | hexdump -C
1+0 records in
1+0 records out
512 bytes transferred in 0.010 secs (51200 bytes/sec)
00000000  00 00 00 00 00 00 00 00  00 00 00 00 00 00 00 00  |................|
*
00000200

$ tunefs -N /dev/rcgd2a
tunefs: tuning /dev/rcgd2a
tunefs: current settings of /dev/rcgd2a
        maximum contiguous block count 4
        maximum blocks per file in a cylinder group 4096
        minimum percentage of free space 5%
        optimization preference: time
        average file size: 16384
        expected number of files per directory: 64
        journal log file location: in filesystem
        journal log file size: 64MB (67108864 bytes)
        journal log flags:
tunefs: no changes made

$ dumpfs -s /dev/rcgd2a
file system: /dev/rcgd2a
endian  little-endian
magic   11954 (UFS1)    time    Sun Mar 15 22:40:18 2009
superblock location     8192    id      [ 4960ee56 5d043c18 ]
cylgrp  dynamic inodes  4.4BSD  sblock  FFSv2   fslevel 4
nbfree  11412090        ndir    642379  nifree  30237094        nffree  947460
ncg     1391    size    131285595       blocks  129238024
bsize   16384   shift   14      mask    0xffffc000
fsize   2048    shift   11      mask    0xfffff800
frag    8       shift   3       fsbtodb 2
bpg     11798   fpg     94384   ipg     23296
minfree 5%      optim   time    maxcontig 4     maxbpg  4096
symlinklen 60   contigsumsize 4
maxfilesize 0x000400400402ffff
nindir  4096    inopb   128
avgfilesize 16384       avgfpdir 64
sblkno  8       cblkno  16      iblkno  24      dblkno  1480
sbsize  2048    cgsize  16384
csaddr  1480    cssize  22528
cgrotor 0       fmod    0       ronly   0       clean   0x00
wapbl version 0x1       location 2      flags 0x0
wapbl loc0 262179520    loc1 131072     loc2 512        loc3 3
flags   none
fsmnt   /mnt
volname         swuid   0

>Fix:
Unknown.  I will keep the crash dump for some weeks or months, and I
will keep the file system unmodified for a few days in case anybody
needs information from them.

Follow-Ups:
- Re: kern/41024: wapbl causes file system corruption
  - From: Scott Ellis

Prev by Date: Re: port-i386/40471 (ACPI quirk for ASUS CUR-DLS)
Next by Date: Re: kern/40764: fsck says "unrecognised wapbl type" after crash
Previous by Thread: Re: misc/41010 (FTP Mirror list is outdated)
Next by Thread: Re: kern/41024: wapbl causes file system corruption
Indexes:

Home | Main Index | Thread Index | Old Index