NetBSD-Bugs archive

[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index][Old Index]

kern/38643: [dM] st tape drive loses data



>Number:         38643
>Category:       kern
>Synopsis:       [dM] st tape drive loses data
>Confidential:   no
>Severity:       critical
>Priority:       medium
>Responsible:    kern-bug-people
>State:          open
>Class:          sw-bug
>Submitter-Id:   net
>Arrival-Date:   Mon May 12 20:40:02 +0000 2008
>Originator:     der Mouse
>Release:        NetBSD 4.0
>Organization:
        Dis-
>Environment:
System: not booted at the moment; see below
Architecture: i386
Machine: i386
>Description:
        At work, we have a machine that's been running NetBSD/i386 3.0
        for some time, and it's been working fine.  One of the things
        it does is run amanda, dumping to "virtual tapes" (files on
        disk) on a RAIDed filesystem (firmware RAID, not RAIDframe - a
        3ware).  Another thing it does is to dump some of the resulting
        files onto real tapes.

        Recently, I put a 4.0 kernel and userland on it.  Last night,
        it did a dump-to-tape run, which failed oddly.  On
        investigating, I find there's a very peculiar problem with the
        tape drive, which I can only ascribe to the switch to 4.0.  (It
        showed up when testing manually, and disappeared upon switching
        back to the 3.0 kernel.)

        Specifically, on a tape written with 10KB records, an attempt
        to read 1048577 bytes reads one (10KB) record, returns it to
        userland, and throws away the next 15 records.  (Attempts to
        read the tape with blocksize 10240 work fine; the data actually
        are on the tape.)

        The tape drive is

        ahc1 at pci4 dev 2 function 0: Adaptec 2940 Pro Ultra SCSI adapter
        ahc1: interrupting at ioapic0 pin 19 (irq 5)
        ahc1: aic7880: Ultra Wide Channel A, SCSI Id=7, 16/253 SCBs
        scsibus0 at ahc1: 16 targets, 8 luns per target
        ...
        st0 at scsibus0 target 5 lun 0: <QUANTUM, SDLT600, 1E1E> tape removable
        st0: drive empty
        st0: sync (50.00ns offset 8), 16-bit (40.000MB/s) transfers

        and it works fine as far as I can tell under 3.0 (in particular
        it does not exhibit this misbehaviour).  Full dmesg included
        below.

        The kernel is the 4.0 GENERIC kernel,
        
builds@wb34:/home/builds/ab/netbsd-4-0-RELEASE/i386/200712160005Z-obj/home/builds/ab/netbsd-4-0-RELEASE/src/sys/arch/i386/compile/GENERIC
        MD5 f201f3213ba5886a9b6f05a5492c6172.

        See how-to-repeat - or my mail to tech-kern, 
        Message-Id: 
<200804292303.TAA20392%Sparkle.Rodents.Montreal.QC.CA@localhost>,
        for more.

        Here's a full dmesg from a 4.0 boot on the hardware in
        question:

        Copyright (c) 1996, 1997, 1998, 1999, 2000, 2001, 2002, 2003, 2004, 
2005,
            2006, 2007
            The NetBSD Foundation, Inc.  All rights reserved.
        Copyright (c) 1982, 1986, 1989, 1991, 1993
            The Regents of the University of California.  All rights reserved.
        
        NetBSD 4.0 (GENERIC) #0: Sun Dec 16 00:20:10 PST 2007
                
builds@wb34:/home/builds/ab/netbsd-4-0-RELEASE/i386/200712160005Z-obj/home/builds/ab/netbsd-4-0-RELEASE/src/sys/arch/i386/compile/GENERIC
        total memory = 1023 MB
        rbus: rbus_min_start set to 0x40000000
        avail memory = 995 MB
        timecounter: Timecounters tick every 10.000 msec
        timecounter: Timecounter "i8254" frequency 1193182 Hz quality 100
        BIOS32 rev. 0 found at 0xfd6e0
        mainbus0 (root)
        cpu0 at mainbus0: apid 0 (boot processor)
        cpu0: Intel Xeon (686-class), 2790.83 MHz, id 0xf25
        cpu0: features 
bfebfbff<FPU,VME,DE,PSE,TSC,MSR,PAE,MCE,CX8,APIC,SEP,MTRR>
        cpu0: features bfebfbff<PGE,MCA,CMOV,PAT,PSE36,CFLUSH,DS,ACPI,MMX>
        cpu0: features bfebfbff<FXSR,SSE,SSE2,SS,HTT,TM,SBF>
        cpu0: features2 4400<CID,xTPR>
        cpu0: "Intel(R) Xeon(TM) CPU 2.80GHz"
        cpu0: I-cache 12K uOp cache 8-way, D-cache 8 KB 64B/line 4-way
        cpu0: L2 cache 512 KB 64B/line 8-way
        cpu0: ITLB 4K/4M: 64 entries
        cpu0: DTLB 4K/4M: 64 entries
        cpu0: enabling thermal monitor 1 ... enabled.
        cpu0: calibrating local timer
        cpu0: apic clock running at 132 MHz
        cpu0: 16 page colors
        cpu1 at mainbus0: apid 1 (application processor)
        cpu1: not started
        ioapic0 at mainbus0 apid 2 (I/O APIC)
        ioapic0: pa 0xfec00000, version 20, 24 pins
        ioapic1 at mainbus0 apid 3 (I/O APIC)
        ioapic1: pa 0xfec80000, version 20, 24 pins
        ioapic2 at mainbus0 apid 4 (I/O APIC)
        ioapic2: pa 0xfec80400, version 20, 24 pins
        acpi0 at mainbus0: Advanced Configuration and Power Interface
        acpi0: using Intel ACPI CA subsystem version 20060217
        acpi0: X/RSDT: OemId <PTLTD ,  RSDT  ,06040000>, AslId < LTP,00000000>
        acpi0: SCI interrupting at int 9
        acpi0: fixed-feature power button present
        timecounter: Timecounter "ACPI-Fast" frequency 3579545 Hz quality 1000
        ACPI-Fast 24-bit timer
        ACPI Object Type 'Processor' (0x0c) at acpi0 not configured
        ACPI Object Type 'Processor' (0x0c) at acpi0 not configured
        ACPI Object Type 'Processor' (0x0c) at acpi0 not configured
        ACPI Object Type 'Processor' (0x0c) at acpi0 not configured
        ACPI Object Type 'Processor' (0x0c) at acpi0 not configured
        ACPI Object Type 'Processor' (0x0c) at acpi0 not configured
        ACPI Object Type 'Processor' (0x0c) at acpi0 not configured
        ACPI Object Type 'Processor' (0x0c) at acpi0 not configured
        PNP0A03 [PCI/PCI-X Host Bridge] at acpi0 not configured
        PNP0C0F [PCI interrupt link device] at acpi0 not configured
        PNP0C0F [PCI interrupt link device] at acpi0 not configured
        PNP0C0F [PCI interrupt link device] at acpi0 not configured
        PNP0C0F [PCI interrupt link device] at acpi0 not configured
        PNP0C0F [PCI interrupt link device] at acpi0 not configured
        PNP0C0F [PCI interrupt link device] at acpi0 not configured
        INT0800 [Intel FWH Random Number Generator] at acpi0 not configured
        PNP0C02 [Plug and Play motherboard register resources] at acpi0 not 
configured
        PNP0200 [AT DMA Controller] at acpi0 not configured
        npx1 at acpi0 (PNP0C04)
        npx1: io 0xf0-0xfe irq 13
        npx1: reported by CPUID; using exception 16
        PNP0000 [AT Interrupt Controller] at acpi0 not configured
        PNP0B00 [AT Real-Time Clock] at acpi0 not configured
        pcppi1 at acpi0 (PNP0800)
        pcppi1: io 0x61
        pcppi1: children must have an explicit unit
        midi0 at pcppi1: PC speaker (CPU-intensive output)
        sysbeep0 at pcppi1
        attimer1 at acpi0 (PNP0100): AT Timer
        attimer1: io 0x40-0x43 irq 0
        PNP0A06 [Generic Container Device] at acpi0 not configured
        pckbc1 at acpi0 (PNP0303): kbd port
        pckbc1: io 0x60,0x64 irq 1
        PNP0700 [PC standard floppy disk controller] at acpi0 not configured
        PNP0501 [16550A-compatible COM port] at acpi0 not configured
        PNP0501 [16550A-compatible COM port] at acpi0 not configured
        PNP0401 [ECP printer port] at acpi0 not configured
        acpibut0 at acpi0 (PNP0C0C): ACPI Power Button
        pcppi1: attached to attimer1
        pckbd0 at pckbc1 (kbd slot)
        pckbc1: using irq 1 for kbd slot
        wskbd0 at pckbd0: console keyboard
        pci0 at mainbus0 bus 0: configuration mode 1
        pci0: i/o space, memory space enabled, rd/line, rd/mult, wr/inv ok
        pchb0 at pci0 dev 0 function 0
        pchb0: Intel E7501 MCH Host (rev. 0x01)
        Intel E7500 MCH DRAM Controller (undefined subclass 0x00, revision 
0x01) at pci0 dev 0 function 1 not configured
        ppb0 at pci0 dev 2 function 0: Intel E7500 MCH HI_B vppb 1 (rev. 0x01)
        pci1 at ppb0 bus 1
        pci1: i/o space, memory space enabled
        Intel 82870P2 P64H2 IOxAPIC (interrupt system, interface 0x20, revision 
0x04) at pci1 dev 28 function 0 not configured
        ppb1 at pci1 dev 29 function 0: Intel 82870P2 P64H2 PCI-PCI Bridge 
(rev. 0x04)
        pci2 at ppb1 bus 2
        pci2: i/o space, memory space enabled
        Intel 82870P2 P64H2 IOxAPIC (interrupt system, interface 0x20, revision 
0x04) at pci1 dev 30 function 0 not configured
        ppb2 at pci1 dev 31 function 0: Intel 82870P2 P64H2 PCI-PCI Bridge 
(rev. 0x04)
        pci3 at ppb2 bus 3
        pci3: i/o space, memory space enabled
        twe0 at pci3 dev 1 function 0: 3ware Escalade
        twe0: interrupting at ioapic2 pin 0 (irq 10)
        twe0: 12 ports, Firmware FE7S 1.05.00.065, BIOS BE7X 1.08.00.048
        twe0: Monitor ME7X 1.01.00.038, PCB Rev5    , Achip 3.20    , Pchip 
1.30-66 
        twe0: port 0: ST3300831AS                              286168 MB
        twe0: port 1: ST3300831AS                              286168 MB
        twe0: port 2: ST3300831AS                              286168 MB
        twe0: port 3: ST3300831AS                              286168 MB
        twe0: port 4: ST3300831AS                              286168 MB
        twe0: port 5: ST3300831AS                              286168 MB
        twe0: port 6: ST3300831AS                              286168 MB
        twe0: port 7: ST3300831AS                              286168 MB
        twe0: port 8: ST3300831AS                              286168 MB
        twe0: port 9: ST3300831AS                              286168 MB
        twe0: port 10: ST3300831AS                              286168 MB
        twe0: port 11: ST3300831AS                              286168 MB
        ld0 at twe0 unit 0: 64K stripe RAID5, status: Normal
        ld0: 1956 GB, 255368 cyl, 255 head, 63 sec, 512 bytes/sect x 4102491904 
sectors
        ld1 at twe0 unit 2: 1024K stripe RAID0, status: Normal
        ld1: 838 GB, 109443 cyl, 255 head, 63 sec, 512 bytes/sect x 1758210048 
sectors
        ld2 at twe0 unit 9: JBOD, status: Normal
        ld2: 279 GB, 36481 cyl, 255 head, 63 sec, 512 bytes/sect x 586072368 
sectors
        Intel E7500 MCH HI_B vppb 2 (undefined subclass 0x00, revision 0x01) at 
pci0 dev 2 function 1 not configured
        uhci0 at pci0 dev 29 function 0: Intel 82801CA USB Controller (rev. 
0x02)
        uhci0: interrupting at ioapic0 pin 16 (irq 10)
        usb0 at uhci0: USB revision 1.0
        uhub0 at usb0
        uhub0: Intel UHCI root hub, class 9/0, rev 1.00/1.00, addr 1
        uhub0: 2 ports with 2 removable, self powered
        uhci1 at pci0 dev 29 function 1: Intel 82801CA USB Controller (rev. 
0x02)
        uhci1: interrupting at ioapic0 pin 19 (irq 5)
        usb1 at uhci1: USB revision 1.0
        uhub1 at usb1
        uhub1: Intel UHCI root hub, class 9/0, rev 1.00/1.00, addr 1
        uhub1: 2 ports with 2 removable, self powered
        uhci2 at pci0 dev 29 function 2: Intel 82801CA USB Controller (rev. 
0x02)
        uhci2: interrupting at ioapic0 pin 18 (irq 10)
        usb2 at uhci2: USB revision 1.0
        uhub2 at usb2
        uhub2: Intel UHCI root hub, class 9/0, rev 1.00/1.00, addr 1
        uhub2: 2 ports with 2 removable, self powered
        ppb3 at pci0 dev 30 function 0: Intel 82801BA Hub-PCI Bridge (rev. 0x42)
        pci4 at ppb3 bus 4
        pci4: i/o space, memory space enabled
        ahc1 at pci4 dev 2 function 0: Adaptec 2940 Pro Ultra SCSI adapter
        ahc1: interrupting at ioapic0 pin 19 (irq 5)
        ahc1: aic7880: Ultra Wide Channel A, SCSI Id=7, 16/253 SCBs
        scsibus0 at ahc1: 16 targets, 8 luns per target
        vga1 at pci4 dev 3 function 0: ATI Technologies Rage XL (rev. 0x27)
        wsdisplay0 at vga1 kbdmux 1: console (80x25, vt100 emulation), using 
wskbd0
        wsmux1: connecting to wsdisplay0
        fxp0 at pci4 dev 4 function 0: i82550 Ethernet, rev 13
        fxp0: interrupting at ioapic0 pin 20 (irq 11)
        fxp0: Ethernet address 00:0e:0c:4d:ec:5b
        inphy0 at fxp0 phy 1: i82555 10/100 media interface, rev. 4
        inphy0: 10baseT, 10baseT-FDX, 100baseTX, 100baseTX-FDX, auto
        wm0 at pci4 dev 5 function 0: Intel i82540EM 1000BASE-T Ethernet, rev. 2
        wm0: interrupting at ioapic0 pin 23 (irq 10)
        wm0: 32-bit 33MHz PCI bus
        wm0: 64 word (6 address bits) MicroWire EEPROM
        wm0: Ethernet address 00:0e:0c:4d:ec:11
        makphy0 at wm0 phy 1: Marvell 88E1011 Gigabit PHY, rev. 3
        makphy0: 10baseT, 10baseT-FDX, 100baseTX, 100baseTX-FDX, 1000baseT, 
1000baseT-FDX, auto
        pcib0 at pci0 dev 31 function 0
        pcib0: Intel 82801CA LPC Interface (rev. 0x02)
        piixide0 at pci0 dev 31 function 1
        piixide0: Intel 82801CA IDE Controller (ICH3) (rev. 0x02)
        piixide0: bus-master DMA support present
        piixide0: primary channel configured to compatibility mode
        piixide0: primary channel interrupting at ioapic0 pin 14 (irq 14)
        atabus0 at piixide0 channel 0
        piixide0: secondary channel configured to compatibility mode
        piixide0: secondary channel interrupting at ioapic0 pin 15 (irq 15)
        atabus1 at piixide0 channel 1
        Intel 82801CA SMBus Controller (SMBus serial bus, revision 0x02) at 
pci0 dev 31 function 3 not configured
        isa0 at pcib0
        lpt0 at isa0 port 0x378-0x37b irq 7
        com0 at isa0 port 0x3f8-0x3ff irq 4: ns16550a, working fifo
        com1 at isa0 port 0x2f8-0x2ff irq 3: ns16550a, working fifo
        isapnp0 at isa0 port 0x279: ISA Plug 'n Play device support
        fdc0 at isa0 port 0x3f0-0x3f7 irq 6 drq 2
        isapnp0: no ISA Plug 'n Play devices found
        ioapic0: enabling
        ioapic1: enabling
        ioapic2: enabling
        timecounter: Timecounter "clockinterrupt" frequency 100 Hz quality 0
        fd0 at fdc0 drive 0: 1.44MB, 80 cyl, 2 head, 18 sec
        Kernelized RAIDframe activated
        scsibus0: waiting 2 seconds for devices to settle...
        wd0 at atabus0 drive 0: <WDC WD400JB-00JJA0>
        wd0: drive supports 16-sector PIO transfers, LBA addressing
        wd0: 38166 MB, 77545 cyl, 16 head, 63 sec, 512 bytes/sect x 78165360 
sectors
        wd0: 32-bit data port
        wd0: drive supports PIO mode 4, DMA mode 2, Ultra-DMA mode 5 (Ultra/100)
        wd1 at atabus0 drive 1: <FUJITSU MPD3043AT>
        wd1: drive supports 16-sector PIO transfers, LBA addressing
        wd1: 4125 MB, 8940 cyl, 15 head, 63 sec, 512 bytes/sect x 8448300 
sectors
        wd1: 32-bit data port
        wd1: drive supports PIO mode 4, DMA mode 2, Ultra-DMA mode 4 (Ultra/66)
        wd0(piixide0:0:0): using PIO mode 4, Ultra-DMA mode 5 (Ultra/100) 
(using DMA)
        wd1(piixide0:0:1): using PIO mode 4, Ultra-DMA mode 2 (Ultra/33) (using 
DMA)
        st0 at scsibus0 target 5 lun 0: <QUANTUM, SDLT600, 1E1E> tape removable
        st0: drive empty
        st0: sync (50.00ns offset 8), 16-bit (40.000MB/s) transfers
        boot device: wd1
        root on wd1a dumps on wd1b
        root file system type: ffs
        wsdisplay0: screen 1 added (80x25, vt100 emulation)
        wsdisplay0: screen 2 added (80x25, vt100 emulation)
        wsdisplay0: screen 3 added (80x25, vt100 emulation)
        wsdisplay0: screen 4 added (80x25, vt100 emulation)
>How-To-Repeat:
        While I initially noticed the problem with other tools, I did
        some tests with a fresh tape and very simple tools.  I created
        a ten-megabyte file full of distinctive data (given any
        512-byte chunk, I could tell exactly where in the original file
        it came from).  I wrote this to the tape with

        # dd bs=10240 if=file of=/dev/nrst0

        which (reassuringly) reported 1024 records written.  I then
        rewound the tape and checked with

        # dd bs=10240 if=/dev/nrst0 | cmp - file

        to verify that it was in fact on the tape; except for dd's
        report, this produced no output.  Then I rewound again and
        tried

        # sh -c 'exec 3</dev/nrst0; dd bs=1048577 count=1 0<&3; dd bs=10240 
0<&3' > file2

        which is basically a command-line version of what I saw when I
        ktraced the program that produced the odd behaviour that put me
        onto this.  This said

        0+1 records in
        0+1 records out
        10240 bytes transferred in 0.063 secs (162539 bytes/sec)
        1008+0 records in
        1008+0 records out
        10321920 bytes transferred in 0.482 secs (21414771 bytes/sec)

        which is what I'd expect, except that the 1008 should be 1023
        (and the total byte count, of course, should match).  Looking
        at the resulting file2, the first 20 blocks - the first 10K -
        are exactly what they should be, but the next block is the one
        from block offset 320, not 20, and it goes on from there;
        something lost the next 15 tape records, as if the first dd had
        actually read 16 10240-byte records, returned the first, and
        thrown away the other 15.

        Is the bug here with 4.0 or with my expectations?  Certainly my
        historical experience has been that reading a tape record with
        a buffer far bigger than the record just reads the record into
        the beginning of the buffer, not reads the record into the
        beginning of the buffer and then reads and discards some
        unobvious number of following records!  (16 records is 320K,
        less than a third of the 1048577-byte buffer size; I have no
        idea why that many.)

        My message to tech-kern appears to have drawn no response,
        tending to argue against stupid pilot-error mistakes as the
        cause.

        I'm classing this as tech-kern, because I suspect a kernel bug
        (the userland side of the above test is pretty much too simple
        to break); serious, because it means most likely no 4.0 on this
        machine until/unless it's fixed; medium, because it doesn't
        seem to affect a lot of people or it would have been noticed
        long ago.  Adjust if you think it should be otherwise.
>Fix:
        Unknown.  I had a quick look but nothing was obvious.

        I can test proposed fixes, but, depending on the time of day,
        it may take a little while; the machine is in live production
        and only practical window for test boots is weekday afternoons
        (North American Eastern timezone, ie, roughly 16:00-22:00 UTC).

/~\ The ASCII                           der Mouse
\ / Ribbon Campaign
 X  Against HTML               mouse%rodents.montreal.qc.ca@localhost
/ \ Email!           7D C8 61 52 5D E7 2D 39  4E F1 31 3E E8 B3 27 4B



Home | Main Index | Thread Index | Old Index