NetBSD-Bugs archive
[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index][Old Index]
kern/38643: [dM] st tape drive loses data
>Number: 38643
>Category: kern
>Synopsis: [dM] st tape drive loses data
>Confidential: no
>Severity: critical
>Priority: medium
>Responsible: kern-bug-people
>State: open
>Class: sw-bug
>Submitter-Id: net
>Arrival-Date: Mon May 12 20:40:02 +0000 2008
>Originator: der Mouse
>Release: NetBSD 4.0
>Organization:
Dis-
>Environment:
System: not booted at the moment; see below
Architecture: i386
Machine: i386
>Description:
At work, we have a machine that's been running NetBSD/i386 3.0
for some time, and it's been working fine. One of the things
it does is run amanda, dumping to "virtual tapes" (files on
disk) on a RAIDed filesystem (firmware RAID, not RAIDframe - a
3ware). Another thing it does is to dump some of the resulting
files onto real tapes.
Recently, I put a 4.0 kernel and userland on it. Last night,
it did a dump-to-tape run, which failed oddly. On
investigating, I find there's a very peculiar problem with the
tape drive, which I can only ascribe to the switch to 4.0. (It
showed up when testing manually, and disappeared upon switching
back to the 3.0 kernel.)
Specifically, on a tape written with 10KB records, an attempt
to read 1048577 bytes reads one (10KB) record, returns it to
userland, and throws away the next 15 records. (Attempts to
read the tape with blocksize 10240 work fine; the data actually
are on the tape.)
The tape drive is
ahc1 at pci4 dev 2 function 0: Adaptec 2940 Pro Ultra SCSI adapter
ahc1: interrupting at ioapic0 pin 19 (irq 5)
ahc1: aic7880: Ultra Wide Channel A, SCSI Id=7, 16/253 SCBs
scsibus0 at ahc1: 16 targets, 8 luns per target
...
st0 at scsibus0 target 5 lun 0: <QUANTUM, SDLT600, 1E1E> tape removable
st0: drive empty
st0: sync (50.00ns offset 8), 16-bit (40.000MB/s) transfers
and it works fine as far as I can tell under 3.0 (in particular
it does not exhibit this misbehaviour). Full dmesg included
below.
The kernel is the 4.0 GENERIC kernel,
builds@wb34:/home/builds/ab/netbsd-4-0-RELEASE/i386/200712160005Z-obj/home/builds/ab/netbsd-4-0-RELEASE/src/sys/arch/i386/compile/GENERIC
MD5 f201f3213ba5886a9b6f05a5492c6172.
See how-to-repeat - or my mail to tech-kern,
Message-Id:
<200804292303.TAA20392%Sparkle.Rodents.Montreal.QC.CA@localhost>,
for more.
Here's a full dmesg from a 4.0 boot on the hardware in
question:
Copyright (c) 1996, 1997, 1998, 1999, 2000, 2001, 2002, 2003, 2004,
2005,
2006, 2007
The NetBSD Foundation, Inc. All rights reserved.
Copyright (c) 1982, 1986, 1989, 1991, 1993
The Regents of the University of California. All rights reserved.
NetBSD 4.0 (GENERIC) #0: Sun Dec 16 00:20:10 PST 2007
builds@wb34:/home/builds/ab/netbsd-4-0-RELEASE/i386/200712160005Z-obj/home/builds/ab/netbsd-4-0-RELEASE/src/sys/arch/i386/compile/GENERIC
total memory = 1023 MB
rbus: rbus_min_start set to 0x40000000
avail memory = 995 MB
timecounter: Timecounters tick every 10.000 msec
timecounter: Timecounter "i8254" frequency 1193182 Hz quality 100
BIOS32 rev. 0 found at 0xfd6e0
mainbus0 (root)
cpu0 at mainbus0: apid 0 (boot processor)
cpu0: Intel Xeon (686-class), 2790.83 MHz, id 0xf25
cpu0: features
bfebfbff<FPU,VME,DE,PSE,TSC,MSR,PAE,MCE,CX8,APIC,SEP,MTRR>
cpu0: features bfebfbff<PGE,MCA,CMOV,PAT,PSE36,CFLUSH,DS,ACPI,MMX>
cpu0: features bfebfbff<FXSR,SSE,SSE2,SS,HTT,TM,SBF>
cpu0: features2 4400<CID,xTPR>
cpu0: "Intel(R) Xeon(TM) CPU 2.80GHz"
cpu0: I-cache 12K uOp cache 8-way, D-cache 8 KB 64B/line 4-way
cpu0: L2 cache 512 KB 64B/line 8-way
cpu0: ITLB 4K/4M: 64 entries
cpu0: DTLB 4K/4M: 64 entries
cpu0: enabling thermal monitor 1 ... enabled.
cpu0: calibrating local timer
cpu0: apic clock running at 132 MHz
cpu0: 16 page colors
cpu1 at mainbus0: apid 1 (application processor)
cpu1: not started
ioapic0 at mainbus0 apid 2 (I/O APIC)
ioapic0: pa 0xfec00000, version 20, 24 pins
ioapic1 at mainbus0 apid 3 (I/O APIC)
ioapic1: pa 0xfec80000, version 20, 24 pins
ioapic2 at mainbus0 apid 4 (I/O APIC)
ioapic2: pa 0xfec80400, version 20, 24 pins
acpi0 at mainbus0: Advanced Configuration and Power Interface
acpi0: using Intel ACPI CA subsystem version 20060217
acpi0: X/RSDT: OemId <PTLTD , RSDT ,06040000>, AslId < LTP,00000000>
acpi0: SCI interrupting at int 9
acpi0: fixed-feature power button present
timecounter: Timecounter "ACPI-Fast" frequency 3579545 Hz quality 1000
ACPI-Fast 24-bit timer
ACPI Object Type 'Processor' (0x0c) at acpi0 not configured
ACPI Object Type 'Processor' (0x0c) at acpi0 not configured
ACPI Object Type 'Processor' (0x0c) at acpi0 not configured
ACPI Object Type 'Processor' (0x0c) at acpi0 not configured
ACPI Object Type 'Processor' (0x0c) at acpi0 not configured
ACPI Object Type 'Processor' (0x0c) at acpi0 not configured
ACPI Object Type 'Processor' (0x0c) at acpi0 not configured
ACPI Object Type 'Processor' (0x0c) at acpi0 not configured
PNP0A03 [PCI/PCI-X Host Bridge] at acpi0 not configured
PNP0C0F [PCI interrupt link device] at acpi0 not configured
PNP0C0F [PCI interrupt link device] at acpi0 not configured
PNP0C0F [PCI interrupt link device] at acpi0 not configured
PNP0C0F [PCI interrupt link device] at acpi0 not configured
PNP0C0F [PCI interrupt link device] at acpi0 not configured
PNP0C0F [PCI interrupt link device] at acpi0 not configured
INT0800 [Intel FWH Random Number Generator] at acpi0 not configured
PNP0C02 [Plug and Play motherboard register resources] at acpi0 not
configured
PNP0200 [AT DMA Controller] at acpi0 not configured
npx1 at acpi0 (PNP0C04)
npx1: io 0xf0-0xfe irq 13
npx1: reported by CPUID; using exception 16
PNP0000 [AT Interrupt Controller] at acpi0 not configured
PNP0B00 [AT Real-Time Clock] at acpi0 not configured
pcppi1 at acpi0 (PNP0800)
pcppi1: io 0x61
pcppi1: children must have an explicit unit
midi0 at pcppi1: PC speaker (CPU-intensive output)
sysbeep0 at pcppi1
attimer1 at acpi0 (PNP0100): AT Timer
attimer1: io 0x40-0x43 irq 0
PNP0A06 [Generic Container Device] at acpi0 not configured
pckbc1 at acpi0 (PNP0303): kbd port
pckbc1: io 0x60,0x64 irq 1
PNP0700 [PC standard floppy disk controller] at acpi0 not configured
PNP0501 [16550A-compatible COM port] at acpi0 not configured
PNP0501 [16550A-compatible COM port] at acpi0 not configured
PNP0401 [ECP printer port] at acpi0 not configured
acpibut0 at acpi0 (PNP0C0C): ACPI Power Button
pcppi1: attached to attimer1
pckbd0 at pckbc1 (kbd slot)
pckbc1: using irq 1 for kbd slot
wskbd0 at pckbd0: console keyboard
pci0 at mainbus0 bus 0: configuration mode 1
pci0: i/o space, memory space enabled, rd/line, rd/mult, wr/inv ok
pchb0 at pci0 dev 0 function 0
pchb0: Intel E7501 MCH Host (rev. 0x01)
Intel E7500 MCH DRAM Controller (undefined subclass 0x00, revision
0x01) at pci0 dev 0 function 1 not configured
ppb0 at pci0 dev 2 function 0: Intel E7500 MCH HI_B vppb 1 (rev. 0x01)
pci1 at ppb0 bus 1
pci1: i/o space, memory space enabled
Intel 82870P2 P64H2 IOxAPIC (interrupt system, interface 0x20, revision
0x04) at pci1 dev 28 function 0 not configured
ppb1 at pci1 dev 29 function 0: Intel 82870P2 P64H2 PCI-PCI Bridge
(rev. 0x04)
pci2 at ppb1 bus 2
pci2: i/o space, memory space enabled
Intel 82870P2 P64H2 IOxAPIC (interrupt system, interface 0x20, revision
0x04) at pci1 dev 30 function 0 not configured
ppb2 at pci1 dev 31 function 0: Intel 82870P2 P64H2 PCI-PCI Bridge
(rev. 0x04)
pci3 at ppb2 bus 3
pci3: i/o space, memory space enabled
twe0 at pci3 dev 1 function 0: 3ware Escalade
twe0: interrupting at ioapic2 pin 0 (irq 10)
twe0: 12 ports, Firmware FE7S 1.05.00.065, BIOS BE7X 1.08.00.048
twe0: Monitor ME7X 1.01.00.038, PCB Rev5 , Achip 3.20 , Pchip
1.30-66
twe0: port 0: ST3300831AS 286168 MB
twe0: port 1: ST3300831AS 286168 MB
twe0: port 2: ST3300831AS 286168 MB
twe0: port 3: ST3300831AS 286168 MB
twe0: port 4: ST3300831AS 286168 MB
twe0: port 5: ST3300831AS 286168 MB
twe0: port 6: ST3300831AS 286168 MB
twe0: port 7: ST3300831AS 286168 MB
twe0: port 8: ST3300831AS 286168 MB
twe0: port 9: ST3300831AS 286168 MB
twe0: port 10: ST3300831AS 286168 MB
twe0: port 11: ST3300831AS 286168 MB
ld0 at twe0 unit 0: 64K stripe RAID5, status: Normal
ld0: 1956 GB, 255368 cyl, 255 head, 63 sec, 512 bytes/sect x 4102491904
sectors
ld1 at twe0 unit 2: 1024K stripe RAID0, status: Normal
ld1: 838 GB, 109443 cyl, 255 head, 63 sec, 512 bytes/sect x 1758210048
sectors
ld2 at twe0 unit 9: JBOD, status: Normal
ld2: 279 GB, 36481 cyl, 255 head, 63 sec, 512 bytes/sect x 586072368
sectors
Intel E7500 MCH HI_B vppb 2 (undefined subclass 0x00, revision 0x01) at
pci0 dev 2 function 1 not configured
uhci0 at pci0 dev 29 function 0: Intel 82801CA USB Controller (rev.
0x02)
uhci0: interrupting at ioapic0 pin 16 (irq 10)
usb0 at uhci0: USB revision 1.0
uhub0 at usb0
uhub0: Intel UHCI root hub, class 9/0, rev 1.00/1.00, addr 1
uhub0: 2 ports with 2 removable, self powered
uhci1 at pci0 dev 29 function 1: Intel 82801CA USB Controller (rev.
0x02)
uhci1: interrupting at ioapic0 pin 19 (irq 5)
usb1 at uhci1: USB revision 1.0
uhub1 at usb1
uhub1: Intel UHCI root hub, class 9/0, rev 1.00/1.00, addr 1
uhub1: 2 ports with 2 removable, self powered
uhci2 at pci0 dev 29 function 2: Intel 82801CA USB Controller (rev.
0x02)
uhci2: interrupting at ioapic0 pin 18 (irq 10)
usb2 at uhci2: USB revision 1.0
uhub2 at usb2
uhub2: Intel UHCI root hub, class 9/0, rev 1.00/1.00, addr 1
uhub2: 2 ports with 2 removable, self powered
ppb3 at pci0 dev 30 function 0: Intel 82801BA Hub-PCI Bridge (rev. 0x42)
pci4 at ppb3 bus 4
pci4: i/o space, memory space enabled
ahc1 at pci4 dev 2 function 0: Adaptec 2940 Pro Ultra SCSI adapter
ahc1: interrupting at ioapic0 pin 19 (irq 5)
ahc1: aic7880: Ultra Wide Channel A, SCSI Id=7, 16/253 SCBs
scsibus0 at ahc1: 16 targets, 8 luns per target
vga1 at pci4 dev 3 function 0: ATI Technologies Rage XL (rev. 0x27)
wsdisplay0 at vga1 kbdmux 1: console (80x25, vt100 emulation), using
wskbd0
wsmux1: connecting to wsdisplay0
fxp0 at pci4 dev 4 function 0: i82550 Ethernet, rev 13
fxp0: interrupting at ioapic0 pin 20 (irq 11)
fxp0: Ethernet address 00:0e:0c:4d:ec:5b
inphy0 at fxp0 phy 1: i82555 10/100 media interface, rev. 4
inphy0: 10baseT, 10baseT-FDX, 100baseTX, 100baseTX-FDX, auto
wm0 at pci4 dev 5 function 0: Intel i82540EM 1000BASE-T Ethernet, rev. 2
wm0: interrupting at ioapic0 pin 23 (irq 10)
wm0: 32-bit 33MHz PCI bus
wm0: 64 word (6 address bits) MicroWire EEPROM
wm0: Ethernet address 00:0e:0c:4d:ec:11
makphy0 at wm0 phy 1: Marvell 88E1011 Gigabit PHY, rev. 3
makphy0: 10baseT, 10baseT-FDX, 100baseTX, 100baseTX-FDX, 1000baseT,
1000baseT-FDX, auto
pcib0 at pci0 dev 31 function 0
pcib0: Intel 82801CA LPC Interface (rev. 0x02)
piixide0 at pci0 dev 31 function 1
piixide0: Intel 82801CA IDE Controller (ICH3) (rev. 0x02)
piixide0: bus-master DMA support present
piixide0: primary channel configured to compatibility mode
piixide0: primary channel interrupting at ioapic0 pin 14 (irq 14)
atabus0 at piixide0 channel 0
piixide0: secondary channel configured to compatibility mode
piixide0: secondary channel interrupting at ioapic0 pin 15 (irq 15)
atabus1 at piixide0 channel 1
Intel 82801CA SMBus Controller (SMBus serial bus, revision 0x02) at
pci0 dev 31 function 3 not configured
isa0 at pcib0
lpt0 at isa0 port 0x378-0x37b irq 7
com0 at isa0 port 0x3f8-0x3ff irq 4: ns16550a, working fifo
com1 at isa0 port 0x2f8-0x2ff irq 3: ns16550a, working fifo
isapnp0 at isa0 port 0x279: ISA Plug 'n Play device support
fdc0 at isa0 port 0x3f0-0x3f7 irq 6 drq 2
isapnp0: no ISA Plug 'n Play devices found
ioapic0: enabling
ioapic1: enabling
ioapic2: enabling
timecounter: Timecounter "clockinterrupt" frequency 100 Hz quality 0
fd0 at fdc0 drive 0: 1.44MB, 80 cyl, 2 head, 18 sec
Kernelized RAIDframe activated
scsibus0: waiting 2 seconds for devices to settle...
wd0 at atabus0 drive 0: <WDC WD400JB-00JJA0>
wd0: drive supports 16-sector PIO transfers, LBA addressing
wd0: 38166 MB, 77545 cyl, 16 head, 63 sec, 512 bytes/sect x 78165360
sectors
wd0: 32-bit data port
wd0: drive supports PIO mode 4, DMA mode 2, Ultra-DMA mode 5 (Ultra/100)
wd1 at atabus0 drive 1: <FUJITSU MPD3043AT>
wd1: drive supports 16-sector PIO transfers, LBA addressing
wd1: 4125 MB, 8940 cyl, 15 head, 63 sec, 512 bytes/sect x 8448300
sectors
wd1: 32-bit data port
wd1: drive supports PIO mode 4, DMA mode 2, Ultra-DMA mode 4 (Ultra/66)
wd0(piixide0:0:0): using PIO mode 4, Ultra-DMA mode 5 (Ultra/100)
(using DMA)
wd1(piixide0:0:1): using PIO mode 4, Ultra-DMA mode 2 (Ultra/33) (using
DMA)
st0 at scsibus0 target 5 lun 0: <QUANTUM, SDLT600, 1E1E> tape removable
st0: drive empty
st0: sync (50.00ns offset 8), 16-bit (40.000MB/s) transfers
boot device: wd1
root on wd1a dumps on wd1b
root file system type: ffs
wsdisplay0: screen 1 added (80x25, vt100 emulation)
wsdisplay0: screen 2 added (80x25, vt100 emulation)
wsdisplay0: screen 3 added (80x25, vt100 emulation)
wsdisplay0: screen 4 added (80x25, vt100 emulation)
>How-To-Repeat:
While I initially noticed the problem with other tools, I did
some tests with a fresh tape and very simple tools. I created
a ten-megabyte file full of distinctive data (given any
512-byte chunk, I could tell exactly where in the original file
it came from). I wrote this to the tape with
# dd bs=10240 if=file of=/dev/nrst0
which (reassuringly) reported 1024 records written. I then
rewound the tape and checked with
# dd bs=10240 if=/dev/nrst0 | cmp - file
to verify that it was in fact on the tape; except for dd's
report, this produced no output. Then I rewound again and
tried
# sh -c 'exec 3</dev/nrst0; dd bs=1048577 count=1 0<&3; dd bs=10240
0<&3' > file2
which is basically a command-line version of what I saw when I
ktraced the program that produced the odd behaviour that put me
onto this. This said
0+1 records in
0+1 records out
10240 bytes transferred in 0.063 secs (162539 bytes/sec)
1008+0 records in
1008+0 records out
10321920 bytes transferred in 0.482 secs (21414771 bytes/sec)
which is what I'd expect, except that the 1008 should be 1023
(and the total byte count, of course, should match). Looking
at the resulting file2, the first 20 blocks - the first 10K -
are exactly what they should be, but the next block is the one
from block offset 320, not 20, and it goes on from there;
something lost the next 15 tape records, as if the first dd had
actually read 16 10240-byte records, returned the first, and
thrown away the other 15.
Is the bug here with 4.0 or with my expectations? Certainly my
historical experience has been that reading a tape record with
a buffer far bigger than the record just reads the record into
the beginning of the buffer, not reads the record into the
beginning of the buffer and then reads and discards some
unobvious number of following records! (16 records is 320K,
less than a third of the 1048577-byte buffer size; I have no
idea why that many.)
My message to tech-kern appears to have drawn no response,
tending to argue against stupid pilot-error mistakes as the
cause.
I'm classing this as tech-kern, because I suspect a kernel bug
(the userland side of the above test is pretty much too simple
to break); serious, because it means most likely no 4.0 on this
machine until/unless it's fixed; medium, because it doesn't
seem to affect a lot of people or it would have been noticed
long ago. Adjust if you think it should be otherwise.
>Fix:
Unknown. I had a quick look but nothing was obvious.
I can test proposed fixes, but, depending on the time of day,
it may take a little while; the machine is in live production
and only practical window for test boots is weekday afternoons
(North American Eastern timezone, ie, roughly 16:00-22:00 UTC).
/~\ The ASCII der Mouse
\ / Ribbon Campaign
X Against HTML mouse%rodents.montreal.qc.ca@localhost
/ \ Email! 7D C8 61 52 5D E7 2D 39 4E F1 31 3E E8 B3 27 4B
Home |
Main Index |
Thread Index |
Old Index