tech-kern archive

[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index][Old Index]

4.0: peculiar tape-drive issue

At work, we have a machine that's been running NetBSD/i386 3.0 for some
time, and it's been working fine.  One of the things it does is run
amanda, dumping to "virtual tapes" (files on disk) on a RAIDed
filesystem (firmware RAID, not RAIDframe - a 3ware).  Another thing it
does is to dump some of the resulting files onto real tapes.

Recently, I put a 4.0 kernel and userland on it.  Last night, it did a
dump-to-tape run, which failed oddly.  On investigating, I find there's
a very peculiar problem with the tape drive, which I can only ascribe
to the switch to 4.0.

The drive is

ahc1 at pci4 dev 2 function 0: Adaptec 2940 Pro Ultra SCSI adapter
ahc1: interrupting at ioapic0 pin 19 (irq 5)
ahc1: aic7880: Ultra Wide Channel A, SCSI Id=7, 16/253 SCBs
scsibus0 at ahc1: 16 targets, 8 luns per target
st0 at scsibus0 target 5 lun 0: <QUANTUM, SDLT600, 1E1E> tape removable
st0: drive empty
st0: sync (50.00ns offset 8), 16-bit (40.000MB/s) transfers

The kernel is the 4.0 GENERIC kernel,
MD5 f201f3213ba5886a9b6f05a5492c6172.

While I initially noticed the problem with other tools, I did some
tests with a fresh tape and very simple tools.  I created a
ten-megabyte file full of distinctive data (given any 512-byte chunk, I
could tell exactly where in the original file it came from).  I wrote
this to the tape with
# dd bs=10240 if=file of=/dev/nrst0
which (reassuringly) reported 1024 records written.  I then rewound the
tape and checked with
# dd bs=10240 if=/dev/nrst0 | cmp - file
to verify that it was in fact on the tape; except for dd's report, this
produced no output.  Then I rewound again and tried
# sh -c 'exec 3</dev/nrst0; dd bs=1048577 count=1 0<&3; dd bs=10240 0<&3' > 
which is basically a command-line version of what I saw when I ktraced
the program that produced the odd behaviour that put me onto this.
This said

0+1 records in
0+1 records out
10240 bytes transferred in 0.063 secs (162539 bytes/sec)
1008+0 records in
1008+0 records out
10321920 bytes transferred in 0.482 secs (21414771 bytes/sec)

which is what I'd expect, except that the 1008 should be 1023 (and the
total byte count, of course, should match).  Looking at the resulting
file2, the first 20 blocks - the first 10K - are exactly what they
should be, but the next block is the one from block offset 320, not 20;
something lost the next 15 tape records, as if the first dd had
actually read 16 10240-byte records, returned the first, and thrown
away the other 15.

Is the bug here with 4.0 or with my expectations?  Certainly my
historical experience has been that reading a tape record with a buffer
far bigger than the record just reads the record into the beginning of
the buffer, not reads the record into the beginning of the buffer and
then reads and discards some unobvious number of following records!
(16 records is 320K, less than a third of the 1048577-byte buffer size;
I have no idea why that many.)

Should I just send-pr, or is there some trivial fix I should install,
or what?  If I can't get this fixed (for us, I mean, not necessarily
in-tree) within a week or so, it'll have to be back to 3.0 for this
machine, I think.  I may have time to do some debugging on it if nobody
else comes up with anything, but probably not for at least the next
couple of days.

/~\ The ASCII                           der Mouse
\ / Ribbon Campaign
 X  Against HTML     
/ \ Email!           7D C8 61 52 5D E7 2D 39  4E F1 31 3E E8 B3 27 4B

Home | Main Index | Thread Index | Old Index