Current-Users archive
[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index][Old Index]
Mysterious bit errors on healthy media in i386/4.99.72
I am trying to run -current off of an USB stick on my eeepc900, in order
to build some confidence in it before wiping the SSD, and I am seeing
weird behaviour that I can't explain:
After a kernel panic[0] in the middle of a bulk-install of large amounts
of bloat[1], I thought the state of my pkg tree seemed a bit too messed up
for having been caused by that panic alone, so I pkg_admin check'd my
entire tree -- and it complained about a whole lot of MD5 mismatches.
I decided to wipe the entire pkg tree, and simply copy it from the build
host over NFS, /usr/pkg and /var/db/pkg* and all. While this also induced
another kernel panic, I got enough to verify my suspicion: I get a LOT of
mysterious bit errors on the usb stick when I install the packages (or
just cp -pR them) over NFS. A binary diff shows that the files indeed
differ from those on the build host. The build box was spewing out a
few (from memory) "ex0: transmit underrun (20) @9000" during copy, but
succeeded, and with all the checksumming going on in the network layers it
seems unlikely the bit errors would have come from there.
Seeing as these USB sticks are managed NAND that *should* report errors,
I tried dd:ing the stick full of zeroes on an OS where I know umass itself
works[2], dd:ing it back and comparing it to an image of zero bits only --
no problems. I couldn't reproduce the bit errors or cause any IO failures
in OtherOS(tm) at all. However, the usb stick *seemed* to work fine with
the eeepc's usb controller.
Last night I tried to re-install the entire system onto the usb stick on
the build host instead (which has an Intel usb controller that seemed to
work as well) just to see what happened if I eliminate network from the
equation, but that failed early on with infinite loops of reset failure
after stalled bulk transfers (this rings a bell) -- so I guess umass is
fucked there too, after all. :-/ I never saw these failures on the eeepc,
however.
What's bugging me is that I am not seeing random mayhem, completely
broken files, wildly inconsistent meta-data or any of the other epic
failures that I would expect from completely broken umass or sneaky
-current file system bug. I only get small bit errors. I diffed a hexdump
of /usr/pkg/bin/irssi -- it had a single bit error in the middle of it.
And these bit errors don't seem reproducible in OtherOS. The eeepc is
brand spanking new and doesn't exhibit stability problems in Linux, so
while it can't be ruled out until tested, I would be surprised if it has
faulty RAM.
Can anyone think of anything in -current that can cause small bit errors
like this? Has anyone experienced similar problems? The USB stick is a
4GB Kingston DataTraveler 2.0, if that means anything...
[0] Actually this whole process repeated twice
[1] aka GNOME ;o)
[2] I have never seen usb2 umass work on any NetBSD machine, so I don't
really trust it even when it seemed to work.
Best regards,
ali:)
Home |
Main Index |
Thread Index |
Old Index