Current-Users archive

[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index][Old Index]

Mysterious bit errors on healthy media in i386/4.99.72



I am trying to run -current off of an USB stick on my eeepc900, in order to build some confidence in it before wiping the SSD, and I am seeing weird behaviour that I can't explain:

After a kernel panic[0] in the middle of a bulk-install of large amounts of bloat[1], I thought the state of my pkg tree seemed a bit too messed up for having been caused by that panic alone, so I pkg_admin check'd my entire tree -- and it complained about a whole lot of MD5 mismatches.

I decided to wipe the entire pkg tree, and simply copy it from the build host over NFS, /usr/pkg and /var/db/pkg* and all. While this also induced another kernel panic, I got enough to verify my suspicion: I get a LOT of mysterious bit errors on the usb stick when I install the packages (or just cp -pR them) over NFS. A binary diff shows that the files indeed differ from those on the build host. The build box was spewing out a few (from memory) "ex0: transmit underrun (20) @9000" during copy, but succeeded, and with all the checksumming going on in the network layers it seems unlikely the bit errors would have come from there.

Seeing as these USB sticks are managed NAND that *should* report errors, I tried dd:ing the stick full of zeroes on an OS where I know umass itself works[2], dd:ing it back and comparing it to an image of zero bits only -- no problems. I couldn't reproduce the bit errors or cause any IO failures in OtherOS(tm) at all. However, the usb stick *seemed* to work fine with the eeepc's usb controller.

Last night I tried to re-install the entire system onto the usb stick on the build host instead (which has an Intel usb controller that seemed to work as well) just to see what happened if I eliminate network from the equation, but that failed early on with infinite loops of reset failure after stalled bulk transfers (this rings a bell) -- so I guess umass is fucked there too, after all. :-/ I never saw these failures on the eeepc, however.

What's bugging me is that I am not seeing random mayhem, completely broken files, wildly inconsistent meta-data or any of the other epic failures that I would expect from completely broken umass or sneaky -current file system bug. I only get small bit errors. I diffed a hexdump of /usr/pkg/bin/irssi -- it had a single bit error in the middle of it. And these bit errors don't seem reproducible in OtherOS. The eeepc is brand spanking new and doesn't exhibit stability problems in Linux, so while it can't be ruled out until tested, I would be surprised if it has faulty RAM.

Can anyone think of anything in -current that can cause small bit errors like this? Has anyone experienced similar problems? The USB stick is a 4GB Kingston DataTraveler 2.0, if that means anything...

[0] Actually this whole process repeated twice
[1] aka GNOME ;o)
[2] I have never seen usb2 umass work on any NetBSD machine, so I don't
    really trust it even when it seemed to work.

Best regards,
ali:)




Home | Main Index | Thread Index | Old Index