NetBSD-Users archive

[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index][Old Index]

file corruption problems (!) in a xen domU backed by a zvol



I'm not sure which part of my "stack" is the culprit, but here's the setup where I noticed the problem just now:


The host has a NetBSD 10.1 kernel and a 9.0 userland that's in the process of being updated to 10.1 - unpacking `base.tar.xz` is where I first noticed problems. Here's a quick summary of the problem in example form:

ansible:riz  ~/sets> md5 xfont.tar.xz
MD5 (xfont.tar.xz) = f044efd355a3a8fbee8988200aa526d5
ansible:riz  ~/sets> md5 xfont.tar.xz
MD5 (xfont.tar.xz) = 07e28b41bb982b2b0a1e0b731599a246
ansible:riz  ~/sets> md5 xfont.tar.xz
MD5 (xfont.tar.xz) = b0d14f58745706db4dcf4a30d5b75175
ansible:riz  ~/sets>


...for a file which should most definitely NOT be changing.


The host ("ansible") is a xen domU running in PVH mode on a NetBSD-10.1 dom0; the virtual disk is backed by a ZFS zvol.  I made a ZFS snapshot of the zvol just before starting the upgrade.  One of the two disks in the zpool is reporting soft errors (I just noticed these!), but so far the zpool itself is not showing any errors (i started a scrub about 10m ago to see if that catches anything):

xenserver1:riz  ~> sudo zpool status
  pool: tank
 state: ONLINE
  scan: scrub in progress since Sun Nov  9 17:54:02 2025
        54.5G scanned out of 106G at 70.8M/s, 0h12m to go
        0 repaired, 51.23% done
config:

    NAME                  STATE     READ WRITE CKSUM
    tank                  ONLINE       0     0     0
      wedges/zfs-xs1-wd2  ONLINE       0     0     0
      wedges/zfs-xs1-wd3  ONLINE       0     0     0

errors: No known data errors


This has me really freaked out, because while I have a backup of this particular virtual host, having bits arbitrarily change under a VM is pretty freaky.  The VM doesn't show anything unusual in dmesg, but the dom0 does show some errors which currently seem to be getting corrected:

Nov  9 18:06:50 xenserver1 /netbsd: [ 1250512.6259872] wd3d: channel reset reading fsbn 4084303096 of 4084303096-4084303223 (wd3 bn 4084303096; cn 4051887 tn 15 sn 55), xfer 220, retry 0 Nov  9 18:06:50 xenserver1 /netbsd: [ 1250512.6259872] wd3d: channel reset reading fsbn 4084303224 of 4084303224-4084303351 (wd3 bn 4084303224; cn 4051888 tn 1 sn 57), xfer 2b8, retry 0 Nov  9 18:06:51 xenserver1 /netbsd: [ 1250513.1459781] wd3: soft error (corrected) xfer 58 Nov  9 18:06:51 xenserver1 /netbsd: [ 1250513.1459781] wd3: soft error (corrected) xfer f0 Nov  9 18:06:51 xenserver1 /netbsd: [ 1250513.1459781] wd3: soft error (corrected) xfer 220 Nov  9 18:06:51 xenserver1 /netbsd: [ 1250513.1459781] wd3: soft error (corrected) xfer 2b8


I would love to track this down - anyone have a next step suggestion for figuring it out?


+j





Home | Main Index | Thread Index | Old Index