tech-kern archive

[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index][Old Index]

Re: Xen FreeBSD domU block I/O problem on -current only affects reads > 1024 bytes



With the gracious help of RVP <rvp%SDF.ORG@localhost> I have been able to identify
better what is actually going wrong with FreeBSD's access to NetBSD dom0
xbdback(4) storage.

It seems that in certain circumstances (e.g. in newfs and the test
program) whenever FreeBSD issues a read of more than 1024 bytes only the
first 1024 bytes are correct -- the rest of the bytes returned come from
somewhere else on the disk, which appears to be starting at six(6)
sectors after where they were supposed to have come from.  Note that
this corresponds to exactly 4096 bytes offset from the beginning of the
read.

I.e. it looks like reads over 1024 bytes in size result in something
like the following:

	lseek(fd, offset, SEEK_SET);
	read(fd, &buf[0], 1024);
	lseek(fd, offset+4096, SEEK_SET); // or: lseek(fd, 4096-1024, SEEK_CUR);
	read(fd, &buf[1024], N-1024);

Note that on FreeBSD hw.pagesize is 4096 and its blkfront device driver
seems to use PAGE_SIZE as for its units of request.

The mystery remains in why all is OK on NetBSD-8.99.32, but broken on
NetBSD-9.99.81.

So, how did I come to this observation?

RVP supplied me with a little program to do random seeks and reads of
sectors, create a transportable data file, and then to verify these
reads again using the data file.  I modified it to do random-sized reads
(in sizes of small powers of two blocks), and then I ran it first on the
dom0 from NetBSD to generate the initial data file, and then on a
FreeBSD domU to check against the data file.  I used a device which I
had initialised with sectors of data where each byte in the sector is
the sector's offset, modulo 256.  The program also currently randomly
chooses between lseek()+read() and pread() when doing the verification
step, though I don't know if that is important, except see the note
about dd(1) below.

As a recap:

- This test program effectively reproduces a failure similar to that of
  "newfs && fsck" on the FreeBSD domU, but shows us how the data looks
  when and where there are any problems.

- This problem does not happen in NetBSD domUs (including with the same
  LVM LV backed device and the same blkchk program).

- This problem does not happen if the dom0 is running NetBSD 8.99.32,
  (but does happen with 9.99.81).

- quite surprisingly this problem does not seem to happen with dd(1):

	# dd if=/dev/da0 bs=512 skip=32 count=32 | od -c  > /var/tmp/512
	32+0 records in
	32+0 records out
	16384 bytes transferred in 0.046450 secs (352727 bytes/sec)
	# dd if=/dev/da0 bs=16384 skip=1 count=1 | od -c  > /var/tmp/16384
	1+0 records in
	1+0 records out
	16384 bytes transferred in 0.002174 secs (7536189 bytes/sec)
	# diff /var/tmp/512 /var/tmp/16384
	#

  see also a matching example from blkchk below

  ktrace of FreeBSD dd shows it does lseek() and read()s.

  Note it is impossible to test every example from the random data with
  dd(1) due to the fact that it is impossible to make dd(1) skip in
  units other than its (input) block size.


The awk program to initialize the device is here:

    https://github.com/robohack/experiments/blob/master/tblocks.awk

The block checker is here:

    https://github.com/robohack/experiments/blob/master/blkchk.c


So this program does something a little more easily observable than
newfs does, but it's still not quite obvious what all the preconditions
are, given that a one-off example tests with dd do not reproduce the
same symptoms.


An example of its output in the failure scenario on FreeBSD, where it is
showing that for every read of a block larger than 1024, 1024th (and on)
byte does not match what was read from the same offset on the NetBSD dom0:

# ./blkchk check /dev/da0 ckfile.txt
blkchk: read 65536 bytes @ 19355076608: mismatch: /dev/da0[+1024] \x8c != ckfile.txt[ln#1][1024] \x86
blkchk: pread 32768 bytes @ 10718762496: mismatch: /dev/da0[+1024] \xb3 != ckfile.txt[ln#2][1024] \xad
blkchk: read 32768 bytes @ 5913347072: mismatch: /dev/da0[+1024] \x4a != ckfile.txt[ln#9][1024] \x44
blkchk: read 32768 bytes @ 25366386688: mismatch: /dev/da0[+1024] \x34 != ckfile.txt[ln#11][1024] \x2e
blkchk: pread 32768 bytes @ 25685901824: mismatch: /dev/da0[+1024] \xe9 != ckfile.txt[ln#14][1024] \xe3
blkchk: read 8192 bytes @ 23109372416: mismatch: /dev/da0[+1024] \x8d != ckfile.txt[ln#15][1024] \x87
blkchk: read 32768 bytes @ 20043252224: mismatch: /dev/da0[+1024] \xe9 != ckfile.txt[ln#18][1024] \xe3
blkchk: pread 8192 bytes @ 28141568000: mismatch: /dev/da0[+1024] \x28 != /var/tmp/ckfile.bigones[ln#17][1024] \x22


Here's effectively the same test as last line, but done with dd(1) (it
has an offset that is evenly divisible by the read size, making it
suitable for dd) and we don't see any problem:

# echo 28141568000 / 8192 | bc -l
3435250.00000000000000000000
# dd if=/dev/da0 bs=8192 iseek=3435250 count=1 2>/dev/null | od -h
0000000      2020    2020    2020    2020    2020    2020    2020    2020
*
0001000      2121    2121    2121    2121    2121    2121    2121    2121
*
0002000      2222    2222    2222    2222    2222    2222    2222    2222
*
0003000      2323    2323    2323    2323    2323    2323    2323    2323
*
0004000      2424    2424    2424    2424    2424    2424    2424    2424
*
0005000      2525    2525    2525    2525    2525    2525    2525    2525
*
0006000      2626    2626    2626    2626    2626    2626    2626    2626
*
0007000      2727    2727    2727    2727    2727    2727    2727    2727
*
0010000      2828    2828    2828    2828    2828    2828    2828    2828
*
0011000      2929    2929    2929    2929    2929    2929    2929    2929
*
0012000      2a2a    2a2a    2a2a    2a2a    2a2a    2a2a    2a2a    2a2a
*
0013000      2b2b    2b2b    2b2b    2b2b    2b2b    2b2b    2b2b    2b2b
*
0014000      2c2c    2c2c    2c2c    2c2c    2c2c    2c2c    2c2c    2c2c
*
0015000      2d2d    2d2d    2d2d    2d2d    2d2d    2d2d    2d2d    2d2d
*
0016000      2e2e    2e2e    2e2e    2e2e    2e2e    2e2e    2e2e    2e2e
*
0017000      2f2f    2f2f    2f2f    2f2f    2f2f    2f2f    2f2f    2f2f
*
0020000


Let's try that again with just the one sample data line and blkchk:


# grep 28141568000 /var/tmp/ckfile.txt > /var/tmp/ckfile.1
# /var/tmp/blkchk check /dev/da0 /var/tmp/ckfile.1
blkchk: pread 8192 bytes @ 28141568000: mismatch: /dev/da0[+1024] \x28 != /var/tmp/ckfile.1[ln#0][1024] \x22
#

Every byte after 1024 is different, but I'll cut it off at 10:

# /var/tmp/blkchk check -v /dev/da0 /var/tmp/ckfile.1 2>&1 | head
blkchk: pread 8192 bytes @ 28141568000: mismatch: /dev/da0[+1024] \x28 != /var/tmp/ckfile.1[ln#0][1024] \x22
blkchk: pread 8192 bytes @ 28141568000: mismatch: /dev/da0[+1025] \x28 != /var/tmp/ckfile.1[ln#0][1025] \x22
blkchk: pread 8192 bytes @ 28141568000: mismatch: /dev/da0[+1026] \x28 != /var/tmp/ckfile.1[ln#0][1026] \x22
blkchk: pread 8192 bytes @ 28141568000: mismatch: /dev/da0[+1027] \x28 != /var/tmp/ckfile.1[ln#0][1027] \x22
blkchk: pread 8192 bytes @ 28141568000: mismatch: /dev/da0[+1028] \x28 != /var/tmp/ckfile.1[ln#0][1028] \x22
blkchk: pread 8192 bytes @ 28141568000: mismatch: /dev/da0[+1029] \x28 != /var/tmp/ckfile.1[ln#0][1029] \x22
blkchk: pread 8192 bytes @ 28141568000: mismatch: /dev/da0[+1030] \x28 != /var/tmp/ckfile.1[ln#0][1030] \x22
blkchk: pread 8192 bytes @ 28141568000: mismatch: /dev/da0[+1031] \x28 != /var/tmp/ckfile.1[ln#0][1031] \x22
blkchk: pread 8192 bytes @ 28141568000: mismatch: /dev/da0[+1032] \x28 != /var/tmp/ckfile.1[ln#0][1032] \x22
blkchk: pread 8192 bytes @ 28141568000: mismatch: /dev/da0[+1033] \x28 != /var/tmp/ckfile.1[ln#0][1033] \x22


--
					Greg A. Woods <gwoods%acm.org@localhost>

Kelowna, BC     +1 250 762-7675           RoboHack <woods%robohack.ca@localhost>
Planix, Inc. <woods%planix.com@localhost>     Avoncote Farms <woods%avoncote.ca@localhost>

Attachment: pgpvnM5_8wSh2.pgp
Description: OpenPGP Digital Signature



Home | Main Index | Thread Index | Old Index