NetBSD-Bugs archive

[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index][Old Index]

Re: kern/55362: bad inodes created after dump|restore



The following reply was made to PR kern/55362; it has been noted by GNATS.

From: Patrick Welche <prlw1%cam.ac.uk@localhost>
To: gnats-bugs%netbsd.org@localhost
Cc: 
Subject: Re: kern/55362: bad inodes created after dump|restore
Date: Thu, 19 Aug 2021 16:16:00 +0100

 A big thank you to chs@ and oster@ for working so hard on this bug.
 
 The summary now that we know the solution is: c.f., kre@'s comment above:
 
  Further, since the two drives have different data, and shouldn't, this
  cannot be a filesystem or above problem, it has to be either the drive
  omitting to write sometimes, or the driver forgetting to perform the write,
  or raidframe not requesting one of the writes.
 
 To catch which of the above, chs@ suggested:
 
 fbt::raidstrategy:entry,fbt::bdev_strategy:entry
 / args[0]->b_dev == 0x0023 || args[0]->b_dev == 0x0043 || args[0]->b_dev == 0xa807 || args[0]->b_dev == 0xa80c || args[0]->b_dev == 0x1203 || args[0]->b_dev == 0x3e03 /
 {
   printf("func %s rw %d dev 0x%x blkno %d len %d", probefunc, (args[0]->b_flags & 0x00100000) != 0, args[0]->b_dev, args[0]->b_blkno, args[0]->b_bcount);
 }
 
 Reminder of set up:
 
   wd2 -> dk7  (backup1) --+-> raid0 -> dk19  -> cgd. -> dk..
   wd4 -> dk12 (backup2) -/
 
 Another experiment as per earlier in this PR shows,
 according to cmp, there is a mismatch between offsets
 
 3412087771137   -> 3412087775232  (from cmp)
 6664233928 b+1  -> 6664233936 b   (in blocks)
 
 inclusive. The associated bits of dtrace are:
 
 wd4d: func bdev_strategy rw 0 dev 0x43   blkno 6664233920 len 40960 =   80b
 wd2d: func bdev_strategy rw 0 dev 0x23   blkno 6664233920 len 40960 =   80b
 dk12: func bdev_strategy rw 0 dev 0xa80c blkno 6664233920 len  4096 =    8b
 dk7:  func bdev_strategy rw 0 dev 0xa807 blkno 6664233920 len  4096 =    8b
 dk12: func bdev_strategy rw 0 dev 0xa80c blkno 6664233936 len 57344 = 7168b
 dk7:  func bdev_strategy rw 0 dev 0xa807 blkno 6664233936 len 57344 = 7168b
 raid0:func bdev_strategy rw 0 dev 0x1203 blkno 6664233992 len  4096 =    8b
 raid0:func raidstrategy  rw 0 dev 0x1203 blkno 6664233992 len  4096 =    8b
 dk12: func bdev_strategy rw 0 dev 0xa80c blkno 6664233928 len  4096 =    8b
 
 Just to check "grep -C3 6664233928 dtrace2.log" has only one hit:
 
   5  36066   raidstrategy:entry func raidstrategy rw 0 dev 0x1203 blkno 6628949584 len 32768
   9   4107  bdev_strategy:entry func bdev_strategy rw 0 dev 0x1203 blkno 6628949712 len 32768
   9  36066   raidstrategy:entry func raidstrategy rw 0 dev 0x1203 blkno 6628949712 len 32768
   5   4107  bdev_strategy:entry func bdev_strategy rw 0 dev 0xa80c blkno 6664233928 len 4096
   5   4107  bdev_strategy:entry func bdev_strategy rw 0 dev 0x43 blkno 6731606984 len 4096
  12   4107  bdev_strategy:entry func bdev_strategy rw 0 dev 0x1203  blkno 6628888960 len 4096
  12  36066   raidstrategy:entry func raidstrategy rw 0 dev 0x1203 blkno 6628888960 len 4096
 
 
 
 The first component dk7 is not written to and contains zero.
 
 
 
 chs@ & oster@ traced the issue to a PR_NOWAIT allocation from a raidframe
 pool failing.
 
 oster@ patched raidframe so that getiobuf() can't run out of memory,
 which means this PR was fixed in:
 
 https://mail-index.netbsd.org/source-changes/2021/07/23/msg131062.html
 


Home | Main Index | Thread Index | Old Index