NetBSD-Bugs archive

[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index][Old Index]

kern/56686: wd(4) device timeouts



>Number:         56686
>Category:       kern
>Synopsis:       wd(4) device timeouts
>Confidential:   no
>Severity:       serious
>Priority:       medium
>Responsible:    kern-bug-people
>State:          open
>Class:          sw-bug
>Submitter-Id:   net
>Arrival-Date:   Fri Feb 04 11:05:00 +0000 2022
>Originator:     Andreas Gustafsson
>Release:        NetBSD 9.2
>Organization:
  
>Environment:
System: NetBSD
Architecture: x86_64
Machine: amd64
>Description:

With a workload involving multiple emulator processes running NetBSD
guests running ATF tests on a ST5000LM000-2AN170 disk in a Dell
PowerEdge R630 server, I am seeing bursts of "device timeout" errors
several times a day.  This PR is about these device timeout errors
themselves; for the panics they sometimes trigger, see PR 56479.

Here is the console log from one such burst:

[ 45042.1012910] wd3d: device timeout writing fsbn 4431519296 of 4431519296-4431519359 (wd3 bn 4431519296; cn 4396348 tn 8 sn 8), xfer 218, retry 0
[ 45042.2542704] wd3d: device timeout writing fsbn 3779132800 of 3779132800-3779132927 (wd3 bn 3779132800; cn 3749139 tn 10 sn 58), xfer 50, retry 0
[ 45042.4102613] wd3d: device timeout writing fsbn 4418502912 of 4418502912-4418502935 (wd3 bn 4418502912; cn 4383435 tn 6 sn 54), xfer e8, retry 0
[ 45042.5652114] wd3d: device timeout writing fsbn 3778503360 of 3778503360-3778503423 (wd3 bn 3778503360; cn 3748515 tn 3 sn 51), xfer 180, retry 0
[ 45042.7212016] wd3d: device timeout writing fsbn 4654435520 of 4654435520-4654435583 (wd3 bn 4654435520; cn 4617495 tn 8 sn 56), xfer 2b0, retry 0
[ 45042.8771911] wd3d: device timeout writing fsbn 4417615040 of 4417615040-4417615103 (wd3 bn 4417615040; cn 4382554 tn 9 sn 41), xfer 348, retry 0
[ 45043.0331818] wd3d: device timeout writing fsbn 3957636800 of 3957636800-3957636863 (wd3 bn 3957636800; cn 3926226 tn 15 sn 47), xfer 3e0, retry 0
[ 45043.1902111] wd3d: device timeout writing fsbn 4030504640 of 4030504640-4030504703 (wd3 bn 4030504640; cn 3998516 tn 8 sn 8), xfer 478, retry 0
[ 45043.3451615] wd3d: device timeout writing fsbn 3945492160 of 3945492160-3945492223 (wd3 bn 3945492160; cn 3914178 tn 11 sn 43), xfer 510, retry 0
[ 45043.5021917] wd3d: device timeout writing fsbn 3763322560 of 3763322560-3763322623 (wd3 bn 3763322560; cn 3733454 tn 14 sn 46), xfer 5a8, retry 0
[ 45043.6592219] wd3d: device timeout writing fsbn 3881732800 of 3881732800-3881732863 (wd3 bn 3881732800; cn 3850925 tn 6 sn 22), xfer 640, retry 0
[ 45043.8152122] wd3d: device timeout writing fsbn 3778503296 of 3778503296-3778503359 (wd3 bn 3778503296; cn 3748515 tn 2 sn 50), xfer 6d8, retry 0
[ 45043.9712037] wd3d: device timeout writing fsbn 3779132928 of 3779132928-3779133055 (wd3 bn 3779132928; cn 3749139 tn 12 sn 60), xfer 770, retry 0
[ 45044.1282335] wd3d: device timeout writing fsbn 4031996800 of 4031996800-4031996927 (wd3 bn 4031996800; cn 3999996 tn 13 sn 13), xfer 808, retry 0
[ 45044.2852623] wd3d: device timeout writing fsbn 3950007936 of 3950007936-3950007999 (wd3 bn 3950007936; cn 3918658 tn 10 sn 42), xfer 8a0, retry 0
[ 45044.4422926] wd3d: device timeout writing fsbn 3764548672 of 3764548672-3764548735 (wd3 bn 3764548672; cn 3734671 tn 4 sn 52), xfer 938, retry 0
[ 45044.5982845] wd3d: device timeout writing fsbn 3764544800 of 3764544800-3764544831 (wd3 bn 3764544800; cn 3734667 tn 7 sn 23), xfer 9d0, retry 0
[ 45044.7542730] wd3d: device timeout writing fsbn 7124681152 of 7124681152-7124681279 (wd3 bn 7124681152; cn 7068136 tn 1 sn 1), xfer a68, retry 0
[ 45044.9092242] wd3d: device timeout writing fsbn 7124681088 of 7124681088-7124681151 (wd3 bn 7124681088; cn 7068136 tn 0 sn 0), xfer b00, retry 0
[ 45045.0641735] wd3d: device timeout writing fsbn 3951300736 of 3951300736-3951300799 (wd3 bn 3951300736; cn 3919941 tn 3 sn 19), xfer b98, retry 0
[ 45045.2201647] wd3d: device timeout writing fsbn 3958537280 of 3958537280-3958537343 (wd3 bn 3958537280; cn 3927120 tn 5 sn 5), xfer c30, retry 0
[ 45045.3751142] wd3d: device timeout writing fsbn 3883039872 of 3883039872-3883039935 (wd3 bn 3883039872; cn 3852222 tn 1 sn 33), xfer cc8, retry 0
[ 45045.5311042] wd3d: device timeout writing fsbn 3883032192 of 3883032192-3883032255 (wd3 bn 3883032192; cn 3852214 tn 7 sn 39), xfer d60, retry 0
[ 45045.6870949] wd3d: device timeout writing fsbn 4770050496 of 4770050496-4770050559 (wd3 bn 4770050496; cn 4732192 tn 15 sn 15), xfer df8, retry 0
[ 45046.3329589] wd3: soft error (corrected) xfer df8
[ 45046.3904871] wd3: soft error (corrected) xfer d60
[ 45046.4476828] wd3: soft error (corrected) xfer cc8
[ 45046.5048796] wd3: soft error (corrected) xfer c30
[ 45046.5620769] wd3: soft error (corrected) xfer b98
[ 45046.6192729] wd3: soft error (corrected) xfer b00
[ 45046.6764685] wd3: soft error (corrected) xfer a68
[ 45046.7336655] wd3: soft error (corrected) xfer 9d0
[ 45046.7908619] wd3: soft error (corrected) xfer 938
[ 45046.8480580] wd3: soft error (corrected) xfer 8a0
[ 45046.9052551] wd3: soft error (corrected) xfer 808
[ 45046.9624508] wd3: soft error (corrected) xfer 770
[ 45047.0196475] wd3: soft error (corrected) xfer 6d8
[ 45047.0768438] wd3: soft error (corrected) xfer 640
[ 45047.1340403] wd3: soft error (corrected) xfer 5a8
[ 45047.1912370] wd3: soft error (corrected) xfer 510
[ 45047.2484331] wd3: soft error (corrected) xfer 478
[ 45047.3056288] wd3: soft error (corrected) xfer 3e0
[ 45047.3628253] wd3: soft error (corrected) xfer 348
[ 45047.4200222] wd3: soft error (corrected) xfer 2b0
[ 45047.4772193] wd3: soft error (corrected) xfer 180
[ 45047.5344148] wd3: soft error (corrected) xfer e8
[ 45047.5905719] wd3: soft error (corrected) xfer 50
[ 45047.6467280] wd3: soft error (corrected) xfer 218

Here's another one, which interestingly includes an "uncorrectable
data error" followed by a "soft error (corrected)" for the same
transfer:

[ 56845.1934106] wd3d: device timeout reading fsbn 4190850944 of 4190850944-4190850967 (wd3 bn 4190850944; cn 4157590 tn 3 sn 35), xfer 490, retry 0
[ 56845.3474313] wd3d: device timeout reading fsbn 7217102720 of 7217102720-7217102783 (wd3 bn 7217102720; cn 7159824 tn 2 sn 2), xfer 68, retry 0
[ 56845.5013421] wd3d: device timeout writing fsbn 4492000896 of 4492000896-4492000959 (wd3 bn 4492000896; cn 4456350 tn 1 sn 33), xfer 198, retry 0
[ 56845.6573313] wd3d: device timeout writing fsbn 3945492160 of 3945492160-3945492223 (wd3 bn 3945492160; cn 3914178 tn 11 sn 43), xfer 230, retry 0
[ 56845.8143624] wd3d: device timeout writing fsbn 3881732800 of 3881732800-3881732863 (wd3 bn 3881732800; cn 3850925 tn 6 sn 22), xfer 2c8, retry 0
[ 56845.9703522] wd3d: device timeout writing fsbn 4030504640 of 4030504640-4030504703 (wd3 bn 4030504640; cn 3998516 tn 8 sn 8), xfer 360, retry 0
[ 56846.1253022] wd3d: device timeout writing fsbn 3951298944 of 3951298944-3951299007 (wd3 bn 3951298944; cn 3919939 tn 6 sn 54), xfer 100, retry 0
[ 56846.2812920] wd3d: device timeout writing fsbn 3951359872 of 3951359872-3951359935 (wd3 bn 3951359872; cn 3919999 tn 13 sn 61), xfer 3f8, retry 0
[ 56846.4383222] wd3d: device timeout writing fsbn 4031996864 of 4031996864-4031996927 (wd3 bn 4031996864; cn 3999996 tn 14 sn 14), xfer 528, retry 0
[ 56846.5953523] wd3d: device timeout writing fsbn 4492654656 of 4492654656-4492654783 (wd3 bn 4492654656; cn 4456998 tn 10 sn 42), xfer 5c0, retry 0
[ 56846.7523839] wd3d: device timeout reading fsbn 4669633728 of 4669633728-4669633791 (wd3 bn 4669633728; cn 4632573 tn 2 sn 18), xfer 658, retry 0
[ 56846.9083727] wd3d: device timeout writing fsbn 3883017088 of 3883017088-3883017151 (wd3 bn 3883017088; cn 3852199 tn 7 sn 55), xfer 6f0, retry 0
[ 56914.8308581] wd3d: error writing fsbn 3883017088 of 3883017088-3883017151 (wd3 bn 3883017088; cn 3852199 tn 7 sn 55), xfer 6f0, retry 1
[ 56914.9709129] wd3: (aborted command, id not found, uncorrectable data error)
[ 56915.0509450] wd3d: requeue reading fsbn 4669633728 of 4669633728-4669633791 (wd3 bn 4669633728; cn 4632573 tn 2 sn 18), xfer 658, retry 1
[ 56915.2010038] wd3d: requeue writing fsbn 4492654656 of 4492654656-4492654783 (wd3 bn 4492654656; cn 4456998 tn 10 sn 42), xfer 5c0, retry 1
[ 56915.3510630] wd3d: requeue writing fsbn 4031996864 of 4031996864-4031996927 (wd3 bn 4031996864; cn 3999996 tn 14 sn 14), xfer 528, retry 1
[ 56915.5011219] wd3d: requeue writing fsbn 3951359872 of 3951359872-3951359935 (wd3 bn 3951359872; cn 3919999 tn 13 sn 61), xfer 3f8, retry 1
[ 56915.6511814] wd3d: requeue writing fsbn 3951298944 of 3951298944-3951299007 (wd3 bn 3951298944; cn 3919939 tn 6 sn 54), xfer 100, retry 1
[ 56915.8012403] wd3d: requeue writing fsbn 4030504640 of 4030504640-4030504703 (wd3 bn 4030504640; cn 3998516 tn 8 sn 8), xfer 360, retry 1
[ 56915.9512994] wd3d: requeue writing fsbn 3881732800 of 3881732800-3881732863 (wd3 bn 3881732800; cn 3850925 tn 6 sn 22), xfer 2c8, retry 1
[ 56916.1013587] wd3d: requeue writing fsbn 3945492160 of 3945492160-3945492223 (wd3 bn 3945492160; cn 3914178 tn 11 sn 43), xfer 230, retry 1
[ 56916.2514185] wd3d: requeue writing fsbn 4492000896 of 4492000896-4492000959 (wd3 bn 4492000896; cn 4456350 tn 1 sn 33), xfer 198, retry 1
[ 56916.4014769] wd3d: requeue reading fsbn 7217102720 of 7217102720-7217102783 (wd3 bn 7217102720; cn 7159824 tn 2 sn 2), xfer 68, retry 1
[ 56916.5415318] wd3d: requeue reading fsbn 4190850944 of 4190850944-4190850967 (wd3 bn 4190850944; cn 4157590 tn 3 sn 35), xfer 490, retry 1
[ 56916.6915911] wd3: soft error (corrected) xfer 6f0
[ 56916.7548500] wd3: soft error (corrected) xfer 198
[ 56916.8121630] wd3: soft error (corrected) xfer 230
[ 56916.8693600] wd3: soft error (corrected) xfer 2c8
[ 56916.9265564] wd3: soft error (corrected) xfer 360
[ 56916.9837519] wd3: soft error (corrected) xfer 100
[ 56917.0409486] wd3: soft error (corrected) xfer 3f8
[ 56917.0981443] wd3: soft error (corrected) xfer 528
[ 56917.1553414] wd3: soft error (corrected) xfer 5c0
[ 56917.3418506] wd3: soft error (corrected) xfer 490
[ 56918.7023839] wd3: soft error (corrected) xfer 658

Here's the dmesg output for the controller and disk again:
 
 [     1.092336] ahcisata1 at pci0 dev 31 function 2: vendor 8086 product 8d02 (rev. 0x05)
 [     1.092336] ahcisata1: 64-bit DMA
 [     1.092336] ahcisata1: AHCI revision 1.30, 6 ports, 32 slots, CAP 0xcb30ff45<EMS,PSC,SSC,PMD,ISS=0x3=Gen3,SCLO,SAL,SSS,SNCQ,S64A>
 [     1.092336] ahcisata1: interrupting at msi5 vec 0
 [     1.092336] atabus4 at ahcisata1 channel 0
 [     1.092336] atabus5 at ahcisata1 channel 1
 [     1.092336] atabus6 at ahcisata1 channel 2
 [     1.092336] atabus7 at ahcisata1 channel 3
 [     1.092336] atabus8 at ahcisata1 channel 4
 [     1.092336] atabus9 at ahcisata1 channel 5
 [     3.251001] ahcisata1 port 1: device present, speed: 6.0Gb/s
 [...]
 [     5.921953] wd3: <ST5000LM000-2AN170>
 [     5.961967] wd3: drive supports 1-sector PIO transfers, LBA48 addressing
 [     5.961967] wd3: 4657 GB, 9690021 cyl, 16 head, 63 sec, 512 bytes/sect x 9767541168 sectors (0 bytes/physsect; first aligned sector: 8)
 [     6.422131] wd3: GPT GUID: 31a830f6-dace-4062-ab89-f6911c261385
 [     6.422131] dk2 at wd3: "65183669-9719-47da-9e2a-09a1f9a7bf6d", 9767538688 blocks at 2048, type: ffs
 [     6.532170] wd3: drive supports PIO mode 4, DMA mode 2, Ultra-DMA mode 6 (Ultra/133), WRITE DMA FUA, NCQ (32 tags)
 [     6.532170] wd3(ahcisata1:1:0): using PIO mode 4, DMA mode 2, Ultra-DMA mode 6 (Ultra/133) (using DMA), NCQ (31 tags)

The SMART log of the disk shows no errors.

>How-To-Repeat:

>Fix:



Home | Main Index | Thread Index | Old Index