NetBSD-Bugs archive

[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index][Old Index]

Re: kern/57133



The following reply was made to PR kern/57133; it has been noted by GNATS.

From: Brian Buhrow <buhrow%nfbcal.org@localhost>
To: gnats-bugs%netbsd.org@localhost
Cc: buhrow%nfbcal.org@localhost
Subject: Re: kern/57133
Date: Fri, 29 Sep 2023 11:47:07 -0700

 	Hello.  While playing with a new system, I ran into this bug and am able to reliably
 reproduce the problem.  So, I did a little digging and found some additional details:
 
 1.  mpii.c V1.25 with NetBSD-9977 from January 2021 demonstrates the problem.
 Netbsd-9.1_stable with mpii.c V1.22.4.1 works just fine.
 
 
 2.  The only differences between the two versions are a bunch of changes that make most of the
 functions in mpii.c static functions and calling malloc with M_WAITOK set, so error checking
 for memory shortages can be dropped from mpii.c.  
 
 3.  That makes this problem a symptom, rather than a cause of the trouble.  To that end, I
 modified mpii.c so that when xs->resid != xs->datalen, it prints an error message with the two
 values, rather than panicing.  Now, I can  boot the system and do things with it.  In the
 excerpted dmesg output below, the disks which demonstrate the problem during the probe are
 Western Digital 4TB SATA3 disks, while the 10TB Seagate SAS3 disks don't appear to demonstrate
 the problem.  It seems that the problem here is that something changed in the scsipi subsystem
 and the mpii.c driver makes an assumption about what should be in the xs structure that no
 longer holds.  I did a cursory search down the source tree to see if I could find any other drivers
 that check to see if xs->resid = xs->datalen, but I didn't find any  obvious examples.
 	With that said, I'm not familiar enough with the scsipi system to say this check should be
 removed from the mpii.c driver, but does it make sense to have the system panic when the values
 don't match?  What is this check guarding against?
 
 	Here is the excerpted dmesg output from a successful boot with my diagnostic messages
 instead of the panic.  If anyone has ideas on other things to try, or ideas on what changed in
 the scsipi subsystem to break this check, I'd be interested to know.
 -thanks
 -Brian
 
 
 [   1.0000000] Copyright (c) 1996, 1997, 1998, 1999, 2000, 2001, 2002, 2003, 2004, 2005,
 [   1.0000000]     2006, 2007, 2008, 2009, 2010, 2011, 2012, 2013, 2014, 2015, 2016, 2017,
 [   1.0000000]     2018, 2019, 2020, 2021 The NetBSD Foundation, Inc.  All rights reserved.
 [   1.0000000] Copyright (c) 1982, 1986, 1989, 1991, 1993
 [   1.0000000]     The Regents of the University of California.  All rights reserved.
 
 [   1.0000000] NetBSD 9.99.77 (INSTALL) #0: Fri Sep 29 03:37:19 PDT 2023
 [   1.0000000] 	buhrow%loth-9.nfbcal.org@localhost:/usr/local/netbsd/obj-9977-64/sys/arch/amd64/compile/INSTALL
 [   1.0000000] total memory = 511 GB
 [   1.0000000] avail memory = 496 GB
 [   1.0311173] mpii0 at pci9 dev 0 function 0: Symbios Logic SAS3008 (rev. 0x02)
 [   1.0311173] mpii0: interrupting at msix9 vec 0
 [   1.0311173] mpii0: LSI3008-IT, firmware 16.0.1.0, MPI 2.5
 [   1.0311173] mpii0: physical device inserted in slot 0
 [   1.0311173] mpii0: physical device inserted in slot 1
 [   1.0311173] mpii0: physical device inserted in slot 2
 [   1.0311173] mpii0: physical device inserted in slot 5
 [   1.0311173] mpii0: physical device inserted in slot 6
 [   1.0311173] mpii0: physical device inserted in slot 7
 [   1.0311173] mpii0: physical device inserted in slot 8
 [   1.0311173] mpii0: physical device inserted in slot 9
 [   1.0311173] mpii0: physical device inserted in slot 10
 [   1.0311173] mpii0: physical device inserted in slot 11
 [   1.0311173] mpii0: physical device inserted in slot 28
 [   1.0311173] scsibus0 at mpii0: 256 targets, 8 luns per target
 [   8.5704323] scsibus0: waiting 2 seconds for devices to settle...
 [  10.5804346] sd0 at scsibus0 target 0 lun 0: <ATA, WDC WD4003FFBX-6, 0A83> disk fixed
 [  10.5804346] sd0: 3726 GB, 3815448 cyl, 16 head, 127 sec, 512 bytes/sect x 7814037168 sectors
 [  10.5911436] dk0 at sd0: "1adc589e-4f32-11ee-b97c-00259036fd2e", 7814033005 blocks at 34, type: raidframe
 [  10.6004341] sd0: tagged queueing
 [  10.7404347] mpii0: resid = 0, datalen = 16384
 [  10.7404347] sd1 at scsibus0 target 1 lun 0: <ATA, WDC WD4003FFBX-6, 0A83> disk fixed
 [  10.7504342] sd1: 3726 GB, 3815448 cyl, 16 head, 127 sec, 512 bytes/sect x 7814037168 sectors
 [  10.7904348] sd1: 3907019088 trailing sectors not covered by disklabel
 [  10.7904348] sd1: tagged queueing
 [  10.9904351] mpii0: resid = 0, datalen = 16384
 [  10.9904351] sd2 at scsibus0 target 2 lun 0: <ATA, WDC WD4003FFBX-6, 0A83> disk fixed
 [  11.0004344] sd2: 3726 GB, 3815448 cyl, 16 head, 127 sec, 512 bytes/sect x 7814037168 sectors
 [  11.0504351] sd2: 3907019088 trailing sectors not covered by disklabel
 [  11.0604345] sd2: tagged queueing
 [  11.2404351] mpii0: resid = 0, datalen = 16384
 [  11.2404351] sd3 at scsibus0 target 5 lun 0: <ATA, WDC WD4003FFBX-6, 0A83> disk fixed
 [  11.2504347] sd3: 3726 GB, 3815448 cyl, 16 head, 127 sec, 512 bytes/sect x 7814037168 sectors
 [  11.2704351] dk1 at sd3: "549e5298-5113-11ee-910e-00259036fd2e", 7814033005 blocks at 34, type: raidframe
 [  11.2804346] sd3: tagged queueing
 [  11.2904346] probe(mpii0:0:6:0): Sense Error Code 0x72
 [  11.2904346] probe(mpii0:0:6:0): Sense Error Code 0x72
 [  11.3004345] sd4 at scsibus0 target 6 lun 0: <SEAGATE, ST12000NM0158, RSL2> disk fixed
 [  11.3204350] sd4: 10949 GB, 501605 cyl, 15 head, 3051 sec, 512 bytes/sect x 22961717248 sectors
 [  11.3304345] sd4: tagged queueing
 [  11.3404346] probe(mpii0:0:7:0): Sense Error Code 0x72
 [  11.3504346] probe(mpii0:0:7:0): Sense Error Code 0x72
 [  11.3604346] sd5 at scsibus0 target 7 lun 0: <SEAGATE, ST12000NM0158, RSL2> disk fixed
 [  11.3804351] sd5: 10949 GB, 477414 cyl, 15 head, 3206 sec, 512 bytes/sect x 22961717248 sectors
 [  11.3904347] sd5: tagged queueing
 [  11.4004351] probe(mpii0:0:8:0): Sense Error Code 0x72
 [  11.4104345] probe(mpii0:0:8:0): Sense Error Code 0x72
 [  11.4104345] sd6 at scsibus0 target 8 lun 0: <SEAGATE, ST12000NM0158, RSL2> disk fixed
 [  11.4304351] sd6: 10949 GB, 510249 cyl, 15 head, 3000 sec, 512 bytes/sect x 22961717248 sectors
 [  11.4404351] sd6: tagged queueing
 [  11.4404351] sd7 at scsibus0 target 9 lun 0: <SEAGATE, ST12000NM0158, RSL2> disk fixed
 [  11.4704351] sd7: 10949 GB, 467635 cyl, 16 head, 3068 sec, 512 bytes/sect x 22961717248 sectors
 [  11.5004352] sd7: tagged queueing
 [  11.5004352] probe(mpii0:0:10:0): Sense Error Code 0x72
 [  11.5104346] probe(mpii0:0:10:0): Sense Error Code 0x72
 [  11.5104346] sd8 at scsibus0 target 10 lun 0: <SEAGATE, ST12000NM0158, RSL2> disk fixed
 [  11.5304352] sd8: 10949 GB, 501985 cyl, 15 head, 3049 sec, 512 bytes/sect x 22961717248 sectors
 [  11.5504352] sd8: tagged queueing
 [  11.5604376] probe(mpii0:0:11:0): Sense Error Code 0x72
 [  11.5604376] probe(mpii0:0:11:0): Sense Error Code 0x72
 [  11.5704346] sd9 at scsibus0 target 11 lun 0: <SEAGATE, ST12000NM0158, RSL2> disk fixed
 [  11.5904354] sd9: 10949 GB, 487183 cyl, 16 head, 2945 sec, 512 bytes/sect x 22961717248 sectors
 [  11.6004348] dk2 at sd9: "EFI System Partition", 1048576 blocks at 2048, type: msdos
 [  11.6104348] dk3 at sd9: "2c159d0f-f486-4170-9784-8ec1391fbc00", 22960664576 blocks at 1050624, type: ext2fs
 [  11.6204347] sd9: tagged queueing
 [  11.6304347] ses0 at scsibus0 target 28 lun 0: <LSI, SAS3x28, 0501> enclosure services fixed
 [  11.6404348] ses0: SCSI-3 SES Device
 [  11.6404348] ses0: tagged queueing
 [  15.1704381] sd1: 3907019088 trailing sectors not covered by disklabel
 [  15.1804383] sd1: 3907019088 trailing sectors not covered by disklabel
 [  15.1904427] sd2: 3907019088 trailing sectors not covered by disklabel
 [  15.1904427] sd2: 3907019088 trailing sectors not covered by disklabel
 [  15.2104381] raid1: RAID Level 1
 [  15.2104381] raid1: Components: /dev/dk0 /dev/dk1
 [  15.2104381] raid1: Total Sectors: 7814032896 (3815445 MB)
 [  15.2204383] dk4 at raid1: "2569cdee-4f34-11ee-84a1-00259036fd2e", 7814032829 blocks at 34, type: ffs
 [  15.2304381] raid0: RAID Level 1
 [  15.2404378] raid0: Components: /dev/sd2a /dev/sd1a
 [  15.2404378] raid0: Total Sectors: 3907017920 (1907723 MB)
 [  15.2604380] WARNING: 2 errors while detecting hardware; check system log.
 [  15.2804376] boot device: raid0
 [  15.2804376] root on md0a dumps on md0b
 [  15.2904383] root file system type: ffs
 [  15.2904383] kern.module.path=/stand/amd64/9.99.77/modules
 [  15.3004381] WARNING: clock lost 455 days
 [  15.3004381] WARNING: using filesystem time
 [  15.3116721] WARNING: CHECK AND RESET THE DATE!
 Created tmpfs /dev (1818624 byte, 3520 inodes)
 erase ^?, werase ^W, kill ^U, intr ^C
 
 
 --- End of forwarded message from "Brian Buhrow" <buhrow%nfbcal.org@localhost>
 
 
 
 
 
 


Home | Main Index | Thread Index | Old Index