NetBSD-Bugs archive

[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index][Old Index]

Re: kern/43986: ataraid(4) doesn't seem to handle weird array configs very gracefully



The following reply was made to PR kern/43986; it has been noted by GNATS.

From: Taylor R Campbell <riastradh%NetBSD.org@localhost>
To: "Greg A. Woods" <woods%planix.com@localhost>
Cc: gnats-bugs%NetBSD.org@localhost, netbsd-bugs%NetBSD.org@localhost
Subject: Re: kern/43986: ataraid(4) doesn't seem to handle weird array configs very gracefully
Date: Tue, 26 Mar 2024 02:12:00 +0000

 > uvm_fault(0xc0ab2c40, 0, 1) -> 0xe
 > kernel: supervisor trap page fault, code=3D0
 > Stopped in pid 0.1 (swapper) at netbsd:strncmp+0x23:    movzbl  0(%ebx),%=
 eax
 > db{0}> trace
 > strncmp(c095397a,1c,6,282,c0a9564c) at netbsd:strncmp+0x23
 
 This is an attempt to access the page zero.  Not exactly a null
 pointer dererence, but a near-null pointer dereference -- the second
 argument is 0x1c.
 
 > devsw_name2blk(1c,0,0,0,0) at netbsd:devsw_name2blk+0x7e
 
 It probably happened here:
 
     535 		if (strncmp(conv->d_name, name, len) !=3D 0)
 
 https://nxr.netbsd.org/xref/src/sys/kern/subr_devsw.c?r=3D1.28#535
 
 > ld_ataraid_attach(c3d9c940,c3dede00,c3da0600,c37d4000,c37f4000) at netbsd=
 :ld_ataraid_attach+0x1a8
 
 The devsw_name2blk call is probaby this one:
 
      78 	bmajor =3D devsw_name2blk(device_xname(adi->adi_dev), NULL, 0);
 
 https://nxr.netbsd.org/xref/src/sys/dev/ata/ata_raid_subr.c?r=3D1.2#78
 
 0x1c is offsetof(struct device, dv_xname) on i386 as of around 2010:
 
 https://nxr.netbsd.org/xref/src/sys/sys/device.h?r=3D1.137#141
 
 So adi->adi_dev is probably null here.
 
 ata_raid_disk_vnode_find assumes that adi->adi_dev is nonnull.  This
 is probably a bug on its own: the caller handles a null return, so
 either if the input is null, the output should be null, or the caller
 should avoid calling ata_raid_disk_vnode_find with a null input.
 
     225 		adi =3D &aai->aai_disks[i];
     226 		vp =3D ata_raid_disk_vnode_find(adi);
     227 		if (vp =3D=3D NULL) {
     228 			/*
     229 			 * XXX This is bogus.  We should just mark the
     230 			 * XXX component as FAILED, and write-back new
     231 			 * XXX config blocks.
     232 			 */
     233 			break;
     234 		}
 
 https://nxr.netbsd.org/xref/src/sys/dev/ata/ld_ataraid.c?r=3D1.37#225
 
 How did adi->adi_dev come to be null?
 
 Well, the struct ataraid_disk_info *adi structure is one of the
 aai->aai_ndisks entries in a struct ataraid_array_info *aai structure
 which is created by ataraid_get_array_info:
 
     270 	aai =3D malloc(sizeof(*aai), M_DEVBUF, M_WAITOK | M_ZERO);
 ...
     289 	TAILQ_INSERT_TAIL(&ataraid_array_info_list, aai, aai_list);
 
 https://nxr.netbsd.org/xref/src/sys/dev/ata/ata_raid.c?r=3D1.34#258
 
 This is called by the various ata_raid_*.c drivers.  From the dmesg
 you shared, it looks like this was an Adaptec adapter, so aai_ndisks
 is filled in from the first component here:
 
     156 		aai->aai_ndisks =3D be16toh(info->configs[0].total_disks);
 
 https://nxr.netbsd.org/xref/src/sys/dev/ata/ata_raid_adaptec.c?r=3D1.9#156
 
 And aai->aai_disks[drive].adi_dev is filled in for each component
 found here:
 
     182 	adi =3D &aai->aai_disks[drive];
     183 	adi->adi_dev =3D sc->sc_dev;
 
 https://nxr.netbsd.org/xref/src/sys/dev/ata/ata_raid_adaptec.c?r=3D1.9#182
 
 If any components are missing, there will be some number `drive' for
 which aai->aai_disks[drive].adi_dev remains null, and the logic will
 crash where you saw this.  That seems to be consistent with your
 explanation of the state of the disks.
 
 None of this code has changed much since this PR was filed, so I
 suspect the bug is still there.  I'm inclined to say that
 ld_ataraid_attach should just check adi->adi_dev =3D=3D NULL and handle it
 like when ata_raid_disk_vnode_find returns null -- in both cases, it
 needs to deal with a missing component.
 
 --- ld_ataraid.c
 +++ ld_ataraid.c
 @@ -226,8 +226,8 @@
  	 */
  	for (i =3D 0; i < aai->aai_ndisks; i++) {
  		adi =3D &aai->aai_disks[i];
 -		vp =3D ata_raid_disk_vnode_find(adi);
 -		if (vp =3D=3D NULL) {
 +		if (adi =3D=3D NULL ||
 +		    (vp =3D ata_raid_disk_vnode_find(adi)) =3D=3D NULL) {
  			/*
  			 * XXX This is bogus.  We should just mark the
  			 * XXX component as FAILED, and write-back new
 
 Now, the failure branch logic here may be wrong -- it just gives up
 instead of trying to deal with a missing component -- but that's a
 separate issue which, with any luck, should lead to a more graceful
 failure than crash, even if the more graceful failure isn't ideal.
 


Home | Main Index | Thread Index | Old Index