NetBSD-Bugs archive
[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index][Old Index]
Re: kern/43986: ataraid(4) doesn't seem to handle weird array configs very gracefully
The following reply was made to PR kern/43986; it has been noted by GNATS.
From: Taylor R Campbell <riastradh%NetBSD.org@localhost>
To: "Greg A. Woods" <woods%planix.com@localhost>
Cc: gnats-bugs%NetBSD.org@localhost, netbsd-bugs%NetBSD.org@localhost
Subject: Re: kern/43986: ataraid(4) doesn't seem to handle weird array configs very gracefully
Date: Tue, 26 Mar 2024 02:12:00 +0000
> uvm_fault(0xc0ab2c40, 0, 1) -> 0xe
> kernel: supervisor trap page fault, code=3D0
> Stopped in pid 0.1 (swapper) at netbsd:strncmp+0x23: movzbl 0(%ebx),%=
eax
> db{0}> trace
> strncmp(c095397a,1c,6,282,c0a9564c) at netbsd:strncmp+0x23
This is an attempt to access the page zero. Not exactly a null
pointer dererence, but a near-null pointer dereference -- the second
argument is 0x1c.
> devsw_name2blk(1c,0,0,0,0) at netbsd:devsw_name2blk+0x7e
It probably happened here:
535 if (strncmp(conv->d_name, name, len) !=3D 0)
https://nxr.netbsd.org/xref/src/sys/kern/subr_devsw.c?r=3D1.28#535
> ld_ataraid_attach(c3d9c940,c3dede00,c3da0600,c37d4000,c37f4000) at netbsd=
:ld_ataraid_attach+0x1a8
The devsw_name2blk call is probaby this one:
78 bmajor =3D devsw_name2blk(device_xname(adi->adi_dev), NULL, 0);
https://nxr.netbsd.org/xref/src/sys/dev/ata/ata_raid_subr.c?r=3D1.2#78
0x1c is offsetof(struct device, dv_xname) on i386 as of around 2010:
https://nxr.netbsd.org/xref/src/sys/sys/device.h?r=3D1.137#141
So adi->adi_dev is probably null here.
ata_raid_disk_vnode_find assumes that adi->adi_dev is nonnull. This
is probably a bug on its own: the caller handles a null return, so
either if the input is null, the output should be null, or the caller
should avoid calling ata_raid_disk_vnode_find with a null input.
225 adi =3D &aai->aai_disks[i];
226 vp =3D ata_raid_disk_vnode_find(adi);
227 if (vp =3D=3D NULL) {
228 /*
229 * XXX This is bogus. We should just mark the
230 * XXX component as FAILED, and write-back new
231 * XXX config blocks.
232 */
233 break;
234 }
https://nxr.netbsd.org/xref/src/sys/dev/ata/ld_ataraid.c?r=3D1.37#225
How did adi->adi_dev come to be null?
Well, the struct ataraid_disk_info *adi structure is one of the
aai->aai_ndisks entries in a struct ataraid_array_info *aai structure
which is created by ataraid_get_array_info:
270 aai =3D malloc(sizeof(*aai), M_DEVBUF, M_WAITOK | M_ZERO);
...
289 TAILQ_INSERT_TAIL(&ataraid_array_info_list, aai, aai_list);
https://nxr.netbsd.org/xref/src/sys/dev/ata/ata_raid.c?r=3D1.34#258
This is called by the various ata_raid_*.c drivers. From the dmesg
you shared, it looks like this was an Adaptec adapter, so aai_ndisks
is filled in from the first component here:
156 aai->aai_ndisks =3D be16toh(info->configs[0].total_disks);
https://nxr.netbsd.org/xref/src/sys/dev/ata/ata_raid_adaptec.c?r=3D1.9#156
And aai->aai_disks[drive].adi_dev is filled in for each component
found here:
182 adi =3D &aai->aai_disks[drive];
183 adi->adi_dev =3D sc->sc_dev;
https://nxr.netbsd.org/xref/src/sys/dev/ata/ata_raid_adaptec.c?r=3D1.9#182
If any components are missing, there will be some number `drive' for
which aai->aai_disks[drive].adi_dev remains null, and the logic will
crash where you saw this. That seems to be consistent with your
explanation of the state of the disks.
None of this code has changed much since this PR was filed, so I
suspect the bug is still there. I'm inclined to say that
ld_ataraid_attach should just check adi->adi_dev =3D=3D NULL and handle it
like when ata_raid_disk_vnode_find returns null -- in both cases, it
needs to deal with a missing component.
--- ld_ataraid.c
+++ ld_ataraid.c
@@ -226,8 +226,8 @@
*/
for (i =3D 0; i < aai->aai_ndisks; i++) {
adi =3D &aai->aai_disks[i];
- vp =3D ata_raid_disk_vnode_find(adi);
- if (vp =3D=3D NULL) {
+ if (adi =3D=3D NULL ||
+ (vp =3D ata_raid_disk_vnode_find(adi)) =3D=3D NULL) {
/*
* XXX This is bogus. We should just mark the
* XXX component as FAILED, and write-back new
Now, the failure branch logic here may be wrong -- it just gives up
instead of trying to deal with a missing component -- but that's a
separate issue which, with any luck, should lead to a more graceful
failure than crash, even if the more graceful failure isn't ideal.
Home |
Main Index |
Thread Index |
Old Index