tech-kern archive
[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index][Old Index]
ataraid(4) missing disk handling
Hi,
For the last week I am working on the fix for the ataraid(4) related
to the bug reported in kern/43986 and partially to kern/59130 (hits
same issue due to different bug).
Taylor has a good analysis and explanation of the issue
https://mail-index.netbsd.org/netbsd-bugs/2024/03/26/msg082202.html,
which I also noticed by testing ATA RAID setup on VIA controllers.
For the short context, ataraid(4) configures RAID array and all disks
information depending on vendor in ata_raid_<vendor>.c components by
each connected drive using information from RAID config blocks. The
problem is that code assumes that all initially configured RAID drives
exist and are attached. However, given one drive is missing
(removed/faulty/code bug), configuration of the drive will be skipped
leading to failure on
https://nxr.netbsd.org/xref/src/sys/dev/ata/ld_ataraid.c#229 due to
adi->adi_dev being NULL (or more specifically in
device_xname(adi->adi_dev) at
https://nxr.netbsd.org/xref/src/sys/dev/ata/ata_raid_subr.c#71). After
some discussion I ended up with following patch:
https://netbsd.org/~andvar/ata_raid_fix.diff. It checks that disk
status is online (adi->adi_status has ADI_S_ONLINE status flag),
otherwise treats it as if vnode_find returned NULL. That would solve
described situation a bit more gracefully and avoid the crash.
Initially it looked OK and I successfully tested the patch on VIA
machines (by setting up RAID, removing on of the RAID components
before next reboot, also deleting ). However, after analyzing various
RAID components I noticed it may not work for promise and intel RAIDs.
Promise (https://nxr.netbsd.org/xref/src/sys/dev/ata/ata_raid_promise.c#194)
and intel (https://nxr.netbsd.org/xref/src/sys/dev/ata/ata_raid_intel.c#278)
RAIDs may have ADI_S_SPARE status which removes online flag. I don't
have these controllers, but I assume my patch would treat these drives
incorrectly as missing.
Other RAID types use only ADI_S_ONLINE | ADI_S_ASSIGNED, thus patch
would work for them.
Given that three statuses are defined for adi_status
(https://nxr.netbsd.org/xref/src/sys/dev/ata/ata_raidvar.h#75), I
probably need to check if any of the flags are defined ((adi_status &
(ADI_S_ONLINE | ADI_S_ASSIGNED | ADI_S_SPARE)) instead
(https://netbsd.org/~andvar/ata_raid_fix2.diff).
Another alternative is to check that adi->adi_dev IS NULL as Taylor
proposed in his analysis thread.
Please advice if any of these two proposals would be good enough to
solve the issue or something else should be considered? Thank you.
Regards,
Andrius Varanavicius
Home |
Main Index |
Thread Index |
Old Index