Changing the return value of xxx_attach() from void to int. (take 2)

To: tech-kern%NetBSD.org@localhost
Subject: Changing the return value of xxx_attach() from void to int. (take 2)
From: Masanobu SAITOH <msaitoh%execsw.org@localhost>
Date: Wed, 20 Jul 2016 16:50:51 +0900

 Hello, all.

 Thanks all people who replied to my proposal. One of my fault is that
I didn't show any example code. I made patches. As jnemeth@ said, my
change is straightforward and the change should not cause any catastrophic
problem.



 In my last mail,

    c) We can add the code to fallback to lower priority device driver
      like ukphy(4), ugen(4) or others.


 This change is not easy (though it also not difficult.), so in this change
doesn't include it. The fallback code will be a future plan.


Current patches:

0) Caller side (sys/device.h, kern/subr_autoconf.c and config(1))

	http://www.netbsd.org/~msaitoh/attach-caller-20160720.dif

1) Device drivers

	http://www.netbsd.org/~msaitoh/attach-ddrv-20160720.dif

2) detach related changes of wm(4) for an example

	http://www.netbsd.org/~msaitoh/attach-wm-20160720.dif

 "./build.sh release" finishes without any error on x86 and the kernel works.

Details:

===================================================================
RCS file: /cvsroot/src/sys/sys/device.h,v
retrieving revision 1.149
diff -u -p -r1.149 device.h
--- sys/sys/device.h	19 Jun 2016 09:35:06 -0000	1.149
+++ sys/sys/device.h	20 Jul 2016 05:13:49 -0000
@@ -313,7 +313,7 @@ struct cfattach {
 	size_t	  ca_devsize;		/* size of dev data (for alloc) */
 	int	  ca_flags;		/* flags for driver allocation etc */
 	int	(*ca_match)(device_t, cfdata_t, void *);
-	void	(*ca_attach)(device_t, device_t, void *);
+	int	(*ca_attach)(device_t, device_t, void *);
 	int	(*ca_detach)(device_t, int);
 	int	(*ca_activate)(device_t, devact_t);
 	/* technically, the next 2 belong into "struct cfdriver" */
@@ -401,7 +401,7 @@ typedef int (*cfprint_t)(void *, const c
  * Pseudo-device attach information (function + number of pseudo-devs).
  */
 struct pdevinit {
-	void	(*pdev_attach)(int);
+	int	(*pdev_attach)(int);
 	int	pdev_count;
 };


 It changes the return value from void to int for both nomal and pseudo
device driver.

Index: sys/kern/subr_autoconf.c
===================================================================
RCS file: /cvsroot/src/sys/kern/subr_autoconf.c,v
retrieving revision 1.247
diff -u -p -r1.247 subr_autoconf.c
--- sys/kern/subr_autoconf.c	19 Jul 2016 07:44:03 -0000	1.247
+++ sys/kern/subr_autoconf.c	20 Jul 2016 05:13:48 -0000
@@ -1545,6 +1545,7 @@ config_attach_loc(device_t parent, cfdat
 	device_t dev;
 	struct cftable *ct;
 	const char *drvname;
+	int rv;

 	dev = config_devalloc(parent, cf, locs);
 	if (!dev)
@@ -1595,10 +1596,15 @@ config_attach_loc(device_t parent, cfdat
 	}
 	device_register(dev, aux);

-	/* Let userland know */
-	devmon_report_device(dev, true);
-
-	(*dev->dv_cfattach->ca_attach)(parent, dev, aux);
+	rv = (*dev->dv_cfattach->ca_attach)(parent, dev, aux);
+	if (rv == 0) {
+		/* Let userland know */
+		devmon_report_device(dev, true);
+	} else {
+		aprint_error("%s: %s: ERROR: failed in attach (rv = %d)\n",
+		    __func__, device_xname(dev), rv);
+		dev->dv_flags &= ~DVF_ACTIVE;
+	}

 	if (!device_pmf_is_registered(dev))
 		aprint_debug_dev(dev, "WARNING: power management not "


 This part is the main change. We can get three benefit from this change.

	1) We can call devmon_report_device() only when an ca_attach() call
	   succeeded.

	2) We can show a common error message when ca_attach() failed.
	   This change has some benefits.

		Some drivers don't call aprint_error*() and just return.
		We couldn't know fails from such type of drivers. With
		this change, we can know it for sure.

		Some drivers' error messages aren't clear. It's difficult
		for us to know that a ca_attach() call failed in
		(the middle of) the attach function. With this change,
		we can know it for sure.

		mailaprint_error*() increments aprint_error_count. It's
		the only difference between aprint_normal*() and
		aprint_error*(). We couldn't know what driver's attach
		function failed from the caller side. And, some drivers'
		attach function call aprint_error*() more than one time,
		so the value of aprint_error_count is not the same as
		the number of ca_attach()'s fail. With this change,
		we can know what driver's ca_attach() failed. And,
		if we add a new counter like aprint_error_count, we
		can know the exact number of ca_attach()'s fail. It
		would reduce the existence value of aprint_error*().
		Note that small number of device driver call
		aprint_error(), finish the attach function and the
		drivers work.

	3) This change make possible not to call ca_detach() for
	   devices that the ca_attach() failed. In this patch,
	   the DVF_ACTIVE flags is unset and config_detach() checks
	   it (see below).

@@ -1630,6 +1636,7 @@ device_t
 config_attach_pseudo(cfdata_t cf)
 {
 	device_t dev;
+	int rv;

 	dev = config_devalloc(ROOT, cf, NULL);
 	if (!dev)
@@ -1648,10 +1655,15 @@ config_attach_pseudo(cfdata_t cf)
 	device_register(dev, NULL);	/* like a root node */
 #endif

-	/* Let userland know */
-	devmon_report_device(dev, true);
-
-	(*dev->dv_cfattach->ca_attach)(ROOT, dev, NULL);
+	rv = (*dev->dv_cfattach->ca_attach)(ROOT, dev, NULL);
+	if (rv == 0) {
+		/* Let userland know */
+		devmon_report_device(dev, true);
+	} else {
+		aprint_normal("%s: %s: ERROR: failed in attach (rv = %d)\n",
+		    __func__, device_xname(dev), rv);
+		dev->dv_flags &= ~DVF_ACTIVE;
+	}

 	config_process_deferred(&deferred_config_queue, dev);
 	return dev;


 This is for pseudo device drivers.

@@ -1746,7 +1758,14 @@ config_detach(device_t dev, int flags)
 	    (dev->dv_flags & DVF_DETACH_SHUTDOWN) == 0) {
 		rv = EOPNOTSUPP;
 	} else if (ca->ca_detach != NULL) {
-		rv = (*ca->ca_detach)(dev, flags);
+		if ((dev->dv_flags & DVF_ACTIVE) != 0)
+			rv = (*ca->ca_detach)(dev, flags);
+		else {
+			aprint_verbose(
+				"%s: DVF_ACTIVE not set in dv_flags (%x)\n",
+				device_xname(dev), dev->dv_flags);
+			rv = EOPNOTSUPP;
+		}
 	} else
 		rv = EOPNOTSUPP;


 This is the change to not to call ca_detach() for devices that
the ca_attach() failed.


 Please see the following files for each device drivers:

	http://www.netbsd.org/~msaitoh/attach-ddrv-20160720.dif

For almost all device drivers, it's easy to change from "return;"
to "return 0;" or "return 1;" because almost all drivers have
aprint_error*() before calling "return;" We can check the reason
from the code and the string in aprint_error*().

 Note that some errors return with 0 instead of -1. I noticed
some drivers don't regard error of pmf_device_register() and some
function calls to add sysctl as error. I think it's reasonable,
so I modified drivers like as it does.


 Change for wm(4), please see:

	http://www.netbsd.org/~msaitoh/attach-wm-20160720.dif

As I wrote:

    b) The shutdown sequence calls xxx_shutdown() even if
       it's not really attached. It may causes panic. We have met
       this bugs many times.


 Usually, author of a device driver doesn't expect to call ca_deach()
when ca_attach() failed. They (include me) don't think ca_attach()
really fails. Our experience showed that it occurs. Some examples
are in bge_release_resources(it's called from both bge_attach() and
bge_detach()) and wm_detach():

	https://nxr.netbsd.org/xref/src/sys/dev/pci/if_bge.c#4132
	https://nxr.netbsd.org/xref/src/sys/dev/pci/if_wm.c#2595

Some detach function expects that a pointer is initialized. On some
cases the pointer is not initialized and it occurred NULL pointer
reference and panicked. bge(4) and wm(4) are not only the case of
this problem. And then we added NULL pointer check. For wm(4), it
occurred multiple times. To prevent from such type of unknown
problem, I added WM_F_ATTACHED flag. It's set at the end of the
attach function and it's checked at the begging of the detach
function. I think this type of flag should be done in the upper
layer to avoid code duplication and to avoid unknown bugs in
any drivers.

 It's OK to do this work on a new branch, but the lifetime of
the branch will be 1 week or less. Even if this change cause
a catastrophic problem (though it should not), it's easy to
prevent problem by stopping checks inconfig_attach_loc.
If I mistook "return -1" instead of "return 0", we can easily
know the problem from dmesg.

 It was discussion about the return value to be 0/other than 0,
true/false, Exxx or incremental number. At this time, I think
it's OK to be 0/1. When a new features is added, it'll become
the time to reconsider the return value. After committing this
change, we will modify some driver's attach and detach function
to release resources better than before. It's important for
drivers to make it kernel module. And also we will add error
checks that have not done before. At least almost all MII PHY
driver don't check errors and I'll modify to do it.

 Any comment and advice are welcomed.

--
-----------------------------------------------
                SAITOH Masanobu (msaitoh%execsw.org@localhost
                                 msaitoh%netbsd.org@localhost)

Follow-Ups:
- Re: Changing the return value of xxx_attach() from void to int. (take 2)
  - From: Michael van Elst

Prev by Date: Re: CVS commit: src/sys
Next by Date: Re: Changing the return value of xxx_attach() from void to int. (take 2)
Previous by Thread: An anomaly in config(1)
Next by Thread: Re: Changing the return value of xxx_attach() from void to int. (take 2)
Indexes:

Home | Main Index | Thread Index | Old Index