tech-misc archive

[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index][Old Index]

alx-0008r2 - Standardize strtoi(3) and strtou(3) from NetBSD



Hi!

Message-ID: <ywcuccsuwvssekftw6hlsv6umxbd6qoxdiwxb54vcyeb65vkfk@mpxoyhjuri22>

Here's revision r2.  The main change is the addition of the wchar_t
variants, and allowance of status to be NULL.  I've also added mentions
to the two bug tickets open in NetBSD at the moment against these APIs.

To address comments saying that the error handling of this API is
suboptimal: I disagree.  I've already replied to those comments with
specific details.  An innovative error handling system would be worse
than this.  And the existing strtol(3) is certainly worse.


Cheers,
Alex

---
Name
	alx-0008r2 - Standardize strtoi(3) and strtou(3) from NetBSD

Principles
	-  Codify existing practice to address evident deficiencies.
	-  Enable secure programming

Category
	Standardize existing libc APIs

Author
	Alejandro Colomar <alx%kernel.org@localhost>

	Cc: <liba2i%lists.linux.dev@localhost>
	Cc: <libbsd%lists.freedesktop.org@localhost>
	Cc: <sc22wg14%open-std.org@localhost>
	Cc: <tech-misc%netbsd.org@localhost>
	Cc: Bruno Haible <bruno%clisp.org@localhost>
	Cc: christos <christos%netbsd.org@localhost>
	Cc: Đoàn Trần Công Danh <congdanhqx%gmail.com@localhost>
	Cc: Paul Eggert <eggert%cs.ucla.edu@localhost>
	Cc: Eli Schwartz <eschwartz93%gmail.com@localhost>
	Cc: Guillem Jover <guillem%hadrons.org@localhost>
	Cc: Iker Pedrosa <ipedrosa%redhat.com@localhost>
	Cc: Joseph Myers <josmyers%redhat.com@localhost>
	Cc: Michael Vetter <jubalh%iodoru.org@localhost>
	Cc: Robert Elz <kre%netbsd.org@localhost>
	Cc: <riastradh%NetBSD.org@localhost>
	Cc: Sam James <sam%gentoo.org@localhost>
	Cc: "Serge E. Hallyn" <serge%hallyn.com@localhost>
	Cc: наб <nabijaczleweli%nabijaczleweli.xyz@localhost>

History
	<https://www.alejandro-colomar.es/src/alx/alx/wg14/alx-0008.git/>

	r0 (2025-03-18):
	-  Initial draft.

	r1 (2025-03-18):
	-  Add 'Future directions' section.
	-  Fix typos.
	-  Move to <inttypes.h> (7.8 instead of 7.24).
	-  Add links to more NetBSD bug reports in 'See also'.
	-  Add link to n3183 (discussed in Strasbourg) in 'See also'.
	-  Specify the possible implementation-defined behaviors when
	   the base is a value not specified here.
	-  Specify that the range coercion is done with saturation.
	-  Specify that if min>max, these functions return an
	   unspecified value.
	-  Add ECANCELED, EINVAL, ENOTSUP to <errno.h> (7.5).
	-  Note that in the future we'll want to make this
	   const-generic.
	-  Add example.
	-  Add implementation.

	r2 (2025-03-20):
	-  Add Caveats section.
	-  Rename rstatus => status.
	-  Allow 'status' to be NULL.
	-  Add links to 'See also'.
	-  Add wchar_t variants.

Description
	The strtol(3) family of functions is do damn hard to use
	correctly.  Only a handful of programmers in the world really
	know how to use it correctly in all the corner cases, and even
	those need to be really careful to not make mistakes.

	Several projects have tried to develop successor APIs, from
	which the only one that is generic enough to supersede them is
	strtoi/u(3) from NetBSD.

	Other APIs include OpenBSD's strtonum(3), but that API isn't
	generic, and cannot replace every use of strtol(3).  gnulib has
	also some attempts to improve their situation, but they're also
	not suitable for standardization.

	strtoi/u(3) had originally a bug, which shows how difficult it
	is to correctly wrap strto{i,u}max(3) (from the strtol(3)
	family).  That bug has been fixed, and after two years of
	research into string-to-numeric APIs, I can conclude that it is
	a net improvement over the existing APIs, and doesn't have any
	significant flaws.

	It is still not the ideal API in terms of type safety, and I'm
	working on a library that provides safer wrappers.  However,
	such a library would still benefit from having strtoi/u(3) in
	the standard library, by being able to wrap around it.  And user
	programs would immediately benefit from being able to replace
	strtol(3) et al. by strtoi/u(3).

	I have audited several projects which use strtol(3) et al., and
	they're full of bugs.  It's an API that we should really
	deprecate some day.

Prior art
	NetBSD provides strto{i,u}(3), which were introduced in
	NetBSD 7.

	libbsd ports these APIs to other POSIX systems.

	shadow-utils has its own implementation for internal use.

	Here's a possible implementation of strtoi(3):

		intmax_t
		strtoi(const char *s, char **restrict endp, int base,
		    intmax_t min, intmax_t max, int *restrict status)
		{
			int        e, st;
			char       *end;
			intmax_t   n;

			if (endp == NULL)
				endp = &end;
			if (status == NULL)
				status = &st;

			if (base != 0 && (base < 2 || base > 36)) {
				*endp = (char *) s;
				*status = EINVAL;
				return MAX(min, MIN(max, 0));
			}

			e = errno;
			errno = 0;

			n = strtoimax(s, endp, base);

			if (*endp == s)
				*status = ECANCELED;
			else if (errno == ERANGE || n < min || n > max)
				*status = ERANGE;
			else if (**endp != '\0')
				*status = ENOTSUP;
			else
				*status = 0;

			errno = e;

			return MAX(min, MIN(max, n));
		}

	strtou(3) can be implemented with the same exact code, replacing
	s/intmax_t/uintmax_t/, and s/strtoimax/strtoumax/.

    wchar_t
	NetBSD doesn't provide a wchar_t variant of these functions.

Caveats
    strtou_nn()
	strtou(3) leaves one issue of strtoul(3) unfixed: negative
	values are converted to huge positive values by modulo
	arithmetic, before performing range checks.

	For this, I personally use a wrapper, strtou_noneg(), which
	rejects any negative values.  It might be interesting to add it
	to the standard too, since most callers of strtou() really want
	to avoid negative values.  We could call it strtou_nn().  Here's
	a possible implementation:

		uintmax_t
		strtou_nn(const char *s, char **restrict endp, int base,
		    uintmax_t min, uintmax_t max, int *restrict status)
		{
			int  st;

			if (status == NULL)
				status = &st;
			if (strtoi(s, endp, base, 0, 1, status) == 0
			    && *status == ERANGE)
			{
				return min;
			}

			return strtou(s, endp, base, min, max, status);
		}

	Another possibility is to change strtou(3) to reject negative
	numbers.  I've proposed this possibility to NetBSD:
	<https://gnats.netbsd.org/cgi-bin/query-pr-single.pl?number=59198>

    status
	The status parameter can be NULL in the NetBSD implementation,
	but from what I could find, there are no users of this feature.
	It would make sense to have a narrower contract where it cannot
	be NULL.  I've proposed this to NetBSD:
	<https://gnats.netbsd.org/cgi-bin/query-pr-single.pl?number=59199>

Future directions
    atoi(3), scanf(3)
	The atoi(3) family of functions has unnecessary UB.  It could be
	removed by redefining it in terms of this API:

		int
		atoi(const char *s)
		{
			int  n, e;

			n = strtoi(s, NULL, 10, _Minof(n), _Maxof(n), &e)
			errno = e ?: errno;

			return n;
		}

	Which would make atoi(3) behave just like one would expect.
	Then we could define scanf(3)'s %d et al. in terms of atoi(3).

    wchar_t
	It could be interesting to add a wchar-based variant of these
	APIs.

    locale_t
	It could be interesting to add a variant of these APIs that
	accepts a locale_t parameter instead of using the current
	locale.  Those APIs exist in NetBSD as strtoi_l(), strtou_l().

    _Generic
	Once something like Chris's n3510 (2025-02-27, "Enhanced type
	variance (v2)") is accepted into C2y, we could transform these
	functions to use QChar, thus transforming them into
	const-generic functions, as as with the strtol(3) family of
	functions.

See also
	<https://gnats.netbsd.org/cgi-bin/query-pr-single.pl?number=57828>
	<https://gnats.netbsd.org/cgi-bin/query-pr-single.pl?number=58453>
	<https://gnats.netbsd.org/cgi-bin/query-pr-single.pl?number=58461>
	<https://gnats.netbsd.org/cgi-bin/query-pr-single.pl?number=59198>
	<https://gnats.netbsd.org/cgi-bin/query-pr-single.pl?number=59199>
	<https://www.open-std.org/jtc1/sc22/wg14/www/docs/n3183.pdf>

Proposed wording
	Based on N3467.

    7.5  Errors <errno.h>
	@@ p2
	 The macros are
	+	ECANCELED
		EDOM
	+	EINVAL
		EILSEQ
	+	ENOTSUP
		ERANGE

    7.8.3  Functions for greatest-width integer types
	New section _before_ 7.8.3.3 (The strtoimax and strtoumax functions).

	While all this section is new, some text is pasted verbatim from
	7.24.2.8.  I'll write that text as if it was already existing
	in the diff below.

	I also renamed the parameters of strtol(3):
	nptr => s	Because it's a string, not a pointer to a number.
	endptr => endp	It's shorter and just as readable (if not more).

	@@
	+7.8.2.*  The <b>strtoi</b> and <b>strtou</b> functions
	+
	+Synopsis
	+1	#include <inttypes.h>
	+	intmax_t strtoi(const char *restrict s, char **restrict endp,
	+	    int base, intmax_t min, intmax_t max, int *status);
	+	uintmax_t strtou(const char *restrict s, char **restrict endp,
	+	    int base, uintmax_t min, uintmax_t max, int *status);
	+
	+Description
	+2	The <b>strtoi</b> and <b>strtou</b> functions
		convert the initial portion of
		the string pointed to by <tt>s</tt>
	+	to <b>intmax_t</b> and <b>uintmax_t</b>,
		respectively.
		First,
		they decompose the input string into three parts:
		an initial, possibly empty, sequence of white-space characters,
		a subject sequence resembling an integer
		represented in some radix determined by the value of <tt>base</tt>,
		and a final string of one or more unrecognized characters,
		including the terminating null character of the input string.
	+	Then,
		they attempt to convert the subject sequence to an integer.
	+	Then,
	+	they coerce with saturation
	+	the integer into the range [min, max].
	+	Finally,
		they return the result.

	Paste p3, p4, p5, p6 from 7.24.2.8, replacing the function and
	type names as appropriate.

	@@
	+7	If the value of <tt>base</tt> is different from
	+	the values specified in the preceding paragraphs,
	+	it is implementation-defined
	+	whether these functions successfully convert the value
	+	and in which manner.

	The above paragraph ensures that this function has no
	input-controlled UB.  strtol(s, NULL, base) with a
	user-controlled base can result in UB, and thus vulnerabilities.
	It is trivial to report an error, so let's do it.  This function
	is heavy enough that optimizing this is not worth.  Even POSIX
	does this for strtol(3).

	@@
	 8	If the subject sequence is empty
		or does not have the expected form,
	+	or the value of <tt>base</tt> is not supported,
		no conversion is performed;
		the value of <tt>s</tt>
		is stored in the object pointer to by <tt>endp</tt>,
		provided that <tt>endp</tt> is not a null pointer.

	The above paragraph ensures that *endp can be read after a call
	to these functions.  strtol(3) doesn't provide enough guarantees
	to be able to reliably read it, even in POSIX, and it's hard to
	portably write code that calls it and can inspect *endp after
	the call without UB.

	@@
	 Returns
	+10	The <b>strtoi</b> and <b>strtou</b> functions
		return the converted value, if any.
		If no conversion could be performed,
	+	zero is coerced with saturation into the range,
	+	and then returned.

	The paragraph above doesn't mention the range of representable
	values (unlike 7.24.2.8) because that's already covered by the
	range coercion specified in p2 above.

	@@
	+11	If <tt>min > max</tt>,
	+	these functions return an unspecified value.

	The above paragraph covers the case where min>max, where the
	conversion with saturation into the range cannot do anything
	meaningful.  The error is still specified as ERANGE.

		the value of <tt>s</tt>
		is stored in the object pointer to by <tt>endp</tt>,
		provided that <tt>endp</tt> is not a null pointer.
	@@
	+Errors
	+12	These functions do not set <b>errno</b>.
	+	Instead,
	+	and provided that <tt>endp</tt> is not a null pointer,
	+	they set the object pointed to by <tt>status</tt>
	+	to an error code,
	+	or to zero on success.
	+
	+13	-- EINVAL	The value in <tt>base</tt> is not supported.
	+	-- ECANCELED	The given string did not contain
	+			any characters that were converted.
	+	-- ERANGE	The converted value was out of range
	+			and has been coerced,
	+			or the range was invalid (e.g., min > max).
	+	-- ENOTSUP	The given string contained characters
	+			that did not get converted.
	+
	+14	If various errors happen in the same call,
	+	the first one listed here is reported.

	The paragraph above is important to differentiate the following:
	strtoi("7z", &end, 0, 3, 7, &status);
	strtoi("42z", &end, 0, 3, 7, &status);

	@@
	+15	EXAMPLE 1
	+	The following is an example of
	+	using these functions to parse a number
	+	and the string that follows.
	+
	+		int       err;
	+		char      *end;
	+		intmax_t  n, min = 5, max = 50;
	+
	+		n = strtoi(" 42 kg", &end, 10, min, max, &err);
	+		if (err != 0) {
	+			if (err == EINVAL || err == ECANCELED)
	+				fprintf(stderr, "%s\n", strerror(err));
	+				exit(EXIT_FAILURE);
	+			if (err == ERANGE && n == min)
	+				puts("Too light");
	+			if (err == ERANGE && n == max)
	+				puts("Too heavy");
	+		}
	+		printf("Quantity: %jd\n", n);
	+		if (err == ENOTSUP)
	+			printf("Units: %s\n", end + strspn(end));
	+		else
	+			puts("Unitless?");

    7.32.4.2.1  Wide string numeric conversion functions :: General
	@@ p1
	 This subclause describes
	 wide string analogs of
	-the <b>strtod</b> family of functions (...).
	+the <b>strtoi</b> and <b>strtod</b> families of functions (...).

	Note to the editor: make sure to update the (...) correctly in
	the text above.

    7.32.4.2  Wide string numeric conversion functions
	New section after 7.32.4.2.1 (General).

	+7.32.4.2.1  The <b>wcstoi</b> and <b>wcstou</b> functions
	+
	+Synopsis
	+1	#include <wchar.h>
	+	intmax_t wcstoi(const wchar_t *restrict s, wchar_t **restrict endp,
	+	    int base, intmax_t min, intmax_t max, int *status);
	+	uintmax_t wcstou(const wchar_t *restrict s, wchar_t **restrict endp,
	+	    int base, uintmax_t min, uintmax_t max, int *status);
	+
	+Description
	+2	The <b>wcstoi</b> and <b>wcstou</b> functions
	+	are equivalent to
	+	<b>strtoi</b> and <b>strtou</b>,
	+	except that these handle wide strings.


-- 
<https://www.alejandro-colomar.es/>

Attachment: signature.asc
Description: PGP signature



Home | Main Index | Thread Index | Old Index