Re: alx-0008 - Standardize strtoi(3) and strtou(3) from NetBSD

To: Bruno Haible <bruno%clisp.org@localhost>
Subject: Re: alx-0008 - Standardize strtoi(3) and strtou(3) from NetBSD
From: Alejandro Colomar <alx%kernel.org@localhost>
Date: Tue, 18 Mar 2025 23:43:46 +0100

Hi Bruno,

On Tue, Mar 18, 2025 at 10:53:09PM +0100, Bruno Haible wrote:
> > Below is a draft of a proposal for standardization of strtoi/u(3) from
> > NetBSD in ISO C2y.
> 
> First of all: I like your initiative, and I moderately like this proposal.

Thanks!

> > 	The strtol(3) family of functions is do damn hard to use
> > 	correctly.  Only a handful of programmers in the world really
> > 	know how to use it correctly in all the corner cases, and even
> > 	those need to be really careful to not make mistakes.
> 
> It would be useful to list the mistakes that are being made most frequently;
> so as to verify that the proposed strtoi / strtou functions don't tend
> to provoke the same mistakes.

Ughhh, I don't think I can come up with a comprehensive list; I will
probably forget some.  They are spread on the mailing lists, for anyone
very interested.  I noted some of them intermingled in the proposal, as
comments below the proposed wording paragraphs.

The groff@ mailing list also has a large number of bugs I found in calls
to strtol(3), some of which probably still remain there, because it
still uses strtol(3).

I think covering the API in "Police, do not trespass" bands would be
easier.  :-)

> (I'd guess that one of the frequent mistakes
> is that when the number is not expected to occupy the entire string,

Do you mean "not expected to occupy the entire string", or "expected to
not occupy the entire string"?  They are different cases.

> the success test after (errno = 0, strtol (...)) is
>     endptr > nptr && errno == 0

You need to dereference *endptr.  And the test actually depends on your
response to the above.

> and programmers tend to forget one of the two conditions.)

Assuming that you know that the base is valid beforehand, and that your
response yo my question is the former, and that you dereference endptr,
yes that's the correct check, AFAICS (it is painful to validate these
things; I may make mistakes now).  And yes, that's one of the common
mistakes.

Another is forgetting to validate the base, which makes your test UB,
because *endptr might be uninitialized.  It ain't fun.  :|

> > 	+Synopsis
> > 	+1	#include <stdlib.h>
> > 	+	intmax_t strtoi(const char *restrict s, char **restrict endp, int base,
> > 	+	    intmax_t min, intmax_t max, int *rstatus);
> > 	+	uintmax_t strtou(const char *restrict s, char **restrict endp, int base,
> > 	+	    uintmax_t min, uintmax_t max, int *rstatus);
> 
> Probably it will be an impediment to adoption that these functions work
> on [u]intmax_t, which is 64-bits or 128-bits integers, which seems overkill
> when people want to parse, say, a port number in the range 0..65535.
> 
> To address this adoption problem, how about changing these function to
> generic functions (in the sense of <tgmath.h>)? In such a way that
>     strtoi (n, &end, base, LONG_MIN, LONG_MAX, &status)
> is known to return a 'long' rather than 'intmax_t', and
>     strtoi (n, &end, base, INT_MIN, INT_MAX, &status)

The first parameter should be a string, not a number.  I think the name
nptr is a historic mistake.  The number is returned.

> is known to return an 'int' rather than 'intmax_t'.

How would you decide when the type of min and max is different?
The problem of this API is not having the number as an argument.

I have designed a better API, through a type-generic macro:

	int
	a2i(typename T, T *restrict n, QChar *s,
	    QChar *_Optional *restrict endp, int base,
	    T min, T max);

to be used as:

	if (a2i(int, &n, "42 c", &end, 0, -1000, 1000) == -1 && errno != ENOTSUP)
		err(1, "a2i");

However, I was wondering lately about changing it slightly:

	QChar *
	funcname(typename T, T *restrict n, QChar *restrict s, int base,
	    T min, T max);

to be used as:

	errno = 0;
	end = funcname(int, &n, "42 c", 0, -1000, 1000);
	if (errno != 0 && errno != ENOTSUP)
		err(1, "funcname");

It has the benefit of being simpler regarding the 'restrict' pseudo-
qualifier, and it also means that the implementation is probably
simpler.  This one would need to guarantee not clobbering errno, though,
since there's no error code.  I'm still undecided.  For now, these APIs
are under experimentation in shadow-utils.  (But feedback is welcome!)

> If the standard does NOT say that these functions are generic, it would
> be harder for an implementation to optimize invocations of these
> functions for narrower types: I don't see how it could be done without
> explicit compiler support.

I think a function that parses numbers is going to be slow enough that
these micro-optimizations should be unimportant.

When/if we have a2i() in the future, we could get better optimizations.

> > 	+	Instead, they set the object pointed to by <tt>rstatus</tt>
> > 	+	to an error code,
> > 	+	or to zero on success.
> > 	+
> > 	+12	-- EINVAL	The value in <tt>base</tt> is not supported.
> > 	+	-- ECANCELED	The given string did not contain
> > 	+			any characters that were converted.
> > 	+	-- ERANGE	The converted value was out of range
> > 	+			and has been coerced,
> > 	+			or the range was invalid (e.g., min > max).
> > 	+	-- ENOTSUP	The given string contained characters
> > 	+			that did not get converted.
> > 	+
> > 	+13	If various errors happen in the same call,
> > 	+	the first one listed here is reported.
> 
> It would be useful to show how a success test looks like, after
>     strtoi (s, &end, base, min, max, &status)
> for each of the four frequent use-cases:
>   -a. expect to parse the initial portion of the string, no coercion,
>   -b. expect to parse the initial portion of the string, silent coercion,
>   -c. expect to parse the entire string, no coercion,
>   -d. expect to parse the entire string, silent coercion.
> 
> AFAICS, the success tests are:
>   -a. status == 0 || status == ENOTSUP

Correct.

>   -b. status == 0 || status == ENOTSUP || status == ERANGE

Correct (but most likely a bug).

>   -c. status == 0

Correct.

>   -d. status == 0 || (status == ERANGE && end > s && *end == '\0')

You don't need end>s, because that would preclude ERANGE.

	status == 0 || (status == ERANGE && end == '\0')

Aaand, most likely a bug.

BTW, remember that 'endp' is the name of the parameter because you pass
strtoi(..., &end, ...).  The name of your local variable should be end.

> 
> The success test in case d. is so complicated that, for my feeling, the goal
> to avoid programmer mistakes is not being met.

Cases b and d are not real, IMO.  I have never seen code where that is
wanted, AFAIR, and I analyzed the entire Debian and NetBSD code bases
looking precisely for that usage.  That's what allowed us to fix the bug
in strtoi/u(3) back then.

> I would therefore propose to change the status value to a bit mask, so that
> the error conditions "The converted value was out of range and has been
> coerced" and "The given string contains characters that did not get converted"
> can be both returned together, without conflicting.

Because it is theoretical conditions that a real program never wants,
let's not do that.

> And, while at it, the error condition "min > max" is an error that is
> independent of the given string contents; I would better see it mapped to
> EINVAL rather than ERANGE.

I think it is ERANGE because it should never happen in good programs.
The base is something that depends on the system supporting a certain
base or not, so it makes sense to query it.  But the range is something
specified by the user, and which would never make any sense to pass
twisted like that, so the API just does the least harmful thing.

Consider the following call:

	strtoi(s, NULL, 42, min, max, &err)

which will return EINVAL.  Does the system support the base or not?  If
the only condition that will trigger EINVAL is an invalid base, then I
have the answer.  Otherwise, I need to check for the validity of
min,max.  I think it's better if the function reports ERANGE for an
invalid range, and then it's my fault if I don't validate my range.

TL;DR:  The base can only be validated by the system, while the range
can (and should) be validated by the caller.

Have a lovely day!
Alex

-- 
<https://www.alejandro-colomar.es/>

Attachment: signature.asc
Description: PGP signature

Follow-Ups:
- Re: alx-0008 - Standardize strtoi(3) and strtou(3) from NetBSD
  - From: Thorsten Glaser
- Re: alx-0008 - Standardize strtoi(3) and strtou(3) from NetBSD
  - From: Bruno Haible

References:
- alx-0008 - Standardize strtoi(3) and strtou(3) from NetBSD
  - From: Alejandro Colomar
- Re: alx-0008 - Standardize strtoi(3) and strtou(3) from NetBSD
  - From: Bruno Haible

Prev by Date: Re: [SC22WG14.29912] alx-0008 - Standardize strtoi(3) and strtou(3) from NetBSD
Next by Date: Re: [SC22WG14.29912] alx-0008 - Standardize strtoi(3) and strtou(3) from NetBSD
Previous by Thread: Re: alx-0008 - Standardize strtoi(3) and strtou(3) from NetBSD
Next by Thread: Re: alx-0008 - Standardize strtoi(3) and strtou(3) from NetBSD
Indexes:

Home | Main Index | Thread Index | Old Index