Hi Bruno, On Tue, Mar 18, 2025 at 10:53:09PM +0100, Bruno Haible wrote: > > Below is a draft of a proposal for standardization of strtoi/u(3) from > > NetBSD in ISO C2y. > > First of all: I like your initiative, and I moderately like this proposal. Thanks! > > The strtol(3) family of functions is do damn hard to use > > correctly. Only a handful of programmers in the world really > > know how to use it correctly in all the corner cases, and even > > those need to be really careful to not make mistakes. > > It would be useful to list the mistakes that are being made most frequently; > so as to verify that the proposed strtoi / strtou functions don't tend > to provoke the same mistakes. Ughhh, I don't think I can come up with a comprehensive list; I will probably forget some. They are spread on the mailing lists, for anyone very interested. I noted some of them intermingled in the proposal, as comments below the proposed wording paragraphs. The groff@ mailing list also has a large number of bugs I found in calls to strtol(3), some of which probably still remain there, because it still uses strtol(3). I think covering the API in "Police, do not trespass" bands would be easier. :-) > (I'd guess that one of the frequent mistakes > is that when the number is not expected to occupy the entire string, Do you mean "not expected to occupy the entire string", or "expected to not occupy the entire string"? They are different cases. > the success test after (errno = 0, strtol (...)) is > endptr > nptr && errno == 0 You need to dereference *endptr. And the test actually depends on your response to the above. > and programmers tend to forget one of the two conditions.) Assuming that you know that the base is valid beforehand, and that your response yo my question is the former, and that you dereference endptr, yes that's the correct check, AFAICS (it is painful to validate these things; I may make mistakes now). And yes, that's one of the common mistakes. Another is forgetting to validate the base, which makes your test UB, because *endptr might be uninitialized. It ain't fun. :| > > +Synopsis > > +1 #include <stdlib.h> > > + intmax_t strtoi(const char *restrict s, char **restrict endp, int base, > > + intmax_t min, intmax_t max, int *rstatus); > > + uintmax_t strtou(const char *restrict s, char **restrict endp, int base, > > + uintmax_t min, uintmax_t max, int *rstatus); > > Probably it will be an impediment to adoption that these functions work > on [u]intmax_t, which is 64-bits or 128-bits integers, which seems overkill > when people want to parse, say, a port number in the range 0..65535. > > To address this adoption problem, how about changing these function to > generic functions (in the sense of <tgmath.h>)? In such a way that > strtoi (n, &end, base, LONG_MIN, LONG_MAX, &status) > is known to return a 'long' rather than 'intmax_t', and > strtoi (n, &end, base, INT_MIN, INT_MAX, &status) The first parameter should be a string, not a number. I think the name nptr is a historic mistake. The number is returned. > is known to return an 'int' rather than 'intmax_t'. How would you decide when the type of min and max is different? The problem of this API is not having the number as an argument. I have designed a better API, through a type-generic macro: int a2i(typename T, T *restrict n, QChar *s, QChar *_Optional *restrict endp, int base, T min, T max); to be used as: if (a2i(int, &n, "42 c", &end, 0, -1000, 1000) == -1 && errno != ENOTSUP) err(1, "a2i"); However, I was wondering lately about changing it slightly: QChar * funcname(typename T, T *restrict n, QChar *restrict s, int base, T min, T max); to be used as: errno = 0; end = funcname(int, &n, "42 c", 0, -1000, 1000); if (errno != 0 && errno != ENOTSUP) err(1, "funcname"); It has the benefit of being simpler regarding the 'restrict' pseudo- qualifier, and it also means that the implementation is probably simpler. This one would need to guarantee not clobbering errno, though, since there's no error code. I'm still undecided. For now, these APIs are under experimentation in shadow-utils. (But feedback is welcome!) > If the standard does NOT say that these functions are generic, it would > be harder for an implementation to optimize invocations of these > functions for narrower types: I don't see how it could be done without > explicit compiler support. I think a function that parses numbers is going to be slow enough that these micro-optimizations should be unimportant. When/if we have a2i() in the future, we could get better optimizations. > > + Instead, they set the object pointed to by <tt>rstatus</tt> > > + to an error code, > > + or to zero on success. > > + > > +12 -- EINVAL The value in <tt>base</tt> is not supported. > > + -- ECANCELED The given string did not contain > > + any characters that were converted. > > + -- ERANGE The converted value was out of range > > + and has been coerced, > > + or the range was invalid (e.g., min > max). > > + -- ENOTSUP The given string contained characters > > + that did not get converted. > > + > > +13 If various errors happen in the same call, > > + the first one listed here is reported. > > It would be useful to show how a success test looks like, after > strtoi (s, &end, base, min, max, &status) > for each of the four frequent use-cases: > -a. expect to parse the initial portion of the string, no coercion, > -b. expect to parse the initial portion of the string, silent coercion, > -c. expect to parse the entire string, no coercion, > -d. expect to parse the entire string, silent coercion. > > AFAICS, the success tests are: > -a. status == 0 || status == ENOTSUP Correct. > -b. status == 0 || status == ENOTSUP || status == ERANGE Correct (but most likely a bug). > -c. status == 0 Correct. > -d. status == 0 || (status == ERANGE && end > s && *end == '\0') You don't need end>s, because that would preclude ERANGE. status == 0 || (status == ERANGE && end == '\0') Aaand, most likely a bug. BTW, remember that 'endp' is the name of the parameter because you pass strtoi(..., &end, ...). The name of your local variable should be end. > > The success test in case d. is so complicated that, for my feeling, the goal > to avoid programmer mistakes is not being met. Cases b and d are not real, IMO. I have never seen code where that is wanted, AFAIR, and I analyzed the entire Debian and NetBSD code bases looking precisely for that usage. That's what allowed us to fix the bug in strtoi/u(3) back then. > I would therefore propose to change the status value to a bit mask, so that > the error conditions "The converted value was out of range and has been > coerced" and "The given string contains characters that did not get converted" > can be both returned together, without conflicting. Because it is theoretical conditions that a real program never wants, let's not do that. > And, while at it, the error condition "min > max" is an error that is > independent of the given string contents; I would better see it mapped to > EINVAL rather than ERANGE. I think it is ERANGE because it should never happen in good programs. The base is something that depends on the system supporting a certain base or not, so it makes sense to query it. But the range is something specified by the user, and which would never make any sense to pass twisted like that, so the API just does the least harmful thing. Consider the following call: strtoi(s, NULL, 42, min, max, &err) which will return EINVAL. Does the system support the base or not? If the only condition that will trigger EINVAL is an invalid base, then I have the answer. Otherwise, I need to check for the validity of min,max. I think it's better if the function reports ERANGE for an invalid range, and then it's my fault if I don't validate my range. TL;DR: The base can only be validated by the system, while the range can (and should) be validated by the caller. Have a lovely day! Alex -- <https://www.alejandro-colomar.es/>
Attachment:
signature.asc
Description: PGP signature