tech-misc archive

[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index][Old Index]

Re: alx-0008 - Standardize strtoi(3) and strtou(3) from NetBSD



On Wed, Mar 19, 2025 at 07:48:59PM +0100, Alejandro Colomar wrote:
> On Wed, Mar 19, 2025 at 04:26:11PM +0100, Alejandro Colomar wrote:
> > Hi Bruno,
> > 
> > On Wed, Mar 19, 2025 at 01:15:30AM +0100, Bruno Haible wrote:
> > > Alejandro Colomar wrote:
> > > > > It would be useful to show how a success test looks like, after
> > > > >     strtoi (s, &end, base, min, max, &status)
> > > > > for each of the four frequent use-cases:
> > > > >   -a. expect to parse the initial portion of the string, no coercion,
> > > > >   -b. expect to parse the initial portion of the string, silent coercion,
> > > > >   -c. expect to parse the entire string, no coercion,
> > > > >   -d. expect to parse the entire string, silent coercion.
> > > > > 
> > > > > AFAICS, the success tests are:
> > > > >   -a. status == 0 || status == ENOTSUP
> > > > 
> > > > Correct.
> > > > 
> > > > >   -b. status == 0 || status == ENOTSUP || status == ERANGE
> > > > 
> > > > Correct (but most likely a bug).
> > 
> > Actually, now I remember that status can be NULL, in which case it's not
> > reported.  This is a case where you could check for errors with a
> > simpler expression:
> > 
> > 	end != str
> > 
> > but (status == 0 || status == ENOTSUP || status == ERANGE) is still a
> > reasnoable one.
> > 
> > like you can do with strtol(3), but with portability guarantess
> > regarding EINVAL, because strtoi(3bsd) always writes *endp (if nonnull).
> > 
> > I need to update the specification to mention that status can be NULL.
> > 
> > > > 
> > > > >   -c. status == 0
> > > > 
> > > > Correct.
> > > > 
> > > > >   -d. status == 0 || (status == ERANGE && end > s && *end == '\0')
> > > > 
> > > > You don't need end>s, because that would preclude ERANGE.
> > > > 
> > > > 	status == 0 || (status == ERANGE && end == '\0')
> > > > 
> > > > Aaand, most likely a bug.
> > > 
> > > Cases b. and d. are not bugs. Often, the programmer knows that treating
> > > a value > ULONG_MAX is equivalent to treating the value ULONG_MAX. These
> > > are *normal* uses of strto[u]l[l]. Often it is the programmer's intent
> > > that the values "4294967297" and "4294967295" produce the same behaviour
> > > (the same error message, for example).
> > 
> > If you want ULONG_MAX + 1 to be treated like ULONG_MAX, and both
> > result in an error, then you should probably clamp at ULONG_MAX - 1,
> > and consider anything above an error.
> > 
> > > It is for these cases that your specification contains the clamping /
> > > coercion behaviour.
> > > 
> > > Now, when you look at the table of success tests:
> > > 
> > >    -a. status == 0 || status == ENOTSUP
> > >    -b. status == 0 || status == ENOTSUP || status == ERANGE
> > >    -c. status == 0
> > >    -d. status == 0 || (status == ERANGE && *end == '\0')
> > > 
> > > it is immediately clear that the status return convention is ill-designed,
> > > because the returned 'status' is not the only thing a programmer has to test
> > > after calling the function.
> > > 
> > > > Cases b and d are not real, IMO.  I have never seen code where that is
> > > > wanted, AFAIR, and I analyzed the entire Debian and NetBSD code bases
> > > > looking precisely for that usage.
> > > 
> > > I disagree.
> > 
> > I didn't find any occurence of 'd' in calls to strtoi(3)/strtou(3).
> > I didn't analyze calls to strtol(3) et al.
> > 
> > > Any use of strtoul that does not test errno wants overflow
> > > to be mapped to ULONG_MAX, that is, is in case b. or d.
> > > Just looking in gnulib and gettext, I find already 6 occurrences:
> > >   gnulib/lib/getaddrinfo.c:299
> > 
> > lib/getaddrinfo.c-297-          if (!(*servname >= '0' && *servname <= '9'))
> > lib/getaddrinfo.c-298-            return EAI_NONAME;
> > lib/getaddrinfo.c:299:          port = strtoul (servname, &c, 10);
> > lib/getaddrinfo.c-300-          if (*c || port > 0xffff)
> > lib/getaddrinfo.c-301-            return EAI_NONAME;
> > lib/getaddrinfo.c-302-          port = htons (port);
> > 
> > You could remove the preceding conditional if you don't want to avoid
> > leading whitespace.  You could merge that into the strtou(3) call, which
> > would report ECANCELED for non-numeric input).  Except that a negative
> > number is silently converted to a positive large value.  This is why I
> > use a wrapper function strtou_noneg() that rejects negative numbers.
> > 
> > You could rewrite it as:
> > 
> > 	port = strtou_noneg(servname, NULL, 10, 0, UINT16_MAX, &status);
> > 	if (status != 0)
> > 		return EAI_NONAME;
> > 	port = htons(port);
> > 
> > where strtou_noneg() is:
> > 
> > 	uintmax_t
> > 	strtou_noneg(const char *s, char **restrict endp, int base,
> > 	    uintmax_t min, uintmax_t max, int *restrict status)
> > 	{
> > 		int  st;
> > 
> > 		if (status == NULL)
> > 			status = &st;
> > 		if (strtoi(s, endp, base, 0, 1, status) == 0 && *status == ERANGE)
> > 			return min;
> > 
> > 		return strtou(s, endp, base, min, max, status);
> > 	}
> > 
> > I think this is not one case where you want silent saturation.  You're
> > indeed doing range checks [0, UINT16_MAX].
> > 
> > >   gnulib/lib/nproc.c:402
> > 
> > lib/nproc.c-383-/* Parse OMP environment variables without dependence on OMP.
> > lib/nproc.c-384-   Return 0 for invalid values.  */
> > lib/nproc.c-385-static unsigned long int
> > lib/nproc.c:386:parse_omp_threads (char const* threads)
> > lib/nproc.c-387-{
> > 
> > ...
> > 
> > lib/nproc.c-398-  /* Convert it from positive decimal to 'unsigned long'.  */
> > lib/nproc.c-399-  if (c_isdigit (*threads))
> > lib/nproc.c-400-    {
> > lib/nproc.c-401-      char *endptr = NULL;
> > lib/nproc.c:402:      unsigned long int value = strtoul (threads, &endptr, 10);
> > lib/nproc.c-403-
> > lib/nproc.c-404-      if (endptr != NULL)
> > lib/nproc.c-405-        {
> > lib/nproc.c-406-          while (*endptr != '\0' && c_isspace (*endptr))
> > lib/nproc.c-407-            endptr++;
> > lib/nproc.c-408-          if (*endptr == '\0')
> > lib/nproc.c-409-            return value;
> > lib/nproc.c-410-          /* Also accept the first value in a nesting level,
> > lib/nproc.c-411-             since we can't determine the nesting level from env vars.  */
> > lib/nproc.c-412-          else if (*endptr == ',')
> > lib/nproc.c-413-            return value;
> > lib/nproc.c-414-        }
> > lib/nproc.c-415-    }
> > 
> > First of all, the endptr!=NULL test seems misplaced.  The only way that
> > could be true is if the base is unsupported, and 10 is necessarily
> 
> s/true/equal/
> 
> > supported.  You should remove the initialization '= NULL', and the
> > check, since both are dead code, IIRC.  That's one of the things you
> > don't need to care with strtoi(3), because it _always_ sets *endp.
> > 
> > And you could probably remove the isdigit test by calling
> > strtou_noneg().
> > 
> > This could be something like this (fixing the bugs reported above):
> > 
> > 	char    *end;
> > 	u_long  value;
> > 
> > 	value = strtou_noneg(threads, &end, 10, 0, ULONG_MAX, NULL);
> > 	if (end != threads) {
> > 		end += strspn(end, " \t\n");
> > 		if (streq(end, "")
> > 			return value;
> > 		if (strprefix(end, ","))
> > 			return value;
> > 	}
> > 
> > 
> > This is one case where you seem to silently ignore saturation.  Why
> > don't you have any diagnostic message?
> > 
> > >   gnulib/lib/omp-init.c:48
> > 
> > lib/omp-init.c-47-      char *endptr = NULL;
> > lib/omp-init.c:48:      unsigned long int value = strtoul (threads, &endptr, 10);
> > lib/omp-init.c-49-
> > lib/omp-init.c-50-      if (endptr != NULL)
> > lib/omp-init.c-51-        {
> > lib/omp-init.c-52-          while (*endptr != '\0' && c_isspace (*endptr))
> > lib/omp-init.c-53-            endptr++;
> > lib/omp-init.c-54-          if (*endptr == '\0')
> > lib/omp-init.c-55-            return value;
> > lib/omp-init.c-56-          /* Also accept the first value in a nesting level,
> > lib/omp-init.c-57-             since we can't determine the nesting level from env vars.  */
> > lib/omp-init.c-58-          else if (*endptr == ',')
> > lib/omp-init.c-59-            return value;
> > lib/omp-init.c-60-        }
> > 
> > This seems identical to the previous case.
> > 
> > >   gettext/gettext-tools/src/msgfmt.c:287
> > 
> > gettext-tools/src/msgfmt.c-286-          char *endp;
> > gettext-tools/src/msgfmt.c:287:          size_t new_align = strtoul (optarg, &endp, 0);
> > gettext-tools/src/msgfmt.c-288-
> > gettext-tools/src/msgfmt.c-289-          if (endp != optarg)
> > gettext-tools/src/msgfmt.c-290-            alignment = new_align;
> > 
> > This code will misbehave badly on platforms where size_t is narrower
> > than u_long.  Consider the case where you parse a high u_long, let's say
> > SIZE_MAX + 1ul.  It will be converted to 1.  There's a bug due to a
> > missing range check (but orthogonal to saturation).
> > 
> > You also don't reject negative numbers, which I expect to be a bug,
> > connected with the one from above.
> > 
> > This could be rewritten to (fixing the bugs reported above):
> > 
> > 	char    *end;
> > 	size_t  new_align;
> > 
> > 	new_align = strtou_noneg(optarg, &end, 0, 0, SIZE_MAX, NULL);
> > 	if (optarg != end)
> > 		alignment = new_align;
> > 
> > This seems another case where you silently saturate.  Why don't you have
> > a diagnostic message for invalid input?
> > 
> > >   gettext/gettext-tools/src/msgl-check.c:379
> > 
> > gettext-tools/src/msgl-check.c-374-          while (*nplurals != '\0' && c_isspace ((unsigned char) *nplurals))
> > gettext-tools/src/msgl-check.c-375-            ++nplurals;
> > gettext-tools/src/msgl-check.c-376-          endp = nplurals;
> > gettext-tools/src/msgl-check.c-377-          nplurals_value = 0;
> > gettext-tools/src/msgl-check.c-378-          if (*nplurals >= '0' && *nplurals <= '9')
> > gettext-tools/src/msgl-check.c:379:            nplurals_value = strtoul (nplurals, (char **) &endp, 10);
> > gettext-tools/src/msgl-check.c-380-          if (nplurals == endp)
> > gettext-tools/src/msgl-check.c-381-            {
> > gettext-tools/src/msgl-check.c-382-              const char *msg = _("invalid nplurals value");
> > gettext-tools/src/msgl-check.c-383-              char *help = plural_help (nullentry);
> > gettext-tools/src/msgl-check.c-384-
> > gettext-tools/src/msgl-check.c-385-              if (help != NULL)
> > gettext-tools/src/msgl-check.c-386-                {
> > gettext-tools/src/msgl-check.c-387-                  char *msgext = xasprintf ("%s\n%s", msg, help);
> > gettext-tools/src/msgl-check.c-388-                  xeh->xerror (CAT_SEVERITY_ERROR, header, NULL, 0, 0, true,
> > gettext-tools/src/msgl-check.c-389-                               msgext);
> > gettext-tools/src/msgl-check.c-390-                  free (msgext);
> > gettext-tools/src/msgl-check.c-391-                  free (help);
> > gettext-tools/src/msgl-check.c-392-                }
> > gettext-tools/src/msgl-check.c-393-              else
> > gettext-tools/src/msgl-check.c-394-                xeh->xerror (CAT_SEVERITY_ERROR, header, NULL, 0, 0, false,
> > gettext-tools/src/msgl-check.c-395-                             msg);
> > gettext-tools/src/msgl-check.c-396-
> > gettext-tools/src/msgl-check.c-397-              seen_errors++;
> > gettext-tools/src/msgl-check.c-398-            }
> > 
> > You could get rid of a lot of code preceding the strtoul(3) call by
> > calling strtou_noneg() instead.  And strspn(3) would also help.
> > 
> > 	nplurals += strspn(nplurals, " \t\n");

(Actually, strtou(3) skips leading space, so we don't really need this
 at all.)

> > 	nplurals_value = strtou_noneg(nplurals, (char **) &end, 10, 0, ULONG_MAX);
> > 	if (nplurals == end)
> > 		...
> > 
> > On the other hand, I also wonder why you don't diagnose invalid input.
> > Why is -10 an "invalid nplurals value", but ULONG_MAX+10 is a valid
> > (albeit clamped) one?
> > 
> > All of these cases look like missing error handling, IMO.
> > 
> > >   gettext/gettext-tools/src/read-stringtable.c:561
> > 
> > gettext-tools/src/read-stringtable.c-553-    {
> > gettext-tools/src/read-stringtable.c-554-      char *last_colon;
> > gettext-tools/src/read-stringtable.c-555-      unsigned long number;
> > gettext-tools/src/read-stringtable.c-556-      char *endp;
> > gettext-tools/src/read-stringtable.c-557-
> > gettext-tools/src/read-stringtable.c-558-      if (strlen (line) >= 6 && memcmp (line, "File: ", 6) == 0
> > gettext-tools/src/read-stringtable.c-559-          && (last_colon = strrchr (line + 6, ':')) != NULL
> > gettext-tools/src/read-stringtable.c-560-          && *(last_colon + 1) != '\0'
> > gettext-tools/src/read-stringtable.c:561:          && (number = strtoul (last_colon + 1, &endp, 10), *endp == '\0'))
> > gettext-tools/src/read-stringtable.c-562-        {
> > gettext-tools/src/read-stringtable.c-563-          /* A "File: <filename>:<number>" type comment.  */
> > gettext-tools/src/read-stringtable.c-564-          *last_colon = '\0';
> > gettext-tools/src/read-stringtable.c-565-          catalog_reader_seen_comment_filepos (catr, line + 6, number);
> > gettext-tools/src/read-stringtable.c-566-        }
> > gettext-tools/src/read-stringtable.c-567-      else
> > gettext-tools/src/read-stringtable.c-568-        catalog_reader_seen_comment (catr, line);
> > gettext-tools/src/read-stringtable.c-569-    }
> > 
> > Let me try to rewrite it for readability first.
> > 
> > 	char        *filepos, *last_colon;
> > 	u_long      n;
> > 	const char  *numstr, *end;
> > 
> > 	filepos = strprefix(line, "File: ") ?: (char []){""};
> > 	last_colon = strrchr(filepos, ':') ?: (char []){":"};
> > 	numstr = strprefix(last_colon, ":");
> > 
> > 	n = strtoul(numstr, (char **) &end, 10);
> > 	if (numstr != end && streq(end, "")) {
> > 		/* A "File: <filename>:<number>" type comment.  */
> > 		strcpy(last_colon, "");
> > 		catalog_reader_seen_comment_filepos(catr, filepos, n);
> > 
> > 	} else {
> > 		catalog_reader_seen_comment(catr, line);
> > 	}
> > 
> > You're forgetting about negative numbers?  Or are you certain that they
> > can't happen?  How about huge values?  Assuming you want compatible code
> > calling strtou():
> > 
> > 	n = strtou(numstr, (char **) &end, 10, 0, ULONG_MAX, NULL);
> > 	if (status == 0 || status == ENOTSUP || status == ERANGE) {
> > 		...
> > 	} else {
> > 		...
> > 	}
> > 
> > But again, I wonder why you don't do range checks.
> > 
> > > > > I would therefore propose to change the status value to a bit mask, so that
> > > > > the error conditions "The converted value was out of range and has been
> > > > > coerced" and "The given string contains characters that did not get converted"
> > > > > can be both returned together, without conflicting.
> > > > 
> > > > Because it is theoretical conditions that a real program never wants,
> > > > let's not do that.
> > > 
> > > If you don't want to do that, I can only repeat what I said in the previous
> > > mail: The proposal *does not achieve the goal* of avoiding the most common
> > > programmer mistakes. For a robust API, the success test should *only* involve
> > > testing the returned 'status', nothing else.
> > 
> > Let's discuss this after your responses to the above.
> > 
> > > 
> > > Bruno
> > > 
> > > 
> > > 
> > > 
> > 
> > -- 
> > <https://www.alejandro-colomar.es/>
> 
> 
> 
> -- 
> <https://www.alejandro-colomar.es/>



-- 
<https://www.alejandro-colomar.es/>

Attachment: signature.asc
Description: PGP signature



Home | Main Index | Thread Index | Old Index