tech-userlevel archive

[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index][Old Index]

Next steps for /bin/sh



There doesn't appear to have been any fallout from the last set of
changes, which were (I think) the end of the easy no-brainer changes
(easy in that there was no question about whether to make them or not.)

Now we get to more difficult issues.   I plan on sending a series
of messages (over time) to discuss what I see as needing doing and
seek input from the rest of the community.

There is a bunch of stuff, but I am going to start with PR bin/35423
for no particular reason other than that it is an old open PR (9 years
old).   It is related to PR bin/19832, also still open, and 13 years old.

These relate to the way that the shell parser works, and the intermediate
form into which it converts the script (commands, whatever it is executing)
before executing it.

Part of the reported bugs is trivial, there is a reference in 19832 to
UPEOF (an internal shell magic character) - that one is silly (not the
complaint, UPEOF) it doesn't need to exist at all (and is gone from FreeBSD's
sh) and in time, I will remove it from NetBSD's sh.   You won't (shouldn't)
see any side effects from that.

However, the rest of it is more complicated.  Internally, the script is
represented by a tree (for loops, case, ... the tree represents the
structure of the commmands), then eact simple command is represented by
a string.   In that string, all the normal magic shell syntax chars have
been removed (things like () ; & | and redirects) are all moved to be
representations in the tree.  Hwoever, varible expansions $... in all
its forms, and quoting (to some extent) remain.

The internal form of this is a simple C string (\0 terminated.) - \0's that
appear in the input, if any, are silently removed ... this practice may
just be ragmatic, or it may date way back to the very early days, which
would have overlapped use of paper tape as an input method (not sure if
paper tape was ever used in any serious way on a unix system, but it would
have been in use on other systems of the time for sure) - there a \0 just
represents an area of the tape that is (was) not punched, usually because
someone wound the tape forward to see what had been "typed", and either
did not wind it back at all, or not far enough, leaving a gap in the tape,
which would read as \0 characters, so ignoring \0's when seen (usually at
a very low level) was normal practice.    That made it the prefect char to
use as a string terminator for C, as it wasn't a chatacter that it was
possible to enter.

Anyway, the internal form needs to represent a few special characters,
the two characters that don't wrork in filenames (0x81 and 0x88) are the
internal representations of \ and ' (or "", by the time we get this far
it no longer matters what kind of quoting, just that text is quoted).
That is, approximately.

I don't see the internal representation changing in any dramatic way,
that would be a huge job, which would affect almost everything in the
shell (well a lot of it anyway) - it would probably be easier to simply
start again.

We could do what FreeBSD have done, and alter the internal characters
used to be ones that never appear (anywhere) in UTF-8 input.   That
isn't to say that the shell will suddenly become UTF-8 friendly in any
real sense, but at least any valid characters could be used for filenames,
strings, and the like, probided they're represented in UTF-8.   It turns
out that the nature of its encoding leaves just a few bit patterns of the
256 available totally unused - FreeBSD uses (some of) those for its
internal representation.

That's not really a fix, just a band-aid, but it is really easy to do.

Alternatively, I could look hard at the parser, and all its uses, and see
if I can find exactly why those two chars are not being properly quoted
(internally) when they are intended to represent themselves.   I suspect
that is what I should do, but I can't guarantee that it will be successful.

For reference, the chars that can't occur in UTF-8 that we could pick
are 0xC0 0xC1, and 0xF4..0xFF - we really need two of those if nothing
else changes to replace the two chars that cause problems.

The first question is would just doing that be enough, or do people need
to be able to pass through all available 255 chars?  (sorry, \0 is just
not going to happen!)

In a similar vein, but a bit more internal (though with some external 
visibility) back in 1996, in response to PR bin/2808 a whole bunch of
patches from FreeBSD were incorporated into the NetBSD shell.

One of those changes was to stop using the shell's own private isalpha()
macros (they have different names - and "stop using" meant to redefine them
in terms of <ctype.h> and isalpha() etc.)

In 2010, FreeBSD undid that change, with a commit log entry that
reads ...

   sh: Do not use locale for determining if something is a name.

   This makes it impossible to use locale-specific characters in variable
   names.

   Names containing locale-specific characters make scripts only work with the
   correct locale setting. Also, they did not even work in many practical cases
   because multibyte character sets such as utf-8 are not supported.

   This also avoids weirdness if LC_CTYPE is changed in the middle of a script.

I think we should do the same (with a minor side-effect of a small speed
improvement while parsing ... <ctype.h> doesn't consider '_' to be
alphabetic (hardly surprising) but for the shell it mostly is, leading
currently to a bunch of things that expand into expressions like

	(c == '_' || isalpha(c))

which would be just an "isalpha()" (lookalike) macro call using the internal
forms.

Any objections to that change ?

kre



Home | Main Index | Thread Index | Old Index