Source-Changes archive

[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index][Old Index]

CVS commit: othersrc/external/bsd/elex



Module Name:    othersrc
Committed By:   agc
Date:           Thu Dec  9 04:15:26 UTC 2021

Added Files:
        othersrc/external/bsd/elex: Makefile README TODO
        othersrc/external/bsd/elex/bin: Makefile
        othersrc/external/bsd/elex/dist: Makefile agcre.c agcre.h elex.c elex.h
            main.c striter.c striter.h
        othersrc/external/bsd/elex/dist/tests: 1.expected 1.in 1.lex
            10.expected 10.in 11.expected 11.in 12.expected 12.lex 13.expected
            13.lex 14.expected 14.in 14.lex 15.expected 15.in 15.lex
            16.expected 16.in 16.lex 17.expected 17.in 17.lex 18.expected 18.in
            18.lex 19.expected 19.in 19.lex 2.expected 2.in 20.expected 20.in
            20.lex 21.expected 21.in 21.lex 22.expected 22.lex 23.expected
            23.lex 24.expected 24.lex 25.expected 25.lex 26.expected 26.lex
            27.expected 27.lex 28.expected 29.expected 29.in 29.lex 3.expected
            3.in 3.lex 30.expected 30.lex 4.expected 4.lex 5.expected 5.in
            5.lex 6.expected 6.in 6.lex 7.expected 7.in 7.lex 8.expected 8.in
            8.lex 9.expected 9.in 9.lex
        othersrc/external/bsd/elex/lib: Makefile shlib_version

Log Message:
Elex - an embeddable regexp-based lexer
=======================================

I have found myself fairly often needing a lexer utility to tokenise
input (for configuration files, for various file-scanning utilities,
and for other applciations), but using full-blown lex(1) program to do
this is overkill, or designed for a separate process, which doesn't fit
well with the design - syntax-coloring editors, for example.

This utility, elex, is a regexp-based tokenizer, an embedded lexer,
which can be used for various uses.  It usually takes a lexer file
(similar to lex input files), although the API allows lexers to be
built on the fly by just issuing the calls to make new rules.

Implementation
==============

Normal lex(1) is implemented (usually) as a conglomeration of all the
regular expressions for a start state - if multiple matches are found,
the largest match is the one used.  I've found that, in practice, this
constrains the way a number of things are done.  So I've implemented
elex, using a multiple, prioritised multiple regexp matching scheme.
This supercedes the usual way of distinguishing reserved words and
identifiers in the lexer - recognising all "words" first, and
searching for each word through a number of tables; if a match is not
found, then the word recognized is an identifier.  elex works around
this by using a regexp to match reserved words first, and then to
recognise the word as an identifier after that.  Since normal regular
expressions usually progress through the input trying to find a match,
the regular expressions used in elex are constrained by anchoring the
search, not allowing progression through the input.  In practice, this
makes for more efficient matching.

another side effect is the ability to use more modern regexp features,
such as perl escapes, UTF-8 matching, in-subexpression ignore case, etc.

elex implements start states, similar to flex.  These are useful for
recognising multiline comments (almost any language), or multi-line
strings (perl, python, lua etc).

elex dynamically sizes the regmatch arrays used to accommodate the
largest regexp in the input, and matching subexpressions can be
returned to the caller.  The 0'th subexpression is the whole matching
expression, and is the same as "yytext".

And so on to an elex definition which recognises C and some C++:

        # start state
        %state COMMENT

        # the types we define
        %type IDENT     0xdb8ea4d
        %type PUNCT     0xe454e3a
        %type NUMBER    0xca1edaec
        %type COMMENT   0xee5ae423
        %type CONSTANT  0xd497741f
        %type PREPROC   0xdcf9b98d
        %type RESWORD1  0xb5ac6a6a
        %type RESWORD2  0xb5ac6a6b

        # and finally... the rules

        
<INITIAL>(auto|char|class|const|double|enum|extern|float|friend|inline|int|long|mutable|namespace|new|private|protected|public|register|requires|short|signed|static|this|struct|this|typedef|union|unsigned|void|volatile)\></>
 { return RESWORD1; }
        <INITIAL>(asm|break|case|catch|continue|default|do|else|for|goto|if|return|switch|throw|try|while)\></> { return RESWORD2; }
        <INITIAL>[a-zA-Z_][0-9a-zA-Z_]*</>      { return IDENT; }

        <INITIAL>([1-9][0-9]*|0x[0-9a-f]|0X[0-9A-F]+|0[0-7]*|'(\\.|[^'])*')</>          { return NUMBER; }
        <INITIAL>[ \t\n\r]+</>                  { return PUNCT; }

        <INITIAL>/\*</>                         { BEGIN(COMMENT); return COMMENT; }
        <COMMENT>[^\n]*\*/</>                   { BEGIN(INITIAL); return COMMENT; }
        <COMMENT>\n|[^\n]+</>                   { return COMMENT; }

        <INITIAL>//[^\n]*</>                    { return COMMENT; }

        <INITIAL>"(\\.|[^"])*"</>               { return CONSTANT; }

        <INITIAL>(==|[-]>|!=|<=|>=|~=|%=|&=|[*]=|[-]=|[+]=|[|]=|(<<|>>)=?)</>   { return PUNCT; }
        <INITIAL>[\u005b;(){}\u005d*<>,+/%~!\u005e&=|.?:\u002d]</>              { return PUNCT; }

        <INITIAL>#[ \t]*(define|el(se|if)|endif|error|if|ifn?def|include|line|pragma|undef)[^\n]*</>    { return PREPROC; }

Start states are explicitly used for rules, since it is easier to read
in practice.  Elex comments are eol-style comments, beginning '#' and
ending with '\n'.  Types can be defined using the "%type" directive,
and the unsigned 32bit value they take will be returned.  This is more
work than using magic constants, but much more readable in practice -
see the example calling program below.

A rule is of the form:

        <startstate>regexp</>   { BEGIN(newstate); return TYPE; }

(the BEGIN() is optional.  </> terminates the regexp - it's clearer
than wrapping everything in double quotes or some other normal
character which would then have to be escaped.  The return is also
optional, but is usually present in practice.)

Hex codepoints can be used to specify wildcard characters in character
classes - the ones enclosed in [] above).

Start states can be defined using the %state (or %x) directive, and
transitioned to using the BEGIN() action, in the same way as standard
lex(1).

Elex provides bookmarks, which are numbered numerically from 0.
Assuming a mark has already been successfully created using
"set-mark", the bookmark offset can be retrieved by using its index
using "get-mark", and the user can then seek to that offset.  This
makes it possible to set up, and re-visit sync points, in case there
was ever a programming language or other input devised which was
ambiguous to some degree.

A heavier-weight alternative to bookmarks is to simply clone the elex
structure at the point of possible divergence, and perform the
alternatives on the clones.  A deep copy is made during the clone
operation.

elex is called from a C (or other language) program in much the same
way as lex is:

        int
        main(int argc, char **argv)
        {
                uint64_t         len;
                elex_t          *elex;
                size_t           size;
                char            *lexfile;
                char            *input;
                char            *text;
                char            *s;
                int              graphic;
                int              type;
                int              i;

                input = lexfile = NULL;
                s = NULL;
                size = 0;
                graphic = 0;
                while ((i = getopt(argc, argv, "Vf:gi:")) != -1) {
                        switch(i) {
                        case 'V':
                                printf("elex version " ELEX_VERSION_S(ELEX_VERSION_NUM) "\n");
                                exit(EXIT_SUCCESS);
                        case 'f':
                                lexfile = optarg;
                                break;
                        case 'g':
                                graphic = 1;
                                break;
                        case 'i':
                                input = optarg;
                                break;
                        }
                }
                if ((elex = elex_new()) == NULL) {
                        errx(EXIT_FAILURE, "can't create new elex structure");
                }
                if (!elex_exec(elex, "read-defs-file", lexfile, 0)) {
                        errx(EXIT_FAILURE, "can't read defs file '%s'", lexfile);
                }
                if (input == NULL) {
                        input = argv[optind];
                }
                if ((s = readfile(input, &size)) == NULL || size == 0) {
                        errx(EXIT_FAILURE, "can't read input file '%s'", input);
                }
                if (!elex_exec(elex, "insert", s, size)) {
                        errx(EXIT_FAILURE, "can't set input");
                }
                while (elex_exec(elex, "next-token", NULL, 0) != 0) {
                        type = elex_exec(elex, "get-yytype", NULL, 0);
                        text = elex_exec_str(elex, "get-yytext", 0, &len);
                        if (graphic) {
                                switch(type) {
                                case /* "IDENT" */ 0xdb8ea4d:
                                case /* "PUNCT" */ 0xe454e3a:
                                case 1:
                                        /* no color, standard */
                                        printf("%.*s", (int)len, text);
                                        break;
                                case /* "NUMBER" */ 0xca1edaec:
                                case 2:
                                        p(CYAN, len, text);
                                        break;
                                case /* "COMMENT" */ 0xee5ae423:
                                case 3:
                                        p(BRIGHT_RED, len, text);
                                        break;
                                case /* "CONSTANT" */ 0xd497741f:
                                case 4:
                                        p(MAGENTA, len, text);
                                        break;
                                case /* "PREPROC" */ 0xdcf9b98d:
                                case 5:
                                        p(GREEN, len, text);
                                        break;
                                case /* "RESWORD1" */ 0xb5ac6a6a:
                                case 6:
                                        p(BRIGHT_GREEN, len, text);
                                        break;
                                case /* "RESWORD2" */ 0xb5ac6a6b:
                                case 7:
                                        p(BRIGHT_YELLOW, len, text);
                                        break;
                                default:
                                        printf("%.*s", (int)len, text);
                                        break;
                                }
                        } else {
                                printf("[%d] %d '%s'\n", type, (int)len, text);
                        }
                        free(text);
                }
                elex_dispose(&elex);
                exit(EXIT_SUCCESS);
        }

This program, when invoked with -g, and using the elex definition above,
will perform the same syntax coloring on C programs as is performed in
my editor.

All access to elex is through an opaque elex_t structure pointer.

There are 5 functions in the API:

        /* create and destroy */
        elex_t *elex_new(void);
        int elex_dispose(elex_t **/*elex*/);

        /* these functions do ALL the work */
        int64_t elex_exec(elex_t */*elex*/, const char */*info*/, const char */*s*/, int64_t /*n*/);
        void *elex_exec_str(elex_t */*elex*/, const char */*info*/, uint64_t /*n*/, uint64_t */*size*/);

        /* with one exeception - deal with states */
        int elex_make_new_rule(elex_t */*elex*/, const char */*startstate*/,
                const char */*pat*/, const char */*newstate*/, int /*ret*/);

An editable input buffer is held in the elex structure.  This can be
set, additional input inserted, and bytes deleted from it.  Callers
can seek within the input buffer, although it is the caller's
responsibility to set the start states correctly after any seeking.

Usually a single file is used (which can be re-used by other programs),
although new start states, types and rules can be done manually.

Note that the type returned on EOF is 0, as in lex, so it is best to
start your types at 1; an alternative is to use stringswitch methods
by returning a hash value of the state name. See the elex file above.

There are 2 main accessor functions:  elex_exec(), which executes an
action and returns a numeric result, and elex_exec_str() which
executes an action and returns a string (allocated using calloc(1),
and it is the caller's responsibility to free this storage).  Within
each of these functions, the second, string argument gives the action
to be performed.  These actions are:

elex_exec:
"delete" - delete n chars from the input
"get-input-length" - return the length of the input buffer
"get-mark" - return bookmark number n (see set-mark)
"get-yyleng" - get the length of the pattern matched
"get-yyline" - get the current line number in the input buffer
"get-yytype" - return the type of the pattern matched
"hash" - return djbhash value for a string
"insert" - insert bytes into the input buffer
"new-state" - make a new start state
"new-type" - make a new type
"next-token" - perform the equivalent of yylex(). Return the type of
        the pattern matched.  If no pattern is matched, print the
        current byte on stdout and advance 1 byte.
"read-defs" - read the elex definitions from memory
"read-defs-file" - read the elex definitions from a file
"seek" - move to an ofset in the input buffer. It is the caller's
        responsibility to set the start state after using this action
"tell" - return the current offset in the input buffer
"set-mark" - set a bookmark at the current offset, and return the index of
        the mark
"set-start-state" - set the current start state

elex_exec_str:
"clone" - make a deep copy of the elex struct
"get-rule" - return the full rule for the n'th rule
"get-subexpr" - for the current match, get the n'th matching subexpression in allocated space
"get-yyrule-pat" - get the input pattern for the matching regexp in
        allocated space
"get-yystate" - get the name of the current state in allocated space
"get-yytext" - get the full text of the current match in allocated space

There is also a function to add a new rule, elex_make_new_rule. In future,
this will be extended to allow changing rules on the fly (an example use
of this would be to use a regexp to recognise identifiers in the language).

Designing elex definition files
===============================

Firstly, any start states are defined. Start states are states in elex
which are different from the initial state, aka INITIAL. They are generally
used for complex recognition, like multi-line strings or comments - they
work more effectively, and the definitions are clearer.

After start states, the next section to be defined is the types
section.  This allows us to define the return types that elex will
return to the calling program.  Since a 0 type usually denotes end of
input, it is advised to define return types starting at 1.
Historically, in lex definitions, the user-defined types started at
256, and it was common to return ASCII values for single characters up
to 256.  Since this is no longer acceptable in a world with multibyte
characters, and because we tend to tokenise based on types of input
tokens, hopefully this practice will never be used again.

Current start states are specified at the start of the rule.

        <INITIAL>/\*</>                         { BEGIN(COMMENT); return COMMENT; }
        <COMMENT>[^\n]*\*/</>                   { BEGIN(INITIAL); return COMMENT; }
        <COMMENT>\n|[^\n]+</>                   { return COMMENT; }

To use the start state above as an example, if we are in the initial state, and
get a "/*" pattern in the input, we transition to the COMMENT state using the
"BEGIN" action, and return the COMMENT type.

Once we are in the COMMENT state, if we receive any text terminated by
a "*/" pattern, we transition to the INITIAL state, and return a
COMMENT type.  Having specified the termination of the COMMENT state,
if we receive any other text, we simply return a COMMENT at the end of
each line.

It is also possible just to specify a COMMENT action as

        <INITIAL>/\*.*\*/</>

but any special actions to be taken on a newline character would be missed
in this.

Usually, when tokenising programming language, there would be a number
of definitions for reserved words, and standard identifiers. There would
also be definitions for punctuation, and numeric and string constants.
Some languages have definitions for multiline strings.

Alistair Croooks
Thu Nov 18 16:57:44 PST 2021


To generate a diff of this commit:
cvs rdiff -u -r0 -r1.1 othersrc/external/bsd/elex/Makefile \
    othersrc/external/bsd/elex/README othersrc/external/bsd/elex/TODO
cvs rdiff -u -r0 -r1.1 othersrc/external/bsd/elex/bin/Makefile
cvs rdiff -u -r0 -r1.1 othersrc/external/bsd/elex/dist/Makefile \
    othersrc/external/bsd/elex/dist/agcre.c \
    othersrc/external/bsd/elex/dist/agcre.h \
    othersrc/external/bsd/elex/dist/elex.c \
    othersrc/external/bsd/elex/dist/elex.h \
    othersrc/external/bsd/elex/dist/main.c \
    othersrc/external/bsd/elex/dist/striter.c \
    othersrc/external/bsd/elex/dist/striter.h
cvs rdiff -u -r0 -r1.1 othersrc/external/bsd/elex/dist/tests/1.expected \
    othersrc/external/bsd/elex/dist/tests/1.in \
    othersrc/external/bsd/elex/dist/tests/1.lex \
    othersrc/external/bsd/elex/dist/tests/10.expected \
    othersrc/external/bsd/elex/dist/tests/10.in \
    othersrc/external/bsd/elex/dist/tests/11.expected \
    othersrc/external/bsd/elex/dist/tests/11.in \
    othersrc/external/bsd/elex/dist/tests/12.expected \
    othersrc/external/bsd/elex/dist/tests/12.lex \
    othersrc/external/bsd/elex/dist/tests/13.expected \
    othersrc/external/bsd/elex/dist/tests/13.lex \
    othersrc/external/bsd/elex/dist/tests/14.expected \
    othersrc/external/bsd/elex/dist/tests/14.in \
    othersrc/external/bsd/elex/dist/tests/14.lex \
    othersrc/external/bsd/elex/dist/tests/15.expected \
    othersrc/external/bsd/elex/dist/tests/15.in \
    othersrc/external/bsd/elex/dist/tests/15.lex \
    othersrc/external/bsd/elex/dist/tests/16.expected \
    othersrc/external/bsd/elex/dist/tests/16.in \
    othersrc/external/bsd/elex/dist/tests/16.lex \
    othersrc/external/bsd/elex/dist/tests/17.expected \
    othersrc/external/bsd/elex/dist/tests/17.in \
    othersrc/external/bsd/elex/dist/tests/17.lex \
    othersrc/external/bsd/elex/dist/tests/18.expected \
    othersrc/external/bsd/elex/dist/tests/18.in \
    othersrc/external/bsd/elex/dist/tests/18.lex \
    othersrc/external/bsd/elex/dist/tests/19.expected \
    othersrc/external/bsd/elex/dist/tests/19.in \
    othersrc/external/bsd/elex/dist/tests/19.lex \
    othersrc/external/bsd/elex/dist/tests/2.expected \
    othersrc/external/bsd/elex/dist/tests/2.in \
    othersrc/external/bsd/elex/dist/tests/20.expected \
    othersrc/external/bsd/elex/dist/tests/20.in \
    othersrc/external/bsd/elex/dist/tests/20.lex \
    othersrc/external/bsd/elex/dist/tests/21.expected \
    othersrc/external/bsd/elex/dist/tests/21.in \
    othersrc/external/bsd/elex/dist/tests/21.lex \
    othersrc/external/bsd/elex/dist/tests/22.expected \
    othersrc/external/bsd/elex/dist/tests/22.lex \
    othersrc/external/bsd/elex/dist/tests/23.expected \
    othersrc/external/bsd/elex/dist/tests/23.lex \
    othersrc/external/bsd/elex/dist/tests/24.expected \
    othersrc/external/bsd/elex/dist/tests/24.lex \
    othersrc/external/bsd/elex/dist/tests/25.expected \
    othersrc/external/bsd/elex/dist/tests/25.lex \
    othersrc/external/bsd/elex/dist/tests/26.expected \
    othersrc/external/bsd/elex/dist/tests/26.lex \
    othersrc/external/bsd/elex/dist/tests/27.expected \
    othersrc/external/bsd/elex/dist/tests/27.lex \
    othersrc/external/bsd/elex/dist/tests/28.expected \
    othersrc/external/bsd/elex/dist/tests/29.expected \
    othersrc/external/bsd/elex/dist/tests/29.in \
    othersrc/external/bsd/elex/dist/tests/29.lex \
    othersrc/external/bsd/elex/dist/tests/3.expected \
    othersrc/external/bsd/elex/dist/tests/3.in \
    othersrc/external/bsd/elex/dist/tests/3.lex \
    othersrc/external/bsd/elex/dist/tests/30.expected \
    othersrc/external/bsd/elex/dist/tests/30.lex \
    othersrc/external/bsd/elex/dist/tests/4.expected \
    othersrc/external/bsd/elex/dist/tests/4.lex \
    othersrc/external/bsd/elex/dist/tests/5.expected \
    othersrc/external/bsd/elex/dist/tests/5.in \
    othersrc/external/bsd/elex/dist/tests/5.lex \
    othersrc/external/bsd/elex/dist/tests/6.expected \
    othersrc/external/bsd/elex/dist/tests/6.in \
    othersrc/external/bsd/elex/dist/tests/6.lex \
    othersrc/external/bsd/elex/dist/tests/7.expected \
    othersrc/external/bsd/elex/dist/tests/7.in \
    othersrc/external/bsd/elex/dist/tests/7.lex \
    othersrc/external/bsd/elex/dist/tests/8.expected \
    othersrc/external/bsd/elex/dist/tests/8.in \
    othersrc/external/bsd/elex/dist/tests/8.lex \
    othersrc/external/bsd/elex/dist/tests/9.expected \
    othersrc/external/bsd/elex/dist/tests/9.in \
    othersrc/external/bsd/elex/dist/tests/9.lex
cvs rdiff -u -r0 -r1.1 othersrc/external/bsd/elex/lib/Makefile \
    othersrc/external/bsd/elex/lib/shlib_version

Please note that diffs are not public domain; they are subject to the
copyright notices on the relevant files.




Home | Main Index | Thread Index | Old Index