Node:Parsing subranges, Next:, Previous:Parsing keywords, Up:Internals



Expressions as lower bounds of subranges

Extended Pascal allows arbitrary expressions as the lower bounds of subrange types. This leads to some following parsing difficulties:

type
  a = (expr1) .. expr2;

(if expr1 starts with an identifier) vs.

type
  a = (enum1, enum2);

If the enum type contains at least two items, we get no real conflicts, but what the bison manual calls "mystery conflicts", i.e. our grammer is LR(1), but not LALR(1) which bison requires, Mystery Conflicts.

Our solution is the one suggested in the bison manual, to add a bogus rule. For that we add a new terminal BOGUS which is never used and a new nonterminal conflict_id which contains the identifiers that are responsible for the six conflicts.

It gets more difficult if the enum type has only one item, i.e.:

type
  a = (enum1);

If further expr1 consists of a single identifier, the conflict cannot be resolved without reading the token following the right parenthesis. (This is inherent in the EP language.)

But already after reading the identifier (expr1 or enum1), our parser has to decide whether to reduce it to an expression or to keep it as an identifier. (The alternative would be to define an expression which is anything but a single identifier, and parse (identifier) as a distinct thing, but this would get quite hairy.)

We resolve it with a lexer hack. The lexer turns a right parenthesis which is followed by .. into the new token LEX_RPAR. Most places in the parser treat LEX_RPAR and ) as equivalent (nonterminal rpar). However, enum types allow only ) (they can never be followed by ..), and the lower bound of a subrange allows only LEX_RPAR (it is always followed by ..). This resolves the conflict.

But there are more conflicts if the lower bound is not enclosed in parentheses:

type
  a = Foo (42) .. expr2;

(where Foo can be one of certain built-in functions such as Sqr, or a type name for a type-cast) vs.

type
  a = Bar (42);

(where Bar is an undiscriminated schema type).

To resolve this, we let the lexer return a special token LEX_SCHEMA for identifiers which correspond to undiscriminated schema types. The parser accepts this token in new_identifier (so schema identifiers can be redefined) and typename (e.g. for parameters), but not in id (which appears in expressions) where undiscriminated schema types are invalid.

The last conflict:

type
  a = @Foo = (expr) .. expr2;

(where @ is the BP address operator - the = (expr) is there to create an ordinal (namely, Boolean) expression that starts with the address operator) vs.

type
  a = @Bar = (expr);

(where @ is a lexical alternative for ^, according to the standards).

The conflict arises already with the @ token. The = (as a comparison operator in the first case, and for a type initializer - EP extension, combined with a BP extension of using = instead of value) just adds to the problem. Since expr can be arbitrary long, the conflict is in fact not solvable with any fixed number of lookup tokens.

This conflict is decided using parser precedence rules, in favour of the latter interpretation. (BP itself can't parse the supposedly BP compatible former syntax.)