Pascal is a language that's easy to lex and parse. Then came Borland ...
A number of their ad-hoc syntax extensions cause lexing or parsing problems, and even ambiguities. This lexer tries to solve them as well as possible, sometimes with clever rules, other times with gross hacks and with help from the parser. (And, BTW, it handles regular Pascal as well. ;-)
Some of the problems are:
.
. Problem: They make the
character sequence 2.)
ambiguous. It could be interpreted as
2.0
, followed by )
or as 2
and .)
(which
is an alternative for ]
). This lexer chooses the latter
interpretation, like BP does, and the standard requires. It would be
possible to handle both, by keeping a stack of the currently open
parentheses and brackets and chosing the matching closing one, but
since BP does not do this, either, it doesn't seem worth the
trouble. (Or maybe later ... ;-)
2..
(the
start of an integer subrange), but this is easily solved by normal
lexer look-ahead since a real constant can't be followed by a
.
in any Pascal dialect we know of.
42to
). It gets worse with hex numbers
($abcduntil
), but it's not really difficult to lex. However,
we don't allow this with Extended Pascal non-decimal integer
constants, e.g. 16#abcduntil
where it would be a little more
difficult (because it would depend on the basis whether or not
u
is a digit). Since BP does not even support EP non-decimal
constants, there's no point in going to such troubles.
=
rather than value
. Problem: It makes initialized
Boolean subrange variable declarations like Foo: False .. True
= False = False
ambiguous. They could be interpreted as Foo:
False .. (True = False) = False
or Foo: False .. True =
(False = False)
. This lexer, like BP, chooses the latter
interpretation. To avoid conflicts in the parser, this is done with
the LEX_CONST_EQUAL
hack, counting parentheses and brackets
so that in Foo: False .. (True = False) = True
the
second =
will become the LEX_CONST_EQUAL token.
(
, )
. When they consist of a
single entry (without an index as required in EP), they conflict
with expressions in parentheses. This is resolved in the parser and
the later processing of initializers.
external [<libname>] [name <name>]
construct where
<libname>
and <name>
can be string expressions. Since
name
is not a reserved word, but an identifier,
external name name name
can be valid which is difficult to
parse. It could be solved by the parser, by making name
a
special identifier whose special meaning is recognized after
external
only.
#
. They conflict with the Extended
Pascal non-decimal integer number notation. #13#10
could mean
Chr (13) + Chr (10)
or Chr (13#10)
. This lexer chooses
the former interpretation, since the latter one would be a mix of BP
and Extended Pascal features.
^
(was this “feature” meant as an AFJ or
something???). GPC tries to make the best out of a stupid situation,
see the next section (see BP character constants) for details.
It should be noted that BP itself fails in a number of situations
involving such character constants, probably the clearest sign for a
design bug.
var Foo: (Bar ...
hard to parse, since Bar
could
be part of an expression in parentheses as the lower bound of a
subrange, or the beginning of an enumeration type declaration. BP
can't handle this situation. This will be solved with a GLR parser.
...
for variadic external function
declarations causes a problem in the sequence (...)
which
could mean (
, ...
, )
, i.e., a parameter list
with only variadic arguments, or (.
, .
, .)
.
Since the latter token sequence is meaningless in any Pascal dialect
we know of, this lexer chooses the former one which is easily
accomplished with normal look-ahead.