General rule syntax
- Basic EBNF rule syntax
-
'c' is a character reference, only valid in lexer grammar.
"string" is a string reference in lexer grammar, a literal reference in parser or tree
parser grammar.
<CHAR1..CHAR2, CHAR3 - CHAR4, CHAR5> is a character set reference, only valid in
lexer grammar. It means the set of character from CHAR1 to CHAR2 and CHAR3 but excluded CHAR4 and
CHAR5.
<TOKEN1, TOKEN2 - EXCLUDE1, EXCLUDE2> is a token set reference, only valid in
parser and tree parser grammars.
<TOKEN> is a token reference. A degenerated form of the general token set syntax
below.
rule() is a rule reference.
(Choice | Choice ...) is an EBNF block that the any one of choices should occurs
exactly once.
(Choice | Choice ...)? or [ Choice | Choice ] is an EBNF block that any
one of the choices may occurs zero or one time.
(Choice | Choice ...)+ is an EBNF block that the choices should occurs at least
once.
(Choice | Choice ...)* is an EBNF block that the choices may occurs zero or more
times.
(Choice | Choice ...):n or (Choice | Choice ...):0,max is an EBNF block
that the choices may occurs n times or 0 to max times.
Important:
- This construct is still experimental.
- To simplify follow set calculation, the ():min,max construct can only accept
simple charSet or tokenSet elements. In particular, nested ()?, ()*, ()+, ruleRef or treeRef are
invalid.
- Also ():n construct is implicitly greedy and LLK do not check it for choice
follow conflicts.
#node[:label] specify the node name for a rule or an element. Element descriptor may
has an optional label (to be used in place of llkThis) since multiple element nodes may be declared in
a rule.
Study LLK.ll and other distributed grammars would be the easiest way to understand the detail syntax and
their use.
- Alias rule
- Alias rule has syntax like:
alias_rule_name = target_rule_name | <charset>;
Where target_rule_name is name of another rule (which may also be an alias rule, as long as
there is no cycle and end of the chain is a normal rule). The alias rule is simply an alias name for the
target rule. Any reference to it is substituted by a reference to the target rule before code generation.
For lexer grammar, the right hand side can also be a charset. In that case, the named charset can later be
referenced through the alias name, example:
ALPHANUM = <'a'..'z', '0'..'9'>;
void IDENTIFIER_START() :
{
<ALPHANUM, '_'>
}
- Empty rule.
- Sometimes a lexer, parser or treeparser rule would return different token depending on
context. Since each rule can declare only one token type, the other token types has to be declared
separately as empty rules. Example:
REAL; DECIMAL; OCTAL; HEX;
DOT; RANGE;
void NUMBER() : {}
{
%LOOKAHEAD(( Digit() )+ ( '.' | 'e' | 'E' ))
Real() ( RealSuffix() )?
{ llkType=REAL; }
|
llkType=Dot()
|
Int() ( IntSuffix() )?
{ llkType=DECIMAL; }
|
Octal() ( IntSuffix() )?
{ llkType=OCTAL; }
|
Hex() ( IntSuffix() )?
{ llkType=HEX; }
}
protected int Dot(): {
int type=DOT;
}
{
'.'
(
'.'
{ type=RANGE; }
|
( Digit() )+ ( Exponent() )? ( RealSuffix() )?
{ type= REAL; }
)?
}
{ return type; }
Here the lexer rule NUMBER do not return a token of type NUMBER but type REAL, DECIMAL, ... etc. depending
on the parse result. REAL, DECIMAL, ... etc are declared as empty lexer rule.
- DOT and RANGE are also declared as empty rules instead of literal rules. If they are declared
as literal rules, the literal rules would have conflict with the Dot() rule, which is require to parse
REAL. Note that since DOT and RANGE are not declared as literal rule, they cannot be used literally (as "."
or "..") in the parser grammar, they have to be referenced as <DOT> and <RANGE>.
- Labels.
- Each reference (charSet(), tokenSet(), ruleRef() or treeRef()) can be prefixed by a label and
an operator. Example:
int rule(int mod) : {
LLKToken start, end;
boolean ok;
}
{
start="{" ok=MatchPair(LBRACE, RBRACE, 1) end="}"
|
mod|=modifiers(mod)
}
LLK accept the following operators:
"=" | "|=" | "&=" | "^=" | "+=" | "-=" | "*=" | "/=" | "%=" | "<<=" | ">>=" | ":"
The operators work just like they are in Java (and other languages). ":" works like "=" except that the
label would be automatically declared, as in ANTLR. For treeRef(), LLK also automatically cast the right
hand side of the ":" operator to the label type. The labels can be used multiple times as long as the
context expected a type that is compatible with the type of the label.
Declaration block and Optional return action block .
- Declaration block is inserted before any generated code and the optional return action block
is appended after any generated code.
- Variables for a rule (especially the return variable) are usually declared in the
declaration block to ensure the variable is available throughout the rule method. Since code
generator may generate local code blocks, variable declared inside rule actions may not be available
outside the user action where it is declared. It is also good practice to initialize the return value
variable to avoid uninitialized variable warnings when there are generated exception handling code.
- Action in the
return action block are executed after all generated code, in
particular, the tree building code or error handling code. It is good practice, and sometimes required, to
always put return statement for the rule in the return action block.
- Public vs protected/private rules.
- Each rule is typically expanded into a method in the declared language. Unless otherwise
specified, most rule can be declared with access modifiers (public, protected and private) just as
declaring a method. However, modifiers sometimes takes on special meanings beyond what they mean for method
declarations. Typically, non-public rules means they can be referenced from other rules in the grammar
only.
|