LLK

General Grammar Syntax

General rule syntax

  • Basic EBNF rule syntax
    • 'c' is a character reference, only valid in lexer grammar.
    • "string" is a string reference in lexer grammar, a literal reference in parser or tree parser grammar.
    • <CHAR1..CHAR2, CHAR3 - CHAR4, CHAR5> is a character set reference, only valid in lexer grammar. It means the set of character from CHAR1 to CHAR2 and CHAR3 but excluded CHAR4 and CHAR5.
    • <TOKEN1, TOKEN2 - EXCLUDE1, EXCLUDE2> is a token set reference, only valid in parser and tree parser grammars.
    • <TOKEN> is a token reference. A degenerated form of the general token set syntax below.
    • rule() is a rule reference.
    • (Choice | Choice ...) is an EBNF block that the any one of choices should occurs exactly once.
    • (Choice | Choice ...)? or [ Choice | Choice ] is an EBNF block that any one of the choices may occurs zero or one time.
    • (Choice | Choice ...)+ is an EBNF block that the choices should occurs at least once.
    • (Choice | Choice ...)* is an EBNF block that the choices may occurs zero or more times.
    • (Choice | Choice ...):n or (Choice | Choice ...):0,max is an EBNF block that the choices may occurs n times or 0 to max times.

      Important:
      • This construct is still experimental.
      • To simplify follow set calculation, the ():min,max construct can only accept simple charSet or tokenSet elements. In particular, nested ()?, ()*, ()+, ruleRef or treeRef are invalid.
      • Also ():n construct is implicitly greedy and LLK do not check it for choice follow conflicts.
    • #node[:label] specify the node name for a rule or an element. Element descriptor may has an optional label (to be used in place of llkThis) since multiple element nodes may be declared in a rule.
    Study LLK.ll and other distributed grammars would be the easiest way to understand the detail syntax and their use.
  • Alias rule
  • Alias rule has syntax like:
    alias_rule_name = target_rule_name | <charset>;
    
    Where target_rule_name is name of another rule (which may also be an alias rule, as long as there is no cycle and end of the chain is a normal rule). The alias rule is simply an alias name for the target rule. Any reference to it is substituted by a reference to the target rule before code generation. For lexer grammar, the right hand side can also be a charset. In that case, the named charset can later be referenced through the alias name, example:
    ALPHANUM = <'a'..'z', '0'..'9'>;
    
    void IDENTIFIER_START() :
    {
        <ALPHANUM, '_'>
    }
    
  • Empty rule.
  • Sometimes a lexer, parser or treeparser rule would return different token depending on context. Since each rule can declare only one token type, the other token types has to be declared separately as empty rules. Example:
    REAL; DECIMAL; OCTAL; HEX;
    DOT; RANGE;
    
    void NUMBER() : {}
    {
        %LOOKAHEAD(( Digit() )+ ( '.' | 'e' | 'E' ))
        Real() ( RealSuffix() )?
        { llkType=REAL; }
        |
        llkType=Dot()
        |
        Int() ( IntSuffix() )?
        { llkType=DECIMAL; }
        |
        Octal() ( IntSuffix() )?
        { llkType=OCTAL; }
        |
        Hex() ( IntSuffix() )?
        { llkType=HEX; }
    }
    
    protected int Dot(): {
        int type=DOT;
    }
    {
        '.' 
        (
            '.'
            { type=RANGE; }
            |
            ( Digit() )+ ( Exponent() )? ( RealSuffix() )?
            { type= REAL; }
        )?
    }
    { return type; }
    
    Here the lexer rule NUMBER do not return a token of type NUMBER but type REAL, DECIMAL, ... etc. depending on the parse result. REAL, DECIMAL, ... etc are declared as empty lexer rule.
  • DOT and RANGE are also declared as empty rules instead of literal rules. If they are declared as literal rules, the literal rules would have conflict with the Dot() rule, which is require to parse REAL. Note that since DOT and RANGE are not declared as literal rule, they cannot be used literally (as "." or "..") in the parser grammar, they have to be referenced as <DOT> and <RANGE>.
  • Labels.
  • Each reference (charSet(), tokenSet(), ruleRef() or treeRef()) can be prefixed by a label and an operator. Example:
    int rule(int mod) : {
        LLKToken start, end;
        boolean ok;
    }
    {
        start="{" ok=MatchPair(LBRACE, RBRACE, 1) end="}"
        |
        mod|=modifiers(mod)
    }
    
    LLK accept the following operators:
    "=" | "|=" | "&=" | "^=" | "+=" | "-=" | "*=" | "/=" | "%=" | "<<=" | ">>=" | ":"
    
    The operators work just like they are in Java (and other languages). ":" works like "=" except that the label would be automatically declared, as in ANTLR. For treeRef(), LLK also automatically cast the right hand side of the ":" operator to the label type. The labels can be used multiple times as long as the context expected a type that is compatible with the type of the label.
  • Declaration block and Optional return action block.
  • Declaration block is inserted before any generated code and the optional return action block is appended after any generated code.
  • Variables for a rule (especially the return variable) are usually declared in the declaration block to ensure the variable is available throughout the rule method. Since code generator may generate local code blocks, variable declared inside rule actions may not be available outside the user action where it is declared. It is also good practice to initialize the return value variable to avoid uninitialized variable warnings when there are generated exception handling code.
  • Action in the return action block are executed after all generated code, in particular, the tree building code or error handling code. It is good practice, and sometimes required, to always put return statement for the rule in the return action block.
  • Public vs protected/private rules.
  • Each rule is typically expanded into a method in the declared language. Unless otherwise specified, most rule can be declared with access modifiers (public, protected and private) just as declaring a method. However, modifiers sometimes takes on special meanings beyond what they mean for method declarations. Typically, non-public rules means they can be referenced from other rules in the grammar only.