LLK

Lookahead Syntax

Lookahead syntax

There are different types of lookaheads:
// Local lookahead options
%LA=n | DEBUG

// Semantic lookahead
%LA([n, [max,]] {...})

// Fixed syntactic lookahead
%LA(n)

// Variable syntactic lookahead
%LA([n, [max,]] ...)

// Mixed syntactic lookahead
%LA([n, [max,]] (... {...})+)
  • Local lookahead options
  • Local lookahead option specify options for the whole Choices block (instead of just for the individual alternative as for syntax or semantic lookaheads).
  • %LA=n specify the local lookahead depth, k, for the Choices block which override the global Lookahead=n option. It is like saying %LA(n) for each alternative, but there is a subtle difference. %LA(n) would suppress conflict warnings while %LA=n would not.
  • %LA=DEBUG ask LLK to dump debug information for that Choices block.
  • Fixed syntactic lookahead
  • Lookahead the specified number of tokens (n) from the lookahead set of the alternative. If matched, commit to the alternative.
  • CAUTION: Since LLK lookahead set is a linear approximation, beware of pitfalls when using fixed lookahead of depth more than 1. Example:
    %LA(2)
    [ "extern" ] DestructorDeclaration()
    |
    [ modifiers() ]
    (
        Type()
        ...
    
    Here, %LA(2) for the first alternative see lookahead sets at k=1 { EXTERN, TILDE }, k=2 { TILDE, IDENTIFIER }. This actually included token sequence EXTERN IDENTIFIER in the alternative below. There is no conflict warning. LLK assumed user is intentionally giving priority to the DestructorDeclaration() through the explicit lookahead clause, which is incorrect. More specific lookahead eg. %LA( ["extern"] "~") should be used here.
  • Semantic lookahead
  • If the semantic predict boolean expression returns true, commit to the alternative. Fixed syntactic looakhead can be used together with semantic lookahead, example:
    (
        %LA(1, { !(LA(2) == IDENTIFIER && LA(3) == ASSIGN) })
        <COMMA>
        Expression()
    )*
    
    equivalent to:
    
    (
        %LA(0, 1, { LA1()==COMMA && !(LA(2) == IDENTIFIER && LA(3) == ASSIGN) })
        <COMMA>
        Expression()
    )*
    
    would generate code like this:
    _loop1 : while (true) {
    if ((LA1() == COMMA) && (!(LA(2) == IDENTIFIER && LA(3) == ASSIGN))) {
        llkInput.consume();
        Expression();
    } else {
        break _loop1;
    }
    
    Note that there is an optional second integer, max, when the fixed looakhead is used with semantic or syntactic lookaheads. It specify number of tokens that LLK can assume the lookahead has verified (verified depth). Typically, it is same as the fixed lookahead, n. However, if the semantic lookahead expression checked additional tokens, max (> n) can be specified to inform LLK about that so it may skip some checkings. If neither n or max is specified, both are default to 0.
  • CAUTION Don't increase verified depth if the semantic predict is a superset of the alternative lookahead set. For example,
    %LA(1, { LA(2)!=PAREN })
    <COMMA> <IDENTIFIER>
    |
    ...
    
    Here, it is still neccessary to check that LA(2) is indeed an IDENTIFIER.
  • Variable/Mixed syntactic lookahead
  • Lookahead whatever tokens specified in the syntactic predict. Syntactic predict is just like a rule except that no lookahead and user actions are allowed. LLK generate a seperate method for each syntactic predict (eg. _synPredict0()).
  • LLK allow semantic predicts, that looks like a user action, to be embedded inside a syntactic predict. Predict fail if any embedded semantic predict evaluated as false. For example:
    %LA(modifiers() { LA1() != CLASS })
    
    Here, LA1() look at the token that follows modifiers().
  • Note that unlike pure semantic predict that is evaluate inline inside the rule method, embedded semantic predicts are evaluated in the syntactic predict method (eg. _synPredict0()). As such, embedded semantic predicts should not reference any local variables in the rule method.
  • Variable syntactic lookahead may reference other rules. That works very much like the normal ruleRef(). However, LLK generate a separate set of methods (syntactic predict rule methods with name eg. _syn_modifiers()) for each rule that is involved in syntactic predicts. Syntactic predict methods, however, discard all the normal parameters, and user actions in the rule. And so, typically, ruleRef() inside variable syntactic predict have no arguments even though the normal rule method require arguments. Example:
    void AttributeSection(FormatBuffer buf) #AttributeSection : {}
    {
        %LA(attributeTarget() <COLON>)
        attributeTarget(buf)
        ...
    }
    
    void attributeTarget(FormatBuffer buf) #void : {}
    {
        ...
    }
    
    Note that lookahead to attributeTarget() has no argument although the attributeTarget() rule requires a FormatBuffer parameter.

Advanced lookahead topics

  • There are some occassions that syntactic lookahead may need to pass variables to the referenced rules. For example, the callee has semantic predicts that needs reference to the variables. There are several constructs in LLK to cater for such situations, namely, Inlined Sytactic Lookahead, Syntactic Lookahead Parameter and Syntactic Lookahead Actions. Here is an example:
    void typeSpecifier(... % int count /* Syntactic Lookahead Parameter */) : {}
    {
        ...
        %LA({ count == 0 })
        typedefName()
        ...
    }
    
    void declSpecifiers() : {}
    %{ int count = 0; } // Syntactic Lookahead Variable Declaration
    {
        (
            ...
            %LA %(typeSpecifier(count)) // Inlined Syntactic Lookahead
            typeSpecifier(count)
            %{ ++count; } // Syntactic Lookahead Action
            ...
        )+
    }
    
  • Syntactic Lookahead Parameter
  • Here typeSpecifier() required to use parameter count for the semantic predict. typeSpecifier() is also used in syntactic predicts, so a syntactic predict method _syn_typeSpecifier() has to be generated. The '%' separator ask LLK code generator to use parameters after '%' in the syntactic lookahead method, ie. generate _syn_typeSpecifier(int count) instead of _syn_typeSpecifier(). The normal rule method typeSpecifier(..., int count) has both the normal parameters and the syntactic lookahead parameters.
  • Inlined Syntactic Lookahead
  • Typical syntactic lookahead inside %LA() cannot pass arguments to the referenced rules. Because such predicts are generated in a separate method that take no arguments (eg. _synPredict0()). In order to pass variables to a syntactic lookahead, the special Inlined Syntactic Lookahead syntax has to be used to ask LLK to inline the method call so that variables in the caller method can be used as arguments in the syntactic lookahead. For example, LLK would generate code like this for the above example:
    _synMark() && _synRewind(typeSpecifier(count)) 
    
    as part of the lookahead expression. Inlined syntactic lookahead syntax only allow a single ruleRef(). For more complicate syntactic lookahead, package the lookahead into a dedicated rule and call it using the inlined syntactic lookahead syntax.
  • Syntactic Lookahead Parameter is also a normal parameter for the normal rule method. Thus the count argument is required for typeSpecifier() in both the syntactic lookahead and the normal rule reference.
  • Syntactic Lookahead Action
  • Typically, LLK strip all user action from the syntactic lookahead methods. To execute actions in syntactic lookahead methods, the syntactic lookahead action syntax %{ action(); } is required. As for the syntactic lookahead parameter, the syntactic lookahead action are generated in both the normal and the syntactic lookahead methods. For example, %{ ++count; } in the above example is required to mantain the state of the count variable inside the syntactic lookahead method _syn_declSpecifiers() as well as in the normal rule method declSpecifiers().
  • Note that count is incremented in declSpecifiers(), not in typeSpecifier(). Otherwise, it would be incremented in both the syntactic lookahead method call and the normal rule method call to typeSpecifier().
  • Return values from Lookahead
  • By passing an array or an object to an Inlined Syntactic Lookahead, it is possible to return values to the caller. Example:
    void rule1(): {}
    {
        ...
        %LA (MemberName() <LPAREN>)
        MethodDeclaration(mod, attrs)
        |
        %LA(MemberName() <LBRACE>)
        PropertyDeclaration(mod, attrs)
        ...
    }
    
    Here the two lookaheads have same prefix MemberName() and require to look into MemberName if it is not a MethodDeclaration. The second lookahead into MemberName() can be avoided if a flag is returned from the first lookahead:
    void rule1() : {
        int[] ret=new int[1]; 
    }
    {
        ...
        %LA %(lookaheadMemberName(ret))
        MethodDeclaration(mod, attrs)
        |
        %LA({ ret[0] == OPEN_BRACE })
        PropertyDeclaration(mod, attrs)
        ...
    }
    
    void lookaheadMemberName(% int[] ret) {
    }
    {
        MemberName()
        %{ ret[0]=LA1(); }
        <PAREN>
    }
    
    Here, lookaheadMemberName() returns the token after MemberName() and thus a simple semantic predict can be used thereafter to check what follows MemeberName().
  • %LA(1) and %LA(1, {...}) vs %LA(<COMMA>)
  • There is subtle difference between the first two and the third. The first two, fixed lookaheads, look into the LookaheadSet of the alternative while the later looks into the LookaheadSet of the specified syntactic lookahead (ie. <COMMA>).
  • %LA(1, { LA(2)==IDENTIFIER }) vs %LA(<COMMA> { LA1()==IDENTIFIER })
  • The two lookaheads are functionally the same, but there is subtle difference between the semantic predicts in the two lookaheads. The semantic predict in the first lookahead is looking ahead from offset before the COMMA token while in the second, it lookahead from offset after COMMA.
  • Is alternatives checked in order or in parallel?
  • LLK generate a switch() statement for alternatives whose lookahead set are mutually exclusive. If there are conflicts, LLK generate a series of if/else block in order that the alternatives appear in the grammar.
  • Cautions when using %LA
  • User has to be careful when adding %LA to resolve conflicts. An alternative with lookahead means whenever the lookahead is met, commit to the alternative. %LA give the alternative priority over other conflicting alternatives and suppress any conflict warnings. It is the user's responsibility to make sure that the lookahead is resolving the conflict correctly. Example:
    void ClassMember(): {}
    {
        %LA(1)
        ConstructorDeclaration()
        |
        MethodDeclaration()
        ...
    }
    
    void ConstructorDeclaration() : {}
    {
        <IDENTIFIER> <OPEN_PAREN>
        ...
    }
    
    void MethodDeclaration(): {}
    {
        type() qualifiedIdentifer() <OPEN_PAREN>
        ...
    }
    
    Here ConstructorDeclaration() and MethodDeclaration() has conflict, however, it is not reported. %LA(1) has given priority to ConstructorDeclaration() whenever an IDENTIFIER is seen although that lookahead is not sufficient to resolve the conflict. %LA(2) is requried here.

Error action and empty action.

  • LLK accept an action only alternative inside a EBNF block. Example:
    ...
    ( 
        "a" 
        | { llkError("a is missing"); } 
    )
    ( 
        "b" 
        | { System.err.println("There is no optional b."); } 
    )?
    ...
    
    For () and ()+ blocks, it is called and error action. For ()? and ()* block, it is called an empty action.
  • Error action is executed when none of the expected alternatives is matched. The action should call llkError(String msg) to throw a LLKParseException with a meaningful error message. In order to be consistent with the EBNF semantic, error action MUST throw a LLKParseException since it is handling an error condition.
  • Empty action is executed when there is zero occurence of the given block (which is a legal condition). In order to be consistent with the EBNF semantic, the empty action should not be used to handle NoViableAlternative error. () and ()+ block with error action should be used for such purpose.