LLK

Lexer Grammar Syntax

Lexer specific grammar syntax

  • llkNextToken() rule.
  • Unlike parsers, user do not call the lexer rules directly. The top level lexer rule, llkNextToken(), is usually very tedious to write. So LLK synthesize the llkNextToken() rule implictily. The llkNextToken() rule has ruleRef() to each public (literal/lexer) rules in the lexer grammar (that are not annotated by #void) as its alternatives. llkNextToken() also resolve conflicts as much as it can implicitly.
  • User can suppress LLK from synthesizing the llkNextToken() rule, by specifying a custom llkNextToken rule in the lexer grammar:
    public void llkNextToken() ::
    {
        ...
    }
    
    There are a number of things that llkNextToken works differently from the normal lexer rule. In particular, llkNextToken rule use two global variables (_token and _llkType) to create the return token. _token holds the token to be returned (which is initialized to null implicitly before the llkNextToken rule is invoked). If _token is null, a new token with type specified by _llkType is create and returned. See also example in the sample grammars for detail usage.
  • Lexer Context
  • Since v0.4, LLK support a context switching construct in the lexer grammar which is particular useful for use in custom llkNextToken() rule to implement context dependent lexer.
    public void llkNextToken() ::
    {
        switch(CONTEXT) {
            case TAG
            tag()
            |
            case STRING
            string()
            |
            case TEXT:
            text()
        }
    }
    
    Above declared a lexer context named CONTEXT with four states (CONTEXT_NONE, CONTEXT_TAG, CONTEXT_STRING and CONTEXT_TEXT). At any one time, only one state can be valid and thus the three choices in the switch() construct are implicitly mutually exclusive. llkSetContext(int context, int state) builtin method would set the state of the specified context. The state of the lexer context can also be changed from parser using the llkSetContext(int context, int state, ILLKToken lt0) method which also take case of rewinding the token stream to end of lt0.
  • Keyword rules
  • LLK lexer grammar use special syntax to explicitly specify keywords and literals and they receive special treatment and optimization during code generation.
  • Keywords are declared in a KEYWORDS section, they are entered into a keyword table that can be looked up with builtin method int llkLookupKeyword(...) methods in actions of any lexer rules. Keywords are public and represent valid token types that can be referenced in the parsers.
  • Multiple string values can be declared for each keyword token. Example:
    %KEYWORDS {
        PUBLIC = "public";
        PROTECTED = "protected";
        CONST = "const" | "__const" | "__const__";
        case Property {
            GET = "get";
            SET = "set";
        }
        ...
    }
    
  • Keyword contexts
  • Keywords can optionally be qualified by a keyword context such that keyword would only be recognized when the corresponding context is activated. Keyword contexts are default to be deactivated. Keyword context can be activated by llkEnableKeyword() and deactivated by llkDisableKeyword() builtin methods in parsers (not in lexer). Multiple keyword contexts can be activated at any point of time and they are independent of the lexer rule contexts.
  • Literal rules
  • Literal rules declares simple string literal tokens. Literal rules can only have stringRef(), nothing else (ie. no action, no return value ... etc). Literal rule can optionally be qualified by a context just like normal lexer rules. Multiple string values can be declared for each literal rule. Example:
    // Literal rules
    ASSIGN: "=" | ":=";
    EQUAL: "==";
    ...
    
    Since literal rules have fixed length, code generate can automatically resolve any conflicts between them by given longest match the higher priority. Conflicts with normal lexer rule still need to be resoved explicitly in the grammar. Literal rules can be public or protected.
  • Normal lexer rule.
  • Normal lexer rule have standard LL(k) rule syntax. Lexer rules can be public, protected or private. Normal lexer rule implicitly return the token type (via the llkType variable) to its caller, typically llkNextToken() unless annotated with a #void attribute. Normal lexer rule accept the following attributes: SPECIAL, IGNORE, IGNORE_CASE and void. If the token type is not TokenType.IGNORE, the llkNextToken() rule implicitly create a LLKToken from the token type. Public lexer rule can also explicitly return LLKToken (with llkCreateToken() builtin methods). In that case, llkNextToken() simply return that token. Example:
    void COMMENT() #SPECIAL : {}
    {
        "/*"
        (
            %LOOKAHEAD(1, {LA(2) != '/' })
            '*'
            |
            '\n'
            { llkInput.newline(); }
            |
            ( "\r\n" | '\r' )
            { llkInput.newline(); }
            |
            ~ < '*' | '\r' | '\n' >
        )*
        "*/"
    }
    
    LLKToken CHARACTER(): {
        int c=0;
    }
    {
        '\'' ( c=ESCAPE() | ~  '\'' { c=LA0(); } ) '\''
    }
    { return llkCreateToken(llkType, c); }