LLK

TreeParser Generator Notes

TreeParser Generator

TreeParser generator generate a template tree parser grammar from a parser grammar. The generated treeparser grammar may not be valid, but it save a lot of tedious work in converting a parser grammar to a treeparser grammar.

Pass 1

  • All literals, tokenRef() are stripped.
  • All lookaheads (syntactic or semantic) are removed.
  • All user actions are removed.
  • All labels are removed.
  • All try/catch/finally constructs are removed. The Choices inside the try{} block would stay.
  • All rule parameters and ruleRef() arguments are removed.
  • Code rules are removed (Any reference to it would have been converted to a tokenRef() since a code rule never has children.
  • Rule that do not create nodes or reference to rule (in the parser grammar) that create nodes are be removed.
  • Rule that create node but create no children would be removed (Any reference to it should have been translated to an empty treeRef()).
  • Rule that create node and reference to rule (in the parser grammar) that create children would stay.
  • RuleRef() to rule that is removed would be removed.
  • RuleRef() to rule that create node that has no children would be translate to an empty treeRef().
  • RuleRef() to rule that create node and has children (ie. referenced to rules, in the parser grammar, that create node) would be translated to a treeRef() with a ruleRef() as child.
  • Remove private rules and references to private rules
  • Parser often pop nodes from the node stack and save it as a node attribute instead of adding it to the output AST. To avoid interpreting llkTree.pop() statements in the action block, LLKTreeParserGenerator assumed a convention that such nodes are declared as a private rule. Any private rule are removed and reference to private rules are simply ignored (assuming that they are not added to the output AST).
  • Remove duplicated elements
  • Alternative that is duplicate of others would be eliminated. Usually such alternatives are distinct in the parser grammar due to lookahead, literals or tokenRef(). Example:
    a() | a() | a() (b())*
    
    the first and second alternatives would be removed.
  • Sideward merge of elements
  • Element that can be merged with its neighbour are merged. Again, such elements are distinct in the parser grammar but not in the treeparser grammar:
    a() [b()] (b())* -> a() (b())*
    a() b() (b())* -> a() (b())+
    
  • An alternative that generate no node in the parser grammar becomes an empty alternative in the treeparser grammar. When a Choices block has empty alternatives, all the empty alternatives are removed and one of the remaining alternatives would be made optional if there is not already one.
  • Upward merge of elements
  • The output tree parser grammar often has a lot of simple nested elements. If an element has single element as child, the two levels can be merged into one and their EBNF modifiers are merged. Example:
    ( (a())+ )? -> ( a() )*
    
  • Pullup optional choice.
  • Since parser discard most tokens, it is often that a non optional choice in the parser becomes an optional choice in the treeparser. An optional choice is often not a valid LLK grammar. When there is an optional choice, the whole Choices block should be make optional instead. In such cases, tree parser generator perform an upward merge of optional choice to its parent Choices block (actually merge to the element that contains the parent Choices block) to fix the grammar. Example:
    (
        ...
        |
        [ #(parameterList parameterList()) ]
        |
        ...
    )
    
    =>
    (
        ...
        |
        #(parameterList parameterList())
        |
        ...
    )?
    
  • TODO: If the optional choice is under a rule, pullup should make all references to the rule optional.
  • Element node translation
  • For each element with node descriptor, create a rule for the node type using content of the element. Convert the element to a treeRef() to the created rule. If the element is conditional, clone the given number of children of the element as an alternative. Example:
    ( a() (b())* ) #node(>1)
    
    =>
    ( a() | #(node node_()) )
    ...
    void node_() : {}
    {
        a() (b())*
    }
    
  • TODO: Currently, children are not yet counted and all children of the conditional element are included as an alternative.
  • TODO: Left factoring of alternatives.
  • 1
    a() b()
    |
    a() c()
    
    =>
    a()
    ( b() | c() )
    
    2
    a() b()
    |
    ( a() c() )+
    
    =>
    a()
    ( b() | c() )
    ( a() c() )*
    

Bugs and limitations

The generated treeparser grammar may not be valid. In particular, the following cases are not handled:
  • Tree manipulated in user action.
  • Any manipulation of the AST in user action of the parser grammar are ignored. The generator do not interpret content of user actions. This can sometimes be avoided by using private parser rules.
  • Conditional nodes.
  • When nodes are created conditionally, LLKTreeParserGenerator generate two alternatives: a reference to the created node and a clone of the element. That is a superset of what would possibly generated. User need to trim cases that would not possibly happen due to the conditional expression.
  • Output treeparser grammar may has conflicts.
  • The generator do not implicitly add lookaheads nor perform left factoring (for now) to resolve any conflicts that may arise.