Parsers

Parsers use grammars to configure how they consume input and produce output.


Parsers implement grammars and can be either interpretive or compiled. Interpretive parsers such as NETS read the grammar and use it to parser some input and generate an output. Compiled parsers written in a language such as C or Java and are often generated from grammars. They also read some input and generate an output, but only for one grammar. GREP is an example of an interpreter for grammars and LEX/YACC is an example of a parser generator for grammars. Grammars are rarely portable between interpretive and compiled parsers.

Input and Output

Parsers act upon some input and generate an output. This may be as simple as a one-to-one mapping (an echo) between source and destination or it may involve a complex domain specific mapping. It is not unusual for the mapping between the source domain model and the target domain model to be a multistage process resulting in some final output. Parsers and interpreters therefore need to read and write from common sources like files, pipes and memory but also complex forms like an Abstract Syntax Trees (AST). ASTs have much in common with the World Wide Web Consortium (W3C) Document Object Models (DOM) - a popular technology found in web browsers - having nodes with a hierarchy of parents and children. NETS uses a W3C DOM like model for input and output of ASTs. NETS uses a common model for addressing files, pipes, memory and DOMs.

The syntax for defining input and output is to use the stream type ('file', 'mem', 'pipe', 'document', 'element', 'attribute', 'entity', 'entityreference', 'text', 'comment' and 'processinginstruction') followed by a colon (':') followed by the stream name (with the exception of 'document', 'text', 'comment' and 'processinginstruction').

We have already seen examples of input and output configured for memory and files as command line parameters. These same parameters can be used in rules as attributes of primaries. The following is therefore equivalent.

# CLI
>nets-parser -grammar=default.g -input=default.in -output=default.out

(* NEBNF *)
start = "Hello World" echo="" input="default.in" output="default.out";

<!-- GXML -->
<rule id="start">
  <terminal echo="" input="default.in" output="default.out">Hello World</terminal>
</rule>

File, Memory and Pipe

File, memory and pipe input and output are specified as follows.

Name Description
File

File input and output can be specified using no prefix, the 'file' prefix or 'pipe' prefix. The first two forms open the file in non-buffered mode. The 'pipe' prefix uses buffered input output.

(* NEBNF *)
start = "Hello World" echo="" input="default.in" output="default.out";
start = "Hello World" echo="" input="file:default.in" output="file:default.out";
start = "Hello World" echo="" input="pipe:default.in" output="pipe:default.out";

NETS parser will overwrite existing files (either non-buffered or piped) by default. Using the'+' symbol at the end of the filename will open the file in append mode. This can be used with both output and error files specified on the command line and in a grammar. This feature can be useful for appending to log files. Append is ignored on non-file stream types.

(* NEBNF *)
start = "Hello World" echo="" input="pipe:default.in" output="pipe:default.out+";

Pipe

Pipe input and output is designed for stdin and stdout streams as well as being used in pipelines and for buffering file input and output (as shown above).

(* NEBNF *)
start = "Hello World" echo="" input="pipe:stdin" output="pipe:stdout";
start = "Hello World" echo="" input="stdin" output="stdout";

Piped input and output can be used where NETS is being used as part of an operating system sequence of commands organised as a pipline.

Memory

Memory input and output is designed for in memory streams. 'mem' uses an internal two or four byte wide character representation of characters depending on the platform. Grammars therefore need to use wide character aware primaries and rules.

(* NEBNF *)
start = L"Hello World" echo="" input="mem:Hello World";

<!-- GXML -->
<rule id="start">
  <terminal echo="" input="mem:Hello World" encoding="WCHAR">Hello World</terminal>
</rule>

Streamed Document Object Model (SDOM)

Abstract Syntax Trees (ASTs) represent data structures. For example if the data represents contact information the corresponding AST may have nodes name, address and telephone number. NETS uses a W3C style DOM to represent the AST generated by the parser. The NETS DOM is stored in any supported stream type (ie. file, memory or pipe) and is known as a Streamed Document Object Model (SDOM).

In NETS the relationship between the grammar and the DOM is defined by assigning primaries in the grammar to DOM nodes using -input and -output attributes. This approach both unifies the input and output model of the parser, but also provides traceability between the grammar and the AST/DOM. The declarative approach to linking AST/DOMs to grammars is useful when creating complex multi-stage parsers. The declarative approach allows NETS to both read and write SDOMs.

NETS parser does not follow a strict implementation of the W3C DOM but supports a subset of features required for producing ASTs. The following table shows the types of node that may exist in a NETS DOM.

Node Type Description
Document

Documents must always be the root of the DOM tree. If not specified they are automatically created before the first child is added.

Document has children Element (1), Comment, Processing Instruction
Element

Elements can be built into a hierarchy using parent-child relationships. Elements have a name and a value.

Element has children Element, Text, Entity Reference, Comment, Processing Instruction, CDATA Section
Element has attributes
Attribute

Attributes can only be added to elements, entities and entity references. Attributes have a name and value.

Attribute has children Text and Entity Reference
Entity

Entities are used to define reusable text elements.

Entity has children Text and Entity Reference
Entity has attributes
Entity Reference

Entity references refer to entities and effectively substitute them during processing.

Entity Reference has no children
Entity Reference has attributes
Text

Text defines text elements.

Text has no children
CDATA

CDATA defines binary elements.

CDATA has no children
Comment

Comments add commentary to a DOM.

Comment has no children
Processing Instruction

Processing instructions define script or code to be interpretted as part of DOM processing.

ProcessingInstruction has no children

The following sections show examples of using the DOM model for input and output.

Elements

Elements can be used to construct complex hierarchies of nodes in the AST. A tree begins with a document node. It is recommended that text in the DOM is encoded as wide characters. The grammar below shows a pipeine that includes converting from ASCII to WCHAR, processing the input and producing a DOM structure, which is then converted into XML (toXML) and finally converted from WCHAR to ASCII.

Within the rule 'parser' a DOM Group (using the '(^' and '^)' notation) increments the level in the DOM hierarchy at which nodes are being added. In this instance the Element 'doc' is added as a child to the document node and the Element 'name' is added as a child to the 'doc' Element. The resulting output is shown at the end of the table.

(* NEBNF *)
start= asciitowchar : parser : toXML : wchartoascii;
parser = output="document:",
  (^output="element:doc",
    (^{wchar. output="element:name"}^)
  ^) ;

<!-- GXML -->
<rule id="start">
  <pipeline>
    <ruleref idref="asciitowchar"/>
    <ruleref idref="parser"/>
    <ruleref idref="toXML"/>
    <ruleref idref="wchartoascii"/>
  </pipeline>
</rule>
<rule id="parser">
  <sequence>
    <empty output="document:"/>
    <dom>
      <sequence>
        <empty output="element:doc"/>
        <dom>
          <iteration>
            <ruleref idref="wchar" echo="" output="element:name"/>
          </iteration>
        </dom>
      </sequence>
    </dom>
  </sequence>
</rule>

<!-- Output -->
<doc>
  <name>H</name>
  <name>e</name>
  <name>l</name>
  <name>l</name>
  <name>o</name>
  <name></name>
  <name>W</name>
  <name>o</name>
  <name>r</name>
  <name>l</name>
  <name>d</name>
</doc>

Levelling the SDOM

In the last grammar a new notation was introduced which changes the level of the DOM hierachy. The '(^...^)' NEBNF and <dom>...</dom> GXML notation makes the last element the parent of the next element in the hierarchy. This DOM hierarchy leveling can be nested to create arbitrarily complex DOM trees with many levels.

Attributes

Attributes are applied to the previous element in the output DOM. They must be unique for an element.

(* NEBNF *)
start = asciitowchar : parser : toXML : wchartoascii;
parser = output="document:",
  (^
    output="element:doc",
    (^
      {
        output="element:letter",
        wchar. output="attribute:name"
      }
    ^)
  ^) ;

<!-- GXML -->
<rule id="start">
  <pipeline>
    <ruleref idref="asciitowchar"/>
    <ruleref idref="parser"/>
    <ruleref idref="toXML"/>
    <ruleref idref="wchartoascii"/>
  </pipeline>
</rule>
<rule id="parser">
  <sequence>
    <empty output="document:"/>
    <dom>
      <sequence>
        <empty output="element:doc"/>
        <dom>
          <iteration>
            <sequence>
              <empty output="element:letter"/>
              <ruleref idref="wchar" echo="" output="attribute:name"/>
            </sequence>
          </iteration>
        </dom>
      </sequence>
    </dom>
  </sequence>
</rule>

<!-- Output -->
<doc>
  <letter name="H"/>
  <letter name="e"/>
  <letter name="l"/>
  <letter name="l"/>
  <letter name="o"/>
  <letter name=" "/>
  <letter name="W"/>
  <letter name="o"/>
  <letter name="r"/>
  <letter name="l"/>
  <letter name="d"/>
</doc>

Text

Text is added as children to Elements, Entities and Attributes to help build up compound objects.

(* NEBNF *)
start= asciitowchar : parser : toXML : wchartoascii;
parser = output="document:",
(^
  output="element:doc",
    (^
      {
        output="element:letters",
        (^
          wchar. output="textnode:",
          wchar. output="textnode:"
        ^)
    }
  ^)
^);

<!-- GXML -->
<rule id="start">
  <pipeline>
    <ruleref idref="asciitowchar"/>
    <ruleref idref="parser"/>
    <ruleref idref="toXML"/>
    <ruleref idref="wchartoascii"/>
  </pipeline>
</rule>
<rule id="parser">
  <sequence>
    <empty output="document:"/>
    <dom>
      <sequence>
        <empty output="element:doc"/>
        <dom>
          <iteration>
            <sequence>
              <empty output="element:letters"/>
              <dom>
                <sequence>
                  <ruleref idref="wchar" echo="" output="textnode:"/>
                  <ruleref idref="wchar" echo="" output="textnode:"/>
                </sequence>
              </dom>
            </sequence>
          </iteration>
        </dom>
      </sequence>
    </dom>
  </sequence>
</rule>

<!-- Output -->
<doc>
  <letters>He</letters>
  <letters>ll</letters>
  <letters>o </letters>
  <letters>Wo</letters>
  <letters>rl</letters>
</doc>

Dynamic Elements and Attributes

In some cases the name of an Element or Attribute needs to be detrmined dynamically by data in the input. NETS parser has two reserved Element names - 'Element' and 'Attribute' that are used to dynamically declare Elements and Attributes. The 'name' Attribute added to either the output="element:element" or output="element:attribute" creates an Element or Attribute with that name, dyanmically.

(* NEBNF *)
start= asciitowchar : parser : toXML : wchartoascii;
parser = output="document:",
(^
   output="element:doc",
   (^
     {
       output="element:element",wchar. output="attribute:name",
       (^
         wchar. output="textnode:",
         output="element:attribute",wchar. output="attribute:name",
         (^
           wchar. output="textnode:"
         ^)
       ^)
     }
   ^)
^) ;

<!-- GXML -->
<rule id="start">
  <pipeline>
    <ruleref idref="asciitowchar"/>
    <ruleref idref="parser"/>
    <ruleref idref="toXML"/>
    <ruleref idref="wchartoascii"/>
  </pipeline>
</rule>
<rule id="parser">
  <sequence>
    <empty output="document:"/>
    <dom>
      <sequence>
        <empty output="element:doc"/>
        <dom>
          <iteration>
            <sequence>
              <empty output="element:element"/>
              <ruleref idref="wchar" echo="" output="attribute:name"/>
              <dom>
                <sequence>
                  <ruleref idref="wchar" echo="" output="textnode:"/>
                  <empty output="element:attribute"/>
                  <ruleref idref="wchar" echo="" output="attribute:name"/>
                  <dom>
                    <ruleref idref="wchar" echo="" output="textnode:"/>
                  </dom>
                </sequence>
              </dom>
            </sequence>
          </iteration>
        </dom>
      </sequence>
    </dom>
  </sequence>
</rule>

<!-- Output -->
<doc>
  <H l="l">e</H>
  <o W="o"></o>
</doc>

Reading the SDOM

toXML takes structured output from the 'parser' rule and converts it into XML. We call the output from this process the Streamed Document Object Model (SDOM), so called because it is a hierarchical DOM stored onto a stream such as a file, memory or pipe. The SDOM structure can be consumed by a parser that knows how to read this data structure. toXML takes the SDOM structure and creates an XML representation. NETS helps build custom SDOM parsers using the input attribute of a primary. Once again the objective is to unify the approach to input and output.

The following example takes the input document and parses it into words using element:doc and element:word. The 'outputparser' rule takes the doc/word SDOM structure and outputs the text.

(* NEBNF *)
start= asciitowchar : inputparser : outputparser : wchartoascii;
  inputparser = output="document:",
  (^
     output="element:doc",
     (^
       {
         output="element:word",(^word. output="textnode:" ^),
         output="element:space",(^spaces. output="textnode:"^)
       }
     ^)
  ^) ;

outputparser = input="document:",
  (^
     input="element:doc",
     (^
       {
         ( L?Found Word ? input="element:word", (^word. input="textnode:"^), L?\n? ),
         ( L?Found Space ? input="element:space", (^spaces. input="textnode:"^), L?\n? )
       }
     ^)
  ^) ;

word = {walnum};
spaces = {wspace};

<!-- GXML -->
<rule id="start">
  <pipeline>
    <ruleref idref="asciitowchar"/>
    <ruleref idref="inputparser"/>
    <ruleref idref="outputparser"/>
    <ruleref idref="wchartoascii"/>
  </pipeline>
</rule>
<rule id="inputparser">
  <sequence>
    <empty output="document:"/>
    <dom>
      <sequence>
        <empty output="element:doc"/>
        <dom>
          <iteration>
            <sequence>
              <empty output="element:word"/>
              <dom>
                <ruleref idref="word" echo="" output="textnode:"/>
              </dom>
              <empty output="element:space"/>
              <dom>
                <ruleref idref="spaces" echo="" output="textnode:"/>
              </dom>
            </sequence>
          </iteration>
        </dom>
      </sequence>
    </dom>
  </sequence>
</rule>
<rule id="outputparser">
  <sequence>
    <empty input="document:"/>
    <dom>
      <sequence>
        <empty input="element:doc"/>
        <dom>
          <iteration>
            <sequence>
              <group>
                <sequence>
                  <special encoding="wchar" input="element:word">Found Word</special>
                  <dom>
                    <ruleref idref="word" echo="" input="textnode:"/>
                  </dom>
                  <special encoding="wchar"></special>
                </sequence>
              </group>
              <group>
                <sequence>
                  <special encoding="wchar" input="element:space">Found Space</special>
                  <dom>
                    <ruleref idref="spaces" echo="" input="textnode:"/>
                  </dom>
                  <special encoding="wchar"></special>
                </sequence>
              </group>
            </sequence>
          </iteration>
        </dom>
      </sequence>
    </dom>
  </sequence>
</rule>
<rule id="word">
  <iteration>
    <ruleref idref="walnum"/>
  </iteration>
</rule>
<rule id="spaces">
  <iteration>
    <ruleref idref="wspace"/>
  </iteration>
</rule>

Input
Hello World Hello2 World2

Output
Found Word Hello
Found Space
Found Word World
Found Space
Found Word Hello2
Found Space
Found Word World2
Found Space

Debugging the SDOM

The SDOM is a binary stream and is difficult to interpret using standard text editors and debugging aids. The toXML rule converts the SDOM into XML as shown in previous examples and is useful for understanding the SDOM stream, however, XML does not show the full SDOM structure like text nodes and entity references and the identity and relationships between nodes. The toTextTree rule can be used to generate a textual representation of the DOM. The previous example can be modified as follows to generate an SDOM.

The first line of the TextTree output has the word 'HEADER'. Subsequent lines display a node per line prefaces with byte offset values for the current node using 'I' for identity, the parent node using 'P', the child node using 'C', the sibling node using 'S' and 'A' for the attribute node. Following this is the node type (element, attribute, textnode...), followed by the node name (if present), followed by the node value (if present).

(* NEBNF *)
start= asciitowchar : inputparser : toTextTree : wchartoascii;
...

<!-- GXML -->
<rule id="start">
  <pipeline>
    <ruleref idref="asciitowchar"/>
    <ruleref idref="inputparser"/>
    <ruleref idref="toTextTree"/>
    <ruleref idref="wchartoascii"/>
  </pipeline>
</rule>
...

Input
Hello World Hello2 World2

Output
HEADER
(I:6,P:0,C:80,S:0,A:0)document--#document
     (I:80,P:6,C:142,S:0,A:0)element--doc
         (I:142,P:80,C:206,S:282,A:0)element--word
             (I:206,P:142,C:0,S:0,A:0)textnode--#text--Hello
         (I:282,P:80,C:348,S:416,A:0)element--space
             (I:348,P:282,C:0,S:0,A:0)textnode--#text--
         (I:416,P:80,C:480,S:556,A:0)element--word
             (I:480,P:416,C:0,S:0,A:0)textnode--#text--World
         (I:556,P:80,C:622,S:690,A:0)element--space
             (I:622,P:556,C:0,S:0,A:0)textnode--#text--
         (I:690,P:80,C:754,S:832,A:0)element--word
             (I:754,P:690,C:0,S:0,A:0)textnode--#text--Hello2
         (I:832,P:80,C:898,S:966,A:0)element--space
             (I:898,P:832,C:0,S:0,A:0)textnode--#text--
         (I:966,P:80,C:1030,S:1108,A:0)element--word
             (I:1030,P:966,C:0,S:0,A:0)textnode--#text--World2
         (I:1108,P:80,C:1174,S:1242,A:0)element--space
             (I:1174,P:1108,C:0,S:0,A:0)textnode--#text--

Actions

A parsing or generating process will need to perform actions, in addition to generating the AST, during processing. Formal language grammars do not define actions, but many tools provide extensions that allow actions to be defined, however, there is usually no declarative way of linking grammars with their actions without mixing code fragments with grammar. Making a clear separation between the grammar and the action code is necessary if grammars are to be reused for parsers built using different languages. But having a clear relationship between grammar and the action code through declaration is also desirable.

NETS uses actions as a means of defining when and what is executed as the result of the success or failure of a primary. In NETS actions can only be references to rules, but when NETS is used with code generation actions may also be scripts and function calls.

Name Description
onbefore The 'onbefore' action is executed prior to the primary.
onafter The 'onafter' action is executed after the primary, if it evaluates successfully.
onerror The 'onerror' action is executed after the primary, if it evaluates to false.

Logging

NETS logging features include the error stream, loglevel and log attributes.

Name Description
loglevel

The log level can be set through the 'loglevel' command line or on any primary.

  • MESSAGE (0) - the highest log level which is only used by the loginfo Attribute
  • ERROR (1) - error messages including the failure to open files
  • WARNING (2) - warning messages
  • INFO (3) - information messages include file opens
  • DEBUG (4) - debug messages including attribute additions
#Command Line
>nets-parser loglevel=1

(* NEBNF *)
start = asciitowchar : parser loglevel="1" : toXML : wchartoascii;

Note that when loglevel=4 and -grammar_xml are set, primaries are generated with unique ids and line number/offset information for each primary.

error The 'error' stream can be set either on the command line or on any primary. It is typically set to a file or 'stderr'. Changing error on a primary allows logs to be generated for particular parts of a gramamr at a particular loglevel.

# Command Line
>nets-parser error=default.log

(* NEBNF *)
start = asciitowchar : parser error="parser.log" : toXML : wchartoascii;
loginfo The 'loginfo' attribute logs the message in the attribute value to the error stream at loglevel=0.

(* NEBNF *)
start = asciitowchar : parser loginfo="We are parsing" : toXML : wchartoascii;
logerror The 'logerror' attribute adds the message in the attribute value to the error stream only if an error occurs in the primary.

(* NEBNF *)
start = asciitowchar : parser logerror="Parser has failed" : toXML : wchartoascii;
jobid The 'jobid' command line parameter adds the job id (instead of the thread id) to each line in the error stream.

Breakpoint

Breakpoints pause the process of parsing until the user strikes a key.

Name Description
breakpoint

Breakpoints can be set on any primary using the breakpoint attribute.

(* NEBNF *)
start = asciitowchar : parser breakpoint="" : toXML : wchartoascii;

<!-- GXML -->
<rule id="start">
  <pipeline>
    <ruleref idref="asciitowchar"/>
    <ruleref idref="parser" breakpoint=""/>
    <ruleref idref="toXML"/>
    <ruleref idref="wchartoascii"/>
  </pipeline>
</rule>