Parsers

Parsers use grammars to configure how they consume input and produce output.

Parsers implement grammars and can be either interpretive or compiled. Interpretive parsers such as NETS read the grammar and use it to parser some input and generate an output. Compiled parsers written in a language such as C or Java and are often generated from grammars. They also read some input and generate an output, but only for one grammar. GREP is an example of an interpreter for grammars and LEX/YACC is an example of a parser generator for grammars. Grammars are rarely portable between interpretive and compiled parsers.

Input and Output

Parsers act upon some input and generate an output. This may be as simple as a one-to-one mapping (an echo) between source and destination or it may involve a complex domain specific mapping. It is not unusual for the mapping between the source domain model and the target domain model to be a multistage process resulting in some final output. Parsers and interpreters therefore need to read and write from common sources like files, pipes and memory but also complex forms like an Abstract Syntax Trees (AST). ASTs have much in common with the World Wide Web Consortium (W3C) Document Object Models (DOM) - a popular technology found in web browsers - having nodes with a hierarchy of parents and children. NETS uses a W3C DOM like model for input and output of ASTs. NETS uses a common model for addressing files, pipes, memory and DOMs.

The syntax for defining input and output is to use the stream type ('file', 'mem', 'pipe', 'document', 'element', 'attribute', 'entity', 'entityreference', 'text', 'comment' and 'processinginstruction') followed by a colon (':') followed by the stream name (with the exception of 'document', 'text', 'comment' and 'processinginstruction').

We have already seen examples of input and output configured for memory and files as command line parameters. These same parameters can be used in rules as attributes of primaries. The following is therefore equivalent.


            # CLI

            >nets-parser -grammar=default.g -input=default.in -output=default.out


            (* NEBNF *)

            start = "Hello World" echo="" input="default.in" output="default.out";


            <!-- GXML -->

            <rule id="start">

              <terminal echo="" input="default.in" output="default.out">Hello World</terminal>

            </rule>

File, Memory and Pipe

File, memory and pipe input and output are specified as follows.

Name	Description
File	File input and output can be specified using no prefix, the `'file'` prefix or `'pipe'` prefix. The first two forms open the file in non-buffered mode. The `'pipe'` prefix uses buffered input output. `(* NEBNF ) start = "Hello World" echo="" input="default.in" output="default.out"; start = "Hello World" echo="" input="file:default.in" output="file:default.out"; start = "Hello World" echo="" input="pipe:default.in" output="pipe:default.out";` NETS parser will overwrite existing files (either non-buffered or piped) by default. Using the`'+'` symbol at the end of the filename will open the file in append mode. This can be used with both output and error files specified on the command line and in a grammar. This feature can be useful for appending to log files. Append is ignored on non-file stream types. `( NEBNF *) start = "Hello World" echo="" input="pipe:default.in" output="pipe:default.out+";`
Pipe	Pipe input and output is designed for stdin and stdout streams as well as being used in pipelines and for buffering file input and output (as shown above). `(* NEBNF *) start = "Hello World" echo="" input="pipe:stdin" output="pipe:stdout"; start = "Hello World" echo="" input="stdin" output="stdout";` Piped input and output can be used where NETS is being used as part of an operating system sequence of commands organised as a pipline.
Memory	Memory input and output is designed for in memory streams. `'mem'` uses an internal two or four byte wide character representation of characters depending on the platform. Grammars therefore need to use wide character aware primaries and rules. `(* NEBNF *) start = L"Hello World" echo="" input="mem:Hello World";` `<!-- GXML --> <rule id="start"> <terminal echo="" input="mem:Hello World" encoding="WCHAR">Hello World</terminal> </rule>`

Streamed Document Object Model (SDOM)

Abstract Syntax Trees (ASTs) represent data structures. For example if the data represents contact information the corresponding AST may have nodes name, address and telephone number. NETS uses a W3C style DOM to represent the AST generated by the parser. The NETS DOM is stored in any supported stream type (ie. file, memory or pipe) and is known as a Streamed Document Object Model (SDOM).

In NETS the relationship between the grammar and the DOM is defined by assigning primaries in the grammar to DOM nodes using -input and -output attributes. This approach both unifies the input and output model of the parser, but also provides traceability between the grammar and the AST/DOM. The declarative approach to linking AST/DOMs to grammars is useful when creating complex multi-stage parsers. The declarative approach allows NETS to both read and write SDOMs.

NETS parser does not follow a strict implementation of the W3C DOM but supports a subset of features required for producing ASTs. The following table shows the types of node that may exist in a NETS DOM.

Node Type	Description
`Document`	Documents must always be the root of the DOM tree. If not specified they are automatically created before the first child is added. Document has children `Element` (1), `Comment`, `Processing Instruction`
`Element`	Elements can be built into a hierarchy using parent-child relationships. Elements have a name and a value. `Element` has children `Element`, `Text`, `Entity Reference`, `Comment`, `Processing Instruction`, `CDATA Section` `Element` has attributes
`Attribute`	Attributes can only be added to elements, entities and entity references. Attributes have a name and value. `Attribute` has children `Text` and `Entity Reference`
`Entity`	Entities are used to define reusable text elements. `Entity` has children `Text` and `Entity Reference` `Entity` has attributes
`Entity Reference`	Entity references refer to entities and effectively substitute them during processing. `Entity Reference` has no children `Entity Reference` has attributes
`Text`	Text defines text elements. `Text` has no children
`CDATA`	CDATA defines binary elements. `CDATA` has no children
`Comment`	Comments add commentary to a DOM. `Comment` has no children
`Processing Instruction`	Processing instructions define script or code to be interpretted as part of DOM processing. `ProcessingInstruction` has no children

The following sections show examples of using the DOM model for input and output.

Elements

Elements can be used to construct complex hierarchies of nodes in the AST. A tree begins with a document node. It is recommended that text in the DOM is encoded as wide characters. The grammar below shows a pipeine that includes converting from ASCII to WCHAR, processing the input and producing a DOM structure, which is then converted into XML (toXML) and finally converted from WCHAR to ASCII.

Within the rule 'parser' a DOM Group (using the '(^' and '^)' notation) increments the level in the DOM hierarchy at which nodes are being added. In this instance the Element 'doc' is added as a child to the document node and the Element 'name' is added as a child to the 'doc' Element. The resulting output is shown at the end of the table.


            (* NEBNF *)

            start= asciitowchar : parser : toXML : wchartoascii;

            parser = output="document:",

              (^output="element:doc",

                (^{wchar. output="element:name"}^)

              ^) ;


            <!-- GXML -->

            <rule id="start">

              <pipeline>

                <ruleref idref="asciitowchar"/>

                <ruleref idref="parser"/>

                <ruleref idref="toXML"/>

                <ruleref idref="wchartoascii"/>

              </pipeline>

            </rule>

            <rule id="parser">

              <sequence>

                <empty output="document:"/>

                <dom>

                  <sequence>

                    <empty output="element:doc"/>

                    <dom>

                      <iteration>

                        <ruleref idref="wchar" echo="" output="element:name"/>

                      </iteration>

                    </dom>

                  </sequence>

                </dom>

              </sequence>

            </rule>

 <doc> <name>H</name> <name>e</name> <name>l</name> <name>l</name> <name>o</name> <name></name> <name>W</name> <name>o</name> <name>r</name> <name>l</name> <name>d</name> </doc>

Levelling the SDOM

In the last grammar a new notation was introduced which changes the level of the DOM hierachy. The '(^...^)' NEBNF and <dom>...</dom> GXML notation makes the last element the parent of the next element in the hierarchy. This DOM hierarchy leveling can be nested to create arbitrarily complex DOM trees with many levels.

Attributes

Attributes are applied to the previous element in the output DOM. They must be unique for an element.


            (* NEBNF *)

            start = asciitowchar : parser : toXML : wchartoascii;

            parser = output="document:",

              (^ 

                output="element:doc",

                (^

                  { 

                    output="element:letter",

                    wchar. output="attribute:name"

                  }

                ^)

              ^) ;


            <!-- GXML -->

            <rule id="start">

              <pipeline>

                <ruleref idref="asciitowchar"/>

                <ruleref idref="parser"/>

                <ruleref idref="toXML"/>

                <ruleref idref="wchartoascii"/>

              </pipeline>

            </rule>

            <rule id="parser">

              <sequence>

                <empty output="document:"/>

                <dom>

                  <sequence>

                    <empty output="element:doc"/>

                    <dom>

                      <iteration>

                        <sequence>

                          <empty output="element:letter"/>

                          <ruleref idref="wchar" echo="" output="attribute:name"/>

                        </sequence>

                      </iteration>

                    </dom>

                  </sequence>

                </dom>

              </sequence>

            </rule>

 <doc> <letter name="H"/> <letter name="e"/> <letter name="l"/> <letter name="l"/> <letter name="o"/> <letter name=" "/> <letter name="W"/> <letter name="o"/> <letter name="r"/> <letter name="l"/> <letter name="d"/> </doc>

Text

Text is added as children to Elements, Entities and Attributes to help build up compound objects.


            (* NEBNF *)

            start= asciitowchar : parser : toXML : wchartoascii;

            parser = output="document:",

            (^ 

              output="element:doc",

                (^

                  { 

                    output="element:letters",

                    (^

                      wchar. output="textnode:",

                      wchar. output="textnode:"

                    ^)

                }

              ^)

            ^);

            

            <!-- GXML -->

            <rule id="start">

              <pipeline>

                <ruleref idref="asciitowchar"/>

                <ruleref idref="parser"/>

                <ruleref idref="toXML"/>

                <ruleref idref="wchartoascii"/>

              </pipeline>

            </rule>

            <rule id="parser">

              <sequence>

                <empty output="document:"/>

                <dom>

                  <sequence>

                    <empty output="element:doc"/>

                    <dom>

                      <iteration>

                        <sequence>

                          <empty output="element:letters"/>

                          <dom>

                            <sequence>

                              <ruleref idref="wchar" echo="" output="textnode:"/>

                              <ruleref idref="wchar" echo="" output="textnode:"/>

                            </sequence>

                          </dom>

                        </sequence>

                      </iteration>

                    </dom>

                  </sequence>

                </dom>

              </sequence>

            </rule>

 <doc> <letters>He</letters> <letters>ll</letters> <letters>o </letters> <letters>Wo</letters> <letters>rl</letters> </doc>

Dynamic Elements and Attributes

In some cases the name of an Element or Attribute needs to be detrmined dynamically by data in the input. NETS parser has two reserved Element names - 'Element' and 'Attribute' that are used to dynamically declare Elements and Attributes. The 'name' Attribute added to either the output="element:element" or output="element:attribute" creates an Element or Attribute with that name, dyanmically.


            (* NEBNF *)

            start= asciitowchar : parser : toXML : wchartoascii;

            parser = output="document:",

            (^ 

                  output="element:doc",

                  (^

                    { 

                          output="element:element",wchar. output="attribute:name",

                      (^

                        wchar. output="textnode:",

                            output="element:attribute",wchar. output="attribute:name",

                        (^

                              wchar. output="textnode:"

                        ^)

                      ^)

                    }

                  ^)

            ^) ;


            <!-- GXML -->

            <rule id="start">

              <pipeline>

                <ruleref idref="asciitowchar"/>

                <ruleref idref="parser"/>

                <ruleref idref="toXML"/>

                <ruleref idref="wchartoascii"/>

              </pipeline>

            </rule>

            <rule id="parser">

              <sequence>

                <empty output="document:"/>

                <dom>

                  <sequence>

                    <empty output="element:doc"/>

                    <dom>

                      <iteration>

                        <sequence>

                          <empty output="element:element"/>

                          <ruleref idref="wchar" echo="" output="attribute:name"/>

                          <dom>

                            <sequence>

                              <ruleref idref="wchar" echo="" output="textnode:"/>

                              <empty output="element:attribute"/>

                              <ruleref idref="wchar" echo="" output="attribute:name"/>

                              <dom>

                                <ruleref idref="wchar" echo="" output="textnode:"/>

                              </dom>

                            </sequence>

                          </dom>

                        </sequence>

                      </iteration>

                    </dom>

                  </sequence>

                </dom>

              </sequence>

            </rule>

 <doc> <H l="l">e</H> <o W="o"></o> </doc>

Reading the SDOM

toXML takes structured output from the 'parser' rule and converts it into XML. We call the output from this process the Streamed Document Object Model (SDOM), so called because it is a hierarchical DOM stored onto a stream such as a file, memory or pipe. The SDOM structure can be consumed by a parser that knows how to read this data structure. toXML takes the SDOM structure and creates an XML representation. NETS helps build custom SDOM parsers using the input attribute of a primary. Once again the objective is to unify the approach to input and output.

The following example takes the input document and parses it into words using element:doc and element:word. The 'outputparser' rule takes the doc/word SDOM structure and outputs the text.


            (* NEBNF *)

            start= asciitowchar : inputparser : outputparser : wchartoascii;

              inputparser = output="document:",

              (^ 

                    output="element:doc",

                    (^

                      { 

                            output="element:word",(^word. output="textnode:" ^),

                        output="element:space",(^spaces. output="textnode:"^)

                      }

                    ^)

              ^) ;

            

            outputparser = input="document:",

              (^

                    input="element:doc",

                    (^

                      {

                            ( L?Found Word ? input="element:word", (^word. input="textnode:"^), L?\n? ),

                        ( L?Found Space ? input="element:space", (^spaces. input="textnode:"^), L?\n? )

                      }

                    ^)

              ^) ;

            

            word = {walnum};

            spaces = {wspace};


            <!-- GXML -->

            <rule id="start">

              <pipeline>

                <ruleref idref="asciitowchar"/>

                <ruleref idref="inputparser"/>

                <ruleref idref="outputparser"/>

                <ruleref idref="wchartoascii"/>

              </pipeline>

            </rule>

            <rule id="inputparser">

              <sequence>

                <empty output="document:"/>

                <dom>

                  <sequence>

                    <empty output="element:doc"/>

                    <dom>

                      <iteration>

                        <sequence>

                          <empty output="element:word"/>

                          <dom>

                            <ruleref idref="word" echo="" output="textnode:"/>

                          </dom>

                          <empty output="element:space"/>

                          <dom>

                            <ruleref idref="spaces" echo="" output="textnode:"/>

                          </dom>

                        </sequence>

                      </iteration>

                    </dom>

                  </sequence>

                </dom>

              </sequence>

            </rule>

            <rule id="outputparser">

              <sequence>

                <empty input="document:"/>

                <dom>

                  <sequence>

                    <empty input="element:doc"/>

                    <dom>

                      <iteration>

                        <sequence>

                          <group>

                            <sequence>

                              <special encoding="wchar" input="element:word">Found Word</special>

                              <dom>

                                <ruleref idref="word" echo="" input="textnode:"/>

                              </dom>

                              <special encoding="wchar"></special>

                            </sequence>

                          </group>

                          <group>

                            <sequence>

                              <special encoding="wchar" input="element:space">Found Space</special>

                              <dom>

                                <ruleref idref="spaces" echo="" input="textnode:"/>

                              </dom>

                              <special encoding="wchar"></special>

                            </sequence>

                          </group>

                        </sequence>

                      </iteration>

                    </dom>

                  </sequence>

                </dom>

              </sequence>

            </rule>

            <rule id="word">

              <iteration>

                <ruleref idref="walnum"/>

              </iteration>

            </rule>

            <rule id="spaces">

              <iteration>

                <ruleref idref="wspace"/>

              </iteration>

            </rule>

Input Hello World Hello2 World2 Output Found Word Hello Found Space Found Word World Found Space Found Word Hello2 Found Space Found Word World2 Found Space

Debugging the SDOM

The SDOM is a binary stream and is difficult to interpret using standard text editors and debugging aids. The toXML rule converts the SDOM into XML as shown in previous examples and is useful for understanding the SDOM stream, however, XML does not show the full SDOM structure like text nodes and entity references and the identity and relationships between nodes. The toTextTree rule can be used to generate a textual representation of the DOM. The previous example can be modified as follows to generate an SDOM.

The first line of the TextTree output has the word 'HEADER'. Subsequent lines display a node per line prefaces with byte offset values for the current node using 'I' for identity, the parent node using 'P', the child node using 'C', the sibling node using 'S' and 'A' for the attribute node. Following this is the node type (element, attribute, textnode...), followed by the node name (if present), followed by the node value (if present).


            (* NEBNF *)

            start= asciitowchar : inputparser : toTextTree : wchartoascii;

            ...

            

            <!-- GXML -->

            <rule id="start">

              <pipeline>

                <ruleref idref="asciitowchar"/>

                <ruleref idref="inputparser"/>

                <ruleref idref="toTextTree"/>

                <ruleref idref="wchartoascii"/>

              </pipeline>

            </rule>

            ...

Input Hello World Hello2 World2
Output HEADER (I:6,P:0,C:80,S:0,A:0)document--#document (I:80,P:6,C:142,S:0,A:0)element--doc (I:142,P:80,C:206,S:282,A:0)element--word (I:206,P:142,C:0,S:0,A:0)textnode--#text--Hello (I:282,P:80,C:348,S:416,A:0)element--space (I:348,P:282,C:0,S:0,A:0)textnode--#text-- (I:416,P:80,C:480,S:556,A:0)element--word (I:480,P:416,C:0,S:0,A:0)textnode--#text--World (I:556,P:80,C:622,S:690,A:0)element--space (I:622,P:556,C:0,S:0,A:0)textnode--#text-- (I:690,P:80,C:754,S:832,A:0)element--word (I:754,P:690,C:0,S:0,A:0)textnode--#text--Hello2 (I:832,P:80,C:898,S:966,A:0)element--space (I:898,P:832,C:0,S:0,A:0)textnode--#text-- (I:966,P:80,C:1030,S:1108,A:0)element--word (I:1030,P:966,C:0,S:0,A:0)textnode--#text--World2 (I:1108,P:80,C:1174,S:1242,A:0)element--space (I:1174,P:1108,C:0,S:0,A:0)textnode--#text--

Actions

A parsing or generating process will need to perform actions, in addition to generating the AST, during processing. Formal language grammars do not define actions, but many tools provide extensions that allow actions to be defined, however, there is usually no declarative way of linking grammars with their actions without mixing code fragments with grammar. Making a clear separation between the grammar and the action code is necessary if grammars are to be reused for parsers built using different languages. But having a clear relationship between grammar and the action code through declaration is also desirable.

NETS uses actions as a means of defining when and what is executed as the result of the success or failure of a primary. In NETS actions can only be references to rules, but when NETS is used with code generation actions may also be scripts and function calls.

Name	Description
`onbefore`	The `'onbefore'` action is executed prior to the primary.
`onafter`	The `'onafter'` action is executed after the primary, if it evaluates successfully.
`onerror`	The `'onerror'` action is executed after the primary, if it evaluates to false.

Logging

NETS logging features include the error stream, loglevel and log attributes.

Name	Description
`loglevel`	The log level can be set through the 'loglevel' command line or on any primary. `MESSAGE (0)` - the highest log level which is only used by the loginfo Attribute `ERROR (1)` - error messages including the failure to open files `WARNING (2)` - warning messages `INFO (3)` - information messages include file opens `DEBUG (4)` - debug messages including attribute additions `#Command Line >nets-parser loglevel=1` `(* NEBNF *) start = asciitowchar : parser loglevel="1" : toXML : wchartoascii;` Note that when `loglevel=4` and `-grammar_xml` are set, primaries are generated with unique ids and line number/offset information for each primary.
`error`	The `'error'` stream can be set either on the command line or on any primary. It is typically set to a file or `'stderr'`. Changing error on a primary allows logs to be generated for particular parts of a gramamr at a particular `loglevel`. `# Command Line >nets-parser error=default.log` `(* NEBNF *) start = asciitowchar : parser error="parser.log" : toXML : wchartoascii;`
`loginfo`	The `'loginfo'` attribute logs the message in the attribute value to the error stream at `loglevel=0`. `(* NEBNF *) start = asciitowchar : parser loginfo="We are parsing" : toXML : wchartoascii;`
`logerror`	The `'logerror'` attribute adds the message in the attribute value to the error stream only if an error occurs in the primary. `(* NEBNF *) start = asciitowchar : parser logerror="Parser has failed" : toXML : wchartoascii;`
`jobid`	The `'jobid'` command line parameter adds the job id (instead of the thread id) to each line in the error stream.

Breakpoint

Breakpoints pause the process of parsing until the user strikes a key.

Name	Description
`breakpoint`	Breakpoints can be set on any primary using the `breakpoint` attribute. `(* NEBNF *) start = asciitowchar : parser breakpoint="" : toXML : wchartoascii;` `<!-- GXML --> <rule id="start"> <pipeline> <ruleref idref="asciitowchar"/> <ruleref idref="parser" breakpoint=""/> <ruleref idref="toXML"/> <ruleref idref="wchartoascii"/> </pipeline> </rule>`