Parsers
Parsers use grammars to configure how they consume input and produce output.
Parsers implement grammars and can be either interpretive or compiled. Interpretive parsers such as NETS read the grammar and use it to parser some input and generate an output. Compiled parsers written in a language such as C or Java and are often generated from grammars. They also read some input and generate an output, but only for one grammar. GREP is an example of an interpreter for grammars and LEX/YACC is an example of a parser generator for grammars. Grammars are rarely portable between interpretive and compiled parsers.
Input and Output
Parsers act upon some input and generate an output. This may be as simple as a one-to-one mapping (an echo) between source and destination or it may involve a complex domain specific mapping. It is not unusual for the mapping between the source domain model and the target domain model to be a multistage process resulting in some final output. Parsers and interpreters therefore need to read and write from common sources like files, pipes and memory but also complex forms like an Abstract Syntax Trees (AST). ASTs have much in common with the World Wide Web Consortium (W3C) Document Object Models (DOM) - a popular technology found in web browsers - having nodes with a hierarchy of parents and children. NETS uses a W3C DOM like model for input and output of ASTs. NETS uses a common model for addressing files, pipes, memory and DOMs.
The syntax for defining input and output is to use the stream type ('file'
, 'mem'
, 'pipe'
, 'document'
,
'element'
, 'attribute'
, 'entity'
, 'entityreference'
, 'text'
, 'comment'
and 'processinginstruction'
) followed by a colon (':'
)
followed by the stream name (with the exception of 'document'
, 'text'
, 'comment'
and 'processinginstruction'
).
We have already seen examples of input and output configured for memory and files as command line parameters. These same parameters can be used in rules as attributes of primaries. The following is therefore equivalent.
# CLI
>nets-parser -grammar=default.g -input=default.in -output=default.out
(* NEBNF *)
start = "Hello World" echo="" input="default.in" output="default.out";
<!-- GXML -->
<rule id="start">
<terminal echo="" input="default.in" output="default.out">Hello World</terminal>
</rule>
File, Memory and Pipe
File, memory and pipe input and output are specified as follows.
Name | Description |
---|---|
File | File input and output can be specified using no prefix, the
(* NEBNF *)
NETS parser will overwrite existing files (either non-buffered or piped) by default. Using the
(* NEBNF *) |
Pipe |
Pipe input and output is designed for stdin and stdout streams as well as being used in pipelines and for buffering file input and output (as shown above).
(* NEBNF *)
Piped input and output can be used where NETS is being used as part of an operating system sequence of commands organised as a pipline. |
Memory | Memory input and output is designed for in memory streams.
(* NEBNF *)
<!-- GXML -->
|
Streamed Document Object Model (SDOM)
Abstract Syntax Trees (ASTs) represent data structures. For example if the data represents contact information the corresponding AST may have nodes name, address and telephone number. NETS uses a W3C style DOM to represent the AST generated by the parser. The NETS DOM is stored in any supported stream type (ie. file, memory or pipe) and is known as a Streamed Document Object Model (SDOM).
In NETS the relationship between the grammar and the DOM is defined by assigning primaries in the grammar to DOM nodes using -input and -output attributes. This approach both unifies the input and output model of the parser, but also provides traceability between the grammar and the AST/DOM. The declarative approach to linking AST/DOMs to grammars is useful when creating complex multi-stage parsers. The declarative approach allows NETS to both read and write SDOMs.
NETS parser does not follow a strict implementation of the W3C DOM but supports a subset of features required for producing ASTs. The following table shows the types of node that may exist in a NETS DOM.
Node Type | Description |
---|---|
Document |
Documents must always be the root of the DOM tree. If not specified they are automatically created before the first child is added. Document has childrenElement (1),
Comment , Processing Instruction
|
Element |
Elements can be built into a hierarchy using parent-child relationships. Elements have a name and a value. Element has children
Element , Text , Entity Reference ,
Comment , Processing Instruction ,
CDATA Section Element has attributes
|
Attribute |
Attributes can only be added to elements, entities and entity references. Attributes have a name and value. Attribute has children Text and Entity Reference
|
Entity |
Entities are used to define reusable text elements. Entity has children Text and Entity Reference Entity has attributes
|
Entity Reference |
Entity references refer to entities and effectively substitute them during processing. Entity Reference has no childrenEntity Reference has attributes
|
Text |
Text defines text elements. Text has no children
|
CDATA |
CDATA defines binary elements. CDATA has no children
|
Comment |
Comments add commentary to a DOM. Comment has no children
|
Processing Instruction |
Processing instructions define script or code to be interpretted as part of DOM processing. ProcessingInstruction has no children
|
The following sections show examples of using the DOM model for input and output.
Elements
Elements can be used to construct complex hierarchies of nodes in the AST. A tree begins with a document
node. It is recommended that text in
the DOM is encoded as wide characters. The grammar below shows a pipeine that includes converting from ASCII
to WCHAR
, processing
the input and producing a DOM structure, which is then converted into XML (toXML
) and finally converted from WCHAR
to ASCII
.
Within the rule 'parser'
a DOM Group (using the '(^'
and '^)'
notation) increments the level in the DOM hierarchy at which nodes
are being added. In this instance the Element 'doc'
is added as a child to the document
node and the Element
'name'
is added as a child to the 'doc'
Element. The resulting output is shown at the end of the table.
(* NEBNF *)
start= asciitowchar : parser : toXML : wchartoascii;
parser = output="document:",
(^output="element:doc",
(^{wchar. output="element:name"}^)
^) ;
<!-- GXML -->
<rule id="start">
<pipeline>
<ruleref idref="asciitowchar"/>
<ruleref idref="parser"/>
<ruleref idref="toXML"/>
<ruleref idref="wchartoascii"/>
</pipeline>
</rule>
<rule id="parser">
<sequence>
<empty output="document:"/>
<dom>
<sequence>
<empty output="element:doc"/>
<dom>
<iteration>
<ruleref idref="wchar" echo="" output="element:name"/>
</iteration>
</dom>
</sequence>
</dom>
</sequence>
</rule>
<!-- Output -->
<doc>
<name>H</name>
<name>e</name>
<name>l</name>
<name>l</name>
<name>o</name>
<name></name>
<name>W</name>
<name>o</name>
<name>r</name>
<name>l</name>
<name>d</name>
</doc>
Levelling the SDOM
In the last grammar a new notation was introduced which changes the level of the DOM hierachy. The '(^...^)'
NEBNF
and <dom>...</dom>
GXML notation makes the last element the parent of the next element in the hierarchy.
This DOM hierarchy leveling can be nested to create arbitrarily complex DOM trees with many levels.
Attributes
Attributes are applied to the previous element in the output DOM. They must be unique for an element.
(* NEBNF *)
start = asciitowchar : parser : toXML : wchartoascii;
parser = output="document:",
(^
output="element:doc",
(^
{
output="element:letter",
wchar. output="attribute:name"
}
^)
^) ;
<!-- GXML -->
<rule id="start">
<pipeline>
<ruleref idref="asciitowchar"/>
<ruleref idref="parser"/>
<ruleref idref="toXML"/>
<ruleref idref="wchartoascii"/>
</pipeline>
</rule>
<rule id="parser">
<sequence>
<empty output="document:"/>
<dom>
<sequence>
<empty output="element:doc"/>
<dom>
<iteration>
<sequence>
<empty output="element:letter"/>
<ruleref idref="wchar" echo="" output="attribute:name"/>
</sequence>
</iteration>
</dom>
</sequence>
</dom>
</sequence>
</rule>
<!-- Output -->
<doc>
<letter name="H"/>
<letter name="e"/>
<letter name="l"/>
<letter name="l"/>
<letter name="o"/>
<letter name=" "/>
<letter name="W"/>
<letter name="o"/>
<letter name="r"/>
<letter name="l"/>
<letter name="d"/>
</doc>
Text
Text is added as children to Elements
, Entities
and Attributes
to help build up compound objects.
(* NEBNF *)
start= asciitowchar : parser : toXML : wchartoascii;
parser = output="document:",
(^
output="element:doc",
(^
{
output="element:letters",
(^
wchar. output="textnode:",
wchar. output="textnode:"
^)
}
^)
^);
<!-- GXML -->
<rule id="start">
<pipeline>
<ruleref idref="asciitowchar"/>
<ruleref idref="parser"/>
<ruleref idref="toXML"/>
<ruleref idref="wchartoascii"/>
</pipeline>
</rule>
<rule id="parser">
<sequence>
<empty output="document:"/>
<dom>
<sequence>
<empty output="element:doc"/>
<dom>
<iteration>
<sequence>
<empty output="element:letters"/>
<dom>
<sequence>
<ruleref idref="wchar" echo="" output="textnode:"/>
<ruleref idref="wchar" echo="" output="textnode:"/>
</sequence>
</dom>
</sequence>
</iteration>
</dom>
</sequence>
</dom>
</sequence>
</rule>
<!-- Output -->
<doc>
<letters>He</letters>
<letters>ll</letters>
<letters>o </letters>
<letters>Wo</letters>
<letters>rl</letters>
</doc>
Dynamic Elements and Attributes
In some cases the name of an Element or Attribute needs to be detrmined dynamically by data in the input. NETS parser has two reserved Element names -
'Element'
and 'Attribute'
that are used to dynamically declare Elements and Attributes. The 'name'
Attribute added to
either the output="element:element"
or output="element:attribute"
creates an Element or Attribute with that name, dyanmically.
(* NEBNF *)
start= asciitowchar : parser : toXML : wchartoascii;
parser = output="document:",
(^
output="element:doc",
(^
{
output="element:element",wchar. output="attribute:name",
(^
wchar. output="textnode:",
output="element:attribute",wchar. output="attribute:name",
(^
wchar. output="textnode:"
^)
^)
}
^)
^) ;
<!-- GXML -->
<rule id="start">
<pipeline>
<ruleref idref="asciitowchar"/>
<ruleref idref="parser"/>
<ruleref idref="toXML"/>
<ruleref idref="wchartoascii"/>
</pipeline>
</rule>
<rule id="parser">
<sequence>
<empty output="document:"/>
<dom>
<sequence>
<empty output="element:doc"/>
<dom>
<iteration>
<sequence>
<empty output="element:element"/>
<ruleref idref="wchar" echo="" output="attribute:name"/>
<dom>
<sequence>
<ruleref idref="wchar" echo="" output="textnode:"/>
<empty output="element:attribute"/>
<ruleref idref="wchar" echo="" output="attribute:name"/>
<dom>
<ruleref idref="wchar" echo="" output="textnode:"/>
</dom>
</sequence>
</dom>
</sequence>
</iteration>
</dom>
</sequence>
</dom>
</sequence>
</rule>
<!-- Output -->
<doc>
<H l="l">e</H>
<o W="o"></o>
</doc>
Reading the SDOM
toXML
takes structured output from the 'parser'
rule and converts it into XML.
We call the output from this process the Streamed Document Object Model (SDOM), so called because it is a hierarchical DOM
stored onto a stream such as a file, memory or pipe. The SDOM structure can be consumed by a parser that knows how to read
this data structure. toXML
takes the SDOM structure and creates an XML representation. NETS helps
build custom SDOM parsers using the input
attribute of a primary.
Once again the objective is to unify the approach to input and output.
The following example takes the input document and parses it into words using element:doc
and element:word
. The 'outputparser'
rule takes the doc/word SDOM structure and outputs the text.
(* NEBNF *)
start= asciitowchar : inputparser : outputparser : wchartoascii;
inputparser = output="document:",
(^
output="element:doc",
(^
{
output="element:word",(^word. output="textnode:" ^),
output="element:space",(^spaces. output="textnode:"^)
}
^)
^) ;
outputparser = input="document:",
(^
input="element:doc",
(^
{
( L?Found Word ? input="element:word", (^word. input="textnode:"^), L?\n? ),
( L?Found Space ? input="element:space", (^spaces. input="textnode:"^), L?\n? )
}
^)
^) ;
word = {walnum};
spaces = {wspace};
<!-- GXML -->
<rule id="start">
<pipeline>
<ruleref idref="asciitowchar"/>
<ruleref idref="inputparser"/>
<ruleref idref="outputparser"/>
<ruleref idref="wchartoascii"/>
</pipeline>
</rule>
<rule id="inputparser">
<sequence>
<empty output="document:"/>
<dom>
<sequence>
<empty output="element:doc"/>
<dom>
<iteration>
<sequence>
<empty output="element:word"/>
<dom>
<ruleref idref="word" echo="" output="textnode:"/>
</dom>
<empty output="element:space"/>
<dom>
<ruleref idref="spaces" echo="" output="textnode:"/>
</dom>
</sequence>
</iteration>
</dom>
</sequence>
</dom>
</sequence>
</rule>
<rule id="outputparser">
<sequence>
<empty input="document:"/>
<dom>
<sequence>
<empty input="element:doc"/>
<dom>
<iteration>
<sequence>
<group>
<sequence>
<special encoding="wchar" input="element:word">Found Word</special>
<dom>
<ruleref idref="word" echo="" input="textnode:"/>
</dom>
<special encoding="wchar"></special>
</sequence>
</group>
<group>
<sequence>
<special encoding="wchar" input="element:space">Found Space</special>
<dom>
<ruleref idref="spaces" echo="" input="textnode:"/>
</dom>
<special encoding="wchar"></special>
</sequence>
</group>
</sequence>
</iteration>
</dom>
</sequence>
</dom>
</sequence>
</rule>
<rule id="word">
<iteration>
<ruleref idref="walnum"/>
</iteration>
</rule>
<rule id="spaces">
<iteration>
<ruleref idref="wspace"/>
</iteration>
</rule>
Input
Hello World Hello2 World2
Output
Found Word Hello
Found Space
Found Word World
Found Space
Found Word Hello2
Found Space
Found Word World2
Found Space
Debugging the SDOM
The SDOM is a binary stream and is difficult to interpret using standard text editors and debugging aids.
The toXML
rule converts the SDOM into XML as shown in previous examples and is useful for understanding the SDOM stream, however, XML does not show the full SDOM
structure like text nodes and entity references and the identity and relationships between nodes. The toTextTree
rule
can be used to generate a textual representation of the DOM. The previous example can be modified as follows
to generate an SDOM.
The first line of the TextTree
output has the word 'HEADER'
. Subsequent lines display a node per line prefaces with byte offset values
for the current node using 'I'
for identity, the parent node using 'P'
, the child node using 'C'
, the sibling node using 'S'
and 'A'
for the attribute node.
Following this is the node type (element
, attribute
, textnode
...), followed by the node name (if present), followed by the node value (if present).
(* NEBNF *)
start= asciitowchar : inputparser : toTextTree : wchartoascii;
...
<!-- GXML -->
<rule id="start">
<pipeline>
<ruleref idref="asciitowchar"/>
<ruleref idref="inputparser"/>
<ruleref idref="toTextTree"/>
<ruleref idref="wchartoascii"/>
</pipeline>
</rule>
...
Input
Hello World Hello2 World2
Output
HEADER
(I:6,P:0,C:80,S:0,A:0)document--#document
(I:80,P:6,C:142,S:0,A:0)element--doc
(I:142,P:80,C:206,S:282,A:0)element--word
(I:206,P:142,C:0,S:0,A:0)textnode--#text--Hello
(I:282,P:80,C:348,S:416,A:0)element--space
(I:348,P:282,C:0,S:0,A:0)textnode--#text--
(I:416,P:80,C:480,S:556,A:0)element--word
(I:480,P:416,C:0,S:0,A:0)textnode--#text--World
(I:556,P:80,C:622,S:690,A:0)element--space
(I:622,P:556,C:0,S:0,A:0)textnode--#text--
(I:690,P:80,C:754,S:832,A:0)element--word
(I:754,P:690,C:0,S:0,A:0)textnode--#text--Hello2
(I:832,P:80,C:898,S:966,A:0)element--space
(I:898,P:832,C:0,S:0,A:0)textnode--#text--
(I:966,P:80,C:1030,S:1108,A:0)element--word
(I:1030,P:966,C:0,S:0,A:0)textnode--#text--World2
(I:1108,P:80,C:1174,S:1242,A:0)element--space
(I:1174,P:1108,C:0,S:0,A:0)textnode--#text--
Actions
A parsing or generating process will need to perform actions, in addition to generating the AST, during processing. Formal language grammars do not define actions, but many tools provide extensions that allow actions to be defined, however, there is usually no declarative way of linking grammars with their actions without mixing code fragments with grammar. Making a clear separation between the grammar and the action code is necessary if grammars are to be reused for parsers built using different languages. But having a clear relationship between grammar and the action code through declaration is also desirable.
NETS uses actions as a means of defining when and what is executed as the result of the success or failure of a primary. In NETS actions can only be references to rules, but when NETS is used with code generation actions may also be scripts and function calls.
Name | Description |
---|---|
onbefore |
The 'onbefore' action is executed prior to the primary. |
onafter |
The 'onafter' action is executed after the primary, if it evaluates successfully. |
onerror |
The 'onerror' action is executed after the primary, if it evaluates to false. |
Logging
NETS logging features include the error stream, loglevel and log attributes.
Name | Description |
---|---|
loglevel |
The log level can be set through the 'loglevel' command line or on any primary.
#Command Line
(* NEBNF *)
Note that when |
error |
The 'error' stream can be set either on the command line or on any primary. It is typically set to a file or 'stderr' .
Changing error on a primary allows logs to be generated for particular parts of a gramamr at a particular loglevel .
# Command Line
(* NEBNF *)
|
loginfo |
The 'loginfo' attribute logs the message in the attribute value to the error stream at loglevel=0 .
(* NEBNF *) |
logerror |
The 'logerror' attribute adds the message in the attribute value to the error stream only if an error occurs in the primary.
(* NEBNF *)
|
jobid |
The 'jobid' command line parameter adds the job id (instead of the thread id) to each line in the error stream. |
Breakpoint
Breakpoints pause the process of parsing until the user strikes a key.
Name | Description |
---|---|
breakpoint |
Breakpoints can be set on any primary using the
(* NEBNF *)
<!-- GXML -->
|