Grammars
Grammars define the rules for configuring and running a parser.
Grammar Languages
Grammars come in many dialects, from initialisation files to formal grammars like Bacus Naur Form (BNF). One of the ideas behind NETS is a unified grammar model for small and large scale processes, which makes configuration easier and improves the reliability of systems.
The grammar language used in NETS is based on ISO/IEC 14977 (ISO), which incorprates the concepts of Extended Backus Naur Form (EBNF) created by Niklaus Wirth. NETS also uses the concepts from Parser Expression Grammars (PEG) defined by Bryan Ford and adds further concepts of particular value to real-world parser generators. This hybrid grammar is called NETS EBNF (NEBNF). Finally, NETS has an alternative XML notation for grammars called Grammar XML (GXML).
Any grammar defined in NEBNF can also be defined in GXML and examples are presented in both NEBNF and GXML. NEBNF and GXML can be mixed together in same grammar.
The principles behind NEBNF are explained in 'A Common Model for Language Grammars'.
Grammar Names
Grammars need to be identified uniquely with a name. In NEBNF the file name is the grammar name and can be refered to from the command line.
In GXML the grammar name is explicitly declared using the id
attribute on the <grammar>
element.
Grammar names help scope grammar rules so that they can be referenced between grammar corpus or libraries.
#Command line
nets-parser -loglevel=3 -grammar=default.g;another.g;grammar2.g -input=default.in -output=default.out
<!-- GXML -->
<grammar id="name">
...
</grammar>
Rules
Syntax rules are a structural element of grammars. Rules have a
unique name that can be referenced in other parts of the grammar. In NEBNF a rule
name is followed by an equals sign and terminates in a semicolon. In GXML a rule is defined using the <rule>
element.
The first rule executed by NETS is the rule with id="start"
unless the
grammar_start
command line parameter is used to refer to a different rule.
(* NEBNF *)
start = ... ;
<!-- GXML -->
<rule id="start">
...
</rule>
Primary
Primaries include an optional sequence, repeated sequence, grouped sequence, rule reference, terminal string, empty sequence, entity reference and pipeline. Primary is an abstract concept and is useful for describing features common across primaries.
Grammar Attributes
Attributes are a means of specifying and extending grammar characteristics. Found in XML they may also be used in NEBNF at the end of a primary. id="name"
and echo=""
are examples of attributes.
(* NEBNF *)
start = L"Hello World".;
(* NEBNF with grammar attributes *)
start = "Hello World" encoding="wchar" echo="";
<!-- GXML -->
<rule id="start">
<terminal encoding="wchar" echo="">Hello World</rule>
</rule>
Terminals
Terminals are literal symbols (or strings of characters) which may appear in the rules of a formal grammar and cannot be changed using the rules of the grammar. Terminals define both parsed input (consumption) and produced output (production) using either quotes or question marks respectively in NEBNF.
Consumption Terminals
Consumption terminals are defined with either single or double quotes in NEBNF. GXML uses the <terminal>
element to define a consumption terminal.
(* NEBNF *)
start = "Hello World";
<!-- GXML -->
<rule id="start">
<terminal>Hello World</rule>
</rule>
Production Terminals
NEBNF uses the ISO/IEC 14977 special sequence notation - question marks - to indicate a production terminal. GXML uses the <special>
element to define a production terminal.
(* NEBNF *)
start = ?Hello World?;
<!-- GXML -->
<rule id="start">
<special>Hello World</special>
</rule>
Encoding
Terminal encoding can be either multi-byte or wide character. NEBNF
uses the C style L predicate to strings to indicate wchar encoded terminal. GXML
uses the encoding
attribute with values "char"
and "wchar"
attribute.
(* NEBNF *)
start = ?Hello World?;
start = L"Hello World";
<!-- GXML -->
<rule id="start">
<special encoding="char">Hello World</special>
</rule>
<rule id="start">
<terminal encoding="wchar">Hello World</terminal>
</rule>
Rule Reference
Rule references are symbols which can be replaced by rules. They are used to structure complex grammars into
reusable rule sets or grammars. GXML uses the idref
attribute to define rule references.
(* NEBNF *)
start = new_rule;
new_rule = "Hello World";
<!-- GXML -->
<rule id="start">
<ruleref idref="new_rule"/>
</rule>
<rule id="new_rule">
<terminal>Hello World</rule>
</rule>
Referencing a rule in a specific grammar, requires the rule reference to be qualified by the grammar name.
(* NEBNF *)
start = mygrammar.rule_name;
<!-- GXML -->
<rule id="start">
<ruleref grammar="mygrammar" idref="rule_name"/>
</rule>
Sequences
A sequence is a series of one or more expressions evaluated in order. Expressions include terminals, rule references, sequences, choices, groups, iterations, options and the empty sequence. ISO/IEC 14977 refers to sequences as a single definition. Sequences are usually implicit but are concretely implemented in syntax rules, grouped sequences, repetitions and options.
- A syntax rule is a named sequence
- A grouped sequence is an unnamed sequence
- An option is a sequence with zero or one occurrences
- A repetition is a sequence with zero or more occurrences
- An empty sequence has no primaries
- A DOM sequence changes the current SDOM level. This feature is specific to NEBNF and GXML
In NEBNF a comma ','
separates primaries in a sequence.
Group
A group is an unnamed sequence. It is defined using curved brackets in NEBNF. GXML defines a group using the <group>
element.
(* NEBNF *)
start = ("Hello ","World");
<!-- GXML -->
<rule id="start">
<group>
<terminal>Hello </rule>
<terminal>World</rule>
</group>
</rule>
Option
An option is a sequence that can occur zero or one times. It is defined using sqaured brackets in NEBNF. GXML uses the <option>
element.
(* NEBNF *)
start = ["Hello "],"World";
<!-- GXML -->
<rule id="start">
<option>
<terminal>Hello </rule>
</option>
<terminal>World</rule>
</rule>
Repetition
A repetition is a sequence that can occur zero or more times. In NEBNF and GXML it is referred to as iteration. It is defined using curly braces in NEBNF. In GXML it is defined using the <iteration>
element.
(* NEBNF *)
start = {char};
<!-- GXML -->
<rule id="start">
<iteration>
<ruleref idref="char"/>
</iteration>
</rule>
Note: To echo characters from the input to the output add a dot '.'
symbol to the end of a primary in NEBNF.
Or for GXML add the echo=""
attribute to a primary.
(* NEBNF *)
start = {char.};
<!-- GXML -->
<rule id="start">
<iteration>
<ruleref idref="char" echo="true"/>
</iteration>
</rule>
For a repetition to occur one or more times either repeat the primary before the repitition in NEBNF or use the minoccurs
attribute in GXML.
(* NEBNF *)
start = char,{char};
<!-- GXML -->
<rule id="start">
<iteration minorccurs="1">
<ruleref idref="char"/>
</iteration>
</rule>
For a repetition to occur exactly n times use the *
symbol in NEBNF or minoccurs
and maxoccurs
attributes in GXML.
(* NEBNF *)
start = 3 * char;
<!-- GXML -->
<rule id="start">
<iteration minoccurs="3" maxoccurs="3">
<ruleref idref="char"/>
</iteration>
</rule>
Empty Sequence
The empty sequence consists of an empty sequence of primaries and always evaluates to true. Use the <empty>
element in GXML.
(* NEBNF *)
start = ;
<!-- GXML -->
<rule id="start">
<empty/>
</rule>
Choice
A choice is a sequence of one or more items evaluated in order until one is
found to be true. More than one item in the series may be valid, but only the first valid choice is used.
A vertical bar '|'
is used to separate choice items in NEBNF. GXML uses the <choice>
element to define choices.
(* NEBNF *)
start = a | b | c;
<!-- GXML -->
<rule id="start">
<choice>
<ruleref idref="a"/>
<ruleref idref="b"/>
<ruleref idref="c"/>
</choice>
</rule>
Pipeline
A pipeline is a sequence of two or more primaries, with the production from a primary in the chain sent to its
successor in the pipeline sequence. Pipelines are common operating system
features and are significant in many grammars. The colon ':'
symbol is used
separate pipeline primaries in NEBNF. In GXML the <pipeline>
element is used.
(* NEBNF *)
start = asciitowchar : {wchar.} : wchartoascii;
<!-- GXML -->
<rule id="start">
<pipeline>
<ruleref idref="asciitowchar"/>
<iteration/>
<ruleref idref="wchar" echo=""/>
</iteration>
<ruleref idref="wchartoascii"/>
</pipeline>
</rule>
Predicates
Predicates are used to test primaries without consuming input. There are three kinds
- The
not
predicate continues evaluation of the sequence when the expression evaluates to false; no input is consumed and no output produced; NEBNF uses a preceding exclamantion symbol to define a not predicate for a primary; GXML uses thepredicate="not"
attribute - The
and
predicate continues evaluation of the sequence when the expression evaluates to true; no input is consumed and no output produced; NEBNF uses a preceding ampersand symbol to define an and predicate for a primary; GXML uses thepredicate="and"
attribute - The
again
predicate continues evaluation of the sequence when the expression evaluates to true; no input is consumed, but output is produced. It permits repeated evaluation of the same input; NEBNF uses a preceding double ampersand symbol to define the again predicate for a primary; GXML uses thepredicate="again"
attribute - ISO/IEC 14977 includes a form of predicate known as an exception and is supported in NEBNF for compatibility with ISO/IEC 19477. The not predicate provides equivalence
Predicate | Example |
---|---|
Not |
(* NEBNF *)
<!-- GXML -->
|
And |
(* NEBNF *)
<!-- GXML -->
|
Again |
(* NEBNF *)
<!-- GXML -->
|
Exception |
(* NEBNF *)
<!-- GXML -->
|
Entity
An Entity is a rule with a single terminal as its definition. It is used to describe reusable text that can be referenced in either consumption or production terminals.
(* NEBNF *)
hello_world = "Hello World";
<!-- GXML -->
<entity id="hello_world">
<terminal>Hello World</terminal>
</entity>
Entity Reference
Entity references are commonly used in XML documents to refer to restricted and common characters
and strings. Examples include &
and "
. Languages like C use the back-slash
in strings to prefix a predefined entity such as newline \n
or a character specified in hex \x0A
.
In Unix shells ${name}
is used to reference environment variables. The term 'entity reference' is used to
describe these types of reference.
NEBNF uses \xhhhh
to define a character using a hexadecimal number.
GXML uses &#dd;
to define a character using a decimal number.
(* NEBNF *)
start = ?${hello} ${world}\n?;
hello = "Hello ";
world = "World";
<!-- GXML -->
<rule id="start">
<special>&hello; &world;</terminal>
</rule>
<rule id="hello">
<terminal>Hello </terminal>
</rule>
<rule id="world">
<terminal>World</terminal>
</rule>
Comments
NEBNF uses the (* *)
combination and GXML uses standard XML comment notation <!-- -->
to add comments
to a grammar.
(* NEBNF *)
(* This is a comment which may wrap over
multiple lines*)
start = ?Hello World?;
<!--This is a comment which may wrap over
multiple lines-->
<rule id="start">
<special>Hello World</special>
</rule>
Mixing NEBNF and GXML
NETS grammars can be written using a mix of NEBNF and GXML.
<!-- GXML -->
<rule id="start">
(* NEBNF *)
start = ?Hello World?;
</rule>