Grammars

Grammars define the rules for configuring and running a parser.


Grammar Languages

Grammars come in many dialects, from initialisation files to formal grammars like Bacus Naur Form (BNF). There are many grammar languages in use, some of which you can read about in 'A Common Model for Language Grammars' (Nelms, 2013). One of the ideas behind NETS is a unified grammar model for small and large scale processes, which makes configuration easier and improves the reliability of systems.

The grammar language used in Nets parser is based on ISO/IEC 14977 (ISO), which incorprates the concepts Extended Backus Naur Form (EBNF) created by Niklaus Wirth. Nets parser also uses concepts from Parser Expression Grammars (PEG) defined by Bryan Ford. This hybrid grammar is called Nets EBNF (NEBNF). Nets parser also uses an XML notation to define grammars called Grammar XML (GXML).

Any grammar defined in NEBNF can also be defined in GXML and in this section you can see examples presented in both NEBNF and GXML. NEBNF and GXML can also be mixed together in same grammar.

Grammar Names

Grammars need to be identified uniquely with a name. In NEBNF the file name is the grammar name and can be refered to from the command line. In GXML the grammar name is explicitly declared using the id attribute on ethe <grammar> element. Grammar names help scope grammar rules so that they can be referenced between grammar corpus or libraries.

#Command line
nets-parser -loglevel=3 -grammar=default.g;another.g;grammar2.g -input=default.in -output=default.out

<!-- GXML -->
<grammar id="name">
...
</grammar>

Rules

Syntax rules are a fundamental building block of grammars. Rules have a unique name that can be referenced in other parts of the grammar. In NEBNF a rule name is followed by an equals sign and terminates in a semicolon. In GXML a rule is defined using the <rule> element. The first rule executed by Nets parser is the rule with id="start" unless the grammar_start command line parameter is used to refer to a different rule.

(* NEBNF *)
start = ... ;

<!-- GXML -->
<rule id="start">
...
</rule>

Primary

Primaries include an optional sequence, repeated sequence, grouped sequence, rule reference, terminal string, empty sequence, entity reference and pipeline. Primary is an abstract concept and is useful for describing features common across primaries.

Grammar Attributes

Attributes are a means of specifying and extending grammar characteristics. They are a standard feature of XML but may also be used in NEBNF at the end of a primary. id="name" and echo="" are examples of an attribute.

(* NEBNF *)
start = L"Hello World".;

(* NEBNF with grammar attributes *)
start = "Hello World" encoding="wchar" echo="";

<!-- GXML -->
<rule id="start">
  <terminal encoding="wchar" echo="">Hello World</rule>
</rule>

Terminals

Terminals are literal symbols (or strings of characters) which may appear in the rules of a formal grammar and cannot be changed using the rules of the grammar. Terminals define both parsed input (consumption) and produced output (production) using either quotes or question marks respectively in NEBNF.

Consumption Terminals

Consumption terminals are defined with either single or double quote in NEBNF. GXML uses the <terminal> element to define a consumption terminal.

(* NEBNF *)
start = "Hello World";

<!-- GXML -->
<rule id="start">
  <terminal>Hello World</rule>
</rule>

Production Terminals

NEBNF uses the ISO/IEC 14977 special sequence notation - question marks - to indicate a production terminal. GXML uses the <special> element to define a production terminal.

(* NEBNF *)
start = ?Hello World?;

<!-- GXML -->
<rule id="start">
  <special>Hello World</special>
</rule>

Encoding

Terminal encoding can be either multi-byte or wide character. NEBNF uses the C style L predicate to strings to indicate wchar encoded terminal. GXML uses the encoding attribute with values "char" and "wchar" attribute.

(* NEBNF *)
start = ?Hello World?;
start = L"Hello World";

<!-- GXML -->
<rule id="start">
  <special encoding="char">Hello World</special>
</rule>
<rule id="start">
  <terminal encoding="wchar">Hello World</terminal>
</rule>

Rule Reference

Rule refenerences are symbols which can be replaced by rules. They are used to structure complex grammars into reusable rule sets or grammars. GXML uses the idref attribute to define rule references.

(* NEBNF *)
start = new_rule;
new_rule = "Hello World";

<!-- GXML -->
<rule id="start">
  <ruleref idref="new_rule"/>
</rule>
<rule id="new_rule">
  <terminal>Hello World</rule>
</rule>

Referencing a rule in a specific grammar, requires the rule reference to be qualified by the grammar name.

(* NEBNF *)
start = mygrammar.rule_name;

<!-- GXML -->
<rule id="start">
  <ruleref grammar="mygrammar" idref="rule_name"/>
</rule>

Sequences

A sequence is a series of one or more expressions evaluated in order. Expressions include terminals, rule references, sequences, choices, groups, iterations, options and the empty sequence. ISO/IEC 14977 refer to sequences as a single definition. Sequences are usually implicit but are concretely implemented in syntax rules, grouped sequences, repetitions and options.

In NEBNF a comma ',' separates primaries in a sequence.

Group

A group is an unnamed sequence. It is defined using curved brackets in NEBNF. GXML defines a group using the <group> element.

(* NEBNF *)
start = ("Hello ","World");

<!-- GXML -->
<rule id="start">
  <group>
    <terminal>Hello </rule>
    <terminal>World</rule>
  </group>
</rule>

Option

An option is a sequence that can occur zero or one times. It is defined using sqaured brackets in NEBNF. GXML uses the <option> element.

(* NEBNF *)
start = ["Hello "],"World";

<!-- GXML -->
<rule id="start">
  <option>
    <terminal>Hello </rule>
  </option>
  <terminal>World</rule>
</rule>

Repetition

A repetition is a sequence that can occur zero or more times. In NEBNF and GXML it is referred to as iteration. It is defined using curly braces in NEBNF. In GXML it is define using the <iteration> element.

(* NEBNF *)
start = {char};

<!-- GXML -->
<rule id="start">
  <iteration>
    <ruleref idref="char"/>
  </iteration>
</rule>

Note: To echo characters from the input to the output add a dot '.' symbol to the end of a primary in NEBNF. Or for GXML add the echo="" attribute to a primary.

(* NEBNF *)
start = {char.};

<!-- GXML -->
<rule id="start">
  <iteration>
    <ruleref idref="char" echo="true"/>
  </iteration>
</rule>

For a repetition to occur one or more times either repeat the primary before the repitition in NEBNF or use the minoccurs attribute in GXML.

(* NEBNF *)
start = char,{char};

<!-- GXML -->
<rule id="start">
  <iteration minorccurs="1">
    <ruleref idref="char"/>
  </iteration>
</rule>

For a repetition to occur exactly n times use the * symbol in NEBNF or minoccurs and maxoccurs attributes in GXML.

(* NEBNF *)
start = 3 * char;

<!-- GXML -->
<rule id="start">
  <iteration minoccurs="3" maxoccurs="3">
    <ruleref idref="char"/>
  </iteration>
</rule>

Empty Sequence

The empty sequence consists of an empty sequence of primaries and always evaluates to true. Use the <empty> element in GXML.

(* NEBNF *)
start = ;

<!-- GXML -->
<rule id="start">
  <empty/>
</rule>

Choice

A choice is a sequence of one or more items evaluated in order until one is found to be true. More than one item in the series may be valid, but only the first valid choice is used. A vertical bar '|' is used to separate choice items in NEBNF. GXML uses the <choice> element to define choices.

(* NEBNF *)
start = a | b | c;

<!-- GXML -->
<rule id="start">
  <choice>
    <ruleref idref="a"/>
    <ruleref idref="b"/>
    <ruleref idref="c"/>
  </choice>
</rule>

Pipeline

A pipeline is a sequence of two or more primaries evaluated simultaneously, with the production from a primary in the chain sent to its successor in the pipeline sequence. Pipelines are common operating system features and are significant in many grammars. The colon ':' symbol is used separate pipeline primaries in NEBNF. In GXML the <pipeline> element is used.

(* NEBNF *)
start = asciitowchar : {wchar.} : wchartoascii;

<!-- GXML -->
<rule id="start">
  <pipeline>
    <ruleref idref="asciitowchar"/>
    <iteration/>
      <ruleref idref="wchar" echo=""/>
    </iteration>
    <ruleref idref="wchartoascii"/>
  </pipeline>
</rule>

Predicates

Predicates are used to test primaries without consuming input. There are three kinds

Predicate Example
Not (* NEBNF *)
start = !"Anything ","Hello ","World";

<!-- GXML -->
<rule id="start">
  <terminal predicate="not">Anything </terminal>
  <terminal>Hello </terminal>
  <terminal>World</terminal>
</rule>
And (* NEBNF *)
start = &"Hello ","Hello ","World";

<!-- GXML -->
<rule id="start">
  <terminal predicate="and">Hello </terminal>
  <terminal>Hello </terminal>
  <terminal>World</terminal>
</rule>
Again (* NEBNF *)
start = &&"Hello ","Hello ","World";

<!-- GXML -->
<rule id="start">
  <terminal predicate="again">Hello </terminal>
  <terminal>Hello </terminal>
  <terminal>World</terminal>
</rule>
Exception (* NEBNF *)
start = "Hello " - "Anything ",World";

<!-- GXML -->
<rule id="start">
  <terminal predicate="not">Anything </terminal>
  <terminal>Hello </terminal>
  <terminal>World</terminal>
</rule>

Entity

An Entity is a rule with a single terminal as its definition. It is used to describe reusable text that can be referenced in either consumption or production terminals.

(* NEBNF *)
hello_world = "Hello World";

<!-- GXML -->
<entity id="hello_world">
  <terminal>Hello World</terminal>
</entity>

Entity Reference

Entity references are commonly used in XML documents to refer to restricted and common characters and strings. Examples include &amp; and &quot;. Languages like C use the back-slash in strings to prefix a predefined entity such as newline \n or a character specified in hex \x0A. In Unix shells ${name} is used to reference environment variables. We use the term 'entity reference' to describe these types of reference.

NEBNF uses \xhhhh to define a character using a hexadecimal number. GXML uses &#dd; to define a character using a decimal number.

(* NEBNF *)
start = ?${hello} ${world}\n?;
hello = "Hello ";
world = "World";

<!-- GXML -->
<rule id="start">
  <special>&hello; &world;</terminal>
</rule>
<rule id="hello">
  <terminal>Hello </terminal>
</rule>
<rule id="world">
  <terminal>World</terminal>
</rule>

Comments

Comments can be added to grammars to help describe the grammar. NEBNF uses the (* *) combination and GXML uses standard XML comment notation <!-- -->.

(* NEBNF *)
(* This is a comment which may wrap over
multiple lines*)
start = ?Hello World?;

<!--This is a comment which may wrap over
multiple lines-->
<rule id="start">
  <special>Hello World</special>
</rule>

Mixing NEBNF and GXML

Nets parser grammars can be written using a mix of NEBNF and GXML.

<!-- GXML -->
<rule id="start">
(* NEBNF *)
  start = ?Hello World?;
</rule>