Creating parsers

Developing Tree-sitter parsers can have a difficult learning curve, but once you get the hang of it, it can be fun and even zen-like. This document should help you to build an effective mental model for parser development.

Getting Started

Understanding the problem

Writing a grammar requires creativity. There are an infinite number of CFGs (context-free grammars) that can be used to describe any given language. In order to produce a good Tree-sitter parser, you need to create a grammar with two important properties:

  1. An intuitive structure - Tree-sitter’s output is a concrete syntax tree; each node in the tree corresponds directly to a terminal or non-terminal symbol in the grammar. So in order to produce an easy-to-analyze tree, there should be a direct correspondence between the symbols in your grammar and the recognizable constructs in the language. This might seem obvious, but it is very different from the way that context-free grammars are often written in contexts like language specifications or Yacc/Bison parsers.

  2. A close adherence to LR(1) - Tree-sitter is based on the GLR parsing algorithm. This means that while it can handle any context-free grammar, it works most efficiently with a class of context-free grammars called LR(1) Grammars. In this respect, Tree-sitter’s grammars are similar to (but less restrictive than) Yacc and Bison grammars, but different from ANTLR grammars, Parsing Expression Grammars, or the ambiguous grammars commonly used in language specifications.

It’s unlikely that you’ll be able to satisfy these two properties just by translating an existing context-free grammar directly into Tree-sitter’s grammar format. There are a few kinds of adjustments that are often required. The following sections will explain these adjustments in more depth.

Dependencies

In order to develop a Tree-sitter parser, there are two dependencies that you need to install:

Installation

To create a Tree-sitter parser, you need to use the the tree-sitter CLI. You can install the CLI in a few different ways:

Setting up a Project

The preferred convention is to name the parser repository “tree-sitter-“ followed by the name of the language.

mkdir tree-sitter-${YOUR_LANGUAGE_NAME}
cd tree-sitter-${YOUR_LANGUAGE_NAME}

You should create a package.json file that describes your project, and allows your parser to be used from Node.js.

# This will prompt you for input
npm init

# This installs a small module that lets your parser be used from Node
npm install --save nan

# This installs the Tree-sitter CLI itself
npm install --save-dev tree-sitter-cli

The last command will install the CLI into the node_modules folder in your project. An executable program called tree-sitter will be created at the path ./node_modules/.bin/tree-sitter. You may want to follow the Node.js convention of adding ./node_modules/.bin to your PATH so that you can easily run this program when working in this directory.

Once you have the CLI installed, create a file called grammar.js with the following skeleton:

module.exports = grammar({
  name: 'the_language_name',

  rules: {
    // The production rules of the context-free grammar
    source_file: $ => 'hello'
  }
});

Then run the the following command:

tree-sitter generate

This will generate the C code required to parse this trivial language, as well as all of the files needed to compile and load this native parser as a Node.js module. You can test this parser by creating a source file with the contents hello and parsing it:

tree-sitter parse ./the-file

This should print the following:

(source_file [1, 0] - [1, 5])

You might notice that the first time you run tree-sitter parse, it takes a few seconds. This is because Tree-sitter automatically compiles your C code into a dynamically-loadable library. Whenever you make changes to your grammar, you can update the parser simply by re-running tree-sitter generate. When the parser changes, Tree-sitter will recompile it as needed.

Writing the Grammar

It’s usually a good idea to find a formal specification for the language you’re trying to parse. This specification will most likely contain a context-free grammar. As you read through the rules of this CFG, you will probably discover a complex and cyclic graph of relationships. It might be unclear how you should navigate this graph as you define your grammar.

Although languages have very different constructs, their constructs can often be categorized in to similar groups like Declarations, Definitions, Statements, Expressions, Types, and Patterns. In writing your grammar, a good first step is to create just enough structure to include all of these basic groups of symbols. For an imaginary C-like language, this might look something like this:

{
  // ...

  rules: {
    source_file: $ => repeat($._definition),

    _definition: $ => choice(
      $.function_definition
      // TODO: other kinds of definitions
    ),

    function_definition: $ => seq(
      'func',
      $.identifier,
      $.parameter_list,
      $._type,
      $.block
    ),

    parameter_list: $ => seq(
      '(',
       // TODO: parameters
      ')'
    ),

    _type: $ => choice(
      'bool'
      // TODO: other kinds of types
    ),

    block: $ => seq(
      '{',
      repeat($._statement),
      '}'
    ),

    _statement: $ => choice(
      $.return_statement
      // TODO: other kinds of statements
    ),

    return_statement: $ => seq(
      'return',
      $._expression,
      ';'
    ),

    _expression: $ => choice(
      $.identifier,
      $.number
      // TODO: other kinds of expressions
    ),

    identifier: $ => /[a-z]+/,

    number: $ => /\d+/
  }
}

Some of the details of this grammar will be explained in more depth later on, but if you focus on the TODO comments, you can see that the overall strategy is breadth-first. Notably, this initial skeleton does not need to directly match an exact subset of the context-free grammar in the language specification. It just needs to touch on the major groupings of rules in as simple and obvious a way as possible.

With this structure in place, you can now freely decide what part of the grammar to flesh out next. For example, you might decide to start with types. One-by-one, you could define the rules for writing basic types and composing them into more complex types:

{
  // ...

  _type: $ => choice(
    $.primitive_type,
    $.array_type,
    $.pointer_type
  ),

  primitive_type: $ => choice(
    'bool',
    'int'
  ),

  array_type: $ => seq(
    '[',
    ']',
    $._type
  ),

  pointer_type: $ => seq(
    '*',
    $._type
  )
}

After developing the type sublanguage a bit further, you might decide to switch to working on statements or expressions instead. It’s often useful to check your progress by trying to parse some real code using tree-sitter parse.

Writing unit tests

For each rule that you add to the grammar, you should first create a test that describes how the syntax trees should look when parsing that rule. These tests are written using specially-formatted text files in a corpus directory in your parser’s root folder. Here is an example of how these tests should look:

==================
Return statements
==================

func x() int {
  return 1;
}

---

(source_file
  (function_definition
    (identifier)
    (parameter_list)
    (primitive_type)
    (block
      (return_statement (number)))))

The name of the test is written between two lines containing only = characters. Then the source code is written, followed by a line containing three or more - characters. Then, the expected syntax tree is written as an S-expression. The exact placement of whitespace in the S-expression doesn’t matter, but ideally the syntax tree should be legible. Note that the S-expression does not show syntax nodes like func, ( and ;, which are expressed as strings and regexes in the grammar. It only shows the named nodes, as described in the previous page.

These tests are important. They serve as the parser’s API documentation, and they can be run every time you change the grammar to verify that everything still parses correctly. You can run these tests using this command:

tree-sitter test

To run a particular test, you can use the the -f flag:

tree-sitter test -f 'Return statements'

The recommendation is to be comprehensive in adding tests. If it’s a visible node, add it to a test file in your corpus directory. It’s typically a good idea to test all of the permutations of each language construct. This increases test coverage, but doubly acquaints readers with a way to examine expected outputs and understand the “edges” of a language.

The Grammar DSL

The following is a complete list of built-in functions you can use to define Tree-sitter grammars. Use-cases for some of these functions will be explained in more detail in later sections.

In addition to the name and rules fields, grammars have a few other optional public fields that influence the behavior of the parser.

Adjusting existing grammars

Imagine that you were just starting work on the Tree-sitter JavaScript parser. You might try to directly mirror the structure of the ECMAScript Language Spec. To illustrate the problem with this approach, consider the following line of code:

return x + y;

According to the specification, this line is a ReturnStatement, the fragment x + y is an AdditiveExpression, and x and y are both IdentifierReferences. The relationship between these constructs is captured by a complex series of production rules:

ReturnStatement          ->  'return' Expression
Expression               ->  AssignmentExpression
AssignmentExpression     ->  ConditionalExpression
ConditionalExpression    ->  LogicalORExpression
LogicalORExpression      ->  LogicalANDExpression
LogicalANDExpression     ->  BitwiseORExpression
BitwiseORExpression      ->  BitwiseXORExpression
BitwiseXORExpression     ->  BitwiseANDExpression
BitwiseANDExpression     ->  EqualityExpression
EqualityExpression       ->  RelationalExpression
RelationalExpression     ->  ShiftExpression
ShiftExpression          ->  AdditiveExpression
AdditiveExpression       ->  MultiplicativeExpression
MultiplicativeExpression ->  ExponentiationExpression
ExponentiationExpression ->  UnaryExpression
UnaryExpression          ->  UpdateExpression
UpdateExpression         ->  LeftHandSideExpression
LeftHandSideExpression   ->  NewExpression
NewExpression            ->  MemberExpression
MemberExpression         ->  PrimaryExpression
PrimaryExpression        ->  IdentifierReference

The language spec encodes the 20 precedence levels of JavaScript expressions using 20 different non-terminal symbols. If we were to create a concrete syntax tree representing this statement according to the language spec, it would have twenty levels of nesting and it would contain nodes with names like BitwiseXORExpression, which are unrelated to the actual code.

Using precedence

To produce a readable syntax tree, we’d like to model JavaScript expressions using a much flatter structure like this:

{
  // ...

  _expression: $ => choice(
    $.identifier,
    $.unary_expression,
    $.binary_expression,
    // ...
  ),

  unary_expression: $ => choice(
    seq('-', $._expression),
    seq('!', $._expression),
    // ...
  ),

  binary_expression: $ => choice(
    seq($._expression, '*', $._expression),
    seq($._expression, '+', $._expression),
    // ...
  ),
}

Of course, this flat structure is highly ambiguous. If we try to generate a parser, Tree-sitter gives us an error message:

Error: Unresolved conflict for symbol sequence:

  '-'  _expression  •  '*'  …

Possible interpretations:

  1:  '-'  (binary_expression  _expression  •  '*'  _expression)
  2:  (unary_expression  '-'  _expression)  •  '*'  …

Possible resolutions:

  1:  Specify a higher precedence in `binary_expression` than in the other rules.
  2:  Specify a higher precedence in `unary_expression` than in the other rules.
  3:  Specify a left or right associativity in `unary_expression`
  4:  Add a conflict for these rules: `binary_expression` `unary_expression`

For an expression like -a * b, it’s not clear whether the - operator applies to the a * b or just to the a. This is where the prec function described above comes into play. By wrapping a rule with prec, we can indicate that certain sequence of symbols should bind to each other more tightly than others. For example, the '-', $._expression sequence in unary_expression should bind more tightly than the $._expression, '+', $._expression sequence in binary_expression:

{
  // ...

  unary_expression: $ => prec(2, choice(
    seq('-', $._expression),
    seq('!', $._expression),
    // ...
  ))
}

Using associativity

Applying a higher precedence in unary_expression fixes that conflict, but there is still another conflict:

Error: Unresolved conflict for symbol sequence:

  _expression  '*'  _expression  •  '*'  …

Possible interpretations:

  1:  _expression  '*'  (binary_expression  _expression  •  '*'  _expression)
  2:  (binary_expression  _expression  '*'  _expression)  •  '*'  …

Possible resolutions:

  1:  Specify a left or right associativity in `binary_expression`
  2:  Add a conflict for these rules: `binary_expression`

For an expression like a * b * c, it’s not clear whether we mean a * (b * c) or (a * b) * c. This is where prec.left and prec.right come into use. We want to select the second interpretation, so we use prec.left.

{
  // ...

  binary_expression: $ => choice(
    prec.left(2, seq($._expression, '*', $._expression)),
    prec.left(1, seq($._expression, '+', $._expression)),
    // ...
  ),
}

Hiding rules

You may have noticed in the above examples that some of the grammar rule name like _expression and _type began with an underscore. Starting a rule’s name with an underscore causes the rule to be hidden in the syntax tree. This is useful for rules like _expression in the grammars above, which always just wrap a single child node. If these nodes were not hidden, they would add substantial depth and noise to the syntax tree without making it any easier to understand.

LR conflicts

Lexical Analysis

Tree-sitter’s parsing process is divided into two phases: parsing (which is described above) and lexing - the process of grouping individual characters into the language’s fundamental tokens. There are a few important things to know about how Tree-sitter’s lexing works.

Conflicting Tokens

Grammars often contain multiple tokens that can match the same characters. For example, a grammar might contain the tokens ("if" and /[a-z]+/). Tree-sitter differentiates between these conflicting tokens in a few ways:

  1. Context-aware Lexing - Tree-sitter performs lexing on-demand, during the parsing process. At any given position in a source document, the lexer only tries to recognize tokens that are valid at that position in the document.

  2. Lexical Precedence - When the precedence functions described above are used within the token function, the given precedence values serve as instructions to the lexer. If there are two valid tokens that match the characters at a given position in the document, Tree-sitter will select the one with the higher precedence.

  3. Match Length - If multiple valid tokens with the same precedence match the characters at a given position in a document, Tree-sitter will select the token that matches the longest sequence of characters.

  4. Match Specificity - If there are two valid tokens with the same precedence and which both match the same number of characters, Tree-sitter will prefer a token that is specified in the grammar as a String over a token specified as a RegExp.

Keywords

Many languages have a set of keyword tokens (e.g. if, for, return), as well as a more general token (e.g. identifier) that matches any word, including many of the keyword strings. For example, JavaScript has a keyword instanceof, which is used as a binary operator, like this:

if (a instanceof Something) b();

The following, however, is not valid JavaScript:

if (a instanceofSomething) b();

A keyword like instanceof cannot be followed immediately by another letter, because then it would be tokenized as an identifier, even though an identifier is not valid at that position. Because Tree-sitter uses context-aware lexing, as described above, it would not normally impose this restriction. By default, Tree-sitter would recognize instanceofSomething as two separate tokens: the instanceof keyword followed by an identifier.

Keyword Extraction

Fortunately, Tree-sitter has a feature that allows you to fix this, so that you can match the behavior of other standard parsers: the word token. If you specify a word token in your grammar, Tree-sitter will find the set of keyword tokens that match strings also matched by the word token. Then, during lexing, instead of matching each of these keywords individually, Tree-sitter will match the keywords via a two-step process where it first matches the word token.

For example, suppose we added identifier as the word token in our JavaScript grammar:

grammar({
  name: 'javascript',

  word: $ => $.identifier,

  rules: {
    _expression: $ => choice(
      $.identifier,
      $.unary_expression,
      $.binary_expression
      // ...
    ),

    binary_expression: $ => choice(
      prec.left(1, seq($._expression, 'instanceof', $._expression)
      // ...
    ),

    unary_expression: $ => choice(
      prec.left(2, seq('typeof', $._expression))
      // ...
    ),

    identifier: $ => /[a-z_]+/
  }
});

Tree-sitter would identify typeof and instanceof as keywords. Then, when parsing the invalid code above, rather than scanning for the instanceof token individually, it would scan for an identifier first, and find instanceofSomething. It would then correctly recognize the code as invalid.

Aside from improving error detection, keyword extraction also has performance benefits. It allows Tree-sitter to generate a smaller, simpler lexing function, which means that the parser will compile much more quickly.

External Scanners

Many languages have some tokens whose structure is impossible or inconvenient to describe with a regular expression. Some examples:

Tree-sitter allows you to handle these kinds of tokens using external scanners. An external scanner is a set of C functions that you, the grammar author, can write by hand in order to add custom logic for recognizing certain tokens.

To use an external scanner, there are a few steps. First, add an externals section to your grammar. This section should list the names of all of your external tokens. These names can then be used elsewhere in your grammar.

grammar({
  name: 'my_language',

  externals: $ => [
    $.indent,
    $.dedent,
    $.newline
  ],

  // ...
});

Then, add another C or C++ source file to your project. Currently, its path must be src/scanner.c or src/scanner.cc for the CLI to recognize it. Be sure to add this file to the sources section of your binding.gyp file so that it will be included when your project is compiled by Node.js.

In this new source file, define an enum type containing the names of all of your external tokens. The ordering of this enum must match the order in your grammar’s externals array.

#include <tree_sitter/parser.h>

enum TokenType {
  INDENT,
  DEDENT,
  NEWLINE
}

Finally, you must define five functions with specific names, based on your language’s name and five actions: create, destroy, serialize, deserialize, and scan. These functions must all use C linkage, so if you’re writing the scanner in C++, you need to declare them with the extern "C" qualifier.

Create

void * tree_sitter_my_language_external_scanner_create() {
  // ...
}

This function should create your scanner object. It will only be called once anytime your language is set on a parser. Often, you will want to allocate memory on the heap and return a pointer to it. If your external scanner doesn’t need to maintain any state, it’s ok to return NULL.

Destroy

void tree_sitter_my_language_external_scanner_destroy(void *payload) {
  // ...
}

This function should free any memory used by your scanner. It is called once when a parser is deleted or assigned a different language. It receives as an argument the same pointer that was returned from the create function. If your create function didn’t allocate any memory, this function can be a noop.

Serialize

unsigned tree_sitter_my_language_external_scanner_serialize(
  void *payload,
  char *buffer
) {
  // ...
}

This function should copy the complete state of your scanner into a given byte buffer, and return the number of bytes written. The function is called every time the external scanner successfully recognizes a token. It receives a pointer to your scanner and a pointer to a buffer. The maximum number of bytes that you can write is given by the TREE_SITTER_SERIALIZATION_BUFFER_SIZE constant, defined in the tree_sitter/parser.h header file.

The data that this function writes will ultimately be stored in the syntax tree so that the scanner can be restored to the right state when handling edits or ambiguities. For your parser to work correctly, the serialize function must store its entire state, and deserialize must restore the entire state. For good performance, you should design your scanner so that its state can be serialized as quickly and compactly as possible.

Deserialize

void tree_sitter_my_language_external_scanner_deserialize(
  void *payload,
  const char *buffer,
  unsigned length
) {
  // ...
}

This function should restore the state of your scanner based the bytes that were previously written by the serialize function. It is called with a pointer to your scanner, a pointer to the buffer of bytes, and the number of bytes that should be read.

Scan

bool tree_sitter_my_language_external_scanner_scan(
  void *payload,
  TSLexer *lexer,
  const bool *valid_symbols
) {
  // ...
}

This function is responsible for recognizing external tokens. It should return true if a token was recognized, and false otherwise. It is called with a “lexer” struct with the following fields:

The third argument to the scan function is an array of booleans that indicates which of your external tokens are currently expected by the parser. You should only look for a given token if it is valid according to this array. At the same time, you cannot backtrack, so you may need to combine certain pieces of logic.

if (valid_symbols[INDENT] || valid_symbol[DEDENT]) {

  // ... logic that is common to both `INDENT` and `DEDENT`

  if (valid_symbols[INDENT]) {

    // ... logic that is specific to `INDENT`

    lexer->result_symbol = INDENT;
    return true;
  }
}