Using Parsers

All of Tree-sitter’s parsing functionality is exposed through C APIs. Applications written in higher-level languages can use Tree-sitter via binding libraries like node-tree-sitter or rust-tree-sitter, which have their own documentation.

This document will describes the general concepts of how to use Tree-sitter, which should be relevant regardless of what language you’re using. It also goes into some C-specific details that are useful if you’re using the C API directly or are building a new binding to a different language.

Building the Runtime Library

Building the runtime library requires one git submodule: utf8proc. Make sure that utf8proc is downloaded by running this command from the Tree-sitter directory:

git submodule update --init

To build the runtime library on a POSIX system, run this script, which will create a static library called libruntime.a in the Tree-sitter folder:

script/build-runtime

Alternatively, you can use the runtime library in a larger project by adding one source file to the project. This source file needs three directories to be in the include path when compiled:

source file:

include directories:

The Objects

There are four main types of objects involved when using Tree-sitter: languages, parsers, syntax trees, and syntax nodes. In C, these are called TSParser, TSLanguage, TSTree, and TSNode.

An Example Program

Here’s an example of a simple C program that uses the Tree-sitter JSON parser.

// Filename - test-json-parser.c

#include <assert.h>
#include <string.h>
#include <stdio.h>
#include <tree_sitter/runtime.h>

// Declare the `tree_sitter_json` function, which is
// implemented by the `tree-sitter-json` library.
TSLanguage *tree_sitter_json();

int main() {
  // Create a parser.
  TSParser *parser = ts_parser_new();

  // Set the parser's language (JSON in this case).
  ts_parser_set_language(parser, tree_sitter_json());

  // Build a syntax tree based on source code stored in a string.
  const char *source_code = "[1, null]";
  TSTree *tree = ts_parser_parse_string(
    parser,
    NULL,
    source_code,
    strlen(source_code)
  );

  // Get the root node of the syntax tree.
  TSNode root_node = ts_tree_root_node(tree);

  // Get some child nodes.
  TSNode array_node = ts_node_named_child(root_node, 0);
  TSNode number_node = ts_node_named_child(array_node, 0);

  // Check that the nodes have the expected types.
  assert(strcmp(ts_node_type(root_node), "value") == 0);
  assert(strcmp(ts_node_type(array_node), "array") == 0);
  assert(strcmp(ts_node_type(number_node), "number") == 0);

  // Check that the nodes have the expected child counts.
  assert(ts_node_child_count(root_node) == 1);
  assert(ts_node_child_count(array_node) == 5);
  assert(ts_node_named_child_count(array_node) == 2);
  assert(ts_node_child_count(number_node) == 0);

  // Print the syntax tree as an S-expression.
  char *string = ts_node_string(root_node);
  printf("Syntax tree: %s\n", string);

  // Free all of the heap-allocated memory.
  free(string);
  ts_tree_delete(tree);
  ts_parser_delete(parser);
  return 0;
}

This program uses the Tree-sitter C API, which is declared in the header file tree_sitter/runtime.h, so we need to add the tree_sitter/include directory to the include path. We also need to link libruntime.a into the binary. We compile the source code of the JSON language directly into the binary as well.

clang                                   \
  -I tree-sitter/include                \
  test-json-parser.c                    \
  tree-sitter-json/src/parser.c         \
  tree-sitter/libruntime.a  \
  -o test-json-parser

./test-json-parser

Providing the Source Code

In the example above, we parsed source code stored in a simple string using the ts_parser_parse_string function:

TSTree *ts_parser_parse_string(
  TSParser *self,
  const TSTree *old_tree,
  const char *string,
  uint32_t length
);

You may want to parse source code that’s stored in a custom data structure, like a piece table or a rope. In this case, you can use the more general ts_parser_parse function:

TSTree *ts_parser_parse(
  TSParser *self,
  const TSTree *old_tree,
  TSInput input
);

The TSInput structure lets you to provide your own function for reading a chunk of text at a given byte offset and row/column position. The function can return text encoded in either UTF8 or UTF16. This interface allows you to efficiently parse text that is stored in your own data structure.

typedef struct {
  void *payload;
  const char *(*read)(
    void *payload,
    uint32_t byte_offset,
    TSPoint position,
    uint32_t *bytes_read
  );
  TSInputEncoding encoding;
} TSInput;

Syntax Nodes

Tree-sitter provides a DOM-style interface for inspecting syntax trees. A syntax node’s type is a string that indicates which grammar rule the node represents.

const char *ts_node_type(TSNode);

Syntax nodes store their position in the source code both in terms of raw bytes and row/column coordinates:

uint32_t ts_node_start_byte(TSNode);
uint32_t ts_node_end_byte(TSNode);

typedef struct {
  uint32_t row;
  uint32_t column;
} TSPoint;

TSPoint ts_node_start_point(TSNode);
TSPoint ts_node_end_point(TSNode);

Retrieving Nodes

Every tree has a root node:

TSNode ts_tree_root_node(const TSTree *);

Once you have a node, you can access the node’s children:

uint32_t ts_node_child_count(TSNode);
TSNode ts_node_child(TSNode, uint32_t);

You can also access its siblings and parent:

TSNode ts_node_next_sibling(TSNode);
TSNode ts_node_prev_sibling(TSNode);
TSNode ts_node_parent(TSNode);

These methods may all return a null node to indicate, for example, that a node does not have a next sibling. You can check if a node is null:

bool ts_node_is_null(TSNode);

Named vs Anonymous Nodes

Tree-sitter produces concrete syntax trees - trees that contain nodes for every individual token in the source code, including things like commas and parentheses. This is important for use-cases that deal with individual tokens, like syntax highlighting. But some types of code analysis are easier to perform using an abstract syntax tree - a tree in which the less important details have been removed. Tree-sitter’s trees support these use cases by making a distinction between named and anonymous nodes.

Consider a grammar rule like this:

if_statement: $ => seq(
  'if',
  '(',
  $._expression,
  ')',
  $._statement,
)

A syntax node representing an if_statement in this language would have 5 children: the condition expression, the body statement, as well as the if, (, and ) tokens. The expression and the statement would be marked as named nodes, because they have been given explicit names in the grammar. But the if, (, and ) nodes would not be named nodes, because they are represented in the grammar as simple strings.

You can check whether any given node is named:

bool ts_node_is_named(TSNode);

When traversing the tree, you can also choose to skip over anonymous nodes by using the _named_ variants of all of the methods described above:

TSNode ts_node_named_child(TSNode, uint32_t);
uint32_t ts_node_named_child_count(TSNode);
TSNode ts_node_next_named_sibling(TSNode);
TSNode ts_node_prev_named_sibling(TSNode);

If you use this group of methods, the syntax tree functions much like an abstract syntax tree.

Editing

In applications like text editors, you often need to re-parse a file after its source code has changed. Tree-sitter is designed to support this use case efficiently. There are two steps required. First, you must edit the syntax tree, which adjusts the ranges of its nodes so that they stay in sync with the code.

typedef struct {
  uint32_t start_byte;
  uint32_t old_end_byte;
  uint32_t new_end_byte;
  TSPoint start_point;
  TSPoint old_end_point;
  TSPoint new_end_point;
} TSInputEdit;

void ts_tree_edit(TSTree *, const TSInputEdit *);

Then, you can call ts_parser_parse again, passing in the old tree. This will create a new tree that internally shares structure with the old tree.

When you edit a syntax tree, the positions of its nodes will change. If you have stored any TSNode instances outside of the TSTree, you must update their positions separately, using the same TSInput value, in order to update their cached positions.

void ts_node_edit(TSNode *, const TSInputEdit *);

This ts_node_edit function is only needed in the case where you have retrieved TSNode instances before editing the tree, and then after editing the tree, you want to continue to use those specific node instances. Often, you’ll just want to re-fetch nodes from the edited tree, in which case ts_node_edit is not needed.

Multi-language Documents

Sometimes, different parts of a file may be written in different languages. For example, templating languages like EJS and ERB allow you to generate HTML by writing a mixture of HTML and another language like JavaScript or Ruby.

Tree-sitter handles these types of documents by allowing you to create a syntax tree based on the text in certain ranges of a file.

typedef struct {
  TSPoint start_point;
  TSPoint end_point;
  uint32_t start_byte;
  uint32_t end_byte;
} TSRange;

void ts_parser_set_included_ranges(
  TSParser *self,
  const TSRange *ranges,
  uint32_t range_count
);

For example, consider this ERB document:

<ul>
  <% people.each do |person| %>
    <li><%= person.name %></li>
  <% end %>
</ul>

Conceptually, it can be represented by three syntax trees with overlapping ranges: an ERB syntax tree, a Ruby syntax tree, and an HTML syntax tree. You could generate these syntax trees with the following code:

#include <string.h>
#include <tree_sitter/runtime.h>

// These functions are each implemented in their own repo.
const TSLanguage *tree_sitter_embedded_template();
const TSLanguage *tree_sitter_html();
const TSLanguage *tree_sitter_ruby();

int main(int argc, const char **argv) {
  const char *text = argv[1];
  unsigned len = strlen(src);

  // Parse the entire text as ERB.
  TSParser *parser = ts_parser_new();
  ts_parser_set_language(parser, tree_sitter_embedded_template());
  TSTree *erb_tree = ts_parser_parse_string(parser, NULL, text, len);
  TSNode erb_root_node = ts_tree_root_node(erb_tree);

  // In the ERB syntax tree, find the ranges of the `content` nodes,
  // which represent the underlying HTML, and the `code` nodes, which
  // represent the interpolated Ruby.
  TSRange html_ranges[10];
  TSRange ruby_ranges[10];
  unsigned html_range_count = 0;
  unsigned ruby_range_count = 0;
  unsigned child_count = ts_node_child_count(erb_root_node);

  for (unsigned i = 0; i < child_count; i++) {
    TSNode node = ts_node_child(erb_root_node, i);
    if (strcmp(ts_node_type(node), "content") == 0) {
      html_ranges[html_range_count++] = (TSRange) {
        ts_node_start_point(node),
        ts_node_end_point(node),
        ts_node_start_byte(node),
        ts_node_end_byte(node),
      };
    } else {
      TSNode code_node = ts_node_named_child(node, 0);
      ruby_ranges[ruby_range_count++] = (TSRange) {
        ts_node_start_point(code_node),
        ts_node_end_point(code_node),
        ts_node_start_byte(code_node),
        ts_node_end_byte(code_node),
      };
    }
  }

  // Use the HTML ranges to parse the HTML.
  ts_parser_set_language(parser, tree_sitter_html());
  ts_parser_set_included_ranges(parser, html_ranges, html_range_count);
  TSTree *html_tree = ts_parser_parse_string(parser, NULL, text, len);
  TSNode html_root_node = ts_tree_root_node(html_tree);

  // Use the Ruby ranges to parse the Ruby.
  ts_parser_set_language(parser, tree_sitter_ruby());
  ts_parser_set_included_ranges(parser, ruby_ranges, ruby_range_count);
  TSTree *ruby_tree = ts_parser_parse_string(parser, NULL, text, len);
  TSNode ruby_root_node = ts_tree_root_node(ruby_tree);

  // Print all three trees.
  char *erb_sexp = ts_node_string(erb_root_node);
  char *html_sexp = ts_node_string(html_root_node);
  char *ruby_sexp = ts_node_string(ruby_root_node);
  printf("ERB: %s\n", erb_sexp);
  printf("HTML: %s\n", html_sexp);
  printf("Ruby: %s\n", ruby_sexp);
  return 0;
}

This API allows for great flexibility in how languages can be composed. Tree-sitter is not responsible for mediating the interactions between languages. Instead, you are free to do that using arbitrary application-specific logic.

Concurrency

Tree-sitter supports multi-threaded use cases by making syntax trees very cheap to copy.

TSTree *ts_tree_copy(const TSTree *);

Internally, copying a syntax tree just entails incrementing an atomic reference count. Conceptually, it provides you a new tree which you can freely query, edit, reparse, or delete on a new thread while continuing to use the original tree on a different thread. Note that individual TSTree instances are not thread safe; you must copy a tree if you want to use it on multiple threads simultaneously.