Implementation

Tree-sitter consists of two components: a C library (libtree-sitter), and a command-line tool (the tree-sitter CLI).

The library, libtree-sitter, is used in combination with the parsers generated by the CLI, to produce syntax trees from source code and keep the syntax trees up-to-date as the source code changes. libtree-sitter is designed to be embedded in applications. It is written in plain C. Its interface is specified in the header file tree_sitter/api.h.

The CLI is used to generate a parser for a language by supplying a context-free grammar describing the language. The CLI is a build tool; it is no longer needed once a parser has been generated. It is written in Rust, and is available on crates.io, npm, and as a pre-built binary on GitHub.

The CLI

The tree-sitter CLI’s most important feature is the generate subcommand. This subcommand reads context-free grammar from a file called grammar.js and outputs a parser as a C file called parser.c. The source files in the cli/src directory all play a role in producing the code in parser.c. This section will describe some key parts of this process.

Parsing a Grammar

First, Tree-sitter must evaluate the JavaScript code in grammar.js and convert the grammar to a JSON format. It does this by shelling out to node. The format of the grammars is formally specified by the JSON schema in grammar.schema.json. The parsing is implemented in parse_grammar.rs.

Grammar Rules

A Tree-sitter grammar is composed of a set of rules - objects that describe how syntax nodes can be composed from other syntax nodes. There are several types of rules: symbols, strings, regexes, sequences, choices, repetitions, and a few others. Internally, these are all represented using an enum called Rule.

Preparing a Grammar

Once a grammar has been parsed, it must be transformed in several ways before it can be used to generate a parser. Each transformation is implemented by a separate file in the prepare_grammar directory, and the transformations are ultimately composed together in prepare_grammar/mod.rs.

At the end of these transformations, the initial grammar is split into two grammars: a syntax grammar and a lexical grammar. The syntax grammar describes how the language’s non-terminal symbols are constructed from other grammar symbols, and the lexical grammar describes how the grammar’s terminal symbols (strings and regexes) can be composed from individual characters.

Building Parse Tables

The Runtime

WIP