Implementation
Tree-sitter consists of two components: a C library (libtree-sitter
), and a command-line tool (the tree-sitter
CLI).
The library, libtree-sitter
, is used in combination with the parsers
generated by the CLI, to produce syntax trees from source code and keep the
syntax trees up-to-date as the source code changes. libtree-sitter
is designed to be embedded in applications. It is
written in plain C. Its interface is specified in the header file tree_sitter/api.h
.
The CLI is used to generate a parser for a language by supplying a context-free grammar describing the language. The CLI is a build tool; it is no longer needed once a parser has been generated. It is written in Rust, and is available on crates.io, npm, and as a pre-built binary on GitHub.
The CLI
The tree-sitter
CLI's most important feature is the generate
command. This subcommand reads in a context-free grammar
from a file called grammar.js
and outputs a parser as a C file called parser.c
. The source files in the cli/src
directory all play a role in producing the code in parser.c
. This section will describe some key parts of this process.
Parsing a Grammar
First, Tree-sitter must evaluate the JavaScript code in grammar.js
and convert the grammar to a JSON format. It does this
by shelling out to node
. The format of the grammars is formally specified by the JSON schema in grammar.schema.json.
The parsing is implemented in parse_grammar.rs.
Grammar Rules
A Tree-sitter grammar is composed of a set of rules — objects that describe how syntax nodes can be composed of other
syntax nodes. There are several types of rules: symbols, strings, regexes, sequences, choices, repetitions, and a few others.
Internally, these are all represented using an enum called Rule
.
Preparing a Grammar
Once a grammar has been parsed, it must be transformed in several ways before it can be used to generate a parser. Each
transformation is implemented by a separate file in the prepare_grammar
directory, and the transformations
are ultimately composed together in prepare_grammar/mod.rs
.
At the end of these transformations, the initial grammar is split into two grammars: a syntax grammar and a lexical grammar. The syntax grammar describes how the language's non-terminal symbols are constructed from other grammar symbols, and the lexical grammar describes how the grammar's terminal symbols (strings and regexes) can be composed of individual characters.
Building Parse Tables
The Runtime
WIP