Tree-sitter consists of two components: a C library (libtree-sitter), and a command-line tool (the tree-sitter CLI).
libtree-sitter
tree-sitter
The library, libtree-sitter, is used in combination with the parsers generated by the CLI, to produce syntax trees from source code and keep the syntax trees up-to-date as the source code changes. libtree-sitter is designed to be embedded in applications. It is written in plain C. Its interface is specified in the header file tree_sitter/api.h.
tree_sitter/api.h
The CLI is used to generate a parser for a language by supplying a context-free grammar describing the language. The CLI is a build tool; it is no longer needed once a parser has been generated. It is written in Rust, and is available on crates.io, npm, and as a pre-built binary on GitHub.
The tree-sitter CLI’s most important feature is the generate subcommand. This subcommand reads context-free grammar from a file called grammar.js and outputs a parser as a C file called parser.c. The source files in the cli/src directory all play a role in producing the code in parser.c. This section will describe some key parts of this process.
generate
grammar.js
parser.c
cli/src
First, Tree-sitter must evaluate the JavaScript code in grammar.js and convert the grammar to a JSON format. It does this by shelling out to node. The format of the grammars is formally specified by the JSON schema in grammar.schema.json. The parsing is implemented in parse_grammar.rs.
node
A Tree-sitter grammar is composed of a set of rules - objects that describe how syntax nodes can be composed from other syntax nodes. There are several types of rules: symbols, strings, regexes, sequences, choices, repetitions, and a few others. Internally, these are all represented using an enum called Rule.
Rule
Once a grammar has been parsed, it must be transformed in several ways before it can be used to generate a parser. Each transformation is implemented by a separate file in the prepare_grammar directory, and the transformations are ultimately composed together in prepare_grammar/mod.rs.
prepare_grammar
prepare_grammar/mod.rs
At the end of these transformations, the initial grammar is split into two grammars: a syntax grammar and a lexical grammar. The syntax grammar describes how the language’s non-terminal symbols are constructed from other grammar symbols, and the lexical grammar describes how the grammar’s terminal symbols (strings and regexes) can be composed from individual characters.
WIP