Tree-sitter consists of two components: a C library (libtree-sitter), and a command-line tool (the tree-sitter CLI).
The library, libtree-sitter, is used in combination with the parsers
generated by the CLI, to produce syntax trees from source code and keep the
syntax trees up-to-date as the source code changes. libtree-sitter is designed to be embedded in applications. It is written in plain C. Its interface is specified in the header file tree_sitter/api.h.
The CLI is
used to generate a parser for a language by supplying a context-free grammar describing the
language. The CLI is a build tool; it is no longer needed once a parser has been generated. It is written in Rust, and is available on crates.io, npm, and as a pre-built binary on GitHub.
The tree-sitter CLI’s most important feature is the generate subcommand. This subcommand reads context-free grammar from a file called grammar.js and outputs a parser as a C file called parser.c. The source files in the cli/src directory all play a role in producing the code in parser.c. This section will describe some key parts of this process.
A Tree-sitter grammar is composed of a set of rules - objects that describe how syntax nodes can be composed from other syntax nodes. There are several types of rules: symbols, strings, regexes, sequences, choices, repetitions, and a few others. Internally, these are all represented using an enum called Rule.
Once a grammar has been parsed, it must be transformed in several ways before it can be used to generate a parser. Each transformation is implemented by a separate file in the prepare_grammar directory, and the transformations are ultimately composed together in prepare_grammar/mod.rs.
At the end of these transformations, the initial grammar is split into two grammars: a syntax grammar and a lexical grammar. The syntax grammar describes how the language’s non-terminal symbols are constructed from other grammar symbols, and the lexical grammar describes how the grammar’s terminal symbols (strings and regexes) can be composed from individual characters.