Tree-sitter consists of two separate libraries, both of which expose C APIs.
The first library, libcompiler, is
used to generate a parser for a language by supplying a context-free grammar describing the
language. libcompiler is a build tool; it is no longer needed once a parser has been generated. It is written in C++, but exposes a C interface, which is declared in the header file compiler.h.
The second library, libruntime, is used in combination with the parsers
generated by libcompiler, to produce syntax trees from source code and keep the
syntax trees up-to-date as the source code changes. libruntime is designed to be embedded in applications. It is written in plain C. Its interface is specified in the header file runtime.h.
The libcompiler library exports only one function: ts_compile_grammar. This function takes a context-free grammar as a JSON string and returns a parser as a string of C code. The source files in the src/compiler directory all play a role in producing this C code. This section will describe some key parts of this process.
First, libcompiler must parse the JSON grammar. The format of the grammars is formally specified by the JSON schema in grammar-schema.json. The parsing is implemented in parser_grammar.cc. It uses udp/json-parser, one of Tree-sitter’s few library dependencies.
A Tree-sitter grammar is composed of a set of rules - objects that describe how syntax nodes can be composed from other syntax nodes. There are several types of rules: symbols, strings, regexes, sequences, choices, repetitions, and a few others. Internally, these are all represented using a tagged union class called Rule. This class has a method called match, which makes it easy to pattern-match a rule, processing each type of rule with separate code.
Once a grammar has been parsed, it must be transformed in several ways before it can be used to generate a parser. Each transformation is implemented by a separate file in the src/compiler/prepare_grammar directory, and the transformations are ultimately composed together in prepare_grammar.cc.
At the end of these transformations, the initial grammar is split into two grammars: a syntax grammar and a lexical grammar. The syntax grammar describes how the language’s non-terminal symbols are constructed from other grammar symbols, and the lexical grammar describes how the grammar’s terminal symbols (strings and regexes) can be composed from individual characters.