Syntax Highlighting

Syntax highlighting is a very common feature in applications that deal with code. Tree-sitter has built-in support for syntax highlighting, via the tree-sitter-highlight library, which is currently used on GitHub.com for highlighting code written in several languages. You can also perform syntax highlighting at the command line using the tree-sitter highlight command.

This document explains how the Tree-sitter syntax highlighting system works, using the command line interface. If you are using tree-sitter-highlight library (either from C or from Rust), all of these concepts are still applicable, but the configuration data is provided using in-memory objects, rather than files.

Overview

All of the files needed to highlight a given language are normally included in the same git repository as the Tree-sitter grammar for that language (for example, tree-sitter-javascript, tree-sitter-ruby). In order to run syntax highlighting from the command-line, three types of files are needed:

  1. Per-user configuration in ~/.config/tree-sitter/config.json
  2. Language configuration in grammar repositories’ tree-sitter.json files.
  3. Tree queries in the grammars repositories’ queries folders.

For an example of the language-specific files, see the tree-sitter.json file and queries directory in the tree-sitter-ruby repository. The following sections describe the behavior of each file.

Per-user Configuration

The Tree-sitter CLI automatically creates two directories in your home folder. One holds a JSON configuration file, that lets you customize the behavior of the CLI. The other holds any compiled language parsers that you use.

These directories are created in the “normal” place for your platform:

The CLI will work if there’s no config file present, falling back on default values for each configuration option. To create a config file that you can edit, run this command:

tree-sitter init-config

(This will print out the location of the file that it creates so that you can easily find and modify it.)

Paths

The tree-sitter highlight command takes one or more file paths, and tries to automatically determine which language should be used to highlight those files. In order to do this, it needs to know where to look for Tree-sitter grammars on your filesystem. You can control this using the "parser-directories" key in your configuration file:

{
  "parser-directories": [
    "/Users/my-name/code",
    "/Users/my-name/other-code"
  ]
}

Currently, any folder within one of these parser directories whose name begins with tree-sitter- will be treated as a Tree-sitter grammar repository.

Theme

The Tree-sitter highlighting system works by annotating ranges of source code with logical “highlight names” like function.method, type.builtin, keyword, etc. In order to decide what color should be used for rendering each highlight, a theme is needed.

In your config file, the "theme" value is an object whose keys are dot-separated highlight names like function.builtin or keyword, and whose values are JSON expressions that represent text styling parameters.

Parse Theme

The Tree-sitter parse command will output a pretty-printed CST when the --cst option is used. You can control which colors are used for various parts of the tree in your configuration file. Note that omitting a field will cause the relevant text to be rendered with its default color.

{
  "parse-theme": {
    // The color of node kinds
    "node-kind": [20, 20, 20],
    // The color of text associated with a node
    "node-text": [255, 255, 255],
    // The color of node fields
    "field": [42, 42, 42],
    // The color of the range information for unnamed nodes
    "row-color": [255, 255, 255],
    // The color of the range information for named nodes
    "row-color-named": [255, 130, 0],
    // The color of extra nodes
    "extra": [255, 0, 255],
    // The color of ERROR nodes
    "error": [255, 0, 0],
    // The color of MISSING nodes and their associated text
    "missing": [153, 75, 0],
    // The color of newline characters
    "line-feed": [150, 150, 150],
    // The color of backtick characters
    "backtick": [0, 200, 0],
    // The color of literals
    "literal": [0, 0, 200],
  }
}

Highlight Names

A theme can contain multiple keys that share a common subsequence. Examples:

For a given highlight produced, styling will be determined based on the longest matching theme key. For example, the highlight function.builtin.static would match the key function.builtin rather than function.

Styling Values

Styling values can be any of the following:

Language Configuration

The tree-sitter.json file is used by the Tree-sitter CLI. Within this file, the CLI looks for data nested under the top-level "grammars" key. This key is expected to contain an array of objects with the following keys:

Basics

These keys specify basic information about the parser:

Language Detection

These keys help to decide whether the language applies to a given file:

Query Paths

These keys specify relative paths from the directory containing tree-sitter.json to the files that control syntax highlighting:

The behaviors of these three files are described in the next section.

Example

Typically, the "tree-sitter" array only needs to contain one object, which only needs to specify a few keys:

{
  "tree-sitter": [
    {
      "scope": "source.ruby",
      "file-types": [
        "rb",
        "gemspec",
        "Gemfile",
        "Rakefile"
      ],
      "first-line-regex": "#!.*\\bruby$"
    }
  ]
}

Queries

Tree-sitter’s syntax highlighting system is based on tree queries, which are a general system for pattern-matching on Tree-sitter’s syntax trees. See this section of the documentation for more information about tree queries.

Syntax highlighting is controlled by three different types of query files that are usually included in the queries folder. The default names for the query files use the .scm file. We chose this extension because it commonly used for files written in Scheme, a popular dialect of Lisp, and these query files use a Lisp-like syntax.

Alternatively, you can think of .scm as an acronym for “Source Code Matching”.

Highlights

The most important query is called the highlights query. The highlights query uses captures to assign arbitrary highlight names to different nodes in the tree. Each highlight name can then be mapped to a color (as described above). Commonly used highlight names include keyword, function, type, property, and string. Names can also be dot-separated like function.builtin.

Example Input

For example, consider the following Go code:

func increment(a int) int {
    return a + 1
}

With this syntax tree:

(source_file
  (function_declaration
    name: (identifier)
    parameters: (parameter_list
      (parameter_declaration
        name: (identifier)
        type: (type_identifier)))
    result: (type_identifier)
    body: (block
      (return_statement
        (expression_list
          (binary_expression
            left: (identifier)
            right: (int_literal)))))))

Example Query

Suppose we wanted to render this code with the following colors:

We can assign each of these categories a highlight name using a query like this:

; highlights.scm

"func" @keyword
"return" @keyword
(type_identifier) @type
(int_literal) @number
(function_declaration name: (identifier) @function)

Then, in our config file, we could map each of these highlight names to a color:

{
  "theme": {
    "keyword": "purple",
    "function": "blue",
    "type": "green",
    "number": "brown"
  }
}

Result

Running tree-sitter highlight on this Go file would produce output like this:

func increment(a int) int {
    return a + 1
}

Local Variables

Good syntax highlighting helps the reader to quickly distinguish between the different types of entities in their code. Ideally, if a given entity appears in multiple places, it should be colored the same in each place. The Tree-sitter syntax highlighting system can help you to achieve this by keeping track of local scopes and variables.

The local variables query is different from the highlights query in that, while the highlights query uses arbitrary capture names which can then be mapped to colors, the locals variable query uses a fixed set of capture names, each of which has a special meaning.

The capture names are as follows:

When highlighting a file, Tree-sitter will keep track of the set of scopes that contains any given position, and the set of definitions within each scope. When processing a syntax node that is captured as a local.reference, Tree-sitter will try to find a definition for a name that matches the node’s text. If it finds a match, Tree-sitter will ensure that the reference and the definition are colored the same.

The information produced by this query can also be used by the highlights query. You can disable a pattern for nodes which have been identified as local variables by adding the predicate (#is-not? local) to the pattern. This is used in the example below:

Example Input

Consider this Ruby code:

def process_list(list)
  context = current_context
  list.map do |item|
    process_item(item, context)
  end
end

item = 5
list = [item]

With this syntax tree:

(program
  (method
    name: (identifier)
    parameters: (method_parameters
      (identifier))
    (assignment
      left: (identifier)
      right: (identifier))
    (method_call
      method: (call
        receiver: (identifier)
        method: (identifier))
      block: (do_block
        (block_parameters
          (identifier))
        (method_call
          method: (identifier)
          arguments: (argument_list
            (identifier)
            (identifier))))))
  (assignment
    left: (identifier)
    right: (integer))
  (assignment
    left: (identifier)
    right: (array
      (identifier))))

There are several different types of names within this method:

Example Queries

Let’s write some queries that let us clearly distinguish between these types of names. First, set up the highlighting query, as described in the previous section. We’ll assign distinct colors to method calls, method definitions, and formal parameters:

; highlights.scm

(call method: (identifier) @function.method)
(method_call method: (identifier) @function.method)

(method name: (identifier) @function.method)

(method_parameters (identifier) @variable.parameter)
(block_parameters (identifier) @variable.parameter)

((identifier) @function.method
 (#is-not? local))

Then, we’ll set up a local variable query to keep track of the variables and scopes. Here, we’re indicating that methods and blocks create local scopes, parameters and assignments create definitions, and other identifiers should be considered references:

; locals.scm

(method) @local.scope
(do_block) @local.scope

(method_parameters (identifier) @local.definition)
(block_parameters (identifier) @local.definition)

(assignment left:(identifier) @local.definition)

(identifier) @local.reference

Result

Running tree-sitter highlight on this ruby file would produce output like this:

def process_list(list)
  context = current_context
  list.map do |item|
    process_item(item, context)
  end
end

item = 5
list = [item]

Language Injection

Some source files contain code written in multiple different languages. Examples include:

All of these examples can be modeled in terms of a parent syntax tree and one or more injected syntax trees, which reside inside of certain nodes in the parent tree. The language injection query allows you to specify these “injections” using the following captures:

The language injection behavior can also be configured by some properties associated with patterns:

Examples

Consider this ruby code:

system <<-BASH.strip!
  abc --def | ghi > jkl
BASH

With this syntax tree:

(program
  (method_call
    method: (identifier)
    arguments: (argument_list
      (call
        receiver: (heredoc_beginning)
        method: (identifier))))
  (heredoc_body
    (heredoc_end)))

The following query would specify that the contents of the heredoc should be parsed using a language named “BASH” (because that is the text of the heredoc_end node):

(heredoc_body
  (heredoc_end) @injection.language) @injection.content

You can also force the language using the #set! predicate. For example, this will force the language to be always ruby.

((heredoc_body) @injection.content
 (#set! injection.language "ruby"))

Unit Testing

Tree-sitter has a built-in way to verify the results of syntax highlighting. The interface is based on Sublime Text’s system for testing highlighting.

Tests are written as normal source code files that contain specially-formatted comments that make assertions about the surrounding syntax highlighting. These files are stored in the test/highlight directory in a grammar repository.

Here is an example of a syntax highlighting test for JavaScript:

var abc = function(d) {
  // <- keyword
  //          ^ keyword
  //               ^ variable.parameter
  // ^ function

  if (a) {
  // <- keyword
  // ^ punctuation.bracket

    foo(`foo ${bar}`);
    // <- function
    //    ^ string
    //          ^ variable
  }

  baz();
  // <- !variable
};

From the Sublime text docs:

The two types of tests are:

Caret: ^ this will test the following selector against the scope on the most recent non-test line. It will test it at the same column the ^ is in. Consecutive ^s will test each column against the selector.

Arrow: <- this will test the following selector against the scope on the most recent non-test line. It will test it at the same column as the comment character is in.

Note that an exclamation mark (!) can be used to negate a selector. For example, !keyword will match any scope that is not the keyword class.