Using Tree-sitter to highlight code on the web

Whenever I wanted to start blogging, I always wondered how to handle syntax highlighting in code blocks. The two most common options were probably:

Use a JavaScript library to handle that dynamically
Skip syntax highlighting whatsoever

Neither seemed attractive for a blog post, to be honest. Using JavaScript could introduce the burden of an external runtime dependency, while having no syntax highlighting at all could backfire by making code snippets confusing instead of informative.

So with that in mind, my end goal was to have something static, consistent, complete and which I could customize. And then I asked myself, “can’t I use Tree-sitter on the web?”.

Enter Tree-sitter

The first time I got to know Tree-sitter was when I still used the Atom code editor. If I’m not wrong, the Tree-sitter project was created by the Atom team themselves, possibly making it the first code editor to use Tree-sitter.

Then it later followed me through my Neovim era, but I never really delved into how Tree-sitter worked, I only plugged it in to my editor in order to get nice syntax highlighting.

But what’s Tree-sitter after all? Quoting the Tree-sitter website:

Tree-sitter is a parser generator tool and an incremental parsing library. It can build a concrete syntax tree for a source file and efficiently update the syntax tree as the source file is edited.

So very cool. This means we can inject content out of code blocks into a Tree-sitter parser and retrieve a syntax tree out of it. Amazing!

And, differently from code editors, we just want to highlight static code, so it should be even easier. But this also means we need to set up a build step for the blog posts.

Parsing code blocks from Markdown files

I think it’s not uncommon for bloggers to use Markdown in order to write their posts. And I’m no different (nor special). I’m using Go with two libraries to parse my blog posts:

https://github.com/adrg/frontmatter to parse Frontmatter metadata from posts
https://github.com/gomarkdown/markdown to parse the Markdown content itself

All the Markdown content is injected in a post.html HTML template during the build step:

{{define "content"}}
<div class="post">
  ...

  <article>
    {{.Content}}
  </article>

  ...
</div>
{{end}}

But, in order to capture code blocks during Markdown parsing, we need to use a custom renderer so that we can intercept code block nodes, parse them into a syntax tree and inject meaningful <span> tags surrounding code we want to highlight.

In order to build the syntax tree, I’m using https://github.com/tree-sitter/go-tree-sitter. It has awesome instructions on how to use it with existing grammars, so I will skip its setup here. You can always visit the source code for this website by clicking on the link at the footer. 🙂

With a Tree-sitter parser in hands, we need is to create a Markdown render hook function. The signature for that looks like the following:

func(w io.Writer, node ast.Node, entering bool) (ast.WalkStatus, bool)

The two important pieces here are the writer w, to which we need to write our own output, and node, which contains the content of our Markdown code block. And with them, we are able to extract code blocks like this:

code, ok := node.(*ast.CodeBlock)
if !ok {
	return ast.GoToNext, false
}

lang := string(code.Info)

Whenever we’re not dealing with a code block, we just let the Markdown parser know that it can apply its own thing to the code block and go to next. Otherwise, we extract the language in order to let the Tree-sitter parse know which language it should use:

var tslang unsafe.Pointer
switch lang {
case "html":
	tslang = tshtml.Language()
case "css":
	tslang = tscss.Language()
case "yaml":
	tslang = tsyaml.Language()
case "go":
	tslang = tsgo.Language()
default:
	return ast.GoToNext, false
}

if err := tsParser.SetLanguage(ts.NewLanguage(tslang)); err != nil {
	return ast.Terminate, false
}

After the language is selected for the Tree-sitter parser, we build the tree, write the opening code block tag, and then proceed to walk over the syntax tree in order to process its leaves. For that, we use a tree cursor:

src := code.Literal
tree := tsParser.Parse(src, nil)
defer tree.Close()

fmt.Fprintf(w, `<pre><code class="lang-%s">`, lang)

var pos uint = 0
cursor := tree.RootNode().Walk()

And with that cursor in hands, we can now loop until there are no more leaves to visit:

loop:
	for {
		node := cursor.Node()

		// Checking count for named children helps considering literal values as leaf nodes.
		if node.NamedChildCount() > 0 && cursor.GotoFirstChild() {
			continue
		}

		// From here on, it's a leaf.
		start, end := node.ByteRange()

		// Write what comes before the leaf node.
		if start > pos {
			before := src[pos:start]
			if _, err := w.Write([]byte(html.EscapeString(before))); err != nil {
				return ast.Terminate, false
			}
		}

		// Then write the node itself wrapped by the appropriate HTML tag.
		value := html.EscapeString(string(src[start:end]))
		kind := node.Kind()
		// ...

		if _, err := fmt.Fprintf(w, `<span class="ts-%s">%s</span>`, kind, value); err != nil {
			return ast.Terminate, false
		}

		pos = end

		// Apply the same processing to the leaf's siblings.
		if cursor.GotoNextSibling() {
			continue
		}

		// If no siblings, go back to parents to find new nodes.
		for cursor.GotoParent() {
			if cursor.GotoNextSibling() {
				continue loop
			}
		}

		break
	}

And finally, we close the code block tag and flag that we will use what we wrote to w:

fmt.Fprint(w, "</pre></code>")

return ast.GoToNext, true

The result of all this parsing is, for example, this kind of HTML code:

<pre><code class="lang-go">
  <span class="ts-func">func</span><span class="ts-(">(</span><span class="ts-identifier">w</span> <span class="ts-package_identifier">io</span><span class="ts-.">.</span><span class="ts-type_identifier">Writer</span><span class="ts-,">,</span> <span class="ts-identifier">node</span> <span class="ts-package_identifier">ast</span><span class="ts-.">.</span><span class="ts-type_identifier">Node</span><span class="ts-,">,</span> <span class="ts-identifier">entering</span> <span class="ts-type_identifier">bool</span><span class="ts-)">)</span> <span class="ts-(">(</span><span class="ts-package_identifier">ast</span><span class="ts-.">.</span><span class="ts-type_identifier">WalkStatus</span><span class="ts-,">,</span> <span class="ts-type_identifier">bool</span><span class="ts-)">)</span>
</code></pre>

Which I can then style using simply CSS classes:

code.lang-go {
  .ts-type_identifier {
    font-weight: bold;
    color: var(--light-cyan);
  }

  .ts-double_quote,
  .ts-interpreted_string_literal_content {
    color: var(--green);
  }

  .ts-import,
  .ts-func,
  .ts-if,
  .ts-var,
  .ts-switch,
  .ts-case,
  .ts-default,
  .ts-return,
  .ts-continue_statement,
  .ts-break_statement,
  .ts-for {
    font-weight: bold;
    color: var(--light-red);
  }

  .ts-nil {
    font-weight: bold;
    color: var(--blue);
  }

  .ts-comment {
    font-style: italic;
    color: var(--gray);
  }

  .ts-field_identifier {
    font-weight: bold;
    color: var(--magenta);
  }
}

Of course, there is some work in order to style it properly, specially for languages not yet being considered by the parser logic, but once things look good, there’s nothing much to change.

And finally after all this processing, the website builder spits out multiple index.html files in structured directories under a pages directory, and the static output is pushed to a different repository that is picked by Codeberg Pages, and after a few minutes, the website is updated.