ClickHouse and Friends (8) A Purely Hand-Crafted SQL Parser
In real life, when an item is labeled “purely hand-crafted,” the first impression is “premium quality” — in a word: expensive. Like Beijing handmade cloth shoes.
But in the computer world, if someone tells you ClickHouse’s SQL parser is purely hand-crafted, isn’t that surprising! This question has caught the attention of many netizens, so this post discusses ClickHouse’s hand-crafted parser, looking at its underlying working mechanism and pros/cons.
Let’s start boringly with a SQL:
1
EXPLAIN SELECT a,b FROM t1
token
First, judge each character in the SQL one by one, then split it into tokens based on their relationships:
For example, consecutive WordChars form a BareWord. The parsing function is in Lexer::nextTokenImpl(). Parsing call stack:
Tokens are the most basic tuples. They have no relationship — just a bunch of cold words and symbols. So we need to perform syntax parsing to establish relationships between these tokens, bringing them to life.
When ClickHouse parses each token, it predicts the state space based on the current token (if parse returns true, it enters the sub-state space to continue), then decides on state transitions. For example:
You can see that when parsing the ast, state spaces are pre-constructed. For example, the select state space:
expression list
from tables
where
group by
with …
order by
limit
Within a state space, you can also use the bool returned by parse to decide whether to continue entering sub-state spaces, recursively parsing out the entire ast.
Summary
The advantage of a hand-written parser is clean, concise code. Every detail is preventable and controllable, with friendly error handling. Changes won’t trigger a chain reaction. The downside is the high manual cost. It requires extensive testing to ensure correctness and some fuzzing for reliability. Fortunately, ClickHouse’s implementation is quite comprehensive. Even with new requirements, you can just patch things up based on the existing foundation.