ClickHouse and Friends (8) A Purely Hand-Crafted SQL Parser [ Bohu's Blog ]

In real life, when an item is labeled “purely hand-crafted,” the first impression is “premium quality” — in a word: expensive. Like Beijing handmade cloth shoes.

But in the computer world, if someone tells you ClickHouse’s SQL parser is purely hand-crafted, isn’t that surprising!
This question has caught the attention of many netizens, so this post discusses ClickHouse’s hand-crafted parser, looking at its underlying working mechanism and pros/cons.

Let’s start boringly with a SQL:

1	EXPLAIN SELECT a,b FROM t1

token

First, judge each character in the SQL one by one, then split it into tokens based on their relationships:

For example, consecutive WordChars form a BareWord. The parsing function is in Lexer::nextTokenImpl(). Parsing call stack:

DB::Lexer::nextTokenImpl() Lexer.cpp:63
DB::Lexer::nextToken() Lexer.cpp:52
DB::Tokens::operator[](unsigned long) TokenIterator.h:36
DB::TokenIterator::get() TokenIterator.h:62
DB::TokenIterator::operator->() TokenIterator.h:64
DB::tryParseQuery(DB::IParser&, char const*&, char const*, std::__1::basic_string<char, std::__1::char_traits<char>, std::__1::allocator<char> >&, bool, std::__1::basic_string<char, std::__1::char_traits<char>, std::__1::allocator<char> > const&, bool, unsigned long, unsigned long) parseQuery.cpp:224
DB::parseQueryAndMovePosition(DB::IParser&, char const*&, char const*, std::__1::basic_string<char, std::__1::char_traits<char>, std::__1::allocator<char> > const&, bool, unsigned long, unsigned long) parseQuery.cpp:314
DB::parseQuery(DB::IParser&, char const*, char const*, std::__1::basic_string<char, std::__1::char_traits<char>, std::__1::allocator<char> > const&, unsigned long, unsigned long) parseQuery.cpp:332
DB::executeQueryImpl(const char *, const char *, DB::Context &, bool, DB::QueryProcessingStage::Enum, bool, DB::ReadBuffer *) executeQuery.cpp:272
DB::executeQuery(DB::ReadBuffer&, DB::WriteBuffer&, bool, DB::Context&, std::__1::function<void (std::__1::basic_string<char, std::__1::char_traits<char>, std::__1::allocator<char> > const&, std::__1::basic_string<char, std::__1::char_traits<char>, std::__1::allocator<char> > const&, std::__1::basic_string<char, std::__1::char_traits<char>, std::__1::allocator<char> > const&, std::__1::basic_string<char, std::__1::char_traits<char>, std::__1::allocator<char> > const&)>) executeQuery.cpp:731
DB::MySQLHandler::comQuery(DB::ReadBuffer&) MySQLHandler.cpp:313
DB::MySQLHandler::run() MySQLHandler.cpp:150

ast

Tokens are the most basic tuples. They have no relationship — just a bunch of cold words and symbols. So we need to perform syntax parsing to establish relationships between these tokens, bringing them to life.

When ClickHouse parses each token, it predicts the state space based on the current token (if parse returns true, it enters the sub-state space to continue), then decides on state transitions. For example:

1	EXPLAIN -- TokenType::BareWord

The logic first enters the ParserQuery::parseImpl method in Parsers/ParserQuery.cpp:

bool res = query_with_output_p.parse(pos, node, expected)
    || insert_p.parse(pos, node, expected)
    || use_p.parse(pos, node, expected)
    || set_role_p.parse(pos, node, expected)
    || set_p.parse(pos, node, expected)
    || system_p.parse(pos, node, expected)
    || create_user_p.parse(pos, node, expected)
    || create_role_p.parse(pos, node, expected)
    || create_quota_p.parse(pos, node, expected)
    || create_row_policy_p.parse(pos, node, expected)
    || create_settings_profile_p.parse(pos, node, expected)
    || drop_access_entity_p.parse(pos, node, expected)
    || grant_p.parse(pos, node, expected);

Here it calls the parse method for all query types until one branch returns true.

Let’s look at layer one, query_with_output_p.parse Parsers/ParserQueryWithOutput.cpp:

bool parsed =
       explain_p.parse(pos, query, expected)
    || select_p.parse(pos, query, expected)
    || show_create_access_entity_p.parse(pos, query, expected)
    || show_tables_p.parse(pos, query, expected)
    || table_p.parse(pos, query, expected)
    || describe_table_p.parse(pos, query, expected)
    || show_processlist_p.parse(pos, query, expected)
    || create_p.parse(pos, query, expected)
    || alter_p.parse(pos, query, expected)
    || rename_p.parse(pos, query, expected)
    || drop_p.parse(pos, query, expected)
    || check_p.parse(pos, query, expected)
    || kill_query_p.parse(pos, query, expected)
    || optimize_p.parse(pos, query, expected)
    || watch_p.parse(pos, query, expected)
    || show_access_p.parse(pos, query, expected)
    || show_access_entities_p.parse(pos, query, expected)
    || show_grants_p.parse(pos, query, expected)
    || show_privileges_p.parse(pos, query, expected

Jump into layer two, explain_p.parse ParserExplainQuery::parseImpl state space:

bool ParserExplainQuery::parseImpl(Pos & pos, ASTPtr & node, Expected & expected)
{
    ASTExplainQuery::ExplainKind kind;
    bool old_syntax = false;

    ParserKeyword s_ast("AST");
    ParserKeyword s_analyze("ANALYZE");
    ParserKeyword s_explain("EXPLAIN");
    ParserKeyword s_syntax("SYNTAX");
    ParserKeyword s_pipeline("PIPELINE");
    ParserKeyword s_plan("PLAN");

    ... ...
    else if (s_explain.ignore(pos, expected))
    {
       ... ...
    }

    ... ...

    ParserSelectWithUnionQuery select_p;
    ASTPtr query;
    if (!select_p.parse(pos, query, expected))
        return false;
    ... ...

The s_explain.ignore method performs keyword parsing, producing an ast node:

1	EXPLAIN -- keyword

Leap into layer three, select_p.parse ParserSelectWithUnionQuery::parseImpl state space:

bool ParserSelectWithUnionQuery::parseImpl(Pos & pos, ASTPtr & node, Expected & expected)
{
    ASTPtr list_node;

    ParserList parser(std::make_unique<ParserUnionQueryElement>(), std::make_unique<ParserKeyword>("UNION ALL"), false);
    if (!parser.parse(pos, list_node, expected))
        return false;
...

parser.parse then calls layer four ParserSelectQuery::parseImpl state space:

bool ParserSelectQuery::parseImpl(Pos & pos, ASTPtr & node, Expected & expected)
{
    auto select_query = std::make_shared<ASTSelectQuery>();
    node = select_query;

    ParserKeyword s_select("SELECT");
    ParserKeyword s_distinct("DISTINCT");
    ParserKeyword s_from("FROM");
    ParserKeyword s_prewhere("PREWHERE");
    ParserKeyword s_where("WHERE");
    ParserKeyword s_group_by("GROUP BY");
    ParserKeyword s_with("WITH");
    ParserKeyword s_totals("TOTALS");
    ParserKeyword s_having("HAVING");
    ParserKeyword s_order_by("ORDER BY");
    ParserKeyword s_limit("LIMIT");
    ParserKeyword s_settings("SETTINGS");
    ParserKeyword s_by("BY");
    ParserKeyword s_rollup("ROLLUP");
    ParserKeyword s_cube("CUBE");
    ParserKeyword s_top("TOP");
    ParserKeyword s_with_ties("WITH TIES");
    ParserKeyword s_offset("OFFSET");

    ParserNotEmptyExpressionList exp_list(false);
    ParserNotEmptyExpressionList exp_list_for_with_clause(false);
    ParserNotEmptyExpressionList exp_list_for_select_clause(true);
    ...

            if (!exp_list_for_select_clause.parse(pos, select_expression_list, expected))
            return false;

Layer five, exp_list_for_select_clause.parse ParserExpressionList::parseImpl state space continues:

bool ParserExpressionList::parseImpl(Pos & pos, ASTPtr & node, Expected & expected)
{
    return ParserList(
        std::make_unique<ParserExpressionWithOptionalAlias>(allow_alias_without_as_keyword),
        std::make_unique<ParserToken>(TokenType::Comma))
        .parse(pos, node, expected);
}

… … Can’t write any more!

You can see that when parsing the ast, state spaces are pre-constructed. For example, the select state space:

expression list
from tables
where
group by
with …
order by
limit

Within a state space, you can also use the bool returned by parse to decide whether to continue entering sub-state spaces, recursively parsing out the entire ast.

Summary

The advantage of a hand-written parser is clean, concise code. Every detail is preventable and controllable, with friendly error handling. Changes won’t trigger a chain reaction.
The downside is the high manual cost. It requires extensive testing to ensure correctness and some fuzzing for reliability.
Fortunately, ClickHouse’s implementation is quite comprehensive. Even with new requirements, you can just patch things up based on the existing foundation.