ClickHouse and Friends (2) MySQL Protocol and Read Call Stack
For an OLAP DBMS, two ends are very important:
How users can conveniently connect - this is the entry end
- Besides its own client, ClickHouse provides MySQL/PG/GRPC/HTTP access methods
How data can be conveniently mounted - this is the data source end
- Besides its own engines, ClickHouse can mount external data sources like MySQL/Kafka
This way, internal and external connectivity is achieved, more friends means more paths, realizing “data”-level orchestration capabilities.
Today we’re talking about the entry end’s MySQL protocol, also the first good friend in this ClickHouse series. Users can directly connect to ClickHouse through MySQL clients or related drivers to perform data read/write operations.
This article uses MySQL Query requests to understand ClickHouse’s complete data reading process through call stacks.
How is it Implemented?
The entry file is at:
MySQLHandler.cpp
Handshake Protocol
- MySQLClient sends Greeting packet to MySQLHandler
- MySQLHandler replies with a Greeting-Response packet
- MySQLClient sends authentication packet
- MySQLHandler authenticates the packet and returns authentication result
MySQL Protocol implementation is in: Core/MySQLProtocol.h
Query Request
After authentication passes, normal data interaction can proceed.
When MySQLClient sends a request:
1
mysql> SELECT * FROM system.numbers LIMIT 5;
MySQLHandler call stack:
1
->MySQLHandler::comQuery -> executeQuery -> pipeline->execute -> MySQLOutputFormat::consume
MySQLClient receives results
In step 2, executeQuery (executeQuery.cpp) is very important.
It’s the interface between all frontend Servers and ClickHouse kernel, with the first parameter being SQL text (‘select 1’) and the second parameter being where to send the result set (socket net).
Call Stack Analysis
1 | SELECT * FROM system.numbers LIMIT 5 |
1. Get Data Source
StorageSystemNumbers data source:
1 | DB::StorageSystemNumbers::read(std::__1::vector<std::__1::basic_string<char, std::__1::char_traits<char>, std::__1::allocator<char> >, std::__1::allocator<std::__1::basic_string<char, std::__1::char_traits<char>, std::__1::allocator<char> > > > const&, std::__1::shared_ptr<DB::StorageInMemoryMetadata const> const&, DB::SelectQueryInfo const&, DB::Context const&, DB::QueryProcessingStage::Enum, unsigned long, unsigned int) StorageSystemNumbers.cpp:135 |
The most important here is the ReadFromStorageStep function, which gets the data source pipe from different storages:
1 | Pipes pipes = storage->read(required_columns, metadata_snapshot, query_info, *context, processing_stage, max_block_size, max_streams); |
2. Pipeline Construction
1 | DB::LimitTransform::LimitTransform(DB::Block const&, unsigned long, unsigned long, unsigned long, bool, bool, std::__1::vector<DB::SortColumnDescription, std::__1::allocator<DB::SortColumnDescription> >) LimitTransform.cpp:21 |
3. Pipeline Execution
1 | DB::LimitTransform::prepare(std::__1::vector<unsigned long, std::__1::allocator<unsigned long> > const&, std::__1::vector<unsigned long, std::__1::allocator<unsigned long> > const&) LimitTransform.cpp:67 |
4. Output Execution and Sending
1 | DB::MySQLOutputFormat::consume(DB::Chunk) MySQLOutputFormat.cpp:53 |
Summary
ClickHouse’s modularization is quite clear, like Lego blocks that can be assembled. When we execute:
1 | SELECT * FROM system.numbers LIMIT 5 |
First, the kernel parses the SQL statement to generate AST, then gets the data source Source based on AST, pipeline.Add(Source).
Next, generates QueryPlan based on AST information, then generates corresponding Transforms based on QueryPlan, pipeline.Add(LimitTransform).
Then adds Output Sink as the data sending object, pipeline.Add(OutputSink).
Execute pipeline, and each Transformer starts working.
ClickHouse’s Transformer scheduling system is called Processor, which is also a critical module determining performance. For details, see Pipeline Processors and Scheduler.
ClickHouse is a luxury sports car with manual transmission, free to own. Cheers everyone!