[Enhancement] improve performace of point query on primary keys

### Search before asking

- [X] I had searched in the [issues](https://github.com/apache/doris/issues?q=is%3Aissue) and found no similar issues.


### Description

# Background
Doris is built on a column oriented format engine, in hight concurrency serving scenario users always want to get a whole row from system. But column oriented format will massively amplify random read IO when table is wide.
Doris query engine and plan is too heavy for some simple queries like point query. We need a short fast path for such queries.FE is an access layer service for SQL queries and write in java, analyzing and parsing SQLs will lead very high CPU overhead for hight concurrency queries.

To address these drawbacks, the following optimization methods can be applied:

Row Store Format Optimization: In high concurrency serving scenarios, users often want to retrieve entire rows. To address the issue of high random read IO in wide tables, a row store format can be introduced in the system. This format stores data in a single row, making it easier to retrieve entire rows in a single read operation, reducing the number of disk accesses required and improving performance.

Short Path Optimization for Point Queries: The heavy query engine and plan in the system can lead to high overhead for simple point queries. To address this, a short path optimization can be implemented for point queries, bypassing the heavy query engine and using a fast and efficient path to directly retrieve the required data, improving performance.

Prepared Statement Optimization: High CPU overhead in high concurrency queries can be partly attributed to the CPU-intensive process of analyzing and parsing SQLs in the frontend (FE) layer. To address this, a prepared statement optimization can be implemented. A prepared statement is a precompiled SQL statement that can be executed multiple times, reducing the overhead of analyzing and parsing SQLs and improving performance.

In conclusion, these optimizations can help address the performance issues faced by Doris in high concurrency scenarios. By providing a row store format, implementing a short path optimization for point queries, and using prepared statements, Doris can deliver fast and efficient performance for high concurrency queries.

 I'll discuss details of each one in the follow.

# Row store

Flexibility for schema evolution: As data evolves and changes, the schema of a database may also change. It's important that the row store format provides flexibility to handle these changes. By using JSONB as the codec, changes in the schema can be easily reflected in the binary format, allowing the row store format to adapt to changes in the data and schema.

Performance Considerations: Accessing columns in a row store format can be faster than in a column store format, as data is stored in a single row. This can significantly reduce disk access overhead in high concurrency scenarios. Additionally, by mapping each column ID to its corresponding value, fast access to individual columns can be achieved.

Storage Considerations: The use of JSONB as the codec for the row store format can also help reduce disk storage costs. The compact binary format can reduce the overall size of data stored on disk, making it more cost-efficient.

In conclusion, the implementation of a row store format in Doris using JSONB as the codec can help address the performance and storage issues faced in high concurrency scenarios. The simple and flexible codec, along with the added column  called `__DORIS_SOURCE_COL__` for fast access, can provide fast and cost-efficient access to data in Doris.


# Short-circuit plan
Due to the simplicity of point queries, the query planner has implemented corresponding shortcuts, allowing the point query to bypass the entire optimization process of the Optimizer. After the query is received by the FrontEnd, it is processed by the Fixed Planner, which generates the appropriate Fixed Plan, serving as the physical plan for the point query. The Fixed Planner is extremely lightweight, without the need for any equivalent transformations, logical optimizations, or physical optimizations. It simply performs some basic analysis based on the AST tree, constructing the corresponding Fixed Plan, and reducing the overhead of the optimizer.

When it comes to simple queries such as select * from tbl where pk1 = 123 and pk2 = 456, which only act on a single tablet, a lightweight RPC interface can be utilized to retrieve the data. This alternative approach eliminates the need to generate a complex Fragment plan and execute it on a heavy execution engine. The details of the RPC interface are as follows:
```
// We use thrift definition for some structure, since TExpr,
// list<Exprs.TExpr>, Descriptors.TDescriptorTable are all thrift format.
// Modify them to protobuf is a redundant work.
message PTabletKeyLookupRequest {
    required int64 tablet_id = 1;
    repeated KeyTuple key_tuples = 2;
    // serilized from Descriptors.TDescriptorTable
    optional bytes desc_tbl = 4;
    // serilized from TExprList 
    optional bytes output_expr = 5;
}

message PTabletKeyLookupResponse {
    required PStatus status = 1;
    optional bytes row_batch = 5;
    optional bool empty_batch = 6;
}
rpc tablet_fetch_data(PTabletKeyLookupRequest) returns (PTabletKeyLookupResponse);
```
`tablet_id` is caculated from condition column `pk`, and `key_tuples` is string format of pk, in the above example, `key_tuples` is like `['123', '456']`, the key tuples will be encoded to the primary key format and using `lookup_row_key` for indentify positions of keys in the tablet.`lookup_row_key` will locate key row positions and check if the key is in delete bitmap, if not in return it's `RowLocation` otherwise return NotFound.Then use the location to directly get the row from related segment, we only need the `__DORIS_SOURCE_COL__` column for point queries, so we just located a row in this column and get a jsonb format raw and decode it to Block for later expressions evaluation.

# Prepared Statement
Prepared Statement is a statement caching mechanism provided by Mysql to reduce the CPU overhead of analyzing and parsing SQL statements in high concurrency scenarios. The feature is fully compatible with the Mysql protocol and is currently only supported for point queries.

By using the Prepared Statement feature, we can improve the performance of our system by up to 4x. The mechanism works by caching the pre-calculated SQL and expressions in a session level memory HashMap, which can then be reused later. This reduces the CPU cost of analyzing and parsing the SQL statement.

To implement Prepared Statement, we use the [Mysql binary protocol ](https://dev.mysql.com/doc/dev/mysqlserver/latest/page_protocol_binary_resultset.html#sect_protocol_binary_resultset_row)for transferring the statement. This protocol is implemented in the file mysql_row_buffer.[h|cpp] and is compliant with the standard Mysql binary protocol.

In addition to caching the statement in the FE, we also need to cache the structures that are reused in the BE. This includes pre-allocated block pools, query descriptors, and expressions, which are also CPU-intensive to serialize and deserialize. For each Prepared Statement, a unique UUID named cacheID is attached. When the Prepared Statement is executed, the related cacheID is used to execute and reuse the aforementioned structures in the BE.

# Benchmark
I run a standard [YCSB](https://github.com/eldenmoon/YCSB) benchamrk for comparing the performance promoted.Bellow is the result.
(The machine env is a single cloud machine with 16 cores and 64G memory plus 4 * 1T HDD.Before the benchmark I did some warmup for caching.I will test performance in a cluster later on, it should be linear scalability) 

cost about 12 cores
```
[OVERALL], Throughput(ops/sec), 31134.704298145927
[READ], Operations, 9882727
[READ], AverageLatency(us), 638.3491656705685
[READ], MinLatency(us), 250
[READ], MaxLatency(us), 152447
[READ], 95thPercentileLatency(us), 856
[READ], 99thPercentileLatency(us), 1277
[READ], Return=OK, 9882727
[READ], Return=NOT_FOUND, 117273
```

the table is created by following SQL
```
CREATE TABLE `usertable` (
  `YCSB_KEY` varchar(255) NULL,
  `FIELD0` text NULL,
  `FIELD1` text NULL,
  `FIELD2` text NULL,
  `FIELD3` text NULL,
  `FIELD4` text NULL,
  `FIELD5` text NULL,
  `FIELD6` text NULL,
  `FIELD7` text NULL,
  `FIELD8` text NULL,
  `FIELD9` text NULL
) ENGINE=OLAP
UNIQUE KEY(`YCSB_KEY`)
COMMENT 'OLAP'
DISTRIBUTED BY HASH(`YCSB_KEY`) BUCKETS 16
PROPERTIES (
"replication_allocation" = "tag.location.default: 1",
"in_memory" = "false",
"persistent" = "false",
"storage_format" = "V2",
"enable_unique_key_merge_on_write" = "true",
"light_schema_change" = "true",
"store_row_column" = "true",
"disable_auto_compaction" = "false"
);
```
and executed in SQL follow with `useServerPrepStmts=true`
```
SELECT * from usertable WHERE YCSB_KEY = ?
```


### Solution

_No response_

### Are you willing to submit PR?

- [X] Yes I am willing to submit a PR!

### Code of Conduct

- [X] I agree to follow this project's [Code of Conduct](https://www.apache.org/foundation/policies/conduct)


Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[Enhancement] improve performace of point query on primary keys #15571

Search before asking

Description

Background

Row store

Short-circuit plan

Prepared Statement

Benchmark

Solution

Are you willing to submit PR?

Code of Conduct

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

[Enhancement] improve performace of point query on primary keys #15571

Description

Search before asking

Description

Background

Row store

Short-circuit plan

Prepared Statement

Benchmark

Solution

Are you willing to submit PR?

Code of Conduct

Metadata

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Issue actions