This document provides a technical overview of Apache Cloudberry, an open-source distributed relational database management system. It introduces the system's architecture, core components, and key design concepts. For detailed information about specific subsystems, refer to their dedicated pages:
Apache Cloudberry is a massively parallel processing (MPP) database system derived from Greenplum Database, which is built on PostgreSQL 14. The codebase extends PostgreSQL's single-node architecture with distributed query execution, parallel processing across segments, and advanced query optimization through GPORCA.
Version Information:
Apache Cloudberry configure5833.0.0-devel configure585302509031 src/include/catalog/catversion.h59Key Differentiators from PostgreSQL:
cdbtm.cSources: src/include/catalog/catversion.h1-62 configure583-588
Diagram: Apache Cloudberry System Architecture with Code Entities
The system architecture consists of five logical layers:
Client Layer: Applications connect via libpq protocol. The PQexec() function in src/interfaces/libpq/fe-exec.c sends queries and receives results.
Coordinator Layer: The Postmaster process src/backend/postmaster/postmaster.c1-300 forks backend processes that run PostgresMain() src/backend/tcop/postgres.c100-200 Each backend parses SQL via raw_parser() src/backend/parser/gram.y analyzes via parse_analyze(), and plans queries using either standard_planner() src/backend/optimizer/plan/planner.c100-200 or GPORCA's optimize_query().
Dispatch Layer: For distributed queries, CdbDispatchPlan() src/backend/cdb/dispatcher/cdbdisp_query.c divides the plan into slices and sends them to segment backends.
Executor Layer: ExecutorStart() and ExecutorRun() src/backend/executor/execMain.c process plan nodes. Motion nodes src/backend/executor/execMotion.c handle inter-segment data transfer.
Storage Layer: Data modifications go through heap access methods like heap_insert() src/backend/access/heap/heapam.c and are logged via XLogInsert() src/backend/access/transam/xlog.c1-100 System catalogs are cached by catcache.c.
Sources: src/backend/postmaster/postmaster.c1-300 src/backend/tcop/postgres.c1-200 src/backend/optimizer/plan/planner.c1-200 src/backend/cdb/dispatcher/cdbdisp_query.c src/backend/executor/execMain.c
Diagram: Process Architecture with Code References
The Postmaster process src/backend/postmaster/postmaster.c200-500 acts as the supervisor process that:
Listens for client connections: The ServerLoop() function src/backend/postmaster/postmaster.c1700-1800 monitors ListenSocket[] on port PostPortNumber (configurable via port GUC, default 5432) src/backend/postmaster/postmaster.c230
Forks backend processes: Upon accepting a connection via BackendStartup() src/backend/postmaster/postmaster.c4200-4400 the postmaster calls fork() and the child runs PostgresMain() src/backend/tcop/postgres.c4000-4100 Each backend has a unique MyProcPid and MyBackendId.
Manages auxiliary processes: Background workers are forked during startup via:
StartupProcessMain() src/backend/postmaster/startup.c - Replays WAL on startupBackgroundWriterMain() src/backend/postmaster/bgwriter.c - Writes dirty buffersCheckpointerMain() - Coordinates checkpoints and syncs XLogCtlWalWriterMain() src/backend/postmaster/walwriter.c - Flushes WAL buffers via XLogWrite()AutoVacLauncherMain() src/backend/postmaster/autovacuum.c - Schedules vacuum operationsPgstatCollectorMain() src/backend/postmaster/pgstat.c - Aggregates statisticsCloudberry-specific processes:
FtsProbeMain() src/backend/postmaster/fts.c - Monitors segment availabilityWalReceiverMain() src/backend/replication/walreceiver.c - Receives WAL from primary (replication)Shared Memory Architecture:
All processes attach to shared memory segments created by PGSharedMemoryCreate() src/backend/storage/ipc/pg_shmem.c Key shared structures include:
XLogCtlData src/backend/access/transam/xlog.c651-791 - WAL control structure with insertion locks, buffer pointers, and checkpoint stateBufferDescriptors - Buffer pool metadata for shared_buffersProcArray src/backend/storage/ipc/procarray.c - Tracks active transactions for MVCC snapshotsLockManager structures for lock tablesProcess Communication:
kill() and signal handlers src/backend/tcop/postgres.c3000-3100Sources: src/backend/postmaster/postmaster.c1-500 src/backend/tcop/postgres.c1-200 src/backend/access/transam/xlog.c651-791 src/backend/storage/ipc/pg_shmem.c src/backend/postmaster/fts.c
Diagram: Query Processing Pipeline with Code Functions
Query execution follows a multi-stage pipeline:
1. Protocol & Parsing Phase:
PQexec() src/interfaces/libpq/fe-exec.c using the frontend/backend protocolPostgresMain() src/backend/tcop/postgres.c4100-4200 calls ReadCommand() to receive the query string (stored in debug_query_string)raw_parser() src/backend/parser/parser.c invokes base_yyparse() which is generated from src/backend/parser/gram.y1-300RawStmt tree src/include/nodes/parsenodes.h2. Analysis Phase:
parse_analyze() src/backend/parser/analyze.c performs semantic analysistransformStmt() for each statement type (SELECT → transformSelectStmt(), INSERT → transformInsertStmt(), etc.)RangeVarGetRelid() and performs name/type resolutionQuery tree src/include/nodes/parsenodes.h with resolved relations and expressions3. Planning Phase:
The system chooses between two optimizers based on the optimizer GUC:
Standard Planner src/backend/optimizer/plan/planner.c200-300:
standard_planner() calls subquery_planner() → grouping_planner()query_planner() generates join orders and access pathscreate_paths() src/backend/optimizer/path/allpaths.c builds RelOptInfo structures with different access paths (SeqScan, IndexScan, etc.)create_plan() src/backend/optimizer/plan/createplan.c200-300 converts the cheapest Path into a Plan treeGPORCA src/backend/optimizer/orca:
optimizer=on, optimize_query() translates the Query to DXL (Data eXchange Language)COptTasks::OptimizeTask() in gpopt) performs advanced optimizationOutput is a PlannedStmt src/include/nodes/plannodes.h containing the execution plan tree
4. Execution Phase:
CreateQueryDesc() src/backend/tcop/pquery.c wraps the plan in a QueryDesc structureExecutorStart() src/backend/executor/execMain.c100-200 calls InitPlan() to initialize executor state (EState and PlanState nodes)ExecutorRun() src/backend/executor/execMain.c300-400 calls ExecutePlan() which repeatedly invokes ExecProcNode() src/backend/executor/execProcnode.cSeqScanState → ExecSeqScan() src/backend/executor/nodeSeqscan.cIndexScanState → ExecIndexScan() src/backend/executor/nodeIndexscan.cModifyTableState → ExecModifyTable() src/backend/executor/nodeModifyTable.cExecutorEnd() src/backend/executor/execMain.c500-600 cleans up executor state5. Storage Access:
Data modifications go through heap access methods:
heap_insert() src/backend/access/heap/heapam.c2000-2100 - Insert new tuplesheap_update() - Update existing tuples (marks old version as deleted, inserts new version)heap_delete() - Mark tuple as deleted (MVCC: sets xmax rather than physical deletion)All modifications are logged via XLogInsert() src/backend/access/transam/xlog.c900-1000 before being applied (write-ahead logging ensures durability).
Sources: src/backend/tcop/postgres.c4100-4200 src/backend/parser/gram.y1-300 src/backend/parser/analyze.c src/backend/optimizer/plan/planner.c1-300 src/backend/optimizer/path/allpaths.c src/backend/optimizer/plan/createplan.c1-300 src/backend/executor/execMain.c1-600 src/backend/executor/execProcnode.c src/backend/access/heap/heapam.c2000-2100
The Grand Unified Configuration (GUC) system manages runtime and startup parameters:
Configuration Sources:
postgresql.conf - Primary configuration file src/backend/utils/misc/guc.c127postgresql.auto.conf - Auto-generated settings from ALTER SYSTEM commandspostgres processSET commandsKey GUC Functions:
ProcessConfigFile() - Reads and applies configuration files src/backend/utils/misc/guc.c256-258SetConfigOption() - Sets individual parametersGetConfigOption() - Retrieves current valuesCloudberry Extensions: Additional MPP-specific parameters are defined in src/backend/utils/misc/guc_gp.c including:
gp_role - Process role (coordinator/segment)gp_session_id - Unique session identifier across clusteroptimizer - Enable/disable GPORCA src/backend/utils/misc/guc_gp.c85Sources: src/backend/utils/misc/guc.c1-200 src/backend/utils/misc/guc_gp.c1-150
The WAL system ensures durability and crash recovery:
Core Components:
XLogInsert() - Writes WAL records src/backend/access/transam/xlog.cXLogFlush() - Forces WAL to diskpg_wal/ directory src/backend/access/transam/xlog.c109-132Configuration Parameters:
max_wal_size_mb - Maximum WAL size before checkpoint src/backend/access/transam/xlog.c109wal_level - Amount of information logged src/backend/access/transam/xlog.c127wal_sync_method - Method for flushing WAL src/backend/access/transam/xlog.c126Recovery Process: During startup, the startup process replays WAL records from the last checkpoint to restore the database to a consistent state src/backend/access/transam/xlog.c234-280
Sources: src/backend/access/transam/xlog.c1-500
Metadata is stored in system catalogs within the pg_catalog schema:
Core Catalogs:
| Catalog | Purpose | File |
|---|---|---|
pg_proc | Functions and operators | src/include/catalog/pg_proc.h |
pg_class | Tables, indexes, views | src/include/catalog/pg_class.h |
pg_attribute | Table columns | src/include/catalog/pg_attribute.h |
pg_type | Data types | src/include/catalog/pg_type.h |
pg_database | Databases | src/include/catalog/pg_database.h |
Catalog Version Control:
The CATALOG_VERSION_NO macro src/include/catalog/catversion.h59 tracks catalog schema changes. Incompatible catalog changes require initdb to recreate the database cluster.
Catalog Queries: All metadata lookups use the syscache mechanism for performance, caching frequently accessed catalog tuples in memory.
Sources: src/include/catalog/pg_proc.h1-130 src/include/catalog/catversion.h1-62 doc/src/sgml/catalogs.sgml1-100
Diagram: Distributed Query Execution (MPP Architecture)
Cloudberry's MPP architecture extends PostgreSQL with parallel execution across segments:
1. Query Distribution Phase:
After standard planning, the coordinator performs distributed query preparation:
Plan Parallelization: cdbllize_decorate() src/backend/cdb/cdbllize.c inserts Motion nodes into the plan tree where data must move between segments:
MOTION_GATHER - Collect results from all segments to coordinatorMOTION_REDISTRIBUTE - Rehash data across segments on different keysMOTION_BROADCAST - Send same data to all segmentsMOTION_EXPLICIT - Direct motion for specific operationsSlice Table Construction: BuildSliceTable() src/backend/cdb/dispatcher/cdbdisp_query.c1500-1600 divides the plan into execution slices. Each Motion node creates a slice boundary. A SliceTable src/include/nodes/execnodes.h tracks which segments execute each slice.
Gang Allocation: AllocateGang() src/backend/cdb/cdbgang.c500-600 establishes connections to segment backends. A "gang" is a set of QE processes (one per segment) that execute a slice. Gang types:
PRIMARY_READER_GANG - Read-only gangs for SELECT queriesPRIMARY_WRITER_GANG - Writer gangs for DML operationsSINGLETON_GANG - Single-segment execution2. Plan Dispatch:
CdbDispatchPlan() src/backend/cdb/dispatcher/cdbdisp_query.c200-300 serializes the plan and sends it to segments via cdb_dispatch()PlannedStmt (via serializeNode())SliceTable identifying which slices run on which segmentsrandom_page_cost)3. Segment Execution:
Each segment backend receives the plan and:
ReadCommand() deserializes the plan via stringToNode()standard_ProcessUtility() or ProcessQuery() routes to ExecutorRun()ExecProcNode() on local dataheap_insert(), heap_update(), etc. operates only on locally-stored tuples4. Inter-Segment Communication (Motion Nodes):
Motion nodes coordinate data movement:
ExecMotion() src/backend/executor/execMotion.c200-400 handles tuple sending/receivingSetupInterconnect() src/backend/cdb/cdbinterconnect.c establishes connections between segmentsic_tcp.c src/backend/cdb/motion/ic_tcp.c - TCP-based interconnect (default)ic_udp.c src/backend/cdb/motion/ic_udp.c - UDP-based interconnect (if enabled)SendTuple() to transmit tuples to receivers on other segmentsRecvTuple() to pull tuples from senders5. Distribution Policies:
Table distribution is defined in GpPolicy src/include/catalog/gp_distribution_policy.h stored in pg_attribute_encoding:
cdbhash() src/backend/cdb/cdbhash.c based on distribution key columns. Query executor routes to specific segments.The gp_segment_configuration catalog src/include/catalog/gp_segment_configuration.h tracks segment addresses, ports, and roles (primary/mirror).
6. Result Gathering:
MOTION_GATHER that collects results to coordinatorCdbDispatchGetResults() src/backend/cdb/dispatcher/cdbdisp_query.c waits for all segments to completeProcess Roles:
Gp_role GUC src/backend/utils/misc/guc_gp.c distinguishes process types:
GP_ROLE_DISPATCH - Coordinator process (QD)GP_ROLE_EXECUTE - Segment process (QE)GP_ROLE_UTILITY - Single-node modeSources: src/backend/cdb/cdbllize.c src/backend/cdb/dispatcher/cdbdisp_query.c200-300 src/backend/cdb/cdbgang.c500-600 src/backend/executor/execMotion.c200-400 src/backend/cdb/cdbinterconnect.c src/backend/cdb/motion/ic_tcp.c src/backend/cdb/cdbhash.c
The build system supports both Unix (autoconf/make) and Windows (MSVC) platforms:
Unix Build:
configure script configure1-100 detects platform capabilitiessrc/Makefile.global with build settings src/Makefile.global.inmake compiles source files into executablesKey Build Outputs:
postgres - Backend server executablelibpq.so - Client librarypsql - Interactive terminalpg_dump / pg_restore - Backup utilitiesWindows Build:
src/tools/msvc/Mkvcbuild.pm src/tools/msvc/Mkvcbuild.pm1-50 generates Visual Studio project filesGPORCA Integration:
When --enable-orca is specified during configure, the build system links against the GPORCA optimizer library src/backend/optimizer/plan/planner.c90-92
Sources: configure1-100 src/Makefile.global.in1-50 src/tools/msvc/Mkvcbuild.pm1-50
Cloudberry uses MVCC to provide concurrent access without read locks:
xmin, xmax transaction IDs)VACUUM removes themEach client connection gets a dedicated backend process forked from postmaster. This provides isolation but requires shared memory for inter-process communication.
The query planner estimates execution costs for different access paths:
Statistics collected by ANALYZE inform cost estimates stored in pg_statistic catalog.
Sources: src/backend/optimizer/path/costsize.c doc/src/sgml/catalogs.sgml
This wiki is organized hierarchically:
Each section contains detailed subsections covering specific subsystems with code references and implementation details.
Sources: Table of Contents
Refresh this wiki
This wiki was recently refreshed. Please wait 4 days to refresh again.