Overview

Relevant source files

Purpose and Scope

This document provides a technical overview of Apache Cloudberry, an open-source distributed relational database management system. It introduces the system's architecture, core components, and key design concepts. For detailed information about specific subsystems, refer to their dedicated pages:

Build system configuration and cross-platform support: Build System
Core database engine internals: Core Database Engine
Client applications and utilities: Client Tools and Utilities
Distributed query execution: Distributed Architecture (Greenplum/Cloudberry)

System Identity

Apache Cloudberry is a massively parallel processing (MPP) database system derived from Greenplum Database, which is built on PostgreSQL 14. The codebase extends PostgreSQL's single-node architecture with distributed query execution, parallel processing across segments, and advanced query optimization through GPORCA.

Version Information:

Package name: Apache Cloudberry configure583
Package version: 3.0.0-devel configure585
Current catalog version: 302509031 src/include/catalog/catversion.h59
Base PostgreSQL compatibility: 14.x with Greenplum extensions

Key Differentiators from PostgreSQL:

MPP query execution with coordinator/segment architecture
GPORCA cost-based optimizer for complex analytical queries
Data distribution across segments via hash or random policies
Motion operators for inter-segment data movement
Distributed transaction management via cdbtm.c

Sources: src/include/catalog/catversion.h1-62 configure583-588

Architectural Overview

Diagram: Apache Cloudberry System Architecture with Code Entities

The system architecture consists of five logical layers:

Client Layer: Applications connect via libpq protocol. The PQexec() function in src/interfaces/libpq/fe-exec.c sends queries and receives results.
Coordinator Layer: The Postmaster process src/backend/postmaster/postmaster.c1-300 forks backend processes that run PostgresMain() src/backend/tcop/postgres.c100-200 Each backend parses SQL via raw_parser() src/backend/parser/gram.y analyzes via parse_analyze(), and plans queries using either standard_planner() src/backend/optimizer/plan/planner.c100-200 or GPORCA's optimize_query().
Dispatch Layer: For distributed queries, CdbDispatchPlan() src/backend/cdb/dispatcher/cdbdisp_query.c divides the plan into slices and sends them to segment backends.
Executor Layer: ExecutorStart() and ExecutorRun() src/backend/executor/execMain.c process plan nodes. Motion nodes src/backend/executor/execMotion.c handle inter-segment data transfer.
Storage Layer: Data modifications go through heap access methods like heap_insert() src/backend/access/heap/heapam.c and are logged via XLogInsert() src/backend/access/transam/xlog.c1-100 System catalogs are cached by catcache.c.

Sources: src/backend/postmaster/postmaster.c1-300 src/backend/tcop/postgres.c1-200 src/backend/optimizer/plan/planner.c1-200 src/backend/cdb/dispatcher/cdbdisp_query.c src/backend/executor/execMain.c

Process Model

Diagram: Process Architecture with Code References

The Postmaster process src/backend/postmaster/postmaster.c200-500 acts as the supervisor process that:

Listens for client connections: The ServerLoop() function src/backend/postmaster/postmaster.c1700-1800 monitors ListenSocket[] on port PostPortNumber (configurable via port GUC, default 5432) src/backend/postmaster/postmaster.c230
Forks backend processes: Upon accepting a connection via BackendStartup() src/backend/postmaster/postmaster.c4200-4400 the postmaster calls fork() and the child runs PostgresMain() src/backend/tcop/postgres.c4000-4100 Each backend has a unique MyProcPid and MyBackendId.
Manages auxiliary processes: Background workers are forked during startup via:
- StartupProcessMain() src/backend/postmaster/startup.c - Replays WAL on startup
- BackgroundWriterMain() src/backend/postmaster/bgwriter.c - Writes dirty buffers
- CheckpointerMain() - Coordinates checkpoints and syncs XLogCtl
- WalWriterMain() src/backend/postmaster/walwriter.c - Flushes WAL buffers via XLogWrite()
- AutoVacLauncherMain() src/backend/postmaster/autovacuum.c - Schedules vacuum operations
- PgstatCollectorMain() src/backend/postmaster/pgstat.c - Aggregates statistics
Cloudberry-specific processes:
- FtsProbeMain() src/backend/postmaster/fts.c - Monitors segment availability
- WalReceiverMain() src/backend/replication/walreceiver.c - Receives WAL from primary (replication)

Shared Memory Architecture:

All processes attach to shared memory segments created by PGSharedMemoryCreate() src/backend/storage/ipc/pg_shmem.c Key shared structures include:

XLogCtlData src/backend/access/transam/xlog.c651-791 - WAL control structure with insertion locks, buffer pointers, and checkpoint state
BufferDescriptors - Buffer pool metadata for shared_buffers
ProcArray src/backend/storage/ipc/procarray.c - Tracks active transactions for MVCC snapshots
LockManager structures for lock tables

Process Communication:

Signals via kill() and signal handlers src/backend/tcop/postgres.c3000-3100
Shared memory for data exchange
Unix domain sockets for client-backend protocol
Inter-segment communication via interconnect (TCP or UDP) src/backend/cdb/motion/

Sources: src/backend/postmaster/postmaster.c1-500 src/backend/tcop/postgres.c1-200 src/backend/access/transam/xlog.c651-791 src/backend/storage/ipc/pg_shmem.c src/backend/postmaster/fts.c

Query Execution Flow

Diagram: Query Processing Pipeline with Code Functions

Query execution follows a multi-stage pipeline:

1. Protocol & Parsing Phase:

Client sends SQL via PQexec() src/interfaces/libpq/fe-exec.c using the frontend/backend protocol
Backend's PostgresMain() src/backend/tcop/postgres.c4100-4200 calls ReadCommand() to receive the query string (stored in debug_query_string)
raw_parser() src/backend/parser/parser.c invokes base_yyparse() which is generated from src/backend/parser/gram.y1-300
Parser produces a RawStmt tree src/include/nodes/parsenodes.h

2. Analysis Phase:

parse_analyze() src/backend/parser/analyze.c performs semantic analysis
Calls transformStmt() for each statement type (SELECT → transformSelectStmt(), INSERT → transformInsertStmt(), etc.)
Resolves table names via RangeVarGetRelid() and performs name/type resolution
Produces a Query tree src/include/nodes/parsenodes.h with resolved relations and expressions

3. Planning Phase:

The system chooses between two optimizers based on the optimizer GUC:

Standard Planner src/backend/optimizer/plan/planner.c200-300:
- standard_planner() calls subquery_planner() → grouping_planner()
- query_planner() generates join orders and access paths
- create_paths() src/backend/optimizer/path/allpaths.c builds RelOptInfo structures with different access paths (SeqScan, IndexScan, etc.)
- create_plan() src/backend/optimizer/plan/createplan.c200-300 converts the cheapest Path into a Plan tree
GPORCA src/backend/optimizer/orca:
- When optimizer=on, optimize_query() translates the Query to DXL (Data eXchange Language)
- GPORCA's C++ optimizer (COptTasks::OptimizeTask() in gpopt) performs advanced optimization
- Fallback to standard planner if GPORCA cannot handle the query
Output is a PlannedStmt src/include/nodes/plannodes.h containing the execution plan tree

4. Execution Phase:

CreateQueryDesc() src/backend/tcop/pquery.c wraps the plan in a QueryDesc structure
ExecutorStart() src/backend/executor/execMain.c100-200 calls InitPlan() to initialize executor state (EState and PlanState nodes)
ExecutorRun() src/backend/executor/execMain.c300-400 calls ExecutePlan() which repeatedly invokes ExecProcNode() src/backend/executor/execProcnode.c
Each plan node type has a specific execution function:
- SeqScanState → ExecSeqScan() src/backend/executor/nodeSeqscan.c
- IndexScanState → ExecIndexScan() src/backend/executor/nodeIndexscan.c
- ModifyTableState → ExecModifyTable() src/backend/executor/nodeModifyTable.c
ExecutorEnd() src/backend/executor/execMain.c500-600 cleans up executor state

5. Storage Access:

Data modifications go through heap access methods:

heap_insert() src/backend/access/heap/heapam.c2000-2100 - Insert new tuples
heap_update() - Update existing tuples (marks old version as deleted, inserts new version)
heap_delete() - Mark tuple as deleted (MVCC: sets xmax rather than physical deletion)

All modifications are logged via XLogInsert() src/backend/access/transam/xlog.c900-1000 before being applied (write-ahead logging ensures durability).

Sources: src/backend/tcop/postgres.c4100-4200 src/backend/parser/gram.y1-300 src/backend/parser/analyze.c src/backend/optimizer/plan/planner.c1-300 src/backend/optimizer/path/allpaths.c src/backend/optimizer/plan/createplan.c1-300 src/backend/executor/execMain.c1-600 src/backend/executor/execProcnode.c src/backend/access/heap/heapam.c2000-2100

Core Subsystems

Configuration System (GUC)

The Grand Unified Configuration (GUC) system manages runtime and startup parameters:

Configuration Sources:

postgresql.conf - Primary configuration file src/backend/utils/misc/guc.c127
postgresql.auto.conf - Auto-generated settings from ALTER SYSTEM commands
Command-line arguments to postgres process
Per-database and per-role settings
Session-level SET commands

Key GUC Functions:

ProcessConfigFile() - Reads and applies configuration files src/backend/utils/misc/guc.c256-258
SetConfigOption() - Sets individual parameters
GetConfigOption() - Retrieves current values

Cloudberry Extensions: Additional MPP-specific parameters are defined in src/backend/utils/misc/guc_gp.c including:

gp_role - Process role (coordinator/segment)
gp_session_id - Unique session identifier across cluster
optimizer - Enable/disable GPORCA src/backend/utils/misc/guc_gp.c85

Sources: src/backend/utils/misc/guc.c1-200 src/backend/utils/misc/guc_gp.c1-150

Write-Ahead Logging (WAL)

The WAL system ensures durability and crash recovery:

Core Components:

XLogInsert() - Writes WAL records src/backend/access/transam/xlog.c
XLogFlush() - Forces WAL to disk
WAL segment files stored in pg_wal/ directory src/backend/access/transam/xlog.c109-132

Configuration Parameters:

max_wal_size_mb - Maximum WAL size before checkpoint src/backend/access/transam/xlog.c109
wal_level - Amount of information logged src/backend/access/transam/xlog.c127
wal_sync_method - Method for flushing WAL src/backend/access/transam/xlog.c126

Recovery Process: During startup, the startup process replays WAL records from the last checkpoint to restore the database to a consistent state src/backend/access/transam/xlog.c234-280

Sources: src/backend/access/transam/xlog.c1-500

System Catalogs

Metadata is stored in system catalogs within the pg_catalog schema:

Core Catalogs:

Catalog	Purpose	File
`pg_proc`	Functions and operators	src/include/catalog/pg_proc.h
`pg_class`	Tables, indexes, views	src/include/catalog/pg_class.h
`pg_attribute`	Table columns	src/include/catalog/pg_attribute.h
`pg_type`	Data types	src/include/catalog/pg_type.h
`pg_database`	Databases	src/include/catalog/pg_database.h

Catalog Version Control: The CATALOG_VERSION_NO macro src/include/catalog/catversion.h59 tracks catalog schema changes. Incompatible catalog changes require initdb to recreate the database cluster.

Catalog Queries: All metadata lookups use the syscache mechanism for performance, caching frequently accessed catalog tuples in memory.

Sources: src/include/catalog/pg_proc.h1-130 src/include/catalog/catversion.h1-62 doc/src/sgml/catalogs.sgml1-100

Distributed Architecture (MPP)

Diagram: Distributed Query Execution (MPP Architecture)

Cloudberry's MPP architecture extends PostgreSQL with parallel execution across segments:

1. Query Distribution Phase:

After standard planning, the coordinator performs distributed query preparation:

Plan Parallelization: cdbllize_decorate() src/backend/cdb/cdbllize.c inserts Motion nodes into the plan tree where data must move between segments:
- MOTION_GATHER - Collect results from all segments to coordinator
- MOTION_REDISTRIBUTE - Rehash data across segments on different keys
- MOTION_BROADCAST - Send same data to all segments
- MOTION_EXPLICIT - Direct motion for specific operations
Slice Table Construction: BuildSliceTable() src/backend/cdb/dispatcher/cdbdisp_query.c1500-1600 divides the plan into execution slices. Each Motion node creates a slice boundary. A SliceTable src/include/nodes/execnodes.h tracks which segments execute each slice.
Gang Allocation: AllocateGang() src/backend/cdb/cdbgang.c500-600 establishes connections to segment backends. A "gang" is a set of QE processes (one per segment) that execute a slice. Gang types:
- PRIMARY_READER_GANG - Read-only gangs for SELECT queries
- PRIMARY_WRITER_GANG - Writer gangs for DML operations
- SINGLETON_GANG - Single-segment execution

2. Plan Dispatch:

CdbDispatchPlan() src/backend/cdb/dispatcher/cdbdisp_query.c200-300 serializes the plan and sends it to segments via cdb_dispatch()
Dispatch includes:
- Serialized PlannedStmt (via serializeNode())
- Transaction snapshot for MVCC consistency
- SliceTable identifying which slices run on which segments
- GUC settings that must be synchronized (e.g., random_page_cost)

3. Segment Execution:

Each segment backend receives the plan and:

ReadCommand() deserializes the plan via stringToNode()
standard_ProcessUtility() or ProcessQuery() routes to ExecutorRun()
Executes its assigned slices, calling ExecProcNode() on local data
Local heap access via heap_insert(), heap_update(), etc. operates only on locally-stored tuples

4. Inter-Segment Communication (Motion Nodes):

Motion nodes coordinate data movement:

ExecMotion() src/backend/executor/execMotion.c200-400 handles tuple sending/receiving
SetupInterconnect() src/backend/cdb/cdbinterconnect.c establishes connections between segments
Interconnect implementations:
- ic_tcp.c src/backend/cdb/motion/ic_tcp.c - TCP-based interconnect (default)
- ic_udp.c src/backend/cdb/motion/ic_udp.c - UDP-based interconnect (if enabled)
Motion senders call SendTuple() to transmit tuples to receivers on other segments
Motion receivers call RecvTuple() to pull tuples from senders

5. Distribution Policies:

Table distribution is defined in GpPolicy src/include/catalog/gp_distribution_policy.h stored in pg_attribute_encoding:

HASH Distribution: Tuples distributed via cdbhash() src/backend/cdb/cdbhash.c based on distribution key columns. Query executor routes to specific segments.
RANDOM Distribution: Tuples round-robin distributed. Requires broadcast or full redistribution.
REPLICATED Distribution: Full table copy on all segments. Optimizes joins.

The gp_segment_configuration catalog src/include/catalog/gp_segment_configuration.h tracks segment addresses, ports, and roles (primary/mirror).

6. Result Gathering:

Final slice contains a MOTION_GATHER that collects results to coordinator
CdbDispatchGetResults() src/backend/cdb/dispatcher/cdbdisp_query.c waits for all segments to complete
Results returned to client via standard PostgreSQL protocol

Process Roles:

Gp_role GUC src/backend/utils/misc/guc_gp.c distinguishes process types:
- GP_ROLE_DISPATCH - Coordinator process (QD)
- GP_ROLE_EXECUTE - Segment process (QE)
- GP_ROLE_UTILITY - Single-node mode

Sources: src/backend/cdb/cdbllize.c src/backend/cdb/dispatcher/cdbdisp_query.c200-300 src/backend/cdb/cdbgang.c500-600 src/backend/executor/execMotion.c200-400 src/backend/cdb/cdbinterconnect.c src/backend/cdb/motion/ic_tcp.c src/backend/cdb/cdbhash.c

Build System

The build system supports both Unix (autoconf/make) and Windows (MSVC) platforms:

Unix Build:

configure script configure1-100 detects platform capabilities
Generates src/Makefile.global with build settings src/Makefile.global.in
make compiles source files into executables

Key Build Outputs:

postgres - Backend server executable
libpq.so - Client library
psql - Interactive terminal
pg_dump / pg_restore - Backup utilities

Windows Build:

src/tools/msvc/Mkvcbuild.pm src/tools/msvc/Mkvcbuild.pm1-50 generates Visual Studio project files
Uses Perl scripts to manage build process

GPORCA Integration: When --enable-orca is specified during configure, the build system links against the GPORCA optimizer library src/backend/optimizer/plan/planner.c90-92

Sources: configure1-100 src/Makefile.global.in1-50 src/tools/msvc/Mkvcbuild.pm1-50

Key Design Concepts

MVCC (Multi-Version Concurrency Control)

Cloudberry uses MVCC to provide concurrent access without read locks:

Each tuple has visibility information (xmin, xmax transaction IDs)
Transactions see a consistent snapshot of data
Old tuple versions remain until VACUUM removes them

Process-Per-Connection Model

Each client connection gets a dedicated backend process forked from postmaster. This provides isolation but requires shared memory for inter-process communication.

Cost-Based Optimization

The query planner estimates execution costs for different access paths:

Sequential scans vs. index scans
Join algorithms (nested loop, hash join, merge join)
Join order selection
Parallel execution strategies

Statistics collected by ANALYZE inform cost estimates stored in pg_statistic catalog.

Sources: src/backend/optimizer/path/costsize.c doc/src/sgml/catalogs.sgml

Documentation Organization

This wiki is organized hierarchically:

Section 1: Overview (this page) and architecture
Section 2: Build System - Compilation and configuration
Section 3: Core Database Engine - Backend internals
Section 4: Client Tools and Utilities - User-facing programs
Section 5: Replication and High Availability - Backup and recovery
Section 6: Maintenance and Optimization - Database health
Section 7: Distributed Architecture - MPP-specific features

Each section contains detailed subsections covering specific subsystems with code references and implementation details.

Sources: Table of Contents

Overview

Relevant source files

Purpose and Scope

Build system configuration and cross-platform support: Build System
Core database engine internals: Core Database Engine
Client applications and utilities: Client Tools and Utilities
Distributed query execution: Distributed Architecture (Greenplum/Cloudberry)

System Identity

Version Information:

Package name: Apache Cloudberry configure583
Package version: 3.0.0-devel configure585
Current catalog version: 302509031 src/include/catalog/catversion.h59
Base PostgreSQL compatibility: 14.x with Greenplum extensions

Key Differentiators from PostgreSQL:

MPP query execution with coordinator/segment architecture
GPORCA cost-based optimizer for complex analytical queries
Data distribution across segments via hash or random policies
Motion operators for inter-segment data movement
Distributed transaction management via cdbtm.c

Sources: src/include/catalog/catversion.h1-62 configure583-588

Architectural Overview

Diagram: Apache Cloudberry System Architecture with Code Entities

The system architecture consists of five logical layers:

Client Layer: Applications connect via libpq protocol. The PQexec() function in src/interfaces/libpq/fe-exec.c sends queries and receives results.
Coordinator Layer: The Postmaster process src/backend/postmaster/postmaster.c1-300 forks backend processes that run PostgresMain() src/backend/tcop/postgres.c100-200 Each backend parses SQL via raw_parser() src/backend/parser/gram.y analyzes via parse_analyze(), and plans queries using either standard_planner() src/backend/optimizer/plan/planner.c100-200 or GPORCA's optimize_query().
Dispatch Layer: For distributed queries, CdbDispatchPlan() src/backend/cdb/dispatcher/cdbdisp_query.c divides the plan into slices and sends them to segment backends.
Executor Layer: ExecutorStart() and ExecutorRun() src/backend/executor/execMain.c process plan nodes. Motion nodes src/backend/executor/execMotion.c handle inter-segment data transfer.
Storage Layer: Data modifications go through heap access methods like heap_insert() src/backend/access/heap/heapam.c and are logged via XLogInsert() src/backend/access/transam/xlog.c1-100 System catalogs are cached by catcache.c.

Process Model

Diagram: Process Architecture with Code References

The Postmaster process src/backend/postmaster/postmaster.c200-500 acts as the supervisor process that:

Listens for client connections: The ServerLoop() function src/backend/postmaster/postmaster.c1700-1800 monitors ListenSocket[] on port PostPortNumber (configurable via port GUC, default 5432) src/backend/postmaster/postmaster.c230
Forks backend processes: Upon accepting a connection via BackendStartup() src/backend/postmaster/postmaster.c4200-4400 the postmaster calls fork() and the child runs PostgresMain() src/backend/tcop/postgres.c4000-4100 Each backend has a unique MyProcPid and MyBackendId.
Manages auxiliary processes: Background workers are forked during startup via:
- StartupProcessMain() src/backend/postmaster/startup.c - Replays WAL on startup
- BackgroundWriterMain() src/backend/postmaster/bgwriter.c - Writes dirty buffers
- CheckpointerMain() - Coordinates checkpoints and syncs XLogCtl
- WalWriterMain() src/backend/postmaster/walwriter.c - Flushes WAL buffers via XLogWrite()
- AutoVacLauncherMain() src/backend/postmaster/autovacuum.c - Schedules vacuum operations
- PgstatCollectorMain() src/backend/postmaster/pgstat.c - Aggregates statistics
Cloudberry-specific processes:
- FtsProbeMain() src/backend/postmaster/fts.c - Monitors segment availability
- WalReceiverMain() src/backend/replication/walreceiver.c - Receives WAL from primary (replication)

Shared Memory Architecture:

All processes attach to shared memory segments created by PGSharedMemoryCreate() src/backend/storage/ipc/pg_shmem.c Key shared structures include:

XLogCtlData src/backend/access/transam/xlog.c651-791 - WAL control structure with insertion locks, buffer pointers, and checkpoint state
BufferDescriptors - Buffer pool metadata for shared_buffers
ProcArray src/backend/storage/ipc/procarray.c - Tracks active transactions for MVCC snapshots
LockManager structures for lock tables

Process Communication:

Signals via kill() and signal handlers src/backend/tcop/postgres.c3000-3100
Shared memory for data exchange
Unix domain sockets for client-backend protocol
Inter-segment communication via interconnect (TCP or UDP) src/backend/cdb/motion/

Sources: src/backend/postmaster/postmaster.c1-500 src/backend/tcop/postgres.c1-200 src/backend/access/transam/xlog.c651-791 src/backend/storage/ipc/pg_shmem.c src/backend/postmaster/fts.c

Query Execution Flow

Diagram: Query Processing Pipeline with Code Functions

Query execution follows a multi-stage pipeline:

1. Protocol & Parsing Phase:

Client sends SQL via PQexec() src/interfaces/libpq/fe-exec.c using the frontend/backend protocol
Backend's PostgresMain() src/backend/tcop/postgres.c4100-4200 calls ReadCommand() to receive the query string (stored in debug_query_string)
raw_parser() src/backend/parser/parser.c invokes base_yyparse() which is generated from src/backend/parser/gram.y1-300
Parser produces a RawStmt tree src/include/nodes/parsenodes.h

2. Analysis Phase:

parse_analyze() src/backend/parser/analyze.c performs semantic analysis
Calls transformStmt() for each statement type (SELECT → transformSelectStmt(), INSERT → transformInsertStmt(), etc.)
Resolves table names via RangeVarGetRelid() and performs name/type resolution
Produces a Query tree src/include/nodes/parsenodes.h with resolved relations and expressions

3. Planning Phase:

The system chooses between two optimizers based on the optimizer GUC:

Standard Planner src/backend/optimizer/plan/planner.c200-300:
- standard_planner() calls subquery_planner() → grouping_planner()
- query_planner() generates join orders and access paths
- create_paths() src/backend/optimizer/path/allpaths.c builds RelOptInfo structures with different access paths (SeqScan, IndexScan, etc.)
- create_plan() src/backend/optimizer/plan/createplan.c200-300 converts the cheapest Path into a Plan tree
GPORCA src/backend/optimizer/orca:
- When optimizer=on, optimize_query() translates the Query to DXL (Data eXchange Language)
- GPORCA's C++ optimizer (COptTasks::OptimizeTask() in gpopt) performs advanced optimization
- Fallback to standard planner if GPORCA cannot handle the query
Output is a PlannedStmt src/include/nodes/plannodes.h containing the execution plan tree

4. Execution Phase:

CreateQueryDesc() src/backend/tcop/pquery.c wraps the plan in a QueryDesc structure
ExecutorStart() src/backend/executor/execMain.c100-200 calls InitPlan() to initialize executor state (EState and PlanState nodes)
ExecutorRun() src/backend/executor/execMain.c300-400 calls ExecutePlan() which repeatedly invokes ExecProcNode() src/backend/executor/execProcnode.c
Each plan node type has a specific execution function:
- SeqScanState → ExecSeqScan() src/backend/executor/nodeSeqscan.c
- IndexScanState → ExecIndexScan() src/backend/executor/nodeIndexscan.c
- ModifyTableState → ExecModifyTable() src/backend/executor/nodeModifyTable.c
ExecutorEnd() src/backend/executor/execMain.c500-600 cleans up executor state

5. Storage Access:

Data modifications go through heap access methods:

heap_insert() src/backend/access/heap/heapam.c2000-2100 - Insert new tuples
heap_update() - Update existing tuples (marks old version as deleted, inserts new version)
heap_delete() - Mark tuple as deleted (MVCC: sets xmax rather than physical deletion)

All modifications are logged via XLogInsert() src/backend/access/transam/xlog.c900-1000 before being applied (write-ahead logging ensures durability).

Core Subsystems

Configuration System (GUC)

The Grand Unified Configuration (GUC) system manages runtime and startup parameters:

Configuration Sources:

postgresql.conf - Primary configuration file src/backend/utils/misc/guc.c127
postgresql.auto.conf - Auto-generated settings from ALTER SYSTEM commands
Command-line arguments to postgres process
Per-database and per-role settings
Session-level SET commands

Key GUC Functions:

ProcessConfigFile() - Reads and applies configuration files src/backend/utils/misc/guc.c256-258
SetConfigOption() - Sets individual parameters
GetConfigOption() - Retrieves current values

Cloudberry Extensions: Additional MPP-specific parameters are defined in src/backend/utils/misc/guc_gp.c including:

gp_role - Process role (coordinator/segment)
gp_session_id - Unique session identifier across cluster
optimizer - Enable/disable GPORCA src/backend/utils/misc/guc_gp.c85

Sources: src/backend/utils/misc/guc.c1-200 src/backend/utils/misc/guc_gp.c1-150

Write-Ahead Logging (WAL)

The WAL system ensures durability and crash recovery:

Core Components:

XLogInsert() - Writes WAL records src/backend/access/transam/xlog.c
XLogFlush() - Forces WAL to disk
WAL segment files stored in pg_wal/ directory src/backend/access/transam/xlog.c109-132

Configuration Parameters:

max_wal_size_mb - Maximum WAL size before checkpoint src/backend/access/transam/xlog.c109
wal_level - Amount of information logged src/backend/access/transam/xlog.c127
wal_sync_method - Method for flushing WAL src/backend/access/transam/xlog.c126

Recovery Process: During startup, the startup process replays WAL records from the last checkpoint to restore the database to a consistent state src/backend/access/transam/xlog.c234-280

Sources: src/backend/access/transam/xlog.c1-500

System Catalogs

Metadata is stored in system catalogs within the pg_catalog schema:

Core Catalogs:

Catalog	Purpose	File
`pg_proc`	Functions and operators	src/include/catalog/pg_proc.h
`pg_class`	Tables, indexes, views	src/include/catalog/pg_class.h
`pg_attribute`	Table columns	src/include/catalog/pg_attribute.h
`pg_type`	Data types	src/include/catalog/pg_type.h
`pg_database`	Databases	src/include/catalog/pg_database.h

Catalog Queries: All metadata lookups use the syscache mechanism for performance, caching frequently accessed catalog tuples in memory.

Sources: src/include/catalog/pg_proc.h1-130 src/include/catalog/catversion.h1-62 doc/src/sgml/catalogs.sgml1-100

Distributed Architecture (MPP)

Diagram: Distributed Query Execution (MPP Architecture)

Cloudberry's MPP architecture extends PostgreSQL with parallel execution across segments:

1. Query Distribution Phase:

After standard planning, the coordinator performs distributed query preparation:

Plan Parallelization: cdbllize_decorate() src/backend/cdb/cdbllize.c inserts Motion nodes into the plan tree where data must move between segments:
- MOTION_GATHER - Collect results from all segments to coordinator
- MOTION_REDISTRIBUTE - Rehash data across segments on different keys
- MOTION_BROADCAST - Send same data to all segments
- MOTION_EXPLICIT - Direct motion for specific operations
Slice Table Construction: BuildSliceTable() src/backend/cdb/dispatcher/cdbdisp_query.c1500-1600 divides the plan into execution slices. Each Motion node creates a slice boundary. A SliceTable src/include/nodes/execnodes.h tracks which segments execute each slice.
Gang Allocation: AllocateGang() src/backend/cdb/cdbgang.c500-600 establishes connections to segment backends. A "gang" is a set of QE processes (one per segment) that execute a slice. Gang types:
- PRIMARY_READER_GANG - Read-only gangs for SELECT queries
- PRIMARY_WRITER_GANG - Writer gangs for DML operations
- SINGLETON_GANG - Single-segment execution

2. Plan Dispatch:

CdbDispatchPlan() src/backend/cdb/dispatcher/cdbdisp_query.c200-300 serializes the plan and sends it to segments via cdb_dispatch()
Dispatch includes:
- Serialized PlannedStmt (via serializeNode())
- Transaction snapshot for MVCC consistency
- SliceTable identifying which slices run on which segments
- GUC settings that must be synchronized (e.g., random_page_cost)

3. Segment Execution:

Each segment backend receives the plan and:

ReadCommand() deserializes the plan via stringToNode()
standard_ProcessUtility() or ProcessQuery() routes to ExecutorRun()
Executes its assigned slices, calling ExecProcNode() on local data
Local heap access via heap_insert(), heap_update(), etc. operates only on locally-stored tuples

4. Inter-Segment Communication (Motion Nodes):

Motion nodes coordinate data movement:

ExecMotion() src/backend/executor/execMotion.c200-400 handles tuple sending/receiving
SetupInterconnect() src/backend/cdb/cdbinterconnect.c establishes connections between segments
Interconnect implementations:
- ic_tcp.c src/backend/cdb/motion/ic_tcp.c - TCP-based interconnect (default)
- ic_udp.c src/backend/cdb/motion/ic_udp.c - UDP-based interconnect (if enabled)
Motion senders call SendTuple() to transmit tuples to receivers on other segments
Motion receivers call RecvTuple() to pull tuples from senders

5. Distribution Policies:

Table distribution is defined in GpPolicy src/include/catalog/gp_distribution_policy.h stored in pg_attribute_encoding:

HASH Distribution: Tuples distributed via cdbhash() src/backend/cdb/cdbhash.c based on distribution key columns. Query executor routes to specific segments.
RANDOM Distribution: Tuples round-robin distributed. Requires broadcast or full redistribution.
REPLICATED Distribution: Full table copy on all segments. Optimizes joins.

The gp_segment_configuration catalog src/include/catalog/gp_segment_configuration.h tracks segment addresses, ports, and roles (primary/mirror).

6. Result Gathering:

Final slice contains a MOTION_GATHER that collects results to coordinator
CdbDispatchGetResults() src/backend/cdb/dispatcher/cdbdisp_query.c waits for all segments to complete
Results returned to client via standard PostgreSQL protocol

Process Roles:

Gp_role GUC src/backend/utils/misc/guc_gp.c distinguishes process types:
- GP_ROLE_DISPATCH - Coordinator process (QD)
- GP_ROLE_EXECUTE - Segment process (QE)
- GP_ROLE_UTILITY - Single-node mode

Build System

The build system supports both Unix (autoconf/make) and Windows (MSVC) platforms:

Unix Build:

configure script configure1-100 detects platform capabilities
Generates src/Makefile.global with build settings src/Makefile.global.in
make compiles source files into executables

Key Build Outputs:

postgres - Backend server executable
libpq.so - Client library
psql - Interactive terminal
pg_dump / pg_restore - Backup utilities

Windows Build:

src/tools/msvc/Mkvcbuild.pm src/tools/msvc/Mkvcbuild.pm1-50 generates Visual Studio project files
Uses Perl scripts to manage build process

GPORCA Integration: When --enable-orca is specified during configure, the build system links against the GPORCA optimizer library src/backend/optimizer/plan/planner.c90-92

Sources: configure1-100 src/Makefile.global.in1-50 src/tools/msvc/Mkvcbuild.pm1-50

Key Design Concepts

MVCC (Multi-Version Concurrency Control)

Cloudberry uses MVCC to provide concurrent access without read locks:

Each tuple has visibility information (xmin, xmax transaction IDs)
Transactions see a consistent snapshot of data
Old tuple versions remain until VACUUM removes them

Process-Per-Connection Model

Each client connection gets a dedicated backend process forked from postmaster. This provides isolation but requires shared memory for inter-process communication.

Cost-Based Optimization

The query planner estimates execution costs for different access paths:

Sequential scans vs. index scans
Join algorithms (nested loop, hash join, merge join)
Join order selection
Parallel execution strategies

Statistics collected by ANALYZE inform cost estimates stored in pg_statistic catalog.

Sources: src/backend/optimizer/path/costsize.c doc/src/sgml/catalogs.sgml

Documentation Organization

This wiki is organized hierarchically:

Section 1: Overview (this page) and architecture
Section 2: Build System - Compilation and configuration
Section 3: Core Database Engine - Backend internals
Section 4: Client Tools and Utilities - User-facing programs
Section 5: Replication and High Availability - Backup and recovery
Section 6: Maintenance and Optimization - Database health
Section 7: Distributed Architecture - MPP-specific features

Each section contains detailed subsections covering specific subsystems with code references and implementation details.

Sources: Table of Contents

Overview

Purpose and Scope

System Identity

Architectural Overview

Process Model

Query Execution Flow

Core Subsystems

Configuration System (GUC)

Write-Ahead Logging (WAL)

System Catalogs

Distributed Architecture (MPP)

Build System

Key Design Concepts

MVCC (Multi-Version Concurrency Control)

Process-Per-Connection Model

Cost-Based Optimization

Documentation Organization

On this page

Overview

Purpose and Scope

System Identity

Architectural Overview

Process Model

Query Execution Flow

Core Subsystems

Configuration System (GUC)

Write-Ahead Logging (WAL)

System Catalogs

Distributed Architecture (MPP)

Build System

Key Design Concepts

MVCC (Multi-Version Concurrency Control)

Process-Per-Connection Model

Cost-Based Optimization

Documentation Organization

On this page