Planet PostgreSQL

Which indexes can be corrupted after an operating system upgrade?

Posted by Laurenz Albe in Cybertec on 2025-12-16 at 06:00

Dialog between two men: "We had some corruption after the operating system upgrade." — "What, Windows 11 and a different codepage in Notepad?" — "No, PostgreSQL and a natural language collation."
© Laurenz Albe 2025

Most major Linux distributions have upgraded to the GNU C library version 2.28 or later. Therefore, there is a growing awareness that an operating system upgrade can lead to index corruption. However, rebuilding all your indexes can slow down the upgrade process considerably. In this article, I want to discuss how you can avoid rebuilding more indexes than necessary.

A management summary of the problem

For natural language collation support (the rules to compare and sort strings in natural languages), PostgreSQL largely relies on external libraries. Depending on how how you ran initdb, that could be the C library or the ICU library. Sometimes bugfixes in the new library versions that you install during an operating system upgrade lead to a change in a collation. Since database indexes are pre-sorted data structures, a change in the rules for sorting strings can render such indexes to be sorted incorrectly. This index corruption can lead to wrong query results and violated unique constraints.

The solution is to rebuild an affected index with the REINDEX command.

Keeping the down time for an operating system upgrade short

For many users, it is important that any down time of the database is as short as possible. For such users, the need to rebuild indexes is a problem. If you want to play it safe, you cannot start working with the database before rebuilding the indexes is completed. You could also take a more relaxed attitude and risk starting the application right away. Then you would use REINDEX INDEX CONCURRENTLY to rebuild problematic indexes while your application is running. The risk is that your index scans may return wrong results for a while, and few people are willing to take such a risk.

One way to perform an operating system upgrade that avoids the problem of corrupted indexes and can keep the down time very short is logical replication. Here, the subscriber builds its indexes based on logical information (similar to SQL statements). Consequently, th

[...]

From PGDays to PGCon/fs

Posted by Valeria Kaplan in Data Egret on 2025-12-15 at 22:51

A Journey Through PostgreSQL Events

• Introduction
• Community recognition (transparency, inclusivity, organisational balance)
• If I organise a PostgreSQL event — does it have to be recognised?
• Are community-recognised events better than those that aren’t?
• Conferences — a quick flyover (pgDays, pgConfs, FOSDEM, PGConf.dev etc.)
• PostgreSQL User Groups / meet-ups
• Key takeaways
• Which PostgreSQL event to attend?

Introduction

Working at a PostgreSQL-focused company for over 10 years takes you places. Specifically, it takes you to PostgreSQL community events that are very different than your standard commercial conference.

The dynamic of community events is different. They feel like a village fête, where people have known each other for ages. There’s familiarity, shared purpose, and a genuine sense of belonging.

At the same time, there are many different Postgres-focused events — community and commercial ones, small meet-ups, bigger pgDays, and even larger pgConfs - each with a slightly different audience and scope in mind.

So when I’ve been recently asked, “Which Postgres conference would you recommend going to?” I realised this might be a good topic for a post.

A quick note —here, I’ll mention a handful of events, predominantly in Europe, but there are, of course, many others Postgres-focused events out-there. From PG Down Under in Australia to pgDay Mexico to PGConf NYC 2025 to growing user group meetings in Madagascar and Armenia, and the virtual ones, such as POSETTE and Postgres meetup for all— Postgres is definitely having its momentum.

When you see 🔔 below, those are actions or deadlines worth noting.

Community recognition

Majority of PostgreSQL conferences can be found on PostgreSQL.org events page and of those some are community recognised (note a small gold medal next to their titles) while others aren’t.

screenshot of conferences listed on postgresql.org. Some conferences have small golden medlas next to them. — Source postgresql.org

Community event recognition indicates that an event complies with the PostgreSQL event recog

[...]

Improved Markdown Parsing

Posted by David E. Wheeler on 2025-12-15 at 15:55

Quick announcement to say that I’ve replaced the ancient markdown parser with a new one, discount, which supports tables, code fences, definition lists, and more. I reindexed pg_clickhouse this morning and it’s sooo nice to see the table properly formatted.

New uploads will use this parser for Markdown (but not MultiMarkdown) files from now on. I think I’ll start working on reindexing all existing extensions, too. Give me a holler if you don’t see an improvement in your extensions in the next few days.

PostgreSQL 18 Asynchronous Disk I/O – Deep Dive Into Implementation

Posted by Josef Machytka in credativ on 2025-12-15 at 09:53

AI generated image symbolizing an asynchronous running database PostgreSQL 17 introduced streaming I/O – grouping multiple page reads into a single system call and using smarter posix_fadvise() hints. That alone gave up to ~30% faster sequential scans in some workloads, but it was still strictly synchronous: each backend process would issue a read and then sit there waiting for the kernel to return data before proceeding. Before PG17, PostgreSQL typically read one 8kB page at a time.

PostgreSQL 18 takes the next logical step: a full asynchronous I/O (AIO) subsystem that can keep multiple reads in flight while backends keep doing useful work. Reads become overlapped instead of only serialized. The AIO subsystem is deliberately targeted at operations that know their future block numbers ahead of time and can issue multiple reads in advance:

Heap sequential scans, like plain SELECT and COPY operations that stream lots of data
VACUUM on big tables and indexes
ANALYZE sampling
Bitmap heap scans

Autovacuum benefits from this change too, since its workers share the same VACUUM/ANALYZE code paths. Other operations still remain synchronous for now:

B‑tree index scans / index‑only scans
Recovery & replication
All write operations INSERT, UPDATE, DELETE, WAL writes
Small OLTP lookups that touch a single heap page

Future work is expected to widen coverage, especially index‑only scans and some write‑path optimizations.

Significant improvements for cloud volumes

Community benchmarks show that PostgreSQL 18 AIO significantly improves cold cache data reads in cloud setups with network‑attached storage where latency is high. AWS documentation states that average latency of Block Express volumes is “under 500 microseconds for 16 KiB I/O size”, when latency of General Purpose volumes can exceed 800 microseconds. Some articles suggest that under high load each physical block read from disk can cost around 1ms, while page processing in PostgreSQL is much cheaper. By combining many pages into one read, all

[...]

PostgreSQL Contributor Story: Nishant Sharma

Posted by Floor Drees in EDB on 2025-12-15 at 07:56

Earlier this year we started a program (“Developer U”) to help colleagues who show promise for PostgreSQL Development to become contributors. Meet: Nishant Sharma, Staff SDE, who maintains a list of places in the world he has yet to visit.

Anonymising PII in PostgreSQL with pgEdge Anonymizer

Posted by Dave Page in pgEdge on 2025-12-15 at 06:13

Data privacy regulations such as GDPR, CCPA, and HIPAA have made it increasingly important for organisations to protect personally identifiable information (PII) in their databases. Whether you're creating a development environment from production data, sharing datasets with third parties, or simply trying to minimise risk, you'll often need to anonymise sensitive data whilst maintaining the structure and relationships within your database.I've been working on a tool to address this need: . It's a command-line utility that replaces PII in PostgreSQL databases with realistic but fake values, all whilst preserving referential integrity and data consistency.

The Problem

Consider a typical scenario: you have a production database containing customer records, and you need to create a copy for your development team. The data includes names, email addresses, phone numbers, National Insurance numbers, and credit card details. You can't simply hand over the production data as that would be a compliance nightmare, but you also need the development database to contain realistic data that exercises the same code paths as production.Manually anonymising this data is tedious and error-prone. You need to ensure that:

The same customer email appears consistently across all tables

Foreign key relationships remain intact

The anonymised data looks realistic (not just "XXXX" or "[email protected]")

The process is repeatable and auditable

Enter pgEdge Anonymizer

pgEdge Anonymizer addresses these challenges with a simple YAML-based configuration approach. You define which columns contain PII and what type of data they hold, and the tool handles the rest.

Installation

Building from source is straightforward:This produces a single binary in the directory that you can copy wherever you need it.

Configuration

The configuration file defines your database connection and the columns to anonymise. Here's a typical example:Each column is specified using its fully-qualified name () and assigned a pattern [...]

Contributions for week 51, 2025

Posted by Cornelia Biacsics in postgres-contrib.org on 2025-12-14 at 20:57

On December 11, the PGDay CERN 2026 CfP committee met to finalize the talk selection. The committee members are listed here:

Tobias Bussmann, SCNAT (chair, non voting)
Abel Cabezas Alonso, CERN
Maurizio De Giorgi, CERN
Andreas Geppert, ZKB
Julia Gugel, migrolino
Tom Hagel, Volue
Svitlana Lytvynenko, CYBERTEC *Gülçin Yıldırım Jelinek, Xata

IT Tage Frankfurt Dec 8-11 2025 Workshops:

Hans-Jürgen Schönig

Talks:

Thomas Koch
Raphael Salguero Aragón
Robert Baric
Hans-Jürgen Schönig
Adrien Obernesser
Bernd Patolla
Michael Mühlbeyer
Daniel Westermann
Dirk Krautschick
Chris Norman Bischoff
Jonas Gassenmeyer
Markus Dahm
Dr. Veikko Krypczyk

Meetup Frankfurt on December 10 2025 organised by Ilya Kosmodemiansky and Diego Calvo de No

Speaker

Andreas Scherbaum
Dirk Aumueller

New podcast episode What Postgres developers can expect from PGConf.dev published by Claire Giordano from her series “Talking Postgres” with Melanie Plageman.

On December 11 2025, Jimmy Angelakos organized the first PostgreSQL Edinburgh meetup. Chris Ellis and Jimmy Angelakos spoke at this meetup.

The Swiss PostgreSQL Users had a Meetup in Zurich, speakers are Stefan Keller and Lars Herrmann.

The Talk Selection Committee from the Prague PostgreSQL Developer Day finalized the schedule for P2D2 - the team consists of:

Pavlo Golub
Pavel Hák
Hana Litavská
Teresa Lopes
Mayur B
Esther Miñano
Josef Šimánek
Pavel Stěhule
Tomáš Vondra

How does the PostgreSQL Buildfarm check upgrades across versions?

Posted by Andrew Dunstan in EDB on 2025-12-12 at 19:47

From time to time I see questions from otherwise well informed people about how the PostgreSQL Build farm checks how pg_upgrade checking is done across versions, e.g. how does it check upgrading from release 9.5 to release 18. I realize that this isn't well documented anywhere, so here is a description of the process.

All of the code referenced here can be found at https://github.com/PGBuildFarm/client-code.

The principal buildfarm client script is run_build.pl, which builds and tests a single branch (e.g. REL_17_STABLE). There is another script called run_branches.pl, which calls this run_build

Zero-Downtime PostgreSQL Maintenance with pgEdge

Posted by Antony Pegg in pgEdge on 2025-12-12 at 06:02

PostgreSQL maintenance doesn't have to mean downtime anymore. With pgEdge's zero-downtime node addition, you can perform critical maintenance tasks like version upgrades, hardware replacements, and cluster expansions without interrupting production workloads. Your applications stay online. Your users stay connected. Your business keeps running.This capability is available across both single-primary deployments (with integration of the open source extension Spock) and globally distributed deployments (by default), giving you the same operational advantages whether you're running a single-region deployment or a globally distributed system. Now, for an incredibly quick and easy approach to zero-downtime node addition, you can use the pgEdge Postgres Control Plane (hosted on GitHub). This approach provides drastically simplified management and orchestration of Postgres databases using a declarative API, whether you're running a single-primary or a globally distributed deployment of PostgreSQL clusters. Spock and other high-availability components come built-in accompanying community PostgreSQL for quick database administration with simple commands.And because pgEdge and all associated components are 100% open source under the PostgreSQL license, using 100% core community PostgreSQL, you get to leverage the high-availability components that enable zero-downtime maintenance without vendor lock-in or compatibility concerns.

What Is the Spock Extension?

Spock is pgEdge's advanced logical replication extension for PostgreSQL that enables active-active (multi-master) replication in clusters with row filtering, column projection, conflict handling, and more. Even though Spock originated from earlier projects like pgLogical and BDR 1, it has seen enormous growth. Our dedicated team of PostgreSQL experts continues to push Spock forward, making it a high-performance, enterprise-grade replication system built for distributed environments.

Zero-Downtime Node Addition: Maintenance Without Interruption

Adding a new node [...]

VACUUM Is a Lie (About Your Indexes)

Posted by Radim Marek on 2025-12-11 at 23:30

There is common misconception that troubles most developers using PostgreSQL: tune VACUUM or run VACUUM, and your database will stay healthy. Dead tuples will get cleaned up. Transaction IDs recycled. Space reclaimed. Your database will live happily ever after.

But there are couple of dirty "secrets" people are not aware of. First of them being VACUUM is lying to you about your indexes.

The anatomy of storage🔗

When you delete a row in PostgreSQL, it is just marked as a 'dead tuple'. Invisible for new transactions but still physically present. Only when all transactions referencing the row are finished, VACUUM can come along and actually remove them - reclamining the space in the heap (table) space.

To understand why this matters differently for tables versus indexes, you need to picture how PostgreSQL actually stores your data.

Your table data lives in the heap - a collection of 8 KB pages where rows are stored wherever they fit. There's no inherent order. When you INSERT a row, PostgreSQL finds a page with enough free space and slots the row in. Delete a row, and there's a gap. Insert another, and it might fill that gap - or not - they might fit somewhere else entirely.

This is why SELECT * FROM users without an ORDER BY can return rows in order initially, and after some updates in seemingly random order, and that order can change over time. The heap is like Tetris. Rows drop into whatever space is available, leaving gaps when deleted.

Heap Page

When VACUUM runs, it removes those dead tuples and compacts the remaining rows within each page. If an entire page becomes empty, PostgreSQL can reclaim it entirely.

And while indexes are on surface the same collection of 8KB pages, they are different. A B-tree index must maintain sorted order - that's the whole point of their existence and the reason why WHERE id = 12345 is so fast. PostgreSQL can binary-search down the tree instead of scanning every possible row. You can learn more about the fundamentals of B-Tree Indexes and what makes th

[...]

pgBackRest PITR in Docker: a simple demo

Posted by Stefan Fercot in Data Egret on 2025-12-11 at 13:25

While moving production database workloads towards cloud-native (Kubernetes) environments has become very popular lately, plenty of users still rely on good old Docker containers. Compared to running PostgreSQL on bare metal, on virtual machines, or via a Kubernetes operator, Docker adds a bit of complexity, especially once you want to go beyond simple pg_dump / pg_restore for backups, upgrades, and disaster recovery.

A few years back, it was common to find open-sourced Dockerfiles from trusted companies bundling PostgreSQL with the extensions and tooling you needed. Most of those projects are now deprecated, as the ecosystem’s focus has shifted hard towards cloud-native patterns.

In many of my conversations with pgBackRest users, one theme comes up regularly: deploying pgBackRest for backups in Docker is straightforward, but restoring from those backups feels much harder. The usual advice was to “use a trusted Docker image and follow their guidelines”, but those images are mostly out-of-date now. These days you typically need to maintain your own image, and more importantly you need a reliable recovery playbook.

So I asked myself: how hard is point-in-time recovery (PITR) with pgBackRest for a PostgreSQL 18 Docker container? Turns out: not that hard, once you’ve seen it end-to-end.

This post is a small lab you can run locally. You’ll:

build a PostgreSQL 18 + pgBackRest image
take a full backup
create restore points
delete data on purpose
restore to the moment just before the delete

The tiny lab setup

The lab image is deliberately small. It just layers pgBackRest on top of the official PostgreSQL 18 image, adds a minimal config, and enables WAL archiving on first init.

Dockerfile

FROM postgres:18

# Install PGDG repository
RUN apt-get update && apt-get install -y \
wget \
gnupg \
lsb-release \
&& wget --quiet -O /usr/share/keyrings/postgresql-archive-keyring.asc https://www.postgresql.org/media/keys/ACCC4CF8.asc \
&& echo "deb [signed-by=/usr/share/keyri

[...]

Postgres 18 New Default for Data Checksums and How to Deal with Upgrades

Posted by Greg Sabino Mullane in Crunchy Data on 2025-12-11 at 13:00

In a recent Postgres patch authored by Greg Sabino Mullane, Postgres has a new step forward for data integrity: data checksums are now enabled by default.

This appears in the release notes as a fairly minor change but it significantly boosts the defense against one of the sneakiest problems in data management - silent data corruption.

Let’s dive into what this feature is, what the new default means for you, and how it impacts upgrades.

What is a data checksum?

A data checksum is a simple but powerful technique to verify the integrity of data pages stored on disk. It's like a digital fingerprint for every 8KB block of data (a "page") in your database.

Creation: When Postgres writes a data page (table and indexes) to disk, it runs an algorithm on the page's contents to calculate a derived, small value—the checksum.
Storage: This checksum is stored in the page header alongside the data.
Verification: Whenever Postgres reads that page back from disk, it immediately recalculates the checksum from the data and compares it to the stored value.

If the two values do not match, it means the data page has been altered or corrupted since it was last written. This is important because data corruption can happen silently. By detecting a mismatch, Postgres can immediately raise an error and alert you to a potential problem. Checksums are also an integral part of pgBackRest which uses these checksums to verify backups.

What is initdb and why does it matter?

The initdb command in Postgres is the utility used to create a new Postgres database cluster and initializes the data directory where Postgres stores all the permanent data. When you run initdb, it does things like:

create the directory structure
create the template databases like template1 and postgres
populate the initial system catalog tables
create the initial version of the server configuration files
enable and start keeping track of checkums

The syntax often looks something lik

[...]

PGIBZ 2025: An Event for the Postgres Community in Ibiza

Posted by Evan Stanton in Data Bene on 2025-12-11 at 00:00

Postgres Ibiza (PGIBZ): An open source conference designed to bring together people with a love for PostgreSQL in Ibiza, a relaxed place for fresh and innovative discussions. An international event run by the nonprofit PostgreSQL España.

This was the first time that the Data Bene team attended the event, and we’re happy to share that it was a very positive experience.

The Conference

Location and Venue

As its name suggests, this conference takes place on the Mediterranean island of Ibiza. For those who are less familiar, the Official Tourism Site provides a nice overview that you may refer to. While it should go without saying, this UNESCO World Heritage Site has incredible offerings outside of the conference itself. Combined with a potentially long-haul flight, it’s strongly recommended to plan an extra day or two (or more) to explore the island and enjoy the local experience.

The event itself is hosted at the Palacio de Congresos de Ibiza; a two-story conference center with multiple rooms and catering. Though located outside of the capital city, the venue can be reached by car in good timing (~15 minutes). To add, the venue is situated within walking distance of numerous hotels, restaurants, and a beach! We happened to have stayed at Hotel Tres Torres, which we can recommend, though admittedly it’s difficult to make a wrong decision, especially following a sprinkle of background research.

Briefly, on the topic of extracurriculars, our go-to recommendation would be to visit the capital city, Ibiza. There you will be able to find a surprisingly large amount of shopping, restaurants, and sightseeing. Adding to that a walk through the historic neighborhood and you’ve got quite the memorable experience.

Attendance

This year’s PGIBZ welcomed about 30-40 attendees from across the globe, including China and the United States. And while there are certainly two sides to the coin, I consider this group size as more of a positive. Indeed, during both the presentation Q/A sessions, as well

[...]

Introducing pg_clickhouse

Posted by David Wheeler on 2025-12-10 at 16:34

The ClickHouse blog has a posted a piece by yours truly introducing pg_clickhouse, a PostgreSQL extension to run ClickHouse queries from PostgreSQL:

While clickhouse_fdw and its predecessor, postgres_fdw, provided the foundation for our FDW, we set out to modernize the code & build process, to fix bugs & address shortcomings, and to engineer into a complete product featuring near universal pushdown for analytics queries and aggregations.

Such advances include:

Adopting standard PGXS build pipeline for PostgreSQL extensions

Adding prepared INSERT support to and adopting the latest supported

release of the ClickHouse C++ library

Creating test cases and CI workflows to ensure it works on PostgreSQL versions 13-18 and ClickHouse versions 22-25

Support for TLS-based connections for both the binary protocol and the HTTP API, required for ClickHouse Cloud

Bool, Decimal, and JSON support

Transparent aggregate function pushdown, including for ordered-set aggregates like percentile_cont()

SEMI JOIN pushdown

I’ve spent most of the last couple months working on this project, learning a ton about ClickHouse, foreign data wrappers, C and C++, and query pushdown. Interested? Try ou the Docker image:

docker run --name pg_clickhouse -e POSTGRES_PASSWORD=my_pass \
       -d ghcr.io/clickhouse/pg_clickhouse:18
docker exec -it pg_clickhouse psql -U postgres -c 'CREATE EXTENSION pg_clickhouse'

Or install it from PGXN (requires C and C++ build tools, cmake, and the openssl libs, libcurl, and libuuid):

pgxn install pg_clickhouse

Or download it and build it yourself from:

PGXN
GitHub

Let me know what you think!

[...]

What you should know about constraints in PostgreSQL

Posted by Gülçin Yıldırım Jelínek in Xata on 2025-12-10 at 12:00

In this blog, we explore Postgres constraints through the pg_constraint catalog, covering table vs. column constraints, constraint triggers, domains and more.

Building a RAG Server with PostgreSQL - Part 3: Deploying Your RAG API

Posted by Dave Page in pgEdge on 2025-12-10 at 09:24

In Part 1 we loaded our documentation into PostgreSQL. In Part 2 we chunked those documents and generated vector embeddings. Now it's time to put it all together with an API that your applications can use.
In this final post, we'll deploy the pgEdge RAG Server to provide a simple HTTP API for asking questions about your content. By the end, you'll have a working RAG system that can answer questions using your own documentation.

What the RAG Server Does

The RAG server sits between your application and the LLM, handling the retrieval part of Retrieval-Augmented Generation. When a query comes in, it:

Converts the query to a vector embedding

Searches for relevant chunks using both semantic (vector) and keyword (BM25) matching

Combines and ranks the results

Formats the top results as context for the LLM

Sends the context and query to the LLM

Returns the generated answer

This hybrid search approach - combining vector similarity with traditional keyword matching - tends to give better results than either method alone. Vector search catches semantically related content even when the exact words differ, while BM25 ensures you don't miss obvious keyword matches.

Prerequisites

Before we start, you'll need:

The database we set up in Parts 1 and 2, with documents and embeddings

An API key for your chosen LLM provider (Anthropic, OpenAI, or local Ollama)

Go 1.23 or later for building from source

Installing the RAG Server

Clone and build the server:This creates the binary at .

Configuration

The RAG server uses a YAML configuration file. Here's a basic setup:Save this as . Let's break down the key sections: - Where the API listens. Default is port 8080 on all interfaces. - Paths to files containing your API keys. Each file should contain just the key, nothing else. Make sure they have restrictive permissions (chmod 600). - This is where it gets interesting. A pipeline defines a complete RAG configuration: which database to query, which tables to sear[...]

PostGIS Performance: Simplification

Posted by Paul Ramsey in Crunchy Data on 2025-12-09 at 13:00

There’s nothing simple about simplification! It is very common to want to slim down the size of geometries, and there are lots of different approaches to the problem.

We will explore different methods starting with ST_Letters for this rendering of the letter “a”.

alt

SELECT ST_Letters('a');

This is a good starting point, but to show the different effects of different algorithms on things like redundant linear points, we need a shape with more vertices along the straights, and fewer along the curves.

alt

SELECT ST_RemoveRepeatedPoints(ST_Segmentize(ST_Letters('a'), 1), 1);

Here we add in vertices every one meter with ST_Segmentize and ST_RemoveRepeatedPoints to thin out the points along the curves. Already we are simplifying!

Lets apply the same “remove repeated” algorithm, with a 10 meter tolerance.

alt

WITH a AS (
  SELECT ST_RemoveRepeatedPoints(ST_Segmentize(ST_Letters('a'), 1), 1) AS a
)
SELECT ST_RemoveRepeatedPoints(a, 10) FROM a;

We do have a lot fewer points, and the constant angle curves are well preserved, but some straight lines are no longer legible as such, and there are redundant vertices in the vertical straight lines.

The ST_Simplify function applies the Douglas-Peuker line simplification algorithm to the rings of the polygon. Because it is a line simplifier it does a cruder job preserving some aspects of the polygon area like squareness of the top ligature.

alt

WITH a AS (
  SELECT ST_RemoveRepeatedPoints(ST_Segmentize(ST_Letters('a'), 1), 1) AS a
)
SELECT ST_Simplify(a, 1) FROM a;

The ST_SimplifyVW function applies the Visvalingam–Whyatt algorithm to the rings of the polygon. Visvalingam–Whyatt is better for preserving the shapes of polygons than Douglas-Peuker, but the differences are subtle.

alt

WITH a AS (
  SELECT ST_RemoveRepeatedPoints(ST_Segmentize(ST_Letters('a'), 1), 1) AS a
)
SELECT ST_SimplifyVW(a, 5) FROM a;

Coercing a shape onto a fixed precision grid is another form of simplification, sometimes used to force the edges of adjacent objects

[...]

PostgreSQL High-Availability Architectures

Posted by Hans-Juergen Schoenig in Cybertec on 2025-12-09 at 08:25

PostgreSQL is highly suitable for powering critical applications in all industries. However, to run critical applications, there are key requirements which are absolutely needed: High-Availability and automatic failover.

This document explains which options are available and which problems one can solve with PostgreSQL. We have listed the most common setups and some of the most common recommendations.

Scenario 1: Automatic failover with Patroni

The first scenario has been designed to provide a solution to the following requirements and scenario:

Run servers on premise or on VMs
Automatically handle server errors
Make the cluster transparent to the application
Handle recovery automatically

This is a typical scenario which is commonly seen in real world deployments. An application has to rely on a highly available database and in case of error we simply want to switch to a backup system.

The solution in this case is a Patroni cluster. It uses distributed consensus based on “etcd” and fully automated failover as well as database recovery. The following image shows how this works:

The image shows a 3-node cluster. The primary server is able to handle read as well as write transactions. The remaining two servers will accept reads.

Advantages of this approach

This approach has a couple of advantages that are important to understand:

The primary server is a single source of truth
Replicas allow read scalability
A “Shared nothing infrastructure” is highly resilient
Easy to scale as needed
Replicas can exist in various locations

In general, achieving high availability is possible if you follow a simple design pattern: Keep it simple. The more complexity you add to the system, the more likely it is that something goes wrong or something impacts operations. Simplicity is a key advantage of this approach.

Transparent connection handling

When dealing with clusters, there is an important question which has to be answered: H

[...]

Building a RAG Server with PostgreSQL - Part 2: Chunking and Embeddings

Posted by Dave Page in pgEdge on 2025-12-09 at 06:30

In Part 1 of this series, we loaded our documentation into PostgreSQL using the pgEdge Document Loader. Our documents are sitting in the database as clean Markdown content, ready for the next step: turning them into something an LLM can search through semantically.In this post, we'll use pgEdge Vectorizer to chunk those documents and generate vector embeddings. By the end, you'll have a searchable vector database that can find relevant content based on meaning rather than just keywords.

What Are Embeddings and Why Chunk?

Before we dive in, let's quickly cover the concepts. are numerical representations of text that capture semantic meaning. Similar concepts end up close together in vector space, so "PostgreSQL replication" and "database synchronisation" would have similar embeddings even though they share few words. This is what makes semantic search possible. is necessary because embedding models have token limits (typically 8,000 tokens or so), and more importantly, smaller chunks provide more focused results. If you embed an entire 50-page document as one vector, searching for "how to create an index" might return that whole document when you really want just the relevant paragraph. Breaking documents into smaller pieces gives you more precise retrieval.

Why Vectorize in the Database?

You could run a separate service to generate embeddings, but pgEdge Vectorizer takes a different approach: it runs inside PostgreSQL as an extension. When you insert or update documents, triggers automatically chunk the text and queue it for embedding. Background workers process the queue asynchronously, calling your chosen embedding API and storing the results.This has several advantages:

No external service to deploy and manage

Embeddings stay in sync with your data automatically

Everything lives in one transactional system

You can use standard SQL for similarity search

Prerequisites

Before we start, you'll need:

PostgreSQL 14 or later (we set this up in Part 1)

The pgvector ex

[...]

PostGIS Day 2025 Recap: AI, Lakehouses and Geospatial Community

Posted by Paul Ramsey in Crunchy Data on 2025-12-08 at 17:23

On Nov. 20, the day after GIS Day, Elizabeth Christensen and I hosted the 7th annual PostGIS Day, a celebration of the Little Spatial Database That Could. Brought to you this year by Snowflake, the event featured an amazing collection of speakers from around the globe — from India to Africa, Europe to North America.

The themes this year were, if you can forgive the pun, all over the map! Moving beyond the generic hype of AI into the practical reality of agents, we saw tools from Felt, Carto and Bunting Labs that can actually write, execute and debug spatial SQL on the fly. We also saw the lakehouse architecture take center stage, with PostGIS acting less as a data silo and more as a high-performance connector to the vast world of object storage, Iceberg and GeoParquet via tools such as pg_lake and Apache Sedona.

But it wasn't all bleeding edge; we grounded the day in massive, industrial-scale success stories. From enabling mapping at IGN France for more than 20 years to powering State Farm’s critical disaster response platforms, PostGIS remains the bedrock of modern geospatial infrastructure. A full playlist of PostGIS Day 2025 is available on the Snowflake YouTube channel. Here's a deeper look at what we learned.

Spatial AI agents: The real deal

We've all heard the buzzwords, but this year, we saw the code. The concept of AI agents was everywhere, but crucially, it wasn't just chatbots hallucinating SQL. We saw systems designed to iterate.

Brendan Ashworth from Bunting Labs drilled down to what "agent" actually means in the geospatial context. It isn't just asking a question and getting an answer. It's a loop. The agent writes a query, runs it against PostGIS, reads the error message if it fails, corrects itself, and runs it again. He showed us Mundi, an AI-native WebGIS that treats PostGIS as its execution engine, solving complex spatial problems without a human holding its hand.

Jaime Sanchez and Mamata Akella from Felt showed how they are teaching Claude to think spatially. They are

[...]

PostgreSQL, MongoDB, and what “cannot scale” really means

Posted by Umair Shahid in Stormatics on 2025-12-08 at 15:58

Last week, I read The Register’s coverage of MongoDB CEO Chirantan “CJ” Desai telling analysts that a “super-high growth AI company … switched from PostgreSQL to MongoDB because PostgreSQL could not just scale.” (The Register)

I believe you can show the value of your own technology without tearing down another. That is really what this post is about.

I run a company that lives inside PostgreSQL production issues every day. We deal with performance, HA and DR design, cost optimisation, and the occasional “we are down, please help” call from a fintech or SaaS platform. My perspective comes from that hands-on work, not from a quarterly earnings script.

Scaling stories are always more complex than a soundbite

When a CEO says, “This AI customer could not scale with PostgreSQL,” a lot of details disappear:

What did the schema and access patterns look like?
How was the application managing connections and transactions?
What was running underneath? Cloud managed PostgreSQL, a home-rolled cluster, or something else?
Were vertical and horizontal scaling options fully explored?
Did the team have PostgreSQL specialists involved, or was everything on default settings?

None of that reduces MongoDB’s strengths. Document databases are excellent for certain workloads: highly variable document shapes, rapid iteration, and development teams that think in JSON first. MongoDB has earned its place.

My concern is with the narrative that PostgreSQL, as a technology, “cannot scale,” especially given that it is the most popular database among professional developers and has overtaken MongoDB in the DB-Engines ranking over the last decade. (DB-Engines)

The method for calculating this ranking is detailed here: https://db-engines.com/en/ranking_definition.

Every dat

[...]

Contributions for week 50, 2025

Posted by Cornelia Biacsics in postgres-contrib.org on 2025-12-07 at 21:26

PGUG.EE met on December 3 2025 in Estonia, organized by Kaarel Moppel & Ervin Weber

Talks

Mayuresh Bagayatkar
Alexander Matrunich
Ants Aasma
Kaarel Moppel

Bruce Momjian spoke at the PG Armenia Community Meetup, organised by Emma Saroyan on December 4 2025.

A Meetup Quiz?

Posted by Bruce Momjian in EDB on 2025-12-05 at 13:00

I have attended over one hundred Postgres meetups over the years. The usual format is: food with individual discussion, lecture with group questions, and finally more individual discussion. I just spoke at an Armenia PostgreSQL User Group meetup and the event organizer Emma Saroyan did something different — she did a group mobile phone quiz after my lecture. To create the quiz questions before the event, she studied my slides and viewed a previous recording of my talk. The winner got a copy of her book.

I think the quiz was a great post-lecture group activity, and gave dynamism to the content. I know people will remember the quiz fondly when they think about the meetup. I encourage all meetups to have similar quizzes, though they do take work to create.

A deeper look at old UUIDv4 vs new UUIDv7 in PostgreSQL 18

Posted by Josef Machytka in credativ on 2025-12-05 at 11:43

In the past there have been many discussions about using UUID as a primary key in PostgreSQL. For some applications, even a BIGINT column does not have sufficient range: it is a signed 8‑byte integer with range −9,223,372,036,854,775,808 to +9,223,372,036,854,775,807. Although these values look big enough, if we think about web services that collect billions or more records daily, this number becomes less impressive. Simple integer values can also cause conflicts of values in distributed system, in Data Lakehouses when combining data from multiple source databases etc.

However, the main practical problem with UUIDv4 as a primary key in PostgreSQL was not lack of range, but the complete randomness of the values. This randomness causes frequent B‑tree page splits, a highly fragmented primary key index, and therefore a lot of random disk I/O. There have already been many articles and conference talks describing this problem. What many of these resources did not do, however, was dive deep into the on‑disk structures. That’s what I wanted to explore here.

What are UUIDs

UUID (Universally Unique Identifier) is a 16‑byte integer value (128 bits), which has 2^128 possible combinations (approximately 3.4 × 10^38). This range is so large that, for most applications, the probability of a duplicate UUID is practically zero. Wikipedia shows a calculation demonstrating that the probability to find a duplicate within 103 trillion version‑4 UUIDs is about one in a billion. Another often‑quoted rule of thumb is that to get a 50% chance of one collision, you’d have to generate roughly 1 billion UUIDs every second for about 86 years.

Values are usually represented as a 36‑character string with hexadecimal digits and hyphens, for example: f47ac10b-58cc-4372-a567-0e02b2c3d479. The canonical layout is 8‑4‑4‑4‑12 characters. The first character in the third block and the first character in the fourth block have special meaning: xxxxxxxx-xxxx-Vxxx-Wxxx-xxxxxxxxxxxx – V marks UUI

[...]

The Future of the PostgreSQL Hacking Workshop

Posted by Robert Haas in EDB on 2025-12-04 at 18:58

The PostgreSQL Hacking Workshop will be taking a well-earned Christmas break in December of 2025. The future of the workshop is a little bit unclear, because I'm continuing to have a bit of trouble finding enough good talks online to justify doing one per month: the best source of talks for the event is pgconf.dev, but not all of those talks are about hacking on PostgreSQL, and not all of those that are about hacking are equally interesting to potential attendees. Also, we've already eaten through much of the backlog of older hacking-related talks that are posted online. Many conferences have few hacking-related talks, or just don't post any videos (which are expensive and time-consuming to produce).

PostgreSQL Contributor Story: Bryan Green

Posted by Floor Drees in EDB on 2025-12-04 at 15:52

Earlier this year we started a program (“Developer U”) to help colleagues who show promise for PostgreSQL Development to become contributors. Meet Bryan Green, working on the Platform Operations team at EDB, who just enjoys understanding how things work at the lowest levels.

JIT, episode III: warp speed ahead

Posted by Pierre Ducroquet on 2025-12-04 at 14:43

Previously…

In our first JIT episode, we discussed how we could, using copy-patch, easily create a JIT compiler for PostgreSQL, with a slight improvement in performance compared to the PostgreSQL interpreter.

In our second episode, I talked about the performance wall and how hard it was to have a real leap in performance compared to the interpreter. But it ended with a positive outlook, a nice performance jump that I was preparing at that moment…

Beating the interpreter, let’s get back to the basics

The interpreter will run each opcode for every record it has to process. Everything it has to do for each record that could be done only once is better done once, obviously. And this is where a JIT can beat it. The JIT compiler can choose optimizations that would require checks at each opcode for the interpreter, and thus self-defeating for the interpreter. For instance, I mentioned creating inlined opcodes for common function calls like int4eq : replacing the indirect call to int4eq with a comparison of the function pointer and then an inlined version would indeed be silly, since the comparison is going to waste a lot of time already.

So, what can’t the interpreter do? It sure can’t easily remove indirect calls, but this is a 1% performance gain, 2% at most. You won’t get to the headlines with that, right? Well, when in doubt, look at the past…

A short story about belief oriented programming…

A decade ago, I worked at a small company where I heard the weirdest thing ever regarding system performance: “our application is slower when built in 64 bits mode because the bigger pointer size makes it slower”. I didn’t buy this, spent two days digging into the code, and found that it was the opposite: 64 bits brought such a performance improvement that the entire system collapsed on a mutex that held a core structure in the application… Removing the mutex made the application fly in both 32 and 64 bits, with 64 bits beating 32 bits obviously.

But why is 64 bits faster? We are talking databa

[...]

Postgres Scan Types in EXPLAIN Plans

Posted by Elizabeth Garrett Christensen in Crunchy Data on 2025-12-04 at 13:00

The secret to unlocking performance gains often lies not just in what you ask in a query, but in how Postgres finds the answer. The Postgres EXPLAIN system is great for understanding how data is being queried. One of secretes to reading EXPLAIN plans is understanding the type of scan done to retrieve the data. The scan type can be the difference between a lightning-fast response or a slow query.

postgres explain plan

Today I’ll break down the most common scan types, how they work, and when you’ll see them in your queries.

Sequential scan

This type of data scan reads the entire table, row by row checking to see what matches the query conditions. If you have a WHERE or FILTER, Postgres just scans each row looking for matches.

Sequence scans are kind of the foundation of how scans are done and for many searches, this is what Postgres will use. For very large data sets, or those queried often, sequential scans are not ideal and an index scan may be faster. For that reason - knowing how to spot a seq scan vs index scan when reading an EXPLAIN plan is one the most important parts of reading a scan type in a query plan.

EXPLAIN select * from accounts;

QUERY PLAN
-------------------------------------------------------------
Seq Scan on accounts  (cost=0.00..22.70 rows=1270 width=36)
(1 row)

Index Scan

When you create an index in Postgres, you’re creating a column or multi-column reference that is stored on disk. Postgres is able to use this index as a map to the data stored in the table. A basic index scan uses a B-tree to quickly find the exact location of the data using a a two-step process: first Postgres finds the entry in the index, uses the reference, and then it fetches the rest of the row data from the table.

EXPLAIN select * from accounts where id = '5';

                                  QUERY PLAN
-------------------------------------------------------------------------------
 Index Scan using accounts_pkey on accounts  (cost=0.15..2.37 rows=1 width=36)
   Index Cond: (id = 5)
(2

[...]

Integrating Custom Storages with pgwatch

Posted by ahmed gouda in Cybertec on 2025-12-04 at 08:32

As a PostgreSQL-specific monitoring solution, pgwatch is mostly known for storing collected metrics in a PostgreSQL database. While great, as you probably should "just Use Postgres for everything" xD... in some scenarios and specific setups, this might be a limitation.

You may want to store your metrics in a particular OLAP database, or a specific format, etc. This is where pgwatch's gRPC client shines, as it enables users to integrate custom storage solutions by developing a compatible gRPC server that writes the metrics to their desired destination.

In this post, we will go through the process of developing a PoC for a custom gRPC server in Go that pgwatch can use to write metrics to ClickHouse.

pgwatch protobuf

First, let's understand the pgwatch gRPC client protobuf file.

In the file, you can see the main Receiver service defined with a couple of methods; let's explain them one by one.

UpdateMeasurements(MeasurementEnvelope) returns (Reply).
- It's the main write function that gets passed the measurement and writes it to the desired destination.
- The passed MeasurementEnvelope data type consists of DBName, which is the monitored source name, Tags, representing the tags associated with this source, MetricName, the metric being measured, and Data, a list of rows that resulted from running the metric query.
- Returns a log message in Reply.
SyncMetric(SyncReq) returns (Reply).
- It's a special function that synchronizes to the server the addition/deletion of a metric under a specific monitored source, so it can pre-allocate/free resources.
- The passed SyncReq data type consists of DBName, representing the source name under which the metric got added/deleted, MetricName, the metric added/deleted, and Operation, either addition or deletion.
- Returns a log message in Reply.
DefineMetrics(google.protobuf.Struct) returns (Reply).
- It's an optional, re

[...]

Building a RAG Server with PostgreSQL - Part 1: Loading Your Content

Posted by Dave Page in pgEdge on 2025-12-04 at 06:30

Retrieval-Augmented Generation (RAG) has become one of the most practical ways to give Large Language Models (LLMs) access to your own data. Rather than fine-tuning a model or hoping it somehow knows about your documentation, RAG lets you retrieve relevant content from your own sources and provide it as context to the LLM at query time. The result is accurate, grounded responses based on your actual content.In this three-part series, I'll walk through building a complete RAG server using PostgreSQL as the foundation. We'll cover:

Part 1
(this post): Creating a schema and loading your documents

Part 2
: Chunking documents and generating embeddings with pgEdge Vectorizer

Part 3
: Deploying a RAG API server for your applications

By the end of the series, you'll have a working RAG system that can answer questions using your own documentation or knowledge base.

Why PostgreSQL for RAG?

If you're already running PostgreSQL (and let's face it, you probably are), adding RAG capabilities to your existing infrastructure makes a lot of sense. With the pgvector extension, Postgres becomes a capable vector database without requiring you to deploy and manage yet another specialised system. Your documents, embeddings, and application data can all live in one place, with the transactional guarantees and operational tooling you already know.

The Architecture

Our RAG system consists of three components:

In this first post, we'll focus on getting your documents into the d[...]

Latest Blog Posts

Posted by Laurenz Albe in Cybertec on 2025-12-16 at 06:00

A management summary of the problem

Keeping the down time for an operating system upgrade short

Posted by Valeria Kaplan in Data Egret on 2025-12-15 at 22:51

A Journey Through PostgreSQL Events

Introduction

The dynamic of community events is different. They feel like a village fête, where people have known each other for ages. There’s familiarity, shared purpose, and a genuine sense of belonging.

Community recognition

Posted by David E. Wheeler on 2025-12-15 at 15:55

Posted by Josef Machytka in credativ on 2025-12-15 at 09:53

Significant improvements for cloud volumes

Posted by Floor Drees in EDB on 2025-12-15 at 07:56

Posted by Dave Page in pgEdge on 2025-12-15 at 06:13

The Problem

Enter pgEdge Anonymizer

Installation

Configuration

Posted by Cornelia Biacsics in postgres-contrib.org on 2025-12-14 at 20:57

Posted by Andrew Dunstan in EDB on 2025-12-12 at 19:47

Posted by Antony Pegg in pgEdge on 2025-12-12 at 06:02

What Is the Spock Extension?

Zero-Downtime Node Addition: Maintenance Without Interruption

Posted by Radim Marek on 2025-12-11 at 23:30

The anatomy of storage🔗

Posted by Stefan Fercot in Data Egret on 2025-12-11 at 13:25

The tiny lab setup

Dockerfile

Posted by Greg Sabino Mullane in Crunchy Data on 2025-12-11 at 13:00

Posted by Evan Stanton in Data Bene on 2025-12-11 at 00:00

Posted by David Wheeler on 2025-12-10 at 16:34

Posted by Gülçin Yıldırım Jelínek in Xata on 2025-12-10 at 12:00

Posted by Dave Page in pgEdge on 2025-12-10 at 09:24

What the RAG Server Does

Prerequisites

Installing the RAG Server

Configuration

Posted by Paul Ramsey in Crunchy Data on 2025-12-09 at 13:00

Posted by Hans-Juergen Schoenig in Cybertec on 2025-12-09 at 08:25

Scenario 1: Automatic failover with Patroni

Advantages of this approach

Transparent connection handling

Posted by Dave Page in pgEdge on 2025-12-09 at 06:30

What Are Embeddings and Why Chunk?

Why Vectorize in the Database?

Prerequisites

Posted by Paul Ramsey in Crunchy Data on 2025-12-08 at 17:23

Spatial AI agents: The real deal

Posted by Umair Shahid in Stormatics on 2025-12-08 at 15:58

Scaling stories are always more complex than a soundbite

Posted by Cornelia Biacsics in postgres-contrib.org on 2025-12-07 at 21:26

Posted by Bruce Momjian in EDB on 2025-12-05 at 13:00

Posted by Josef Machytka in credativ on 2025-12-05 at 11:43

What are UUIDs

Posted by Robert Haas in EDB on 2025-12-04 at 18:58

Posted by Floor Drees in EDB on 2025-12-04 at 15:52

Posted by Pierre Ducroquet on 2025-12-04 at 14:43

Previously…

Beating the interpreter, let’s get back to the basics

A short story about belief oriented programming…

Posted by Elizabeth Garrett Christensen in Crunchy Data on 2025-12-04 at 13:00

Posted by ahmed gouda in Cybertec on 2025-12-04 at 08:32

pgwatch protobuf

Posted by Dave Page in pgEdge on 2025-12-04 at 06:30

Why PostgreSQL for RAG?

The Architecture

Top posters

Top teams

Feeds

Planet

Contact