Just a Theory

By David E. Wheeler

Posts about Packaging

Adventures in Extension Packaging

I gave a presentation at PGConf.dev last week, Adventures in Extension Packaging. It summarizes stuff I learned in the past year in developing the PGXN Meta v2 RFC, re-packaging all of the extensions on pgt.dev, and experimenting with the CloudNativePG community’s proposal to mount extension OCI images in immutable PostgreSQL containers.

Turns out a ton of work and experimentation remains to be done.

Previous work covers the first half of the talk, including:

The rest of the talk encompasses newer work. Read on for details.

Automated Packaging Challenges

Back in December I took over maintenance of the Trunk registry, a.k.a., pgt.dev, refactoring and upgrading all 200+ extensions and adding Postgres 17 builds. This experience opened my eyes to the wide variety of extension build patterns and configurations, even when supporting a single OS (Ubuntu 22.04 “Jammy”). Some examples:

  • pglogical requires an extra make param to build on PostgreSQL 17: make -C LDFLAGS_EX="-L/usr/lib/postgresql/17/lib"
  • Some pgrx extensions require additional params, for example:
  • pljava needs a pointer to libjvm: mvn clean install -Dpljava.libjvmdefault=/usr/lib/x86_64-linux-gnu/libjvm.so
  • plrust needs files to be moved around, a shell script to be run, and to be built from a subdirectory
  • bson also needs files to be moved around and a pointer to libbson
  • timescale requires an environment variable and shell script to run before building
  • Many extensions require patching to build for various configurations and OSes, like this tweak to build pguri on Postgres 17 and this patch to get duckdb_fdw to build at all

Doubtless there’s much more. These sorts of challenges led the RPM and APT packaging systems to support explicit scripting and patches for every package. I don’t think it would be sensible to support build scripting in the meta spec.

However, the PGXN meta SDK I developed last year supports the merging of multiple META.json files, so that downstream packagers could maintain files with additional configurations, including explicit build steps or lists of packages, to support these use cases.

Furthermore, the plan to add reporting to PGXN v2 means that downstream packages could report build failures, which would appear on PGXN, where they’d encourage some maintainers, at least, to fix issues within their control.

Dependency Resolution

Dependencies present another challenge. The v2 spec supports third party dependencies — those not part of Postgres itself or the ecosystem of extensions. Ideally, an extension like pguri would define its dependence on the uriparser library like so:

{
  "dependencies": {
    "postgres": { "version": ">= 9.3" },
    "packages": {
      "build": {
        "requires": {
          "pkg:generic/uriparser": 0,
        }
      }
    }
  }
}

An intelligent build client will parse the dependencies, provided as purls, to determine the appropriate OS packages to install to satisfy. For example, building on a Debian-based system, it would know to install liburiparser-dev to build the extension and require liburiparser1 to run it.

With the aim to support multiple OSes and versions — not to mention Postgres versions — the proposed PGXN binary registry would experience quite the combinatorial explosion to support all possible dependencies on all possible OSes and versions. While I propose to start simple (Linux and macOS, Postgres 14-18) and gradually grow, it could quickly get quite cumbersome.

So much so that I can practically hear Christoph’s and Devrim’s reactions from here:

Photo of Ronald Reagan and his team laughing uproariously with the white Impact Bold-style meme text at the top that reads, “AND THEN HE SAID…”, followed by large text at the bottom that reads, “WE’LL PACKAGE EVERY EXTENSION FOR EVERY PLATFORM!”

Photo of Christoph, Devrim, and other long-time packagers laughing at me.

Or perhaps:

Photo of two German shepherds looking at a pink laptop and appearing to laugh hysterically, with the white Impact Bold-style meme text at the top that reads, “AND THEN HE SAID…”, followed by large text at the bottom that reads, “UPSTREAM MAINTAINERS WILL FIX BUILD FAILURES!”

Photo of Christoph and Devrim laughing at me.

I hardly blame them.

A CloudNativePG Side Quest

Gabriele Bartolini blogged the proposal to deploy extensions to CloudNativePG containers without violating the immutability of the container. The introduction of the extension_control_path GUC in Postgres 18 and the ImageVolume feature in Kubernetes 1.33 enable the pattern, likely to be introduced in CloudNativePG v1.27. Here’s a sample CloudNativePG cluster manifest with the proposed extension configuration:

 1
 2
 3
 4
 5
 6
 7
 8
 9
10
11
12
13
14
15
apiVersion: postgresql.cnpg.io/v1
kind: Cluster
metadata:
  name: postgresql-with-extensions
spec:
  instances: 1
  imageName: ghcr.io/cloudnative-pg/postgresql-trunk:18-devel
  postgresql:
    extensions:
      - name: vector
        image:
          reference: ghcr.io/cloudnative-pg/pgvector-18-testing
  storage:
    storageClass: standard
    size: 1Gi

The extensions object at lines 9-12 configures pgvector simply by referencing an OCI image that contains nothing but the files for the extension. To “install” the extension, the proposed patch triggers a rolling update, replicas first. For each instance, it takes the following steps:

  • Mounts each extension as a read-only ImageVolume under /extensions; in this example, /extensions/vector provides the complete contents of the image

  • Updates LD_LIBRARY_PATH to include the path to the lib directory of the each extension, e.g., /extensions/vector/lib.

  • Updates the extension_control_path and dynamic_library_path GUCs to point to the share and lib directories of each extension, in this example:

    extension_control_path = '$system:/extensions/vector/share'
    dynamic_library_path   = '$libdir:/extensions/vector/lib'
    

This works! Alas, the pod restart is absolutely necessary, whether or not any extension requires it,1, because:

  • Kubernetes resolves volume mounts, including ImageVolumes, at pod startup
  • The dynamic_library_path and extension_control_path GUCs require a Postgres restart
  • Each extension requires another path to be appended to both of these GUCs, as well as the LD_LIBRARY_PATH

Say we wanted to use five extensions. The extensions part of the manifest would look something like this:

extensions:
  - name: vector
    image:
      reference: ghcr.io/cloudnative-pg/pgvector-18-testing
  - name: semver
    image:
      reference: ghcr.io/example/semver:0.40.0
  - name: auto_explain
    image:
      reference: ghcr.io/example/auto_explain:18
  - name: bloom
    image:
      reference: ghcr.io/example/bloom:18
  - name: postgis
    image:
      reference: ghcr.io/example/postgis:18

To support this configuration, CNPG must configure the GUCs like so:

extension_control_path = '$system:/extensions/vector/share:/extensions/semver/share:/extensions/auto_explain/share:/extensions/bloom/share:/extensions/postgis/share'

dynamic_library_path   = '$libdir:/extensions/vector/lib:/extensions/semver/lib:/extensions/auto_explain/lib:/extensions/bloom/lib:/extensions/postgis/lib'

And also LD_LIBRARY_PATH:

LD_LIBRARY_PATH="$LD_LIBRARY_PATH:/extensions/vector/lib:/extensions/semver/lib:/extensions/auto_explain/lib:/extensions/"

In other words, every additional extension requires another prefix to be appended to each of these configurations. Ideally we could use a single prefix for all extensions, avoiding the need to update these configs and therefore to restart Postgres. Setting aside the ImageVolume limitation2 for the moment, this pattern would require no rolling restarts and no GUC updates unless a newly-added extension requires pre-loading via shared_preload_libraries.

Getting there, however, requires a different extension file layout than PostgreSQL currently uses.

RFC: Extension Packaging and Lookup

Imagine this:

  • A single extension search path GUC
  • Each extension in its own eponymous directory
  • Pre-defined subdirectory names used inside each extension directory

The search path might look something like:

extension_search_path = '$system:/extensions:/usr/local/extensions'

Looking at one of these directories, /extensions, its contents would be extension directories:

❯ ls -1 extensions
auto_explain
bloom
postgis
semver
vector

And the contents of one these extension directories would be something like:

❯ tree extensions/semver
extensions/semver
├── doc
│   └── semver.md
├── lib
│   └── semver.so
├── semver.control
└── sql
    ├── semver--0.31.0--0.31.1.sql
    ├── semver--0.31.1--0.31.2.sql
    ├── semver--0.31.2--0.32.0.sql
    └── semver--0.5.0--0.10.0.sql

For this pattern, Postgres would look for the appropriately-named directory with a control file in each of the paths. To find the semver extension, for example, it would find /extensions/semver/semver.control.

All the other files for the extension would live in specifically-named subdirectories: doc for documentation files, lib for shared libraries, sql for SQL deployment files, plus bin, man, html, include, locale, and any other likely resources.

With all of the files required for an extension bundled into well-defined subdirectories of a single directory, it lends itself to the layout of the proposed binary distribution format. Couple it with OCI distribution and it becomes a natural fit for ImageVolume deployment: simply map each extension OCI image to a subdirectory of the desired search path and you’re done. The extensions object in the CNPG Cluster manifest remains unchanged, and CNPG no longer needs to manipulate any GUCs.

Some might recognize this proposal from a previous RFC post. It not only simplifies the CloudNativePG use cases, but because it houses all of the files for an extension in a single bundle, it also vastly simplifies installation on any system:

  1. Download the extension package
  2. Validate its signature & contents
  3. Unpack its contents into a directory named for the extension in the extension search path

Simple!

Fun With Dependencies

Many extensions depend on external libraries, and rely on the OS to find them. OS packagers follow the dependency patterns of their packaging systems: require the installation of other packages to satisfy the dependencies.

How could a pattern be generalized by the Trunk Packaging Format to work on all OSes? I see two potential approaches:

  1. List the dependencies as purls that the installing client translates to the appropriate OS packages it installs.
  2. Bundle dependencies in the Trunk package itself

Option 1 will work well for most use cases, but not immutable systems like CloudNativePG. Option 2 could work for such situations. But perhaps you noticed the omission of LD_LIBRARY_PATH manipulation in the packaging and lookup discussion above. Setting aside the multitude of reasons to avoid LD_LIBRARY_PATH3, how else could the OS find shared libraries needed by an extension?

Typically, one installs shared libraries in one of a few directories known to tools like ldconfig, which must run after each install to cache their locations. But one cannot rely on ldconfig in immutable environments, because the cache of course cannot be mutated.

We could, potentially, rely on rpath, a feature of modern dynamic linkers that reads a list of known paths from the header of a binary file. In fact, most modern OSes support $ORIGIN as an rpath value4 (or @loader_path on Darwin/macOS), which refers to the same directory in which the binary file appears. Imagine this pattern:

  • The Trunk package for an extension includes dependency libraries alongside the extension module
  • The module is compiled with rpath=$ORIGIN

To test this pattern, let’s install the Postgres 18 beta and try the pattern with the pguri extension. First, remove the $libdir/ prefix (as discussed previously) and patch the extension for Postgres 17+:

perl -i -pe 's{\$libdir/}{}' pguri/uri.control pguri/*.sql
perl -i -pe 's/^(PG_CPPFLAGS.+)/$1 -Wno-int-conversion/' pguri/Makefile

Then compile it with CFLAGS to set rpath and install it with a prefix parameter:

make CFLAGS='-Wl,-rpath,\$$ORIGIN'
make install prefix=/usr/local/postgresql

With the module installed, move the liburiparser shared library from OS packaging to the lib directory under the prefix, resulting in these contents:

❯ ls -1 /usr/local/postgresql/lib
liburiparser.so.1
liburiparser.so.1.0.30
uri.so

The chrpath utility shows that the extension module, uri.so, has its RUNPATH (the modern implementation of rparth) properly configured:

❯ chrpath /usr/local/postgresql/lib/uri.so 
uri.so: RUNPATH=$ORIGIN

Will the OS be able to find the dependency? Use ldd to find out:

❯ ldd /usr/local/postgresql/lib/uri.so 
	linux-vdso.so.1
	liburiparser.so.1 => /usr/local/postgresql/lib/liburiparser.so.1
	libc.so.6 => /lib/aarch64-linux-gnu/libc.so.6
	/lib/ld-linux-aarch64.so.1

The second line of output shows that it does in fact find liburiparser.so.1 where we put it. So far so good. Just need to tell the GUCs where to find them and restart Postgres:

extension_control_path = '$system:/usr/local/postgresql/share'
dynamic_library_path   = '$libdir:/usr/local/postgresql/lib'

And then it works!

❯ psql -c "CREATE EXTENSION uri"
CREATE EXTENSION
❯ psql -c "SELECT 'https://example.com/'::uri"
         uri          
----------------------
 https://example.com/

Success! So we can adopt this pattern, yes?

A Wrinkle

Well, maybe. Try it with a second extension, http, once again building it with rpath=$ORIGIN and installing it in the custom lib directory:

perl -i -pe 's{$libdir/}{}g' *.control
make CFLAGS='-Wl,-rpath,\$$ORIGIN'
make install prefix=/usr/local/postgresql

Make sure it took:

❯ chrpath /usr/local/postgresql/lib/http.so 
http.so: RUNPATH=$ORIGIN

Now use ldd to see what shared libraries it needs:

❯ ldd /usr/local/postgresql/lib/http.so
	linux-vdso.so.1 
	libcurl.so.4 => not found
	libc.so.6 => /lib/aarch64-linux-gnu/libc.so.6

Naturally it needs libcurl; let’s copy it from another system and try again:

 1
 2
 3
 4
 5
 6
 7
 8
 9
10
11
12
13
14
15
16
17
18
19
❯ scp dev:libcurl.so.4 /usr/local/postgresql/lib/
❯ ldd /usr/local/postgresql/lib/http.so
	linux-vdso.so.1
	libcurl.so.4 => /usr/local/postgresql/lib/libcurl.so.4
	libc.so.6 => /lib/aarch64-linux-gnu/libc.so.6
	/lib/ld-linux-aarch64.so.1
	libnghttp2.so.14 => not found
	libidn2.so.0 => /lib/aarch64-linux-gnu/libidn2.so.0
	librtmp.so.1 => not found
	libssh.so.4 => not found
	libpsl.so.5 => not found
	libssl.so.3 => /lib/aarch64-linux-gnu/libssl.so.3
	libcrypto.so.3 => /lib/aarch64-linux-gnu/libcrypto.so.3
	libgssapi_krb5.so.2 => /lib/aarch64-linux-gnu/libgssapi_krb5.so.2
	libldap.so.2 => not found
	liblber.so.2 => not found
	libzstd.so.1 => /lib/aarch64-linux-gnu/libzstd.so.1
	libbrotlidec.so.1 => not found
	libz.so.1 => /lib/aarch64-linux-gnu/libz.so.1

Line 4 shows it found libcurl.so.4 where we put it, but the rest of the output lists a bunch of new dependencies that need to be satisfied. These did not appear before because the http.so module doesn’t depend on them; the libcurl.so library does. Let’s add libnghttp2 and try again:

 1
 2
 3
 4
 5
 6
 7
 8
 9
10
11
12
13
14
15
16
17
18
19
❯ scp dev:libnghttp2.so.14 /usr/local/postgresql/lib/
❯ ldd /usr/local/postgresql/lib/http.so
	linux-vdso.so.1
	libcurl.so.4 => /usr/local/postgresql/lib/libcurl.so.4
	libc.so.6 => /lib/aarch64-linux-gnu/libc.so.6
	/lib/ld-linux-aarch64.so.1
	libnghttp2.so.14 => not found
	libidn2.so.0 => /lib/aarch64-linux-gnu/libidn2.so.0
	librtmp.so.1 => not found
	libssh.so.4 => not found
	libpsl.so.5 => not found
	libssl.so.3 => /lib/aarch64-linux-gnu/libssl.so.3
	libcrypto.so.3 => /lib/aarch64-linux-gnu/libcrypto.so.3
	libgssapi_krb5.so.2 => /lib/aarch64-linux-gnu/libgssapi_krb5.so.2
	libldap.so.2 => not found
	liblber.so.2 => not found
	libzstd.so.1 => /lib/aarch64-linux-gnu/libzstd.so.1
	libbrotlidec.so.1 => not found
	libz.so.1 => /lib/aarch64-linux-gnu/libz.so.1

Sadly, as line 7 shows, it still can’t find libnghttp2.so.

It turns out that rpath works only for immediate dependencies. To solve this problem, liburl and all other shared libraries must also be compiled with rpath=$ORIGIN — which means we can’t simply copy those libraries from OS packages5. In th meantime, only deirect dependencies could be bundled with an extension.

Project Status

The vision of accessible, easy-install extensions everywhere remains intact. I’m close to completing a first release of the PGXN v2 build SDK with support for meta spec v1 and v2, PGXS, and pgrx extensions. I expect the first deliverable to be a command-line client to complement and eventuallly replace the original CLI. It will be put to work building all the extensions currently distributed on PGXN, which will surface new issues and patterns that inform the development and completion of the v2 meta spec.

In the future, I’d also like to:

  • Finish working out Trunk format and dependency patterns
  • Develop and submit the prroposed extension_search_path patch
  • Submit ImageVolume feedback to Kubernetes to allow runtime mounting
  • Start building and distributing OCI Trunk packages
  • Make the pattern available for distributed registries, so anyone can build their own Trunk releases!
  • Hack fully-dynamic extension loading into CloudNativePG

Let’s Talk

I recognize the ambition here, but feel equal to it. Perhaps not every bit will work out, but I firmly believe in setting a clear vision and executing toward it while pragmatically revisiting and revising it as experience warrants.

If you’d like to contribute to the project or employ me to continue working on it, let’s talk! Hit me up via one of the services listed on the about page.


  1. The feature does not yet support pre-loading shared libraries. Presumably a flag will be introduced to add the extension to shared_preload_libraries↩︎

  2. Though we should certainly request the ability to add new ImageVolume mounts without a restart. We can’t be the only ones thinking about kind of feature, right? ↩︎

  3. In general, one should avoid LD_LIBRARY_PATH for variety of reasons, not least of which its bluntness. For various security reasons, macOS ignores it unless sip is disabled, and SELinux prevents its propagation to new processes. ↩︎

  4. Although not Windows, alas. ↩︎

  5. Unless packagers could be pursuaded to build all libraries with rpath=$ORIGIN, which seems like a tall order. ↩︎

RFC: Extension Packaging & Lookup

Several weeks ago, I started a pgsql-hackers thread proposing a new extension file organization and a search path GUC for finding extensions. The discussion of Christoph Berg’s extension_destdir patch inspired this proposal. These threads cover quite a lot of territory, so I want to pull together a more unified, public proposal.

Here goes.

Challenges

A number of challenges face extension users, thanks to extension file organization in the Postgres core. The common thread among them is the need to add extensions without changing the contents of the Postgres installation itself.

Packager Testing

On Debian systems, the user account that creates extension packages lacks permission to add files to Postgres install. But testing extensions requires installing the extension where Postgres can find it. Moreover, extensions ideally build against a clean Postgres install; adding an extension in order to run make installcheck would pollute it.

Christoph’s patch solves these problems by adding a second lookup path for extensions and dynamic modules, so that Postgres can load them directly from the package build directory.

Alas, the patch isn’t ideal, because it simply specifies a prefix and appends the full pg_config directory paths to it. For example, if --sharedir outputs /opt/share and extension_destdir GUC is set to /tmp/build/myext, the patch will search in /tmp/build/myext/opt/share. This approach works for the packaging use case, which explicitly uses full paths with a prefix, but would be weird for other use cases.

Peter Eisentraut proposed an alternate patch with a new GUC, extension_control_path, that provides a more typical search path pattern to find extension control files, but doesn’t account for shared modules that ship with an extension, requiring that they still live in the dynamic_library_path. Installing into custom directories requires the undocumented datadir and pkglibdir variables:

make install datadir=/else/where/share pkglibdir=/else/where/lib

This pattern can probably be simplified.

OCI Immutability

OCI (née Docker) images are immutable, while a container image runs on a writeable but non-persistent file system. To install persistent extensions in a container, one must create a persistent volume, map it to SHAREDIR/extensions, and copy over all the extensions it needs (or muck with symlink magic). Then do it again for shared object libraries (PKGLIBDIR), and perhaps also for other pg_config directories, like --bindir. Once it’s all set up, one can install a new extension and its files will be distributed to the relevant persistent volumes.

This pattern makes upgrades tricky, because the core extensions are mixed in with third-party extensions. Worse, the number of directories that must be mounted into volumes depends on the features of an extension, increasing deployment configuration complexity. It would be preferable to have all the files for an extension in one place, rather than scattered across multiple persistent volumes.

Peter Eisentraut’s patch addresses much of this issue by adding a search path for extension control files and related data/share files (generally SQL files). One can create a single volume with a lib directory for shared modules and share/extension directory for control and data/share files.

OCI Extension Images

However, an additional wrinkle is the ambition from the CloudNativePg (CNPG) community to eliminate the need for a persistent volume, and rely instead on mounting images that each contain all the files for a single extension as their own volumes, perhaps using Kubernetes image volume feature, (currently in alpha).

This feature requires all the file in an extension to live in a single directory, a volume mounted to an extension image contains all the files required to use the extension. The search path patches proposed so far do not enable this behavior.

Postgres.app Immutability

The macOS Postgres.app supports extensions. But installing one into SHAREDIR/extensions changes the contents of the Postgres.app bundle, breaking Apple-required signature validation. The OS will no longer be able to validate that the app is legit and refuse to start it.

Peter Eisentraut’s new patch addresses this issue as well, with all the same caveats as for the packager testing challenges.

Solution

To further address these issues, this RFC proposes to change file organization and lookup patterns for PostgreSQL extensions.

Extension Directories

First, when an extension is installed, by default all of its files will live in a single directory named for the extension. The contents include:

  • The Control file that describes extension
  • Subdirectories for SQL, shared modules, docs, binaries, etc.

Subdirectories roughly correspond to the pg_config --*dir options:

  • bin: Executables
  • doc: Documentation files
  • html: HTML documentation files
  • lib: Dynamically loadable modules
  • locale: Locale support files
  • man: Manual pages
  • share: SQL and other architecture-independent support files

This layout reduces the cognitive overhead for understanding what files belong to what extension. Want to know what’s included in the widget extension? Everything is in the widget directory. It also simplifies installation of an extension: one need add only a directory named for and containing the files required by the extension.

Configuration Parameter

Add a new pg_config value that returns the directory into which extensions will by default be installed:

 --extdir   show location of extensions

Its default value would be $(pg_config --sharedir)/extension, but could be set at compile time like other configuration parameters. Its contents consist of subdirectories that each contain an extension, as described in Extension Directories. With a few extensions installed, it would look something like:

❯ ls -1 "$(pg_config --extdir)"
auto_explain
bloom
isn
pair
plperl
plpgsql
plv8
xml2
semver
vector

Extension Path

Add an extension lookup path GUC akin to dynamic_library_path, called extension_path. It lists all the directories that Postgres will search for extensions and their files. The default value for this GUC will be:

extension_path = '$extdir'

The special string $extdir corresponds to the pg_config option of the same name, and function exactly as $libdir does for the dynamic_library_path GUC, substituting the appropriate value.

Lookup Execution

Update PostgreSQL’s CREATE EXTENSION command to search the directories in extension_path for an extension. For each directory in the list, it will look for the extension control file in a directory named for the extension:

$dir/$extension/$extension.control

The first match will be considered the canonical location for the extension. For example, if Postgres finds the control file for the pair at /opt/pg17/ext/pair/pair.control, it will load files only from the appropriate subdirectories, e.g.:

  • SQL files from /opt/pg17/ext/pair/share
  • Shared module files from /opt/pg17/ext/pair/lib

PGXS

Update the extension installation behavior of PGXS to install extension files into the new layout. A new variable, $EXTDIR, will define the directory into which to install extension directories, and default to $(pg_config --extdir). It can be set to any literal path, which must exist and be accessible by the PostgreSQL service.

The $EXTENSION variable will be changed to allow only one extension name. If it’s set, the installation behavior will be changed for the following variables:

  • EXTENSION: Creates $EXTDIR/$EXTENSION, installs $EXTDIR/$EXTENSION/$EXTENSION.control
  • MODULES and MODULE_big: Installed into $EXTDIR/$EXTENSION/lib
  • MODULEDIR: Removed
  • DATA and DATA_built: Installed into $EXTDIR/$EXTENSION/share
  • DATA_TSEARCH: Installed into $EXTDIR/$EXTENSION/share/tsearch_data
  • DOCS: Installed into $EXTDIR/$EXTENSION/doc
  • PROGRAM, SCRIPTS and SCRIPTS_built: Installed into $EXTDIR/$EXTENSION/bin

Each of these locations can still be overridden by setting one of the (currently undocumented) installation location options (e.g., datadir, pkglibdir, etc.).

Note

External projects that install extensions without using PGXS, like pgrx, must also be updated to either follow the same pattern or to delegate installation to PGXS.

Control File

The directory control file parameter will be deprecated and ignored.

The module_pathname parameter should only name a shared module in the lib subdirectory of an extension directory. Any existing use of a $libdir prefix will be stripped out and ignored before replacing the MODULE_PATHNAME string in SQL files. The implication for loading extension dynamic modules1 differs from the existing behavior as follows:

  1. If the name is an absolute path, the given file is loaded.
  2. If the name does not contain a directory part, the file is searched for in the in the lib subdirectory of the extension’s directory ($EXTDIR/$EXTENSION/lib).
  3. Otherwise (the file was not found in the path, or it contains a non-absolute directory part), the dynamic loader will try to take the name as given, which will most likely fail. (It is unreliable to depend on the current working directory.)

Use Cases

Here’s how the proposed file layout and extension_path GUC addresses the use cases that inspired this RFC.

Packager Testing

A packager who wants to run tests without modifying a PostgreSQL install would follow these steps:

  • Prepend a directory under the packaging install to the extension_path GUC. The resulting value would be something like $RPM_BUILD_ROOT/$(pg_config --extdir):$extdir.
  • Install the extension into that directory: make install EXTDIR=$RPM_BUILD_ROOT
  • Make sure the PostgreSQL server can access the directory, then run make installcheck

This will allow PostgreSQL to find and load the extension during the tests. The Postgres installation will not have been modified; only the extension_path will have changed.

OCI/Kubernetes

To allow extensions to be added to a OCI container and to persist beyond its lifetime, one or more volumes could be used. Some examples:

  • Mount a persistent volume for extensions and prepend the path to that directory to the extension_path GUC. Then Postgres can find any extensions installed there, and they will persist. Files for all extensions will live on a single volume.
  • Or, to meet a desire to keep some extensions separate (e.g., open-source vs company-internal extensions), two or more persistent volumes could be mounted, as long as they’re all included in extension_path, are accessible by PostgreSQL, and users take care to install extensions in the proper locations.

CNPG Extension Images

To meet the CNPG ambition to “install” an extension by mounting a single directory for each, create separate images for each extension, then use the Kubernetes image volume feature (currently in alpha) to mount each as a read-only volume in the appropriate subdirectory of a directory included in extension_path. Thereafter, any new containers would simply have to mount all the same extension image volumes to provide the same extensions to all containers.

Postgres.app

To allow extension installation without invalidating the Postgres.app bundle signature, the default configuration could prepend a well-known directory outside the app bundle, such as /Library/Application Support/Postgres, to extension_path. Users wishing to install new extensions would then need to point the EXTDIR parameter to that location, e.g.,

$ make install EXTDIR="/Library/Application Support/Postgres"`

Or the app could get trickier, setting the --extdir value to that location so that users don’t need to use EXTDIR. As long as extension_path includes both the bundle’s own extension directory and this external directory, Postgres will be able to find and load all extensions.

Extension Directory Examples

A core extension like citext would have a structure similar to:

citext
├── citext.control
├── lib
│   ├── citext.dylib
│   └── bitcode
│       ├── citext
│       │   └── citext.bc
│       └── citext.index.bc
└── share
    ├── citext--1.0--1.1.sql
    ├── citext--1.1--1.2.sql
    ├── citext--1.2--1.3.sql
    ├── citext--1.3--1.4.sql
    ├── citext--1.4--1.5.sql
    ├── citext--1.4.sql
    └── citext--1.5--1.6.sql

The subdirectory for a pure SQL extension named “pair” in a directory named “pair” that looks something like this:

pair
├── LICENSE.md
├── README.md
├── pair.control
├── doc
│   ├── html
│   │   └── pair.html
│   └── pair.md
└── share
    ├── pair--1.0--1.1.sql
    └── pair--1.1.sql

A binary application like pg_top would live in the pg_top directory, structured something like:

pg_top
├── HISTORY.rst
├── INSTALL.rst
├── LICENSE
├── README.rst
├── bin
│   └── pg_top
└── doc
    └── man
        └── man3
            └── pg_top.3

And a C extension like semver would live in the semver directory and be structured something like:

semver
├── LICENSE
├── README.md
├── semver.control
├── doc
│   └── semver.md
├── lib
│   ├── semver.dylib
│   └── bitcode
│       ├── semver
│       │   └── semver.bc
│       └── semver.index.bc
└── share
    ├── semver--1.0--1.1.sql
    └── semver--1.1.sql

Phase Two: Preloading

The above-proposed solution does not allow shared modules distributed with extensions to compatibly be loaded via shared library preloading, because extension modules wil no longer live in the dynamic_library_path. Users can specify full paths, however. For example, instead of:

shared_preload_libraries = 'pg_partman_bgw'

One could use the path to the lib subdirectory of the extension’s directory:

shared_preload_libraries = '/opt/postgres/extensions/pg_partman_bgw/lib/pg_partman_bgw'

But users will likely find this pattern cumbersome, especially for extensions with multiple shared modules. Perhaps some special syntax could be added to specify a single extension module, such as:

shared_preload_libraries = '$extension_path::pg_partman_bgw'

But this overloads the semantics of shared_preload_libraries and the code that processes it rather heavily, not to mention the LOAD command.

Therefore, as a follow up to the solution proposed above, this RFC proposes additional changes to PostgreSQL.

Extension Preloading

Add new GUCs that complement shared library preloading, but for extension module preloading:

  • shared_preload_extensions
  • session_preload_extensions
  • local_preload_extensions

Each takes a list of extensions for which to preload shared modules. In addition, another new GUC, local_extensions, will contain a list of administrator-approved extensions users are allowed to include in local_preload_extensions. This GUC complements local_preload_libraries’s use of a plugins directory.

Then modify the preloading code to also preload these files. For each extension in a list, it would:

  • Search each path in extension_path for the extension.
  • When found, load all the shared libraries from $extension/lib.

For example, to load all shared modules in the pg_partman extension, set:

shared_preload_extensions = 'pg_partman'

To load a single shared module from an extension, give its name after the extension name and two colons. This example will load only the pg_partman_bgw shared module from the pg_partman extension:

shared_preload_extensions = 'pg_partman::pg_partman_bgw'

This change requires a one-time change to existing preload configurations on upgrade.

Future: Deprecate LOAD

For a future change, consider modifying CREATE EXTENSION to support shared module-only extensions. This would allow extensions with no SQL component, such as auto_explain, to be handled like any other extension; it would live under one of the directories in extension_path with a structure like this:

auto_explain
├── auto_explain.control
└── lib
   ├── auto_explain.dylib
   └── bitcode
       ├── auto_explain
       │   └── auto_explain.bc
       └── auto_explain.index.bc

Note the auto_explain.control file. It would need a new parameter to indicate that the extension includes no SQL files, so CREATE EXTENSION and related commands wouldn’t try to find them.

With these changes, extensions could become the primary, recommended interface for extending PostgreSQL. Perhaps the LOAD command could be deprecated, and the *_preload_libraries GUCs along with it.

Compatibility Issues

  • The module_pathname control file variable would prefer the name of a shared module. The code that replaces the MODULE_PATHNAME string in SQL files would to strip out the $libdir/ prefix, if present.
  • The behavior of loading dynamic modules that ship with extensions (i.e., the value of the AS part of CREATE FUNCTION) would change to look for a library name (with no directory part) in the lib subdirectory of the extension directory.
  • The directory control file parameter and the MODULEDIR PGXS variable would be deprecated and ignored.
  • *_preload_libraries would no longer be used to find extension modules without full paths. Administrators would have to remove module names from these GUCs and add the relevant extension names to the new *_preload_extensions variables. To ease upgrades, we might consider adding a PGXS variable that, when true, would symlink shared modules into --pkglibdr.
  • LOAD would no longer be able to find shared modules included with extensions, unless we add a PGXS variable that, when true, would symlink shared modules into --pkglibdr.
  • The EXTENSION PGXS variable will no longer support multiple extension names.
  • The change in extension installation locations must also be adopted by projects that don’t use PGXS for installation, like pgrx. Or perhaps they could be modified to also use PGXS. Long term it might be useful to replace the Makefile-based PGXS with another installation system, perhaps a CLI.

Out of Scope

This RFC does not include or attempt to address the following issue:

  • How to manage third-party shared libraries. Making system dependencies consistent in a OCI/Kubernetes environment or for non-system binary packaging patterns presents its own challenges, though they’re not specific to PostgreSQL or the patterns described here. Research is ongoing into potential solutions, and will be addressed elsewhere.

Acknowledgements

A slew of PostgreSQL community members contributed feedback, asked hard questions, and suggested moderate to significant revisions to this RFC via the the pgsql-hackers list, in-person discussion at PGConf.eu, and pull request comments. I’d especially like to thank:

All remaining errors and omissions remain my own.


  1. But not non-extension modules; see Phase Two and Future for further details on preloading extension modules and eventually deprecating non-extension modules. ↩︎

RFC: PGXN Metadata Sketch

Update 2024-03-22: To compliment high-level discussions on the #extensions channel on the Postgres Slack, I also created a pull request to make it easer directly comment to this post point by point. Keep the feedback coming, and thank you!

Update 2024-03-25: Clarified the definition of “Release” and made “Maintainers” plural. Thanks to Matthias van de Meent for the suggestions!

Update 2024-07-06 Fixed some typos.

Update 2024-07-15 Converted to RFC–2; blog post.


This post proposes a new metadata standard for extension packaging, distribution, and delivery, building on the PGXN Meta Spec to address its shortcomings and emerging use cases 12 years on. The goals include:

  • Don’t break the existing standard, but add to it
  • Add explicit support for different types of Postgres extensions, such as background workers and loadable modules
  • Add additional metadata for curation and automated compilation and binary packaging
  • Add hooks for registry and third-party-contributed metadata, including binary distribution options and stats & reports.

There are two high level categories of metadata, split into two separate files:

  1. Package Metadata provided by extension package authors
  2. Registry Metadata aggregated by the root registry from various sources, including data derived from the extension source code or package metadata, but also trusted third parties such as packagers, smoke testers, security scanners, and more

Following community discussion of this proposal, the Package Metadata will lead to a draft for PGXN Meta Spec version 2.0.0, while the Registry Metadata will guide the design and implementation of the Root Registry APIs required to provide it.

Terms

Definition of terms as used in this document, because every language and distribution system uses terms differently. These are ours.

  • Extension: A software component that extends the capabilities of a PostgreSQL database or cluster. Extensions may be CREATE EXTENSION extensions, background workers, command-line apps, loadable modules, shared libraries, and more.

  • Package: A collection of extensions that are released, versioned, and distributed together. Packages may be downloaded directly from version control repositories or in archive files generated by a release tag.

  • Package Path: Identifies a package, declared in the package metadata file. A package path should describe both what the package does and where to find it. Typically, a package path consists of a repository root path — the directory that contains the metadata file — and a directory within the repository.

  • Repository: The version control repository where the package is developed, referenced as package path or URL.

  • Repository root path: The portion of the package path that corresponds to the root directory of the version control repository where the package is developed, and which contains the metadata file. Most package are defined in their repository’s root directory, so this is usually the entire path. For example, github.com/example/pgtap is the repository root path for the package of the same name.

  • Source Distribution: The contents of a single package bundled together with package metadata into distributable archive file, usually named with the last part of the package path or the main extension, a dash, and the version, e.g., pgtap-1.14.3.zip.

  • Binary Distribution: A binary software distribution of a package, compiled and packaged for a particular OS version, platform, and hardware architecture. Examples include deb, RPM, and wheel.

  • Release: A single version of the package made available to the public on PGXN, expressed as the package path, an at sign, and the semver. Example: github.com/theory/[email protected].

Package Metadata

The extension author provides this data as part of a release submitted to the root registry. The registry itself will not change this data in any way, but provide it as-is.

Essentials

Essential information about the package itself, including its path, name, version, list of authors, license, etc. Ecosystem applications use this data for indexing, naming conventions, and display information.

Fields (all required):

  • Package: A package path uniquely identifying the package (usually the source repo host name and path; see this RFC)
  • Name: The name of the extension package, usually the same as the main extension, not unique but conflicts are resolved by escalation to package
  • Version: The version of the package, expressed as a SemVer
  • Abstract: A short description of the purpose of the package
  • Description: A longer description of the package, answering the question “what is this thing and what value is it?”
  • Maintainers: List of maintainers, each an object with name and either email or url (or both)
  • License: An SPDX 2.3 license expression comprised of one or more licenses from the SPDX License List
  • Spec: Information about the metadata spec itself, a version and URL

Example:

{
  "package": "github.com/example/pg-pair",
  "name": "pair",
  "version": "1.1.0",
  "abstract": "A key/value pair data type",
  "description": "Adds a key/value pair data type to Postgres, useful for specifying arbitrary key/value function parameters.",
  "maintainers": [
    {
        "name": "Naomi Nagata",
        "email": "[email protected]",
        "url": "https://example.com/roci/naomi"
    }
  ],
  "license": "PostgreSQL OR MIT",
  "spec": {
    "version": "2.0.0",
    "url": "https://pgxn.org/meta/spec.txt"
  }
}

Other possibilities:

  • License File: An alternative to license for nonstandard licenses (from cargo)
  • Contributors: List of people who have contributed to the package
  • Emeriti: List of former maintainers and contributors

PGXN Compatibility:

  • Overloads the string representation of the spec v1 field, but it should be able to parse either one, especially with the spec version to differentiate.
  • Uniqueness of package and relationship to name
  • PGXN has maintainer; is it really worth pluralizing it? Maybe we don’t need it at all, given repository and issue links.

Artifacts

A list of links and checksums for downloading the extension package in one or more formats, including source code, binaries, system packages, and more. Apps use this information (along with registry metadata) to determine the best option for installing an extension on a particular system.

This information is optional, and provided in addition to the source code available from the root registry itself. The idea is that, on release, the release manager creates artifacts in one or more locations (such as GitHub releases), and it’s useful to link to them from the registry. Binary builds may be unlikely today, but leave the option open for now by allowing arbitrary text in the type field.

Each item must have:

  • Type: The type of artifact: Source, binary, RPM, etc.
  • URL: Link the the artifact
  • SHA: A SHA checksum

Each URL must properly resolve and the checksum must match.

[
  {
    "type": "source",
    "url": "https://github.com/theory/pg-pair/releases/download/v1.1.0/pair-1.1.0.zip",
    "sha256": "2b9d2416096d2930be51e5332b70bcd97846947777a93e4a3d65fe1b5fd7b004"
  },
  {
    "type": "binary",
    "url": "https://github.com/theory/pg-pair/releases/download/v1.1.0/pair-1.1.0-linux-amd64.tar.gz",
    "sha1": "12d9bc5cfb6bc3c453627eac69511f48be63cfc0"
  },
  {
    "type": "binary",
    "url": "https://github.com/theory/pg-pair/releases/download/v1.1.0/pair-1.1.0-linux-arm64.tar.gz",
    "sha1": "787dc39137f7d1510a33ab0a1b8905cd5f3f72d1"
  }
]

PGXN Compatibility:

New object, no conflicts.

Resources

External information about the package, mostly links, including source code repository, bug reporting, documentation, badges, funding, etc. Apps use this data for links, of course, but also full text indexing, documentation rendering, and displaying useful information about the extension.

Fields:

  • Homepage: URL for the package’s home page
  • Issues: URL to issues
  • Documentation: URL to documentation
  • Support: URL to support
  • Repository: Source code repository URL
  • Badges: Array of badge link objects
  • Funding: Array of funding objects
{
  "homepage": "https://pair.example.com",
  "issues": "https://github.com/example/pair/issues",
  "documentation": "https://pair.example.com/docs",
  "support": "https://github.com/example/pair/discussions",
  "repository": "https://github.com/example/pair",
  "badges": [
    {
      "alt": "Test Status",
      "src": "https://test.packages.postgresql.org/github.com/example/pair.svg"
    }
  ],
  "funding": [
    {
      "type" : "individual",
      "url" : "http://example.com/donate"
    },
    {
      "type" : "patreon",
      "url" : "https://www.patreon.com/example"
    }
  ]
}

PGXN Compatibility:

  • issues replaces bugtracker and is just a URL. Can support either
  • repository is just a URL instead of an object. Can support either but with the package key referencing the repository path perhaps we don’t need to change it.

Contents

A description of what’s included in the package. Often a package consists of multiple extensions, such as PostGIS, which includes postgis, postgis_tiger_geocoder, address_standardizer, and more.

Furthermore, some extensions are not CREATE EXTENSION-type extensions at all, such as background workers, command-line apps, loadable modules, shared libraries, and more. Each should be listed along with documentation links where they differ from the package overall (or are simply more specific).

Fields:

  • Extensions: Object describing CREATE EXTENSION extensions. Keys are extension names, fields are objects with the fields sql, doc, abstract, control (same as control file, but can use the file itself), tle (meaning it can be used as a trusted language extension), and preload (to indicate whether its libraries need to be loaded in advance via shared_preload_libraries, session_preload_libraries, or local_preload_libraries).
  • Workers: Object describing background workers. Keys are worker names, fields are objects with the fields bin, doc, and abstract.
  • Apps: Object describing applications, command-line or otherwise. Keys are app names, fields are objects with the fields bin, doc, and abstract.
  • Modules: Object describing loadable modules that can be loaded into Postgres (not necessary for extensions that include libraries). Keys are module names, fields are objects with the fields lib (without file extension), doc, abstract, and preload.
  • Libraries: Other libraries that may ship in the package and need to be installed but are not loadable modules, such as a dynamic library used by an app. Keys are library names, fields are objects with the fields lib (without file extension) or dir (for a directory of files), doc, and abstract.
{
  "extensions": {
    "pair": {
      "sql": "sql/pair.sql",
      "doc": "doc/pair.md",
      "abstract": "A key/value pair data type",
      "preload": "session",
      "tle": true,
      "control": {
        "directory": "pair",
        "module_pathname": "$libdir/pair",
        "relocatable": true
      }
    }
  },
  "workers": {
    "pair_pruner": {
      "bin": "bin/pair_pruner",
      "doc": "doc/pair_pruner.md",
      "abstract": "A worker to periodically prune pairs"
    }
  },
  "apps": {
    "pair_rand": {
      "bin": "bin/pair_rand",
      "doc": "doc/pair_rand.md",
      "abstract": "Command to generate random pairs of strings"
    }
  },
  "modules": {
    "lib_pair": {
      "lib": "lib/lib_pair",
      "doc": "doc/lib_pair.md",
      "abstract": "A library hooking function calls to convert pairs to named parameters",
      "load": "shared_preload_libraries"
    }
  },
  "libraries": {
    "ruby_pair": {
      "dir": "lib/gems",
      "abstract": "Ruby libraries required to run the extension"
    }
  }
}

Other options:

  • Allow or require an array of docs?
  • Specify support for specific doc formats (markdown, manpage, plain text, MkDocs, etc.)?
  • Support glob patterns?
  • Add libraries as sub-key for workers and apps, e.g. for a Ruby app that requires a directory of gems?

PGXN Compatibility:

Significantly different from the existing provides object, so key this one under contents or manifest and continue to support both.

Dependencies

A list of external dependencies required to configure, build, test, install, and run the extensions in the package. These include not only other extension packages, but also external libraries and system dependencies. And let’s not forget the versions of Postgres required, as well as any OS and version dependencies (e.g, does it work on Windows? FreeBSD? What versions?) and architectures (arm64, amd64, etc.)

Fields:

  • platforms: Platforms supported by the package, defined as OS and architecture. Assume all platforms if none are listed.

  • postgres: Versions of PostgreSQL required by the package, specified as a version range, as well as compiled features (--with-* configure options)

  • pipeline: The build pipeline required to build the extension. Options: pgxs, meson (?) pgrx, gnu (./configure), gem, cpan, pip, go, etc.

  • prereqs: Object defining external dependencies required for different phases:

    • configure: Dependencies to configure the package (e.g., items required for ./configure or make to work)
    • build: Dependencies to build the package (in addition to run dependencies)
    • test: Dependencies to test the package (in addition to build and run dependencies)
    • run: Dependencies to run the package
    • develop: Dependencies to develop the package (in addition to all other phase dependencies)

    Each points to an object with at least one of these keys:

    • requires: Required to use the package
    • recommends: Not required, but recommended as a best practice
    • suggests: Not required, but nice to have
    • conflicts: Package will not work with these items

    These, in turn, contain at least one of these keys:

    • package: Other packages, expressed by their package paths, pointing to a version expression
    • external: External dependencies not included with the package
    • contrib: List of Postgres contrib or development packages
  • variations: A list of dependency variations, each with two fields:

    • where: Specifies a configuration that, when matched, includes the dependencies. For example, to specify alternate dependencies for Linux arm64, it would be "platforms": { "linux": "arm64" }. For a specific version of Postgres, it would be something like "postgres": { "version": ">= 16, < 17" }. [This likely needs some deep thinking through.]
    • dependencies: An object containing of the other key structures in this object, defining the additional dependencies for environments that match the where field.

The packages and external keys point to objects in which the keys are dependency names and the values are version range expressions (empty string means any version). The postgres.version field is also a version range expression.

The contrib key points to an array of Postgres-supplied apps and extensions. Versions are not necessary because they effectively have the same version as a supported Postgres version.

Example:

{
  "postgres": {
    "version": ">= 12, < 17",
    "with": [ "xml", "uuid", "perl" ]
  },
  "pipeline": "pgrx",
  "platforms": {
    "linux": [ "amd64", "arm64" ],
    "darwin": [ "amd64", "arm64" ],
    "windows": [ "amd64" ],
    "freebsd": [ "amd64" ]
  },
  "prereqs": {
    "configure": {
      "requires": {
        "external": { "cargo-pgrx": "" }
      }
    },
    "build": {
      "requires": {
        "external": {
          "curl": "",
          "awk": "",
          "sed": "",
          "perl": "5.20"
        }
      },
      "recommends": {
        "external": {
          "jq": "",
          "perl": "5.38"
        }
      }
    },
    "test": {
      "requires": {
        "contrib": [ "pg_regress", "plpgsql", "plperl" ],
        "packages": {
          "github.com/theory/pgtap": "1.1.0"
        }
      }
    },
    "run": {
      "requires": {
        "contrib": [ "plperl" ],
        "packages": {
          "github.com/theory/hostname": ""
        }
      }
    }
  }
}

Notes:

The external field is the tricky one. How does one provide a canonical dependency for a third-party library or app that may have loads of different names on different platforms, different packaging systems (apt vs rpm), and even different versions of them, or split apart for development and runtime use? Some possibilities:

  • Use purl and let downstream services handle the conversion
  • Create a rule set that allows one to predict dependency name changes across ecosystems, similar to repology-rules
  • Just use Repology?

Other notes:

  • Decide on a version range standards to adopt. Start with the PGXN version range, pep-0508, npm ranges, and Gem::Version.
  • Is pipeline really necessary, given configure requirements? I think so, because it tells the client the preferred build system to use, in case it can’t detect it for some reason.

PGXN Compatibility:

Significantly differs from the existing prereqs object, so this one is keyed under dependencies and we can support either.

Classification

Classification metadata lets the extension developer associate additional information to improve discovery, such as key words. It might also allow selections from a curated list of extension classifications, such as the category slugs supported for the cargo categories field.

Ecosystem apps use this data to organize extensions under key words or categories, making it easier for users to find extensions that are frequently used together or for specific workloads or tasks.

Example:

{
  "tags": [
    "testing",
    "pair",
    "parameter"
  ],
  "categories": [
    "Machine Learning"
  ]
}

PGXN Compatibility:

Conflicts with the existing root-level tags array, but we can support either one.

Example: PGML Extension

A hand-coded example that attempts to capture all the complexity of the pgml extension:

{
  "package": "github.com/postgresml/postgresml/pgml-extension",
  "name": "pgml",
  "version": "2.8.2",
  "abstract": "pgml: Created by the PostgresML team",
  "maintainers": [
    {
      "name": "the PostgresML team",
      "url": "https://github.com/postgresml/postgresml/"
    }
  ],
  "license": "MIT",
  "spec": {
    "version": "2.0.0",
    "url": "https://pgxn.org/meta/spec.txt"
  },
  "artifacts": [
    {
      "type": "source zip",
      "url": "https://github.com/postgresml/postgresml/archive/refs/tags/v2.8.2.zip",
      "sha256": "2b9d2416096d2930be51e5332b70bcd97846947777a93e4a3d65fe1b5fd7b004"
    },
    {
      "type": "source tgz",
      "url": "https://github.com/postgresml/postgresml/archive/refs/tags/v2.8.2.tar.gz",
      "sha256": "845f28339c6159ac32daccea1cd17b386ea083c3e60bb8d58fb737725afe7eb5"
    }
  ],
  "resources": {
    "homepage": "https://postgresml.org/",
    "issues": "https://github.com/postgresml/postgresml/issues",
    "documentation": "https://postgresml.org/docs/",
    "support": "https://discord.com/invite/DmyJP3qJ7U",
    "repository": "https://github.com/postgresml/postgresml",
    "badges": [
      {
        "alt": "Tests Passing",
        "src": "https://github.com/postgresml/postgresml/actions/workflows/ci.yml/badge.svg"
      },
      {
        "alt": "Chat Activity",
        "src": "https://camo.githubusercontent.com/1988e7aaaa6a0c8ddcf880faacf1ec3263e23fac4aeff4710922082fde84442a/68747470733a2f2f696d672e736869656c64732e696f2f646973636f72642f31303133383638323433303336393330303939",
        "link": "https://discord.gg/DmyJP3qJ7U"
      }
    ]
  },
  "manifest": {
    "extensions": {
      "pgml": {
        "sql": "sql/pgml--2.8.2.sql",
        "preload": "shared",
        "tle": false,
        "control": {
          "comment": "pgml:  Created by the PostgresML team",
          "default_version": "2.8.2",
          "module_pathname": "$libdir/pgml",
          "relocatable": false,
          "superuser": true,
          "schema": "pgml",
          "trusted": true
        }
      }
    }
  },
  "dependencies": {
    "postgres": {
      "version": ">= 15, < 16"
    },
    "pipeline": "pgxs",
    "platforms": {
      "linux": [
        "amd64",
        "arm64"
      ],
      "darwin": [
        "amd64",
        "arm64"
      ]
    },
    "dependencies": {
      "configure": {
        "requires": {
          "external": {
            "cargo-pgrx": "",
            "bison": "",
            "cmake": "",
            "flex": "",
            "libclang-dev": "",
            "libopenblas-dev": "",
            "libpython3-dev": "",
            "libreadline-dev": "",
            "libssl-dev": "",
            "pkg-config": ""
          }
        }
      },
      "run": {
        "requires": {
          "external": {
            "libopenblas": "",
            "libpython3": "",
            "libreadline": "",
            "libssl": "",
            "python3": ""
          }
        },
        "recommends": {
          "external": {
            "python(pyarrow)": "=11.0.0",
            "python(catboost)": "",
            "python(lightgbm)": "",
            "python(torch)": "",
            "python(torchaudio)": "",
            "python(torchvision)": "",
            "python(xgboost)": "",
            "python(accelerate)": "",
            "python(bitsandbytes)": "",
            "python(ctransformers)": "",
            "python(huggingface-hub)": "",
            "python(deepspeed)": "",
            "python(einops)": "",
            "python(optimum)": "",
            "python(peft)": "",
            "python(tokenizers)": "",
            "python(transformers)": "",
            "python(transformers-stream-generator)": "",
            "python(InstructorEmbedding)": "",
            "python(sentence-transformers)": "",
            "python(rouge)": "",
            "python(sacrebleu)": "",
            "python(sacremoses)": "",
            "python(datasets)": "",
            "python(orjson)": "",
            "python(langchain)": ""
          }
        }
      }
    },
    "variations": [
      {
        "where": {
          "platforms": {
            "linux": []
          }
        },
        "dependencies": {
          "prereqs": {
            "run": {
              "recommends": {
                "external": {
                  "python(auto-gptq)": "",
                  "python(xformers)": ""
                }
              }
            }
          }
        }
      }
    ]
  },
  "classification": {
    "tags": [
      "machine learning",
      "ml",
      "transformers"
    ],
    "categories": [
      "Machine Learning"
    ]
  }
}

Note that PostgresML also provides a dashboard app, but it does not ship with the extension, so it’s not listed here. It could have its own package metadata and be separately distributed.

For the Python dependencies, in the absence of a neutral standard for specifying package, here they’re specified using a syntax borrowed from RPM to make clear that they’re Python dependencies. A couple of those dependencies are Linux-only, so listed in variations where the OS is Linux.

Registry Metadata

The second metadata format is registry metadata. It starts by with a copy the package metadata, but builds from there, adding information (such as the artifact link, SHA, and cryptographic signature from the registry itself) and sections (defined below).

Providing a separate metadata format enables broader community collaboration to augment the metadata for extension packages.

Ideally an ecosystem of trusted services will be developed by various parties, who will report back to the root registry as the source of record for all metadata about extension packages. For example, a binary distribution service like apt.postgresql.org or trunk will submit links and checksums for binary packages and perhaps periodic download stats.

Broadly speaking, there are three typologies for registry metadata: Distributions, Reports, and Classifications.

Distributions

A list of links and checksums for downloading binary distributions for the extension package. The root registry page for the package can therefore list all the ways to install an extension and expand the list as new binary registries add it.

Along with the Artifacts package metadata, the distribution metadata allows installer tools to select the best option to install based on the OS, Postgres version, and cluster configuration (e.g., install the community Apt package if the cluster was installed from the community Apt registry).

Each item must have:

  • Registry: The name of the distribution registry, unique
  • Type: The type of distribution: Apt, RPM, TLE, etc.
  • URL: Link to the the registry’s page for the distribution
  • Command: In lieu of a URL and SHA it can provide instructions to install the distribution, e.g., apt-get install postgresql16-pair.

Each distribution registry must be trusted by the root registry to submit this information to add to the metadata, and have its own page with additional contextual information: its home page, docs, basic instructions, whatever. A README of sorts.

Example:

[
  {
    "registry": "trunk.pgxn.org",
    "type": "trunk",
    "command": "pgxn install github.com/example/pair"
  },
  {
    "registry": "apt.postgresql.org",
    "type": "apt",
    "command": "apt-get install postgresql16-pair"
  },
  {
    "registry": "yum.postgresql.org",
    "type": "rpm",
    "command": "yum install postgresql16-pair"
  },
  {
    "registry": "pgt.dev",
    "type": "apt",
    "url": "https://pgt.dev/extensions/pair",
    "command": "trunk install pair"
  },
  {
    "registry": "pgxman.com",
    "type": "apt",
    "url": "https://pgxman.com/x/pair",
    "command": "pgxman install pair"
  },
  {
    "registry": "database.dev",
    "type": "sql",
    "url": "https://database.dev/example/pair",
    "command": "SELECT dbdev.install('example-pair');"
  },
]

Other Options:

  • Information about installing from a registry that doesn’t offer URLs, like Homebrew, where one just needs to know what the formula is called. Maybe the same should apply to Apt?

Metrics and Reports

This object includes data such as number of downloads, build and test status on various Postgres/OS/version combinations, binary packaging distributions, test coverage, security certification, vulnerability detection, quality metrics and user ratings, and more. Some of these bits of data might include badges, e.g., for ratings or security assessments.

In the broader ecosystem, it would be the responsibility of the root registry to ensure that data comes only from trusted sources. However this data will be publicly readable, allowing any downstream applications to extend and publish metrics and reports with their own information.

Example:

{
    "aggregates": {
      "downloads": 32
    },
    "sources": {
      "smoke-tester.example.com": {
        "stats": {
          "passes": 32,
          "fails": 8,
          "unknown": 0
        },
        "link": "https://smoke-tester.example.com/extension/pair",
        "badge": "https://smoke-tester.example.com/extension/pair.svg"
      },
      "pgt.dev": {
        "downloads": 12,
        "tags": ["two", "kv"],
        "categories": ["Analytics"]
      },
      "pgxman.com": {
        "downloads": 20
      },
      "ratings.example.com": {
        "stats": {
          "ratings": 324,
          "average": 3.2,
          "comments": 13
        },
        "link": "https://ratings.example.com/extension/pair",
        "badge": "https://ratings.example.com/extension/pair.svg"
      },
      "stats.example.com": {
        "contributors": 2,
        "stars": 14,
        "last_active": "2024-01-12",
      },
      "security-scan.example.com": {
        "advisories": {
          "high": 2,
          "moderate": 11,
          "low": 1,
          "cves": 0
        },
        "link": "https://security-scan.example.com/extension/pair",
        "badge": "https://security-scan.example.com/extension/pair.svg"
      }
    }
}

Notes:

  • The aggregates section aggregates results from multiple sources, for example summing all downloads or averaging ratings. The list of items to aggregate could evolve regularly.

  • Each key in sources identifies a trusted downstream source of information. Each would have its own schema describing its objects and their meaning, along with URI templates to link to. For example, stats.example.com might have these templates:

    {
      "contributors": {
        "alt": "List of people who have contributed to the {name} project",
        "badge": "https://stats.example.com/{ package }/contributors.svg",
        "uri": "https://stats.example.com/{ package }/contributors"
      },
      "stars": {
        "alt": "Star ratings for the {name} project",
        "badge": "https://stats.example.com/{ package }/stars.svg",
        "uri": "https://stats.example.com/{ package }/stars"
      }
    }
    
  • Might be worth adopting more of an API of multiple files, rather than one big file. Example: Homebrew analytics. Note also its support for date ranges for analytics (30/90/365 days, stats when directly installed vs. installed as a dependency).

Contemplating Decentralized Extension Publishing

TL;DR

As I think through the future of the Postgres extension ecosystem as a key part of the new job, I wanted to understand how Go decentralized publishing works. In this post I work it out, and think through how we might do something similar for Postgres extension publishing. It covers the Go architecture, namespacing challenges, and PGXS abuse; then experiments with URL-based namespacing and ponders reorganizing installed extension files; and closes with a high-level design for making it work now and in the future.

It is, admittedly, a lot, mainly written for my own edification and for the information of my fellow extension-releasing travelers.

I find it fascinating and learned a ton. Maybe you will too! But feel free to skip this post if you’re less interested in the details of the journey and want to wait for more decisive posts once I’ve reached the destination.

Introduction

Most language registries require developers to take some step to make releases. Many automate the process in CI/CD pipelines, but it requires some amount of effort on the developer’s part:

  • Register for an account
  • Learn how to format things to publish a release
  • Remember to publish again for every new version
  • Create a pipeline to automate publishing (e.g., a GitHub workflow)

Decentralized Publishing

Go decentralized publishing has revised this pattern: it does not require user registration or authentication to to publish a module to pkg.go.dev. Rather, Go developers simply tag the source repository, and the first time someone refers to the tag in Go tools, the Go module index will include it.

For example, publishing v1.2.1 of a module in the github.com/golang/example repository takes just three commands:

git tag v1.2.1 -sm 'Tag v1.2.1'
git push --tags
go list -m github.com/golang/[email protected]

After a few minutes, the module will show up in the index and then on pkg.go.dev. Anyone can run go get -u github.com/golang/example to get the latest version. Go developers rest easy in the knowledge that they’re getting the exact module they need thanks to the global checksum database, which Go uses “in many situations to detect misbehavior by proxies or origin servers”.

This design requires go get to understand multiple source code management systems: it supports Git, Subversion, Mercurial, Bazaar, and Fossil.1 It also needs the go.mod metadata file to live in the project defining the package.

But that’s really it. From the developer’s perspective it could not be easier to publish a module, because it’s a natural extension of the module development tooling and workflow of committing, tagging, and fetching code.

Decentralized Extension Publishing

Could we publish Postgres extensions in such a decentralized pattern? It might look something like this:

  • The developer places a metadata file in the proper location (control file, META.json, Cargo.toml, whatever — standard TBD)
  • To publish a release, the developer tags the repository and calls some sort of indexing service hook (perhaps from a tag-triggered release workflow)
  • The indexing service validates the extension and adds it to the index

Note that there is no registration required. It simply trusts the source code repository. It also avoids name collision: github.com/bob/hash is distinct from github.com/carol/hash.

This design does raise challenges for clients, whether they’re compiling extensions on a production system or building binary packages for distribution: they have to support various version control systems to pull the code (though starting with Git is a decent 90% solution).

Namespacing

Then there’s name conflicts. Perhaps github.com/bob/hash and github.com/carol/hash both create an extension named hash. By the current control file format, the script directory and module path can use any name, but in all likelihood the use these defaults:

directory = 'extension'
module_pathname = '$libdir/hash'

Meaning .sql files will be installed in the Postgres share/extension subdirectory — along with all the other installed extensions — and library files will be installed in the library directory along with all other libraries. Something like this:

pgsql
├── lib
│   └── hash.so
└── share
    └── extension
    │   └── hash.control
    │   ├── hash--1.0.0.sql
    └── doc
        └── hash.md

If both projects include, say, hash.control, hash--1.0.0.sql, and hash.so, the files from one will stomp all over the files of the other.

Installer Abuse

Go avoids this issue by using the domain and path from each package’s repository in its directory structure. For example, here’s a list of modules from google.golang.org repositories:

The ~/go/pkg/mod directory has subdirectories for each VCS host name, and each then subdirectories for package paths. For the github.com/bob/hash example, the files would all live in ~/go/pkg/mod/github.com/bob/hash.

Could a Postgres extension build tool follow a similar distributed pattern by renaming the control file and installation files and directories to something specific for each, say github.com+bob+hash and github.com+carol+hash? That is, using the repository host name and path, but replacing the slashes in the path with some other character that wouldn’t create subdirectories — because PostgreSQL won’t find control files in subdirectories. The control file entries for github.com/carol/hash would look like this:

directory = 'github.com+carol+hash'
module_pathname = '$libdir/github.com+carol+hash'

Since PostgreSQL expects the control file to have the same name as the extension, and for SQL scripts to start with that name, the files would have to be named like so:

hash
├── Makefile
├── github.com+carol+hash.control
└── sql
    └── github.com+carol+hash--1.0.0.sql

And the Makefile contents:

EXTENSION  = github.com+carol+hash
MODULEDIR  = $(EXTENSION)
DATA       = sql/$(EXTENSION)--1.0.0.sql
PG_CONFIG ?= pg_config

PGXS := $(shell $(PG_CONFIG) --pgxs)
include $(PGXS)

In other words, the extension name is the full repository host name and path and the Makefile MODULEDIR variable tells pg_config to put all the SQL and documentation files into a directories named github.com+carol+hash — preventing them from conflicting with any other extension.

Finally, the github.com+carol+hash.control file — so named becaus it must have the same name as the extension — contains:

default_version = '1.0.0'
relocatable = true
directory = 'github.com+carol+hash'
module_pathname = '$libdir/github.com+carol+hash'

Note the directory parameter, which must match MODULEDIR from the Makefile, so that CREATE EXTENSION can find the SQL files. Meanwhile, module_pathname ensures that the library file has a unique name — the same as the long extension name — again to avoid conflicts with other projects.

That unsightly naming extends to SQL: using the URL format could get to be a mouthful:

CREATE EXTENSION "github.com+carol+hash";

Which is do-able, but some new SQL syntax might be useful, perhaps something like:

CREATE EXTENSION hash FROM "github.com+carol+hash";

Or, if we’re gonna really go for it, use slashes after all!

CREATE EXTENSION hash FROM "github.com/carol/hash";

Want to use both extensions but they have conflicting objects (e.g., both create a “hash” data type)? Put them into separatre schemas (assuming relocatable = true in the control file):

CREATE EXTENSION hash FROM "github.com/carol/hash" WITH SCHEMA carol;
CREATE EXTENSION hash FROM "github.com/bob/hash" WITH SCHEMA bob;
CREATE TABLE try (
    h1 carol.hash,
    h2 bob.hash
);

Of course it would be nice if PostgreSQL added support for something like Oracle packages, but using schemas in the meantime may be sufficient.

Clearly we’re getting into changes to the PostgreSQL core, so put that aside and we can just use long names for creating, modifying, and dropping extensions, but not necessarily otherwise:

CREATE EXTENSION "github.com+carol+hash" WITH SCHEMA carol;
CREATE EXTENSION "github.com+bob+hash" WITH SCHEMA bob;
CREATE EXTENSION "gitlab.com+barack+kicker_type";
CREATE TABLE try (
    h1 carol.hash,
    h2 bob.hash
    kt kicker
);

Namespacing Experiment

To confirm that this approach might work, I committed 24134fd and pushed it in the namespace-experiment branch of the semver extension. This commit changes the extension name from semver to github.com+theory+pg-semver, and follows the above steps to ensure that its files are installed with that name.

Abusing the Postgres extension installation infrastructure like this does work, but suffers from a number of drawbacks, including:

  • The extension name is super long, as before, but now so too are the files in the repository (as opposed to the installer renaming them on install). The shared library file has to have the long name, so therefore does the .c source file. The SQL files must all start with github.com+theory+pg-semver, although I skipped that bit in this commit; instead the Makefile generates just one from sql/semver.sql.
  • Any previous installation of the semver type would remain unchanged, with no upgrade path. Changing an extension’s name isn’t a great idea.

I could probably script renaming and modifying file contents like this and make it part of the build process, but it starts to get complicated. We could also modify installers to make the changes, but there are a bunch of moving parts they would have to compensate for, and given how dynamic this can be (e.g., the semver Makefile reads the extension name from META.json), we would rapidly enter the territory of edge case whac-a-mole. I suspect it’s simply too error-prone.

Proposal: Update Postgres Extension Packaging

Perhaps the Go directory pattern could inspire a similar model in Postgres, eliminating the namespace issue by teaching the Postgres extension infrastructure to include all but one of the files for an extension in a single directory. In other words, rather than files distributed like so for semver:

pgsql
├── lib
│   └── semver.so
└── share
    └── extension
    │   └── semver.control
    │   ├── semver--0.32.1.sql
    │   ├── semver--0.32.0--0.32.1.sql
    └── doc
        └── semver.md

Make it more like this:

pgsql
└── share
    └── extension
        └── github.com
            └── theory
                └── pg-semver
                    └── extension.control
                    └── lib
                    │   └── semver.so
                    └── sql
                    │   └── semver--0.32.1.sql
                    │   └── semver--0.32.0--0.32.1.sql
                    └── doc
                        └── semver.md

Or perhaps:

pgsql
└── share
    └── extension
        └── github.com
            └── theory
                └── pg-semver
                    └── extension.control
                    └── semver.so
                    └── semver--0.32.1.sql
                    └── semver--0.32.0--0.32.1.sql
                    └── semver.md

The idea is to copy the files exactly as they’re stored in or compiled in the repository. Meanwhile, the new semver.name file — the only relevant file stored outside the extension module directory — simply points to that path:

github.com/theory/pg-semver

Then for CREATE EXTENSION semver, Postgres reads semver.name and knows where to find all the files to load the extension.

This configuration would require updates to the control file, now named extension.control, to record the full package name and appropriate locations. Add:

name = 'semver'
package = 'github.com/theory/pg-semver'

This pattern could also allow aliasing. Say we try to install a different semver extension from github.com/example/semver. This is in its extension.control file:

name = 'semver'
package = 'github.com/example/pg-semver'

The installer detects that semver.name already exists for a different package and raises an error. The user could then give it a different name by running something like:

make install ALIAS_EXTENSION_NAME=semver2

This would add semver2.name right next to semver.name, and its contents would contain github.com/example/semver, where all of its files are installed. This would allow CREATE EXTENSION semver2 to load the it without issue (assuming no object conflicts, hopefully resolved by relocate-ability).

I realize a lot of extensions with libraries could wreak some havoc on the library resolver having to search so many library directories, but perhaps there’s some way around that as well? Curious what techniques experienced C developers might have adopted.

Back to Decentralized Publishing

An updated installed extension file structure would be nice, and is surely worth a discussion, but even if it shipped in Postgres 20, we need an updated extension ecosystem today, to work well with all supported versions of Postgres. So let’s return to the idea of decentralized publishing without such changes.

I can think of two pieces that’d be required to get Go-style decentralized extension publishing to work with the current infrastructure.

Module Uniqueness

The first is to specify a new metadata field to be unique for the entire index, and which would contain the repository path. Call it module, after Go (a single Git repository can have multiple modules). In PGXN Meta Spec-style JSON it’d look something like this:

{
    "module": "github.com/theory/pg-semver",
    "version": "0.32.1",
    "provides": {
      "semver": {
         "abstract": "A semantic version data type",
      }
    }
}

Switch from the PGXN-style uniqueness on the distribution name (usually the name of the extension) and let the module be globally unique. This would allow another party to release an extension with the same name. Even a fork where only the module is changed:

{
    "module": "github.com/example/pg-semver",
    "version": "0.32.1",
    "provides": {
      "semver": {
         "abstract": "A semantic version data type",
      }
    }
}

Both would be indexed and appear under the module name, and both would be find-able by the provided extension name, semver.

Where that name must still be unique is in a given install. In other words, while github.com/theory/pg-semver and github.com/example/pg-semver both exist in the index, the semver extension can be installed from only one of them in a given Postgres system, where the extension name semver defines its uniqueness.

This pattern would allow for much more duplication of ideas while preserving the existing per-cluster namespacing. It also allows for a future Postgres release that supports something like the flexible per-cluster packaging as described above.2

Extension Toolchain App

The second piece is an extension management application that understands all this stuff and makes it possible. It would empower both extension development workflows — including testing, metadata management, and releasing — and extension user workflows — finding, downloading, building, and installing.

Stealing from Go, imagine a developer making a release with something like this:

git tag v1.2.1 -sm 'Tag v1.2.1'
git push --tags
pgmod list -m github.com/theory/[email protected]

The creatively named pgmod tells the registry to index the new version directly from its Git repository. Thereafter anyone can find it and install it with:

  • pgmod get github.com/theory/[email protected] — installs the specified version
  • pgmod get github.com/theory/pg-semver — installs the latest version
  • pgmod get semver — installs the latest version or shows a list of matching modules to select from

Any of these would fail if the cluster already has an extension named semver with a different module name. But with something like the updated extension installation locations in a future version of Postgres, that limitation could be loosened.

Challenges

Every new idea comes with challenges, and this little thought experiment is no exception. Some that immediately occur to me:

  • Not every extension can be installed directly from its repository. Perhaps the metadata could include a download link for a tarball with the results of any pre-release execution?
  • Adoption of a new CLI could be tricky. It would be useful to include the functionality in existing tools people already use, like pgrx.
  • Updating the uniqueness constraint in existing systems like PGXN might be a challenge. Most record the repository info in the resources META.json object, so it would be do-able to adapt into a new META format, either on PGXN itself or in a new registry, should we choose to build one.
  • Getting everyone to standardize on standardized versioning tags might take some effort. Go had the benefit of controlling its entire toolchain, while Postgres extension versioning and release management has been all over the place. However PGXN long ago standardized on semantic versioning and those who have released extensions on PGXN have had few issues (one can still use other version formats in the control file, for better or worse).
  • Some PGXN distributions have shipped different versions of extensions in a single release, or the same version as in other releases. The release version of the overall package (repository, really) would have to become canonical.

I’m sure there are more, I just thought of these offhand. What have you thought of? Post ’em if you got ’em in the #extensions channel on the Postgres Slack, or give me a holler on Mastodon or via email.


  1. Or does it? Yes, it does. Although the Go CLI downloads most public modules from a module proxy server like proxy.golang.org, it still must know how to download modules from a version control system when a proxy is not available. ↩︎

  2. Assuming, of course, that if and when the Postgres core adopts more bundled packaging that they’d use the same naming convention as we have in the broader ecosystem. Not a perfectly safe assumption, but given the Go precedent and wide adoption of host/path-based projects, it seems sound. ↩︎