Skip to content

Problems with Standby Clusters #2108

@SockenSalat

Description

@SockenSalat

I'm experimenting with standby clusters via S3 storage and experienced some
problems with the postgres-operator 4.5.1.

The scenario is as follows:

  1. I have cluster test-db with local and S3 storage
  2. I have a standby-cluster test-db-standby in another k8s-cluster which sync's using the S3 storage.

Setup of the test-db:

pgo -n postgres \
	create cluster test-db \
	--sync-replication \
	--pgbackrest-storage-type=local,s3 \
	--password-superuser=superUser123 \
	--password-replication=replication123 \
	--password=password123

Setup of the test-db-standby:

pgo -n postgres \
	create cluster test-db-standby \
	--standby \
	--sync-replication \
	--pgbackrest-storage-type=local,s3 \
	--pgbackrest-repo-path=/backrestrepo/test-db-backrest-shared-repo/ \
	--password-superuser=superUser123 \
	--password-replication=replication123 \
	--password=password123

My intention is then to switch the roles of the clusters back and forth, to ensure the
procedure is repeatable.

I try to keep this post short so I'm not writing down every step or detail that is obvious.

Errors when promoting the standby cluster

after pgo update cluster test-db-standby --promote-standby

the database pod seems to be ok, but the operator spams the log with:

2020-12-09 16:16:20 time="2020-12-09T15:16:20Z" level=error msg="command terminated with exit code 1" func="internal/kubeapi.ExecToPodThroughAPI()" file="internal/kubeapi/exec.go:76" version=4.5.1

the reason seems to be what I found in the logs of the database container:

ERROR:  syntax error at or near "'SELECT pg_is_in_recovery();'" at character 1
STATEMENT:  'SELECT pg_is_in_recovery();'

looks like the ' are the problem 🤔

after a while the operator stops spamming the log with:

level=error msg="timed out waiting for cluster test-db-standby to accept writes after disabling standby mode" func="internal/controller/pod.(*Controller).onUpdate()" file="internal/controller/pod/podcontroller.go:133" version=4.5.1

but the database is writeable and all seems to be fine.

Errors when downgrading the old primary cluster to standby

after startup of the new standby cluster test-db no primary instance is found:

bash-4.2$ pgo test test-db

cluster : test-db
	Services
		primary (10.100.229.181:5432): UP
	Instances
		replica (test-db-8468b484-vbs9t): UP

and the operator reports:

level=error msg="Unable to find primary when creating backrest job test-db-stanza-create" func="internal/operator/backrest.Backrest()" file="internal/operator/backrest/backup.go:95" version=4.5.1

which I solved with a manual failover.

after that the new standby cluster behaves as expected.

Errors when reversing the role switch

When I try to promote the test-db again to be the primary, this step does not work and the operator logs:

level=error msg="Configuration is missing from configMap" func="internal/controller/pgcluster.(*Controller).onUpdate()" file="internal/controller/pgcluster/pgclustercontroller.go:219" version=4.5.1

while the database container says

INFO: no action. i am the standby leader with the lock

but pgo show cluster test-db has another opinion:

bash-4.2$ pgo show cluster test-db

cluster : test-db (crunchy-postgres-ha:centos7-12.5-4.5.1)
	pod : test-db-8468b484-vbs9t (Running) on node-4.ote.bdorf.yars.io (2/2) (primary)
		pvc: test-db (2Gi)
	resources : Memory: 128Mi
	deployment : test-db
	deployment : test-db-backrest-shared-repo
	service : test-db - ClusterIP (10.100.229.181) - Ports (9187/TCP, 2022/TCP, 5432/TCP)
	labels : sync-replication=true workflowid=29edafaf-e7e3-49e2-aa57-8291638d1766 crunchy-pgbadger=false crunchy-postgres-exporter=true deployment-name=test-db name=test-db pgo-backrest=true pgo-version=4.5.1 autofail=true crunchy-pgha-scope=test-db pg-cluster=test-db pg-pod-anti-affinity= pgouser=admin 

Metadata

Metadata

Assignees

No one assigned

    Type

    No type

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions