-
Notifications
You must be signed in to change notification settings - Fork 642
Description
I'm experimenting with standby clusters via S3 storage and experienced some
problems with the postgres-operator 4.5.1.
The scenario is as follows:
- I have cluster
test-dbwith local and S3 storage - I have a standby-cluster
test-db-standbyin another k8s-cluster which sync's using the S3 storage.
Setup of the test-db:
pgo -n postgres \
create cluster test-db \
--sync-replication \
--pgbackrest-storage-type=local,s3 \
--password-superuser=superUser123 \
--password-replication=replication123 \
--password=password123
Setup of the test-db-standby:
pgo -n postgres \
create cluster test-db-standby \
--standby \
--sync-replication \
--pgbackrest-storage-type=local,s3 \
--pgbackrest-repo-path=/backrestrepo/test-db-backrest-shared-repo/ \
--password-superuser=superUser123 \
--password-replication=replication123 \
--password=password123
My intention is then to switch the roles of the clusters back and forth, to ensure the
procedure is repeatable.
I try to keep this post short so I'm not writing down every step or detail that is obvious.
Errors when promoting the standby cluster
after pgo update cluster test-db-standby --promote-standby
the database pod seems to be ok, but the operator spams the log with:
2020-12-09 16:16:20 time="2020-12-09T15:16:20Z" level=error msg="command terminated with exit code 1" func="internal/kubeapi.ExecToPodThroughAPI()" file="internal/kubeapi/exec.go:76" version=4.5.1
the reason seems to be what I found in the logs of the database container:
ERROR: syntax error at or near "'SELECT pg_is_in_recovery();'" at character 1
STATEMENT: 'SELECT pg_is_in_recovery();'
looks like the ' are the problem 🤔
after a while the operator stops spamming the log with:
level=error msg="timed out waiting for cluster test-db-standby to accept writes after disabling standby mode" func="internal/controller/pod.(*Controller).onUpdate()" file="internal/controller/pod/podcontroller.go:133" version=4.5.1
but the database is writeable and all seems to be fine.
Errors when downgrading the old primary cluster to standby
after startup of the new standby cluster test-db no primary instance is found:
bash-4.2$ pgo test test-db
cluster : test-db
Services
primary (10.100.229.181:5432): UP
Instances
replica (test-db-8468b484-vbs9t): UP
and the operator reports:
level=error msg="Unable to find primary when creating backrest job test-db-stanza-create" func="internal/operator/backrest.Backrest()" file="internal/operator/backrest/backup.go:95" version=4.5.1
which I solved with a manual failover.
after that the new standby cluster behaves as expected.
Errors when reversing the role switch
When I try to promote the test-db again to be the primary, this step does not work and the operator logs:
level=error msg="Configuration is missing from configMap" func="internal/controller/pgcluster.(*Controller).onUpdate()" file="internal/controller/pgcluster/pgclustercontroller.go:219" version=4.5.1
while the database container says
INFO: no action. i am the standby leader with the lock
but pgo show cluster test-db has another opinion:
bash-4.2$ pgo show cluster test-db
cluster : test-db (crunchy-postgres-ha:centos7-12.5-4.5.1)
pod : test-db-8468b484-vbs9t (Running) on node-4.ote.bdorf.yars.io (2/2) (primary)
pvc: test-db (2Gi)
resources : Memory: 128Mi
deployment : test-db
deployment : test-db-backrest-shared-repo
service : test-db - ClusterIP (10.100.229.181) - Ports (9187/TCP, 2022/TCP, 5432/TCP)
labels : sync-replication=true workflowid=29edafaf-e7e3-49e2-aa57-8291638d1766 crunchy-pgbadger=false crunchy-postgres-exporter=true deployment-name=test-db name=test-db pgo-backrest=true pgo-version=4.5.1 autofail=true crunchy-pgha-scope=test-db pg-cluster=test-db pg-pod-anti-affinity= pgouser=admin