Problems with Standby Clusters

I'm experimenting with standby clusters via S3 storage and experienced some 
problems with the postgres-operator 4.5.1.

The scenario is as follows:

1. I have cluster `test-db` with local and S3 storage
2. I have a standby-cluster `test-db-standby` in another k8s-cluster which sync's using the S3 storage.

Setup of the `test-db`:

```
pgo -n postgres \
	create cluster test-db \
	--sync-replication \
	--pgbackrest-storage-type=local,s3 \
	--password-superuser=superUser123 \
	--password-replication=replication123 \
	--password=password123
```

Setup of the `test-db-standby`:

```
pgo -n postgres \
	create cluster test-db-standby \
	--standby \
	--sync-replication \
	--pgbackrest-storage-type=local,s3 \
	--pgbackrest-repo-path=/backrestrepo/test-db-backrest-shared-repo/ \
	--password-superuser=superUser123 \
	--password-replication=replication123 \
	--password=password123
```

My intention is then to switch the roles of the clusters back and forth, to ensure the 
procedure is repeatable.

I try to keep this post short so I'm not writing down every step or detail that is obvious.

## Errors when promoting the standby cluster 

after `pgo update cluster test-db-standby --promote-standby`

the database pod seems to be ok, but the `operator` spams the log with:


`2020-12-09 16:16:20	time="2020-12-09T15:16:20Z" level=error msg="command terminated with exit code 1" func="internal/kubeapi.ExecToPodThroughAPI()" file="internal/kubeapi/exec.go:76" version=4.5.1`

the reason seems to be what I found in the logs of the database container:

```
ERROR:  syntax error at or near "'SELECT pg_is_in_recovery();'" at character 1
STATEMENT:  'SELECT pg_is_in_recovery();'
```

looks like the `'` are the problem :thinking:

after a while the operator stops spamming the log with:

`level=error msg="timed out waiting for cluster test-db-standby to accept writes after disabling standby mode" func="internal/controller/pod.(*Controller).onUpdate()" file="internal/controller/pod/podcontroller.go:133" version=4.5.1`

but the database is writeable and all seems to be fine.

## Errors when downgrading the old primary cluster to standby

after startup of the new standby cluster `test-db` no primary instance is found:

```
bash-4.2$ pgo test test-db

cluster : test-db
	Services
		primary (10.100.229.181:5432): UP
	Instances
		replica (test-db-8468b484-vbs9t): UP
```

and the operator reports:

`level=error msg="Unable to find primary when creating backrest job test-db-stanza-create" func="internal/operator/backrest.Backrest()" file="internal/operator/backrest/backup.go:95" version=4.5.1`

which I solved with a manual failover.

after that the new standby cluster behaves as expected.

## Errors when reversing the role switch

When I try to promote the `test-db` again to be the primary, this step does not work and the operator logs: 

`level=error msg="Configuration is missing from configMap" func="internal/controller/pgcluster.(*Controller).onUpdate()" file="internal/controller/pgcluster/pgclustercontroller.go:219" version=4.5.1`

while the database container says 

`INFO: no action.  i am the standby leader with the lock`

but `pgo show cluster test-db` has another opinion:

```
bash-4.2$ pgo show cluster test-db

cluster : test-db (crunchy-postgres-ha:centos7-12.5-4.5.1)
	pod : test-db-8468b484-vbs9t (Running) on node-4.ote.bdorf.yars.io (2/2) (primary)
		pvc: test-db (2Gi)
	resources : Memory: 128Mi
	deployment : test-db
	deployment : test-db-backrest-shared-repo
	service : test-db - ClusterIP (10.100.229.181) - Ports (9187/TCP, 2022/TCP, 5432/TCP)
	labels : sync-replication=true workflowid=29edafaf-e7e3-49e2-aa57-8291638d1766 crunchy-pgbadger=false crunchy-postgres-exporter=true deployment-name=test-db name=test-db pgo-backrest=true pgo-version=4.5.1 autofail=true crunchy-pgha-scope=test-db pg-cluster=test-db pg-pod-anti-affinity= pgouser=admin 
```


Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Problems with Standby Clusters #2108

Errors when promoting the standby cluster

Errors when downgrading the old primary cluster to standby

Errors when reversing the role switch

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Problems with Standby Clusters #2108

Description

Errors when promoting the standby cluster

Errors when downgrading the old primary cluster to standby

Errors when reversing the role switch

Metadata

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Issue actions