primary instance PVC hitting limit on our environment

**Describe the bug**
A clear and concise description of what the bug is.

Service accessing the DB runs into connection refused/connection timeout errors.

**Expected behavior**
The application should always be able to communicate with the db.

**Screenshots**
Deployment:
The DB cluster is a postgres cluster deployed with postgres operator, with 1 primary instance
and 2 replica instances.

Diagnosis:
Upon examining the cluster, the primary instance had its WAL PVC filled up, and was crashing with 'No space left on device'.
Only the primary instance out of the 1 primary and 2 replicas, had its WAL PVC filled up. Deleting the primary pod recovered the cluster temporarily,
as one of the replicas was promoted to be the primary and then the new primary WAL PVC started filling up in a few hours and reached 100% capacity.
We could see that the old WAL segment files on the primary instance under pgwal dir were not being cleaned as new segment files continued to get created.
Upon examining the processes, we could 2 walsender processes on the primary and 1 wal receiever process on each replica.
We also found the following log on primary instance which hints that primary could be holding onto the WAL segments in hope of replicating to a failed replica

postgresql.log.day-12:2020-12-12 01:31:39 UTC:240.0.10.23(39708):primaryuser@[unknown]:[1162]: LOG:  terminating walsender process due to replication timeout

However, with the pgo show cluster status command, we see that pod 'edr-incident-db-ulqo-5bc689d4d9-bxgwd' was already in a failed state (which also showed as 'Evicted' with kubectl get command):
The 'pgo df' command showed an error in this condition.

$kubectl get pods
NAME                                                    READY   STATUS      RESTARTS   AGE
apmia-postgres-monitor-d99dd58db-s869s                  1/1     Running     0          16d
edr-incident-db-856f775fdd-dprqb                        2/3     Running     0          20d
edr-incident-db-backrest-shared-repo-744664c678-4zvrb   1/1     Running     0          16d
edr-incident-db-dwdu-56f577f867-gn4c4                   3/3     Running     0          16d
edr-incident-db-ulqo-5bc689d4d9-2v9xf                   3/3     Running     0          13h
edr-incident-db-ulqo-5bc689d4d9-bxgwd                   0/3     Evicted     0          20d
pgo-client-86bbdcc77d-pt9qq                             1/1     Running     0          20d
pgo-command-runner-26-5fl78                             0/1     Completed   0          7h57m
postgres-operator-6c5d8fb6c9-mhqt7                      4/4     Running     0          20d


bash-4.2$ pgo show cluster edr-incident-db -n sedsep-verify-edr-incident-db

cluster : edr-incident-db (crunchy-postgres-ha:centos7-12.2-4.3.0)
	pod : edr-incident-db-856f775fdd-dprqb (Running) on gke-svcstus-saas-gke1-pool-01-0ab246e5-n02t (2/3) (primary)
	pvc : edr-incident-db
	pod : edr-incident-db-dwdu-56f577f867-gn4c4 (Running) on gke-svcstus-saas-gke1-pool-01-916322a4-pe7x (3/3) (replica)
	pvc : edr-incident-db-dwdu
	pod : edr-incident-db-ulqo-5bc689d4d9-2v9xf (Running) on gke-svcstus-saas-gke1-pool-01-0ab246e5-idz8 (3/3) (replica)
	pvc : edr-incident-db-ulqo
	pod : edr-incident-db-ulqo-5bc689d4d9-bxgwd (Failed) on gke-svcstus-saas-gke1-pool-01-0ab246e5-79ix (0/0) (primary)
	pvc : edr-incident-db-ulqo
	resources : CPU: 2 Memory: 16G
	storage : Primary=50G Replica=50G
	deployment : edr-incident-db
	deployment : edr-incident-db-backrest-shared-repo
	deployment : edr-incident-db-dwdu
	deployment : edr-incident-db-ulqo
	service : edr-incident-db - ClusterIP (172.24.252.132)
	service : edr-incident-db-replica - ClusterIP (172.24.241.176)
	pgreplica : edr-incident-db-dwdu
	pgreplica : edr-incident-db-ulqo
	labels : name=edr-incident-db pg-cluster=edr-incident-db pgouser=admin workflowid=c43c9d92-44b4-4b15-b608-4ef8256533b2 custom-config=custom-config-edr-incident-db crunchy-pgbadger=true crunchy-pgha-scope=edr-incident-db crunchy_collect=true deployment-name=edr-incident-db pg-pod-anti-affinity=required pgo-backrest=true pgo-version=4.3.0 autofail=true

bash-4.2$ pgo df edr-incident-db -n sedsep-verify-edr-incident-db
Error:
bash-4.2$

We delete the primary pod (edr-incident-db-856f775fdd-dprqb) using kubectl delete pod command, and we could see the patronictl list output as below.
Note the change in leader below. The lag on replica pod edr-incident-db-856f775fdd-khfnd eventually came down to 0, and TL value was 13.
The lag on Evicted pod 'edr-incident-db-ulqo-5bc689d4d9-bxgwd' continued to rise until the issue was finally resolved.

+ Cluster: edr-incident-db (6890704157037822136) ------+--------+---------+----+-----------+
|                 Member                |     Host     |  Role  |  State  | TL | Lag in MB |
+---------------------------------------+--------------+--------+---------+----+-----------+
|    edr-incident-db-856f775fdd-khfnd   | 240.0.28.116 |        | running | 10 |     79472 |
| edr-incident-db-dwdu-56f577f867-gn4c4 | 240.0.13.21  | Leader | running | 13 |           |
| edr-incident-db-ulqo-5bc689d4d9-2v9xf | 240.0.69.77  |        | running | 13 |         0 |
| edr-incident-db-ulqo-5bc689d4d9-bxgwd | 240.0.40.13  |        | running | 10 |     25152 |
+---------------------------------------+--------------+--------+---------+----+-----------+


The error was resolved on its own after a few days, and we could no longer see the 'Evicted' pod 'edr-incident-db-ulqo-5bc689d4d9-bxgwd'
Below is the output after the issue was resolved:

$kubectl get pods

NAME                                                    READY   STATUS      RESTARTS   AGE
apmia-postgres-monitor-d99dd58db-s869s                  1/1     Running     0          21d
edr-incident-db-856f775fdd-bk9v5                        3/3     Running     0          18h
edr-incident-db-backrest-shared-repo-744664c678-4zvrb   1/1     Running     0          21d
edr-incident-db-dwdu-56f577f867-nrhxh                   3/3     Running     0          3d22h
edr-incident-db-full-sch-backup-qn5rf                   0/1     Completed   0          14h
edr-incident-db-ulqo-5bc689d4d9-ndrnf                   3/3     Running     0          17h
pgo-client-86bbdcc77d-whj8f                             1/1     Running     0          18h
pgo-command-runner-26-5fl78                             0/1     Completed   0          5d10h
postgres-operator-6c5d8fb6c9-4f2ht                      4/4     Running     0          18h

bash-4.2$ pgo show cluster edr-incident-db -n sedsep-verify-edr-incident-db

cluster : edr-incident-db (crunchy-postgres-ha:centos7-12.2-4.3.0)
	pod : edr-incident-db-856f775fdd-bk9v5 (Running) on gke-svcstus-saas-gke1-pool-01-0ab246e5-xl69 (3/3) (replica)
	pvc : edr-incident-db
	pod : edr-incident-db-dwdu-56f577f867-nrhxh (Running) on gke-svcstus-saas-gke1-pool-01-916322a4-2uyh (3/3) (primary)
	pvc : edr-incident-db-dwdu
	pod : edr-incident-db-ulqo-5bc689d4d9-ndrnf (Running) on gke-svcstus-saas-gke1-pool-01-0ab246e5-2r3v (3/3) (replica)
	pvc : edr-incident-db-ulqo
	resources : CPU: 2 Memory: 16G
	storage : Primary=50G Replica=50G
	deployment : edr-incident-db
	deployment : edr-incident-db-backrest-shared-repo
	deployment : edr-incident-db-dwdu
	deployment : edr-incident-db-ulqo
	service : edr-incident-db - ClusterIP (172.24.252.132)
	service : edr-incident-db-replica - ClusterIP (172.24.241.176)
	pgreplica : edr-incident-db-dwdu
	pgreplica : edr-incident-db-ulqo
	labels : custom-config=custom-config-edr-incident-db name=edr-incident-db pgo-backrest=true pgo-version=4.3.0 pgouser=admin workflowid=c43c9d92-44b4-4b15-b608-4ef8256533b2 autofail=true crunchy-pgha-scope=edr-incident-db crunchy_collect=true deployment-name=edr-incident-db-dwdu pg-cluster=edr-incident-db pg-pod-anti-affinity=required crunchy-pgbadger=true

  bash-4.2$ pgo df edr-incident-db -n sedsep-verify-edr-incident-db

  PVC                       INSTANCE                             POD                                                   TYPE        USED      CAPACITY  % USED
  ------------------------- ------------------------------------ ----------------------------------------------------- ----------- --------- --------- ------
  edr-incident-db           edr-incident-db                      edr-incident-db-856f775fdd-bk9v5                      data        17GiB     466GiB    4%
  edr-incident-db-wal       edr-incident-db                      edr-incident-db-856f775fdd-bk9v5                      wal         1GiB      466GiB    0%
  edr-incident-db-pgbr-repo edr-incident-db-backrest-shared-repo edr-incident-db-backrest-shared-repo-744664c678-4zvrb pgbackrest  159GiB    466GiB    34%
  edr-incident-db-dwdu      edr-incident-db-dwdu                 edr-incident-db-dwdu-56f577f867-nrhxh                 data        17GiB     466GiB    4%
  edr-incident-db-dwdu-wal  edr-incident-db-dwdu                 edr-incident-db-dwdu-56f577f867-nrhxh                 wal         1GiB      466GiB    0%
  edr-incident-db-ulqo      edr-incident-db-ulqo                 edr-incident-db-ulqo-5bc689d4d9-ndrnf                 data        38GiB     466GiB    8%
  edr-incident-db-ulqo-wal  edr-incident-db-ulqo                 edr-incident-db-ulqo-5bc689d4d9-ndrnf                 wal         1GiB      466GiB    0%
  bash-4.2$

  bash-4.2$ patronictl list
+ Cluster: edr-incident-db (6890704157037822136) -----+--------+---------+----+-----------+
|                 Member                |     Host    |  Role  |  State  | TL | Lag in MB |
+---------------------------------------+-------------+--------+---------+----+-----------+
|    edr-incident-db-856f775fdd-bk9v5   | 240.0.69.14 |        | running | 17 |         0 |
| edr-incident-db-dwdu-56f577f867-nrhxh | 240.0.0.128 | Leader | running | 17 |           |
| edr-incident-db-ulqo-5bc689d4d9-ndrnf |  240.0.45.7 |        | running | 17 |         0 |
+---------------------------------------+-------------+--------+---------+----+-----------+
bash-4.2$


Steps to reproduce the behavior:
No concrete step to reproduce, but getting some postgres pods into kubernetes 'Evicted' state may help reproduce the problem.


**Please tell us about your environment:**
* Operating System: Linux
* Where is this running ( Local, Cloud Provider): GCP
* Storage being used (NFS, Hostpath, Gluster, etc): Google persistent disks
* Container Image Tag: centos7-4.3.0 / centos7-12.2-4.3.0
* PostgreSQL Version: 12.2
* Platform (Docker, Kubernetes, OpenShift): GKE
* Platform Version: 1.17


Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

primary instance PVC hitting limit on our environment #2129

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

primary instance PVC hitting limit on our environment #2129

Description

Metadata

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Issue actions