-
Notifications
You must be signed in to change notification settings - Fork 642
Description
Describe the bug
A clear and concise description of what the bug is.
Service accessing the DB runs into connection refused/connection timeout errors.
Expected behavior
The application should always be able to communicate with the db.
Screenshots
Deployment:
The DB cluster is a postgres cluster deployed with postgres operator, with 1 primary instance
and 2 replica instances.
Diagnosis:
Upon examining the cluster, the primary instance had its WAL PVC filled up, and was crashing with 'No space left on device'.
Only the primary instance out of the 1 primary and 2 replicas, had its WAL PVC filled up. Deleting the primary pod recovered the cluster temporarily,
as one of the replicas was promoted to be the primary and then the new primary WAL PVC started filling up in a few hours and reached 100% capacity.
We could see that the old WAL segment files on the primary instance under pgwal dir were not being cleaned as new segment files continued to get created.
Upon examining the processes, we could 2 walsender processes on the primary and 1 wal receiever process on each replica.
We also found the following log on primary instance which hints that primary could be holding onto the WAL segments in hope of replicating to a failed replica
postgresql.log.day-12:2020-12-12 01:31:39 UTC:240.0.10.23(39708):primaryuser@[unknown]:[1162]: LOG: terminating walsender process due to replication timeout
However, with the pgo show cluster status command, we see that pod 'edr-incident-db-ulqo-5bc689d4d9-bxgwd' was already in a failed state (which also showed as 'Evicted' with kubectl get command):
The 'pgo df' command showed an error in this condition.
$kubectl get pods
NAME READY STATUS RESTARTS AGE
apmia-postgres-monitor-d99dd58db-s869s 1/1 Running 0 16d
edr-incident-db-856f775fdd-dprqb 2/3 Running 0 20d
edr-incident-db-backrest-shared-repo-744664c678-4zvrb 1/1 Running 0 16d
edr-incident-db-dwdu-56f577f867-gn4c4 3/3 Running 0 16d
edr-incident-db-ulqo-5bc689d4d9-2v9xf 3/3 Running 0 13h
edr-incident-db-ulqo-5bc689d4d9-bxgwd 0/3 Evicted 0 20d
pgo-client-86bbdcc77d-pt9qq 1/1 Running 0 20d
pgo-command-runner-26-5fl78 0/1 Completed 0 7h57m
postgres-operator-6c5d8fb6c9-mhqt7 4/4 Running 0 20d
bash-4.2$ pgo show cluster edr-incident-db -n sedsep-verify-edr-incident-db
cluster : edr-incident-db (crunchy-postgres-ha:centos7-12.2-4.3.0)
pod : edr-incident-db-856f775fdd-dprqb (Running) on gke-svcstus-saas-gke1-pool-01-0ab246e5-n02t (2/3) (primary)
pvc : edr-incident-db
pod : edr-incident-db-dwdu-56f577f867-gn4c4 (Running) on gke-svcstus-saas-gke1-pool-01-916322a4-pe7x (3/3) (replica)
pvc : edr-incident-db-dwdu
pod : edr-incident-db-ulqo-5bc689d4d9-2v9xf (Running) on gke-svcstus-saas-gke1-pool-01-0ab246e5-idz8 (3/3) (replica)
pvc : edr-incident-db-ulqo
pod : edr-incident-db-ulqo-5bc689d4d9-bxgwd (Failed) on gke-svcstus-saas-gke1-pool-01-0ab246e5-79ix (0/0) (primary)
pvc : edr-incident-db-ulqo
resources : CPU: 2 Memory: 16G
storage : Primary=50G Replica=50G
deployment : edr-incident-db
deployment : edr-incident-db-backrest-shared-repo
deployment : edr-incident-db-dwdu
deployment : edr-incident-db-ulqo
service : edr-incident-db - ClusterIP (172.24.252.132)
service : edr-incident-db-replica - ClusterIP (172.24.241.176)
pgreplica : edr-incident-db-dwdu
pgreplica : edr-incident-db-ulqo
labels : name=edr-incident-db pg-cluster=edr-incident-db pgouser=admin workflowid=c43c9d92-44b4-4b15-b608-4ef8256533b2 custom-config=custom-config-edr-incident-db crunchy-pgbadger=true crunchy-pgha-scope=edr-incident-db crunchy_collect=true deployment-name=edr-incident-db pg-pod-anti-affinity=required pgo-backrest=true pgo-version=4.3.0 autofail=true
bash-4.2$ pgo df edr-incident-db -n sedsep-verify-edr-incident-db
Error:
bash-4.2$
We delete the primary pod (edr-incident-db-856f775fdd-dprqb) using kubectl delete pod command, and we could see the patronictl list output as below.
Note the change in leader below. The lag on replica pod edr-incident-db-856f775fdd-khfnd eventually came down to 0, and TL value was 13.
The lag on Evicted pod 'edr-incident-db-ulqo-5bc689d4d9-bxgwd' continued to rise until the issue was finally resolved.
- Cluster: edr-incident-db (6890704157037822136) ------+--------+---------+----+-----------+
| Member | Host | Role | State | TL | Lag in MB |
+---------------------------------------+--------------+--------+---------+----+-----------+
| edr-incident-db-856f775fdd-khfnd | 240.0.28.116 | | running | 10 | 79472 |
| edr-incident-db-dwdu-56f577f867-gn4c4 | 240.0.13.21 | Leader | running | 13 | |
| edr-incident-db-ulqo-5bc689d4d9-2v9xf | 240.0.69.77 | | running | 13 | 0 |
| edr-incident-db-ulqo-5bc689d4d9-bxgwd | 240.0.40.13 | | running | 10 | 25152 |
+---------------------------------------+--------------+--------+---------+----+-----------+
The error was resolved on its own after a few days, and we could no longer see the 'Evicted' pod 'edr-incident-db-ulqo-5bc689d4d9-bxgwd'
Below is the output after the issue was resolved:
$kubectl get pods
NAME READY STATUS RESTARTS AGE
apmia-postgres-monitor-d99dd58db-s869s 1/1 Running 0 21d
edr-incident-db-856f775fdd-bk9v5 3/3 Running 0 18h
edr-incident-db-backrest-shared-repo-744664c678-4zvrb 1/1 Running 0 21d
edr-incident-db-dwdu-56f577f867-nrhxh 3/3 Running 0 3d22h
edr-incident-db-full-sch-backup-qn5rf 0/1 Completed 0 14h
edr-incident-db-ulqo-5bc689d4d9-ndrnf 3/3 Running 0 17h
pgo-client-86bbdcc77d-whj8f 1/1 Running 0 18h
pgo-command-runner-26-5fl78 0/1 Completed 0 5d10h
postgres-operator-6c5d8fb6c9-4f2ht 4/4 Running 0 18h
bash-4.2$ pgo show cluster edr-incident-db -n sedsep-verify-edr-incident-db
cluster : edr-incident-db (crunchy-postgres-ha:centos7-12.2-4.3.0)
pod : edr-incident-db-856f775fdd-bk9v5 (Running) on gke-svcstus-saas-gke1-pool-01-0ab246e5-xl69 (3/3) (replica)
pvc : edr-incident-db
pod : edr-incident-db-dwdu-56f577f867-nrhxh (Running) on gke-svcstus-saas-gke1-pool-01-916322a4-2uyh (3/3) (primary)
pvc : edr-incident-db-dwdu
pod : edr-incident-db-ulqo-5bc689d4d9-ndrnf (Running) on gke-svcstus-saas-gke1-pool-01-0ab246e5-2r3v (3/3) (replica)
pvc : edr-incident-db-ulqo
resources : CPU: 2 Memory: 16G
storage : Primary=50G Replica=50G
deployment : edr-incident-db
deployment : edr-incident-db-backrest-shared-repo
deployment : edr-incident-db-dwdu
deployment : edr-incident-db-ulqo
service : edr-incident-db - ClusterIP (172.24.252.132)
service : edr-incident-db-replica - ClusterIP (172.24.241.176)
pgreplica : edr-incident-db-dwdu
pgreplica : edr-incident-db-ulqo
labels : custom-config=custom-config-edr-incident-db name=edr-incident-db pgo-backrest=true pgo-version=4.3.0 pgouser=admin workflowid=c43c9d92-44b4-4b15-b608-4ef8256533b2 autofail=true crunchy-pgha-scope=edr-incident-db crunchy_collect=true deployment-name=edr-incident-db-dwdu pg-cluster=edr-incident-db pg-pod-anti-affinity=required crunchy-pgbadger=true
bash-4.2$ pgo df edr-incident-db -n sedsep-verify-edr-incident-db
PVC INSTANCE POD TYPE USED CAPACITY % USED
edr-incident-db edr-incident-db edr-incident-db-856f775fdd-bk9v5 data 17GiB 466GiB 4%
edr-incident-db-wal edr-incident-db edr-incident-db-856f775fdd-bk9v5 wal 1GiB 466GiB 0%
edr-incident-db-pgbr-repo edr-incident-db-backrest-shared-repo edr-incident-db-backrest-shared-repo-744664c678-4zvrb pgbackrest 159GiB 466GiB 34%
edr-incident-db-dwdu edr-incident-db-dwdu edr-incident-db-dwdu-56f577f867-nrhxh data 17GiB 466GiB 4%
edr-incident-db-dwdu-wal edr-incident-db-dwdu edr-incident-db-dwdu-56f577f867-nrhxh wal 1GiB 466GiB 0%
edr-incident-db-ulqo edr-incident-db-ulqo edr-incident-db-ulqo-5bc689d4d9-ndrnf data 38GiB 466GiB 8%
edr-incident-db-ulqo-wal edr-incident-db-ulqo edr-incident-db-ulqo-5bc689d4d9-ndrnf wal 1GiB 466GiB 0%
bash-4.2$
bash-4.2$ patronictl list
- Cluster: edr-incident-db (6890704157037822136) -----+--------+---------+----+-----------+
| Member | Host | Role | State | TL | Lag in MB |
+---------------------------------------+-------------+--------+---------+----+-----------+
| edr-incident-db-856f775fdd-bk9v5 | 240.0.69.14 | | running | 17 | 0 |
| edr-incident-db-dwdu-56f577f867-nrhxh | 240.0.0.128 | Leader | running | 17 | |
| edr-incident-db-ulqo-5bc689d4d9-ndrnf | 240.0.45.7 | | running | 17 | 0 |
+---------------------------------------+-------------+--------+---------+----+-----------+
bash-4.2$
Steps to reproduce the behavior:
No concrete step to reproduce, but getting some postgres pods into kubernetes 'Evicted' state may help reproduce the problem.
Please tell us about your environment:
- Operating System: Linux
- Where is this running ( Local, Cloud Provider): GCP
- Storage being used (NFS, Hostpath, Gluster, etc): Google persistent disks
- Container Image Tag: centos7-4.3.0 / centos7-12.2-4.3.0
- PostgreSQL Version: 12.2
- Platform (Docker, Kubernetes, OpenShift): GKE
- Platform Version: 1.17