Skip to content
This repository was archived by the owner on Dec 17, 2024. It is now read-only.
This repository was archived by the owner on Dec 17, 2024. It is now read-only.

Leaking file descriptors #412

@eshkinkot

Description

@eshkinkot

I noticed in my production that with a large number of metrics errors the pgwatch2 process is heavily leaking file descriptors.

It is looks like this, after start:

$ for p in $(pgrep pgwatch2-daemon); do echo $p $(sudo ls /proc/$p/fd/ | wc -l); done
16873 9
16874 10
16875 10
16876 10
16877 10
16878 10
16905 10
16912 9
16913 10
16922 10

After some time:

16873 1024
16874 18
16875 17
16876 10
16877 10
16878 9
16905 11
16912 10
16913 10
16922 203

Process 16873 is stopping monitoring, and 16922 will stop in near future.

[16873]: 2021/09/29 12:24:58 ERRO ReadMetricsFromFolder: open /etc/pgwatch2/metrics/: too many open files
[16873]: 2021/09/29 12:24:58 ERRO main: Could not refresh metric definitions: open /etc/pgwatch2/metrics/: too many open files
...
[16873]: 2021/09/29 12:25:57 ERRO DetectSprocChanges: could not read sproc_hashes from monitored host:  XXX_postgres , err: dial tcp X.X.X.X:PPPP: socket: too many open files
[16873]: 2021/09/29 12:25:57 ERRO DetectTableChanges: could not read table_hashes from monitored host: XXX_postgres , err: dial tcp X.X.X.X:PPPP: socket: too many open files
[16873]: 2021/09/29 12:25:57 ERRO DetectIndexChanges: could not read index_hashes from monitored host: XXX_postgres , err: dial tcp X.X.X.X:PPPP: socket: too many open files

I add some debug messages (#410) and found two race conditions:

  1. two metrics (goroutine) can allocate two pools simultaneously, second pool pointer will overwrite first in cache and first pool will be lost with opened connections
  2. when the pool is closed, there is no lock between closing and writing nil to the cache, so nil can overwrite already allocated new pool and it will be lost too, with opened connections
2021/09/30 17:58:34 INFO DBExecReadByDbUniqueName: 0x0 if cache test:db_size exists true {0 0 0 0 0 0s 0 0 0}
2021/09/30 17:58:34 INFO DBExecReadByDbUniqueName: 0xc00027e630 conn opened test:db_size
2021/09/30 17:58:34 INFO DBExecReadByDbUniqueName: 0x0 if cache test:err_metric exists true {0 0 0 0 0 0s 0 0 0}
2021/09/30 17:58:34 INFO DBExecReadByDbUniqueName: 0xc0005ba840 conn opened test:err_metric
--> 2021/09/30 17:58:34 INFO DBExecReadByDbUniqueName: 0xc00027e630 assign to cache 0x0 test:db_size
--> 2021/09/30 17:58:34 INFO DBExecReadByDbUniqueName: 0xc0005ba840 assign to cache 0xc00027e630 test:err_metric
2021/09/30 17:58:34 INFO DBExecReadByDbUniqueName: 0xc0005ba840 cache unlock test:err_metric
2021/09/30 17:58:34 INFO DBExecReadByDbUniqueName: 0xc00027e630 cache unlock test:db_size

Second new pool 0xc0005ba840 overwrite already allocated 0xc00027e630 in cache.

2021/09/30 17:58:43 INFO DBExecReadByDbUniqueName: 0x0 if cache test:db_size exists true {0 0 0 0 0 0s 0 0 0}
2021/09/30 17:58:43 INFO DBExecReadByDbUniqueName: 0xc00027f410 conn opened test:db_size
2021/09/30 17:58:43 INFO DBExecReadByDbUniqueName: 0xc00027f410 assign to cache 0x0 test:db_size
2021/09/30 17:58:43 INFO DBExecReadByDbUniqueName: 0xc00027f410 cache unlock test:db_size
--> 2021/09/30 17:58:43 INFO DBExecRead: 0xc00027ec00 conn close test {2 0 0 0 0 0s 0 0 0}
--> 2021/09/30 17:58:43 INFO DBExecRead: 0xc00027f410 cache assign nil test

After old pool 0xc00027ec00 closed, new pool 0xc00027f410 allocated and overwrited by nil in the closing goroutine.

Output after bugfix (eshkinkot#1):

2021/09/30 20:25:57 INFO DBExecReadByDbUniqueName: 0x0 if cache test:db_size exists true {0 0 0 0 0 0s 0 0 0}
2021/09/30 20:25:57 INFO DBExecReadByDbUniqueName: 0x0 if cache test:err_metric1 exists true {0 0 0 0 0 0s 0 0 0}
2021/09/30 20:25:57 INFO DBExecReadByDbUniqueName: 0x0 recheck test:db_size exists true
2021/09/30 20:25:57 INFO DBExecReadByDbUniqueName: 0xc0003f9170 conn opened test:db_size
--> 2021/09/30 20:25:57 INFO DBExecReadByDbUniqueName: 0xc0003f9170 assign to cache 0x0 test:db_size
2021/09/30 20:25:57 INFO DBExecReadByDbUniqueName: 0xc0003f9170 cache unlock test:db_size
2021/09/30 20:25:57 INFO DBExecReadByDbUniqueName: 0xc0003f9170 recheck test:err_metric1 exists true
2021/09/30 20:25:57 INFO DBExecReadByDbUniqueName: 0xc0003f9170 cache unlock test:err_metric1

Now there is only one assignment to the cache.

2021/09/30 20:26:03 INFO DBExecReadByDbUniqueName: 0x0 if cache test:err_metric exists true {0 0 0 0 0 0s 0 0 0}
2021/09/30 20:26:03 INFO DBExecReadByDbUniqueName: 0x0 recheck test:err_metric exists true
2021/09/30 20:26:03 INFO DBExecReadByDbUniqueName: 0xc0004f1500 conn opened test:err_metric
2021/09/30 20:26:03 INFO DBExecReadByDbUniqueName: 0xc0004f1500 assign to cache 0x0 test:err_metric
2021/09/30 20:26:03 INFO DBExecReadByDbUniqueName: 0xc0004f1500 cache unlock test:err_metric
--> 2021/09/30 20:26:03 INFO DBExecRead: 0xc0004f0bd0 conn close test {2 0 0 0 0 0s 0 0 0}
2021/09/30 20:26:03 INFO FetchMetrics: [test:db_size] failed to fetch metrics: sql: database is closed
2021/09/30 20:26:03 INFO DBExecRead: 0xc0004f1500 conn close test {2 0 0 0 0 0s 0 0 0}
2021/09/30 20:26:03 INFO DBExecRead: 0xc0004f1500 cache assign nil test
2021/09/30 20:26:03 INFO FetchMetrics: [test:err_metric] failed to fetch metrics: pq: column "error" does not exist

And there is no overwrite new pool 0xc0004f1500 after closing 0xc0004f0bd0.

I rewrote this bugfix over current master and create PR #411, please review.

Metadata

Metadata

Assignees

No one assigned

    Labels

    No labels
    No labels

    Type

    No type

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions