libnet/d/bridge: handleFirewalldReloadNw: fix deadlock#50620
libnet/d/bridge: handleFirewalldReloadNw: fix deadlock#50620thaJeztah merged 2 commits intomoby:masterfrom
Conversation
The bridge driver was embedding `sync.Mutex` which is unconventional and makes it harder to analyze locks ordering. Signed-off-by: Albin Kerouanton <albinker@gmail.com>
handleFirewalldReloadNw locks `d.mu` and then `d.configNetworks`. However, the rest of the driver locks `d.configNetworks` first and then `d.mu`. This could result in deadlocks if `handleFirewalldReloadNw` is called while the bridge driver is already holding `d.configNetworks` lock. Other code paths were checked to ensure that they all follow the same locking order. This bug was introduced by commit a527e5a. Signed-off-by: Albin Kerouanton <albinker@gmail.com>
austinvazquez
left a comment
There was a problem hiding this comment.
Change for consistent lock ordering LGTM.
| // mu is used to protect accesses to config and networks. Do not hold this lock while locking configNetwork. | ||
| mu sync.Mutex |
There was a problem hiding this comment.
FWIW, it's common practice to put the mutex above the fields it protects; in this case (from the changes in this code), it's config and networks, so those could possibly be moved under this one (or the mutex moved to the top).
It's not 100% clear to me what configNetwork protects though. Perhaps worth (could be a follow-up) to add a comment on that as well.
| // Make sure the network isn't being deleted, and ProgramExternalConnectivity | ||
| // isn't modifying iptables rules, while restoring the rules. | ||
| d.configNetwork.Lock() | ||
| defer d.configNetwork.Unlock() | ||
|
|
||
| d.mu.Lock() | ||
| defer d.mu.Unlock() |
There was a problem hiding this comment.
FWIW there's other places where we acquite a lock for configNetwork, but don't hold a lock for networks (mu), or at least not for the whole duration;
moby/daemon/libnetwork/drivers/bridge/bridge_linux.go
Lines 758 to 766 in cf15d5b
moby/daemon/libnetwork/drivers/bridge/bridge_linux.go
Lines 725 to 730 in cf15d5b
There was a problem hiding this comment.
Yup, I checked all of them and wrote down my analysis in the PR description. This is the only case where the lock ordering is reversed.
There was a problem hiding this comment.
Oh! I missed the collapsed block, nice!
thaJeztah
left a comment
There was a problem hiding this comment.
LGTM
but there's definitely improvements to be made; at least the locking strategy seems complex and brittle (and could use touch-ups in documentation)
- What I did
1st commit: libnet/d/bridge: driver: un-embed mutex
The bridge driver was embedding
sync.Mutexwhich is unconventional and makes it harder to analyze locks ordering.2nd commit: libnet/d/bridge: handleFirewalldReloadNw: fix deadlock
handleFirewalldReloadNw locks
d.muand thend.configNetworks. However, the rest of the driver locksd.configNetworksfirst and thend.mu.This could result in deadlocks if
handleFirewalldReloadNwis called while the bridge driver is already holdingd.configNetworkslock.Other code paths were checked to ensure that they all follow the same locking order.
Locks ordering analysis
d.configNetwork:
(*driver).configure()locksd.configNetwork—d.munot acquired(*driver).CreateNetwork()locksd.configNetwork-d.munot acquired(*driver).DeleteNetwork()locksd.configNetwork-d.munot acquired(*driver).handleFirewalldReloadNwlocksd.configNetwork-d.muIS acquired(*driver).ProgramExternalConnectivitylocksd.configNetwork-d.munot acquiredd.mu.Lock:
(*driver).configure()locksd.mu-d.configNetworknot acquired(*driver).CreateEndpoint()locksd.mu-d.configNetworknot acquired(*driver).createNetwork()locksd.mu-d.configNetworkIS acquired by callers:(*driver).CreateNetwork()LOCKSd.configNetwork->(*driver).createNetwork()(*driver).configure()LOCKSd.configNetwork->(*driver).initStore()->(*driver).populateNetworks()->(*driver).createNetwork()(*driver).CreateNetwork()locksd.mu-d.configNetworknot acquired(*driver).DeleteEndpoint()locksd.mu-d.configNetworknot acquired(*driver).deleteNetwork()locksd.mu-d.configNetworkIS acquired by caller:(*driver).DeleteNetwork()LOCKSd.configNetwork->(*driver).deleteNetwork()(*driver).EndpointOperInfo()locksd.mu-d.configNetworknot acquired(*driver).getNetwork()locksd.mu—d.configNetworkIS acquired by caller:(*driver).Join()doesn't lockd.configNetwork->(*driver).getNetwork()(*driver).Leave()doesn't lockd.configNetwork->(*driver).getNetwork()(*driver).ProgramExternalConnectivity()LOCKSd.configNetwork->(*driver).getNetwork()(*driver).getNetworks()locksd.mu—d.configNetworkIS acquired by caller:(*driver).CreateNetwork()LOCKSd.configNetwork->(*driver).checkConflict()->(*driver).getNetworks()(*driver).handleFirewalldReload()locksd.mu—d.configNetworknot acquired(*driver).handleFirewalldReloadNw()locksd.mu-d.configNetworklocked afterd.mu- Human readable description for the release notes
- Fix a deadlock that could happen if a firewalld reload was processed while the bridge networking driver was starting up, or creating or deleting a network, or creating port-mappings