Skip to content

libnet: Controller: cache networks & endpoints in-memory#49736

Merged
vvoland merged 3 commits intomoby:masterfrom
akerouanton:cache-endpoint-in-memory
Apr 7, 2025
Merged

libnet: Controller: cache networks & endpoints in-memory#49736
vvoland merged 3 commits intomoby:masterfrom
akerouanton:cache-endpoint-in-memory

Conversation

@akerouanton
Copy link
Member

@akerouanton akerouanton commented Apr 3, 2025

- What I did

  • Maintain a cache of Endpoint and Network in libnet Controller.
  • Use that cache to determine if any --config-from network, or endpoint is referencing a network before deleting it.
  • Remove the endpointCnt struct.

Libnetwork is afflicted by long-standing data consistency issues caused by its persistence layer creating new instances of Network, Endpoint and Sandbox every time an object is loaded from the datastore or its in-memory cache. See for instance the following code that ends with a CopyTo:

func (c *cache) get(kvObject KVObject) error {
kmap, err := c.kmap(kvObject)
if err != nil {
return err
}
c.mu.Lock()
defer c.mu.Unlock()
o, ok := kmap[Key(kvObject.Key()...)]
if !ok {
return ErrKeyNotFound
}
return o.CopyTo(kvObject)
}

For instance, this led to #47186.

Fixing the data persistence layer itself is particularly hard as it touches virtually all code paths in libnetwork. We need to be especially cautious with mutex locking -- libnetwork makes heavy use of mutexes, but since two concurrent API calls referencing the same object result in to two different instances of that object living in memory, the concurrency model might be broken (unknowingly).

By maintaining a separate cache of all objects in libnet's Controller, we'll be able to slowly iterate on all code paths, and verify each one through static analysis, without touching the data persistence layer.

This PR specifically adds an Endpoint and a Network cache to the Controller. This allows us to delete the struct endpointCnt which was used to ref count networks (either when they were referenced by config-from networks, or by endpoints) to prevent deleting networks in use.

Instead, (*Network).delete() looks for endpoints / config-from networks that reference the network being deleted from the cache maintained in the Controller.

- How I did it

See individual commits.

- How to verify it

Through unit and integration tests.

- Human readable description for the release notes

- Improve how network-endpoint relationships are tracked to reduce the chance of the "has active endpoints" error to be wrongfully returned.

@akerouanton akerouanton added area/networking Networking kind/refactor PR's that refactor, or clean-up code labels Apr 3, 2025
@akerouanton akerouanton added this to the 28.1.0 milestone Apr 3, 2025
@akerouanton akerouanton requested review from robmry and thaJeztah April 3, 2025 09:38
@akerouanton akerouanton self-assigned this Apr 3, 2025
@@ -0,0 +1,11 @@
package maputil
Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

CI is failing because of that:

Suggested change
package maputil
//go:build go1.22
package maputil

@akerouanton akerouanton force-pushed the cache-endpoint-in-memory branch 4 times, most recently from 99867d9 to 40f7c61 Compare April 3, 2025 15:14
Copy link
Contributor

@robmry robmry left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM!

Just some suggestions/nits ... and should it have a changelog comment, saying this might solve a long-standing issue with a "has active endpoints" on network deletion?

}
}
}
c.cacheEndpoint(ep)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Looks like getEndpointFromStore will already have cached it, so this should be moved to the if err != nil block above?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yikes! That's where it was originally 🙈

// maintained by the Controller.
//
// This method is thread-safe.
func (c *Controller) deleteEndpoint(ep *Endpoint) error {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The name might be a bit misleading - deleteEndpoint doesn't actually delete the endpoint, it forgets it ... the caller should already have deleted it from the Sandbox etc? Perhaps deleteStoredEndpoint or something like that?

At some point, it might be good to put the cache in a separate package to protect its map/mutex, then the call would be something like c.epStore.Delete(ep) would also work. But, as you noted offline, that'd be a bigger job because Endpoint/Network would create a circular dependency, or the cache would need to be implemented with a generic.

(upsertEndpoint is clearer, but ditto for deleteNetwork.)

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Renamed upsertEndpoint into storeEndpoint, and deleteEndpoint into deleteStoredEndpoint (and followed the same pattern for networks).


package maputil

func FilterValues[K comparable, V any](in map[K]V, fn func(V) bool) []V {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The callers only want a count, the slice always gets discarded after a len() ... so maybe CountValues, and countEndpoints/countNetworks - and we can add findXYZ funcs if it turns out they're needed?

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Then you could get rid of these warnings for findEndpoints/findNetworks, because there'd be no danger ...

// This method is thread-safe, but do not use it unless you're sure your code
// uses the returned endpoints in thread-safe way (see the comment on
// Controller.endpoints).

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

and we can add findXYZ funcs if it turns out they're needed?

I didn't want to change ActiveEndpointsError in this PR to make clear it's focused on data persistence.

But once this one is merged, I'll open another PR to change it, and findNetworks / findEndpoints will be needed. So, to avoid needless code churn, I think it's fine to add these here.

@akerouanton akerouanton force-pushed the cache-endpoint-in-memory branch 3 times, most recently from e49ce34 to baf40a2 Compare April 4, 2025 07:27
The `Controller`'s store is used by:

- `deleteFromStore`
- `getEndpointFromStore`
- `getEndpointsFromStore`
- `updateToStore`
- … and other methods that can't store / delete / retrieve an Endpoint

Calls to `updateToStore` and `deleteFromStore` have been replaced with
`upsertEndpoint` and `deleteEndpoint`.

Both `getEndpointFromStore` and `getEndpointsFromStore` call
`cacheEndpoint` to ensure endpoints loaded from the datastore are kept
in-memory.

Finally, `sandboxRestore` was instantiating `Endpoint` itself. These are
cached too.

Signed-off-by: Albin Kerouanton <albinker@gmail.com>
The `Controller`'s store is used by:

- `deleteFromStore`
- `getNetworks`
- `getNetworksFromStore`
- `updateToStore`
- … and other methods that can't store / delete / retrieve a Network

Calls to `updateToStore` and `deleteFromStore` have been replaced with
`upsertNetwork` and `deleteNetwork`.

Both `getNetworks` and `getNetworksFromStore` call `cacheNetwork` to
ensure networks loaded from the datastore are kept in-memory.

Finally, `sandboxRestore` was instantiating `Network` itself. These are
cached too.

Signed-off-by: Albin Kerouanton <albinker@gmail.com>
@akerouanton akerouanton force-pushed the cache-endpoint-in-memory branch from baf40a2 to 1169763 Compare April 4, 2025 08:03
endpointCnt is a refcounter used to track how many endpoints use a
network, and how many networks references a config-only network. It's
stored separately from the network.

This is only used to determine if a network can be removed.

This commit removes the `endpointCnt` struct and all its references. The
refcounter is replaced by two lookups in the newly introduced `networks`
and `endpoints` caches added to the `Controller`.

Signed-off-by: Albin Kerouanton <albinker@gmail.com>
@akerouanton akerouanton force-pushed the cache-endpoint-in-memory branch from 1169763 to 51d7f95 Compare April 4, 2025 09:21
@akerouanton akerouanton requested a review from vvoland April 4, 2025 09:24
Comment on lines +30 to +32
assert.Equal(t, len(found), 2)
assert.Equal(t, found[0], ep1)
assert.Equal(t, found[1], ep2)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

nit: could use assert.Check

Comment on lines +40 to +41
assert.Equal(t, len(found), 1)
assert.Equal(t, found[0], ep2)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

ditto

@vvoland vvoland merged commit da3b31f into moby:master Apr 7, 2025
148 checks passed
@akerouanton akerouanton deleted the cache-endpoint-in-memory branch April 7, 2025 10:37
"gotest.tools/v3/assert"
)

func TestEndpointStore(t *testing.T) {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yup, hit some flakiness in #49762 (comment) as well

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

area/networking Networking impact/changelog kind/refactor PR's that refactor, or clean-up code

Projects

None yet

Development

Successfully merging this pull request may close these issues.

4 participants