-
Notifications
You must be signed in to change notification settings - Fork 5
Addressing crashes from failing background connections #609
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Conversation
73a9e72 to
d32b895
Compare
|
I need to make a patch to js-quic to expose the |
Why do you need the dcid? We already have connection ID. And the dcid is only used for negotiation I thought. |
Changes: * Exposed connection destination ID * Created a common ID that is identical for both sides of the connection. Derived from the source and destination ID. * The `quiche` connection is now the Source Of Truth for these IDs. The destinationId is only know after establishing and is only valid while running. * Added test demonstrating the new Ids and how they relate to each other. Related: MatrixAI/Polykey#609 [ci skip]
|
Could go ahead with event optimisation soon too. |
Sorry, I thought I replied to this. For this PR I just need some information about the connection common to both sides. The On top of that, it's just useful to have the id to collate connection pairs in logs between multiple nodes. |
|
Progress update, Now when processing a reverse connection and a connection exists in the connection map for that node Id. Both sides will deterministically select the same connection to keep by comparing the connection IDs. Implementation wise, It creates common That aspect of the problem is solved. A 2nd problem to this is that if we use the connection right after the concurrent establishment. The connection will be closed while using it. To avoid this we need to apply slightly stricter locking on the connection map. I need to use a There is also a possibility that the reverse connection's streams need to be handled while deciding which one to keep. I'll need to keep an eye out for a race condition there.. Ultimately we'd want to be able to concurrently establish a connection between two nodes and not have any intermittent failures happen due to race conditions. We may not be able to fully eliminate it, but we can significantly reduce the chance of it happening. |
|
I think while I'm here I need to use a monitor for re-entrant locking. |
|
I'm starting to think that no matter what we do, locking isn't going to solve the problem of a rpc request being cancelled when handling this. I think we just need to be more lax about it. What we can do is have both connections be valid but the one we want to clean up will not be used to start any new streams. In stead we will destroy it with While this is happening, we just synchronously swap the connection we want to keep into the connection map. This should allow us to avoid cancelling any active RPC requests. The downside is that we could have a duplicate connection exist while it's handling streams. This seems like an OK compromise. |
|
So the problem is you have 2 concurrent things happening out of order. And only one can succeed. Locking isn't sufficient for this because there's no neutral sequencer. In this case what you can do is the STONITH strategy. But you want to make sure that both nodes choose the same one to do it to. |
|
See https://chat.openai.com/share/476a7f17-ab44-43cb-aa2d-dd133bb1f0df Apply a STONITH strategy deterministically and also apply a small random jitter to connection initiations. A random delay of between 0 to 50ms? |
|
I've pretty much implemented that already. I have most of a solution now. Both nodes do select the same connection to kill, the other part of the problem is that the connection could already be handling a stream. I want to handle the stream gracefully while cleaning up the connection. To this end, when we enter a what I'm going to call a I need to check something with |
|
Just checked, all calls with |
|
Two things left before this PR is done.
|
|
Due to the other changes, there is now only 1 place where the magic error code is used for stopping a connection and that's in the await this.quicConnection.stop(
force
? {
isApp: true,
errorCode: 1,
reason: Buffer.from('NodeConnection is forcing destruction'),
force,
}
: {},
);Ultimately this will only be called when the I'm thinking that the code and reason can be passed into the As for the enum, it's just going to be a single entry. I can still add it. Keep in mind that |
|
This is pretty much done, but there are 3 test failures happening that are weird. So there are still some things I need to check. I'm out of energy to deal with it today. |
|
Two tests in There is a discovery test failing due to trying to start a stream when the connection is stopping. Two odd things here, 1. The connection is stopping and I'm not sure why, maybe deadlock + timeout? 2. The error is bubbling up when it should be handled. Still looking into this. |
|
That |
|
I'm stuck on a problem with a discovery test. It's functioning as expected but one of the nodes is trying to connect to another expecting the wrong NodeId. This is really weird and I'm still digging into it. I'm not even sure the problem is related to my changes. I'm still digging into it. |
|
I found the problem. There was a bug introduced in 2834a09 caused this one specific discovery tests to fail sometimes. If nothing else is broken then I should have this done soon |
2e76198 to
575165f
Compare
[ci skip]
575165f to
6d08ff5
Compare
|
Commit messages should be more descriptive. What was the bug. |
|
What are the new exceptions being used in place of enums? Can you spec that out into the PR spec? |
|
CI faliure. I'll need to look into this. Looking deeper, it's a failure of an Any insights @amydevs ? |
…s with new `NodeId` found when searching [ci skip]
6d08ff5 to
1a26cb0
Compare
We're using an enum in place of a magic number error code and message when forcing a connection to end. Two new enum ConnectionErrorCode {
ForceClose = 1,
}
enum ConnectionErrorReason {
ForceClose = 'NodeConnection is forcing destruction',
}This is only needed in one place now. That's when forcing a |
|
It's weird that only connection error codes are enums but not everything else that relies on exceptions. |
|
Plus don't you need all your enums to be together? |
|
Not really an exception in the application here. It's just a code and message for stopping the connection. We don't directly need to throw anything in this case. The enums are together here? I'm not sure what you mean. |

Description
This PR addresses stability fixes for failing connections.
It directly addresses this issue #592
Issues Fixed
PolykeyAgentcrashes from connection failures #592Tasks
NCMshould never throw when a connection fails.enumto get the code and reason. all forced connection stops should use this.Final checklist