Skip to content

Conversation

@tegefaulkes
Copy link
Contributor

@tegefaulkes tegefaulkes commented Oct 25, 2023

Description

This PR addresses stability fixes for failing connections.

It directly addresses this issue #592

Issues Fixed

Tasks

  • 1. Prevent any connection failures in the background from bubbling up to the top of the program. The NCM should never throw when a connection fails.
  • 2. expand tests to include connection failures and concurrent connection failures.
  • 3. Remove magic number error codes from the nodes domain and use an enum to get the code and reason. all forced connection stops should use this.
  • 4. Fix so that concurrent connections results in a single successful active connection.

Final checklist

  • Domain specific tests
  • Full tests
  • Updated inline-comment documentation
  • Lint fixed
  • Squash and rebased
  • Sanity check the final build

@tegefaulkes tegefaulkes self-assigned this Oct 25, 2023
@tegefaulkes tegefaulkes changed the title Feature connection crash fix Addressing crashes from failing background connections Oct 25, 2023
@ghost
Copy link

ghost commented Oct 25, 2023

👇 Click on the image for a new way to code review

Review these changes using an interactive CodeSee Map

Legend

CodeSee Map legend

@tegefaulkes tegefaulkes force-pushed the feature-connection-crash-fix branch 2 times, most recently from 73a9e72 to d32b895 Compare October 26, 2023 03:24
@tegefaulkes
Copy link
Contributor Author

I need to make a patch to js-quic to expose the dcid of the connection.

@CMCDragonkai
Copy link
Member

I need to make a patch to js-quic to expose the dcid of the connection.

Why do you need the dcid? We already have connection ID. And the dcid is only used for negotiation I thought.

tegefaulkes added a commit to MatrixAI/js-quic that referenced this pull request Oct 27, 2023
Changes:
* Exposed connection destination ID
* Created a common ID that is identical for both sides of the connection. Derived from the source and destination ID.
* The `quiche` connection is now the Source Of Truth for these IDs. The destinationId is only know after establishing and is only valid while running.
* Added test demonstrating the new Ids and how they relate to each other.

Related: MatrixAI/Polykey#609

[ci skip]
@CMCDragonkai
Copy link
Member

Could go ahead with event optimisation soon too.

@tegefaulkes
Copy link
Contributor Author

I need to make a patch to js-quic to expose the dcid of the connection.

Why do you need the dcid? We already have connection ID. And the dcid is only used for negotiation I thought.

Sorry, I thought I replied to this. For this PR I just need some information about the connection common to both sides. The scid and dcid are good for this.

On top of that, it's just useful to have the id to collate connection pairs in logs between multiple nodes.

@tegefaulkes
Copy link
Contributor Author

tegefaulkes commented Oct 30, 2023

Progress update, Now when processing a reverse connection and a connection exists in the connection map for that node Id. Both sides will deterministically select the same connection to keep by comparing the connection IDs.

Implementation wise, It creates common connectionId that is the same on both sides by concatenating the connectionId and connectionIdPeer together to form a connectionIdShared. the 'lower' id comes first in this derived id. When comparing the two connections we select the 'lower' connectionIdShared when selecting the connection to keep. This way both nodes select the same connection to keep when making this decision.

That aspect of the problem is solved. A 2nd problem to this is that if we use the connection right after the concurrent establishment. The connection will be closed while using it. To avoid this we need to apply slightly stricter locking on the connection map.

I need to use a RWLockWriter for the object map. Creation and destruction will be write locks, while usage will be read locks.

There is also a possibility that the reverse connection's streams need to be handled while deciding which one to keep. I'll need to keep an eye out for a race condition there..

Ultimately we'd want to be able to concurrently establish a connection between two nodes and not have any intermittent failures happen due to race conditions. We may not be able to fully eliminate it, but we can significantly reduce the chance of it happening.

@tegefaulkes
Copy link
Contributor Author

I think while I'm here I need to use a monitor for re-entrant locking.

@tegefaulkes
Copy link
Contributor Author

I'm starting to think that no matter what we do, locking isn't going to solve the problem of a rpc request being cancelled when handling this.

I think we just need to be more lax about it. What we can do is have both connections be valid but the one we want to clean up will not be used to start any new streams. In stead we will destroy it with force: false and wait for all the streams to finish.

While this is happening, we just synchronously swap the connection we want to keep into the connection map. This should allow us to avoid cancelling any active RPC requests. The downside is that we could have a duplicate connection exist while it's handling streams. This seems like an OK compromise.

@CMCDragonkai
Copy link
Member

So the problem is you have 2 concurrent things happening out of order. And only one can succeed. Locking isn't sufficient for this because there's no neutral sequencer.

In this case what you can do is the STONITH strategy. But you want to make sure that both nodes choose the same one to do it to.

@CMCDragonkai
Copy link
Member

See https://chat.openai.com/share/476a7f17-ab44-43cb-aa2d-dd133bb1f0df

Apply a STONITH strategy deterministically and also apply a small random jitter to connection initiations. A random delay of between 0 to 50ms?

@tegefaulkes
Copy link
Contributor Author

I've pretty much implemented that already. I have most of a solution now. Both nodes do select the same connection to kill, the other part of the problem is that the connection could already be handling a stream. I want to handle the stream gracefully while cleaning up the connection.

To this end, when we enter a what I'm going to call a doppelganger scenario, we synchronously swap existing connection with the new connection if needed. the connection we don't want will be destroyed with force: false so any streams can end gracefully.

I need to check something with js-quic. that destroying a connection with force: false works as expected on both ends. Seems like there might be an issue with that. In the mean time, delaying destruction of the connections seems to work around it.

@tegefaulkes
Copy link
Contributor Author

Just checked, all calls with withConnF inside of the NodeConnectionManager handle any expected connection errors and prevent throwing. Task 1 should be addressed.

@tegefaulkes
Copy link
Contributor Author

Two things left before this PR is done.

  1. removing magic error codes when stopping the NodeConnection.
  2. Check if streams can end gracefully if the connection is stopped with force: false. Need to check this in js-quic.

@tegefaulkes
Copy link
Contributor Author

tegefaulkes commented Nov 1, 2023

Due to the other changes, there is now only 1 place where the magic error code is used for stopping a connection and that's in the NodeConnection.destroy method.

    await this.quicConnection.stop(
      force
        ? {
            isApp: true,
            errorCode: 1,
            reason: Buffer.from('NodeConnection is forcing destruction'),
            force,
          }
        : {},
    );

Ultimately this will only be called when the NCM stops and closes all connections.

I'm thinking that the code and reason can be passed into the NodeConnection.stop().

As for the enum, it's just going to be a single entry. I can still add it. Keep in mind that enum can only map numbers or strings in typescript. So i need separate ones for the code and reason.

@tegefaulkes
Copy link
Contributor Author

This is pretty much done, but there are 3 test failures happening that are weird. So there are still some things I need to check. I'm out of energy to deal with it today.

@tegefaulkes
Copy link
Contributor Author

Two tests in VaultManager.test.ts are failing due to the writable stream being locked. I have no idea how anything I have changed could've affected this. Haven't looked at it yet.

There is a discovery test failing due to trying to start a stream when the connection is stopping. Two odd things here, 1. The connection is stopping and I'm not sure why, maybe deadlock + timeout? 2. The error is bubbling up when it should be handled. Still looking into this.

@CMCDragonkai
Copy link
Member

That WritableStream is a web stream or one of the old FS streams that we haven't replaced?

@tegefaulkes
Copy link
Contributor Author

I'm stuck on a problem with a discovery test. It's functioning as expected but one of the nodes is trying to connect to another expecting the wrong NodeId. This is really weird and I'm still digging into it. I'm not even sure the problem is related to my changes.

I'm still digging into it.

@tegefaulkes
Copy link
Contributor Author

I found the problem. There was a bug introduced in 2834a09 caused this one specific discovery tests to fail sometimes.

If nothing else is broken then I should have this done soon

@tegefaulkes tegefaulkes force-pushed the feature-connection-crash-fix branch from 2e76198 to 575165f Compare November 2, 2023 08:22
@tegefaulkes tegefaulkes marked this pull request as ready for review November 2, 2023 08:30
@tegefaulkes tegefaulkes force-pushed the feature-connection-crash-fix branch from 575165f to 6d08ff5 Compare November 2, 2023 08:31
@CMCDragonkai
Copy link
Member

Commit messages should be more descriptive. What was the bug.

@CMCDragonkai
Copy link
Member

What are the new exceptions being used in place of enums? Can you spec that out into the PR spec?

@tegefaulkes
Copy link
Contributor Author

tegefaulkes commented Nov 2, 2023

CI faliure.

FAIL tests/nodes/NodeConnectionManager.mdns.test.ts
  NodeConnectionManager MDNS test
    ✓ should find local node without seedNodes (335 ms)
    ✕ acquireConnection should create local connection without seednodes (1216 ms)
  ● NodeConnectionManager MDNS test › acquireConnection should create local connection without seednodes
    ErrorNodeGraphNodeIdNotFound:
      766 |       addresses = await this.findNodeAll(targetNodeId, undefined, ctx);
      767 |       if (addresses.length === 0) {
    > 768 |         throw new nodesErrors.ErrorNodeGraphNodeIdNotFound();
          |               ^
      769 |       }
      770 |     }
      771 |     // Then we just get the connection, it should already exist.
      at constructor_.getConnection (src/nodes/NodeConnectionManager.ts:768:15)

I'll need to look into this.

Looking deeper, it's a failure of an mdns test. It's out of scope for this PR. The problem seems to be that it's failing to find the node through MDNS here. But it runs fine locally so may be a race condition happens in the CI.

Any insights @amydevs ?

…s with new `NodeId` found when searching

[ci skip]
@tegefaulkes tegefaulkes force-pushed the feature-connection-crash-fix branch from 6d08ff5 to 1a26cb0 Compare November 3, 2023 01:24
@tegefaulkes
Copy link
Contributor Author

What are the new exceptions being used in place of enums? Can you spec that out into the PR spec?

We're using an enum in place of a magic number error code and message when forcing a connection to end.

Two new enums has been created. I would've used one but they only take numbers or strings, not a object of both.

enum ConnectionErrorCode {
  ForceClose = 1,
}

enum ConnectionErrorReason {
  ForceClose = 'NodeConnection is forcing destruction',
}

This is only needed in one place now. That's when forcing a NodeConnection to destroy.

@tegefaulkes tegefaulkes merged commit 2523bb6 into staging Nov 3, 2023
@CMCDragonkai
Copy link
Member

It's weird that only connection error codes are enums but not everything else that relies on exceptions.

@CMCDragonkai
Copy link
Member

Plus don't you need all your enums to be together?

@tegefaulkes
Copy link
Contributor Author

Not really an exception in the application here. It's just a code and message for stopping the connection. We don't directly need to throw anything in this case.

The enums are together here? I'm not sure what you mean.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Development

Successfully merging this pull request may close these issues.

Prevent PolykeyAgent crashes from connection failures

3 participants