Addressing crashes from failing background connections #609

tegefaulkes · 2023-10-25T07:40:12Z

Description

This PR addresses stability fixes for failing connections.

It directly addresses this issue #592

Issues Fixed

Fixes Prevent PolykeyAgent crashes from connection failures #592

Tasks

1. Prevent any connection failures in the background from bubbling up to the top of the program. The NCM should never throw when a connection fails.
2. expand tests to include connection failures and concurrent connection failures.
3. Remove magic number error codes from the nodes domain and use an enum to get the code and reason. all forced connection stops should use this.
4. Fix so that concurrent connections results in a single successful active connection.

Final checklist

ghost · 2023-10-25T07:41:46Z

👇 Click on the image for a new way to code review

Legend

tegefaulkes · 2023-10-27T01:59:52Z

I need to make a patch to js-quic to expose the dcid of the connection.

CMCDragonkai · 2023-10-27T02:58:19Z

I need to make a patch to js-quic to expose the dcid of the connection.

Why do you need the dcid? We already have connection ID. And the dcid is only used for negotiation I thought.

Changes: * Exposed connection destination ID * Created a common ID that is identical for both sides of the connection. Derived from the source and destination ID. * The `quiche` connection is now the Source Of Truth for these IDs. The destinationId is only know after establishing and is only valid while running. * Added test demonstrating the new Ids and how they relate to each other. Related: MatrixAI/Polykey#609 [ci skip]

CMCDragonkai · 2023-10-27T04:09:23Z

Could go ahead with event optimisation soon too.

tegefaulkes · 2023-10-29T22:06:31Z

I need to make a patch to js-quic to expose the dcid of the connection.

Why do you need the dcid? We already have connection ID. And the dcid is only used for negotiation I thought.

Sorry, I thought I replied to this. For this PR I just need some information about the connection common to both sides. The scid and dcid are good for this.

On top of that, it's just useful to have the id to collate connection pairs in logs between multiple nodes.

tegefaulkes · 2023-10-30T01:43:17Z

Progress update, Now when processing a reverse connection and a connection exists in the connection map for that node Id. Both sides will deterministically select the same connection to keep by comparing the connection IDs.

Implementation wise, It creates common connectionId that is the same on both sides by concatenating the connectionId and connectionIdPeer together to form a connectionIdShared. the 'lower' id comes first in this derived id. When comparing the two connections we select the 'lower' connectionIdShared when selecting the connection to keep. This way both nodes select the same connection to keep when making this decision.

That aspect of the problem is solved. A 2nd problem to this is that if we use the connection right after the concurrent establishment. The connection will be closed while using it. To avoid this we need to apply slightly stricter locking on the connection map.

I need to use a RWLockWriter for the object map. Creation and destruction will be write locks, while usage will be read locks.

There is also a possibility that the reverse connection's streams need to be handled while deciding which one to keep. I'll need to keep an eye out for a race condition there..

Ultimately we'd want to be able to concurrently establish a connection between two nodes and not have any intermittent failures happen due to race conditions. We may not be able to fully eliminate it, but we can significantly reduce the chance of it happening.

tegefaulkes · 2023-10-30T02:32:14Z

I think while I'm here I need to use a monitor for re-entrant locking.

tegefaulkes · 2023-10-30T03:47:21Z

I'm starting to think that no matter what we do, locking isn't going to solve the problem of a rpc request being cancelled when handling this.

I think we just need to be more lax about it. What we can do is have both connections be valid but the one we want to clean up will not be used to start any new streams. In stead we will destroy it with force: false and wait for all the streams to finish.

While this is happening, we just synchronously swap the connection we want to keep into the connection map. This should allow us to avoid cancelling any active RPC requests. The downside is that we could have a duplicate connection exist while it's handling streams. This seems like an OK compromise.

CMCDragonkai · 2023-10-30T04:07:51Z

So the problem is you have 2 concurrent things happening out of order. And only one can succeed. Locking isn't sufficient for this because there's no neutral sequencer.

In this case what you can do is the STONITH strategy. But you want to make sure that both nodes choose the same one to do it to.

CMCDragonkai · 2023-10-30T04:13:34Z

See https://chat.openai.com/share/476a7f17-ab44-43cb-aa2d-dd133bb1f0df

Apply a STONITH strategy deterministically and also apply a small random jitter to connection initiations. A random delay of between 0 to 50ms?

tegefaulkes · 2023-10-30T05:22:05Z

I've pretty much implemented that already. I have most of a solution now. Both nodes do select the same connection to kill, the other part of the problem is that the connection could already be handling a stream. I want to handle the stream gracefully while cleaning up the connection.

To this end, when we enter a what I'm going to call a doppelganger scenario, we synchronously swap existing connection with the new connection if needed. the connection we don't want will be destroyed with force: false so any streams can end gracefully.

I need to check something with js-quic. that destroying a connection with force: false works as expected on both ends. Seems like there might be an issue with that. In the mean time, delaying destruction of the connections seems to work around it.

tegefaulkes · 2023-10-30T06:49:22Z

Just checked, all calls with withConnF inside of the NodeConnectionManager handle any expected connection errors and prevent throwing. Task 1 should be addressed.

tegefaulkes · 2023-10-30T06:53:13Z

Two things left before this PR is done.

removing magic error codes when stopping the NodeConnection.
Check if streams can end gracefully if the connection is stopped with force: false. Need to check this in js-quic.

tegefaulkes · 2023-11-01T01:36:22Z

Due to the other changes, there is now only 1 place where the magic error code is used for stopping a connection and that's in the NodeConnection.destroy method.

    await this.quicConnection.stop(
      force
        ? {
            isApp: true,
            errorCode: 1,
            reason: Buffer.from('NodeConnection is forcing destruction'),
            force,
          }
        : {},
    );

Ultimately this will only be called when the NCM stops and closes all connections.

I'm thinking that the code and reason can be passed into the NodeConnection.stop().

As for the enum, it's just going to be a single entry. I can still add it. Keep in mind that enum can only map numbers or strings in typescript. So i need separate ones for the code and reason.

tegefaulkes · 2023-11-01T06:47:02Z

This is pretty much done, but there are 3 test failures happening that are weird. So there are still some things I need to check. I'm out of energy to deal with it today.

tegefaulkes · 2023-11-01T06:51:39Z

Two tests in VaultManager.test.ts are failing due to the writable stream being locked. I have no idea how anything I have changed could've affected this. Haven't looked at it yet.

There is a discovery test failing due to trying to start a stream when the connection is stopping. Two odd things here, 1. The connection is stopping and I'm not sure why, maybe deadlock + timeout? 2. The error is bubbling up when it should be handled. Still looking into this.

CMCDragonkai · 2023-11-01T16:09:44Z

That WritableStream is a web stream or one of the old FS streams that we haven't replaced?

tegefaulkes · 2023-11-02T04:56:07Z

I'm stuck on a problem with a discovery test. It's functioning as expected but one of the nodes is trying to connect to another expecting the wrong NodeId. This is really weird and I'm still digging into it. I'm not even sure the problem is related to my changes.

I'm still digging into it.

tegefaulkes · 2023-11-02T07:11:44Z

I found the problem. There was a bug introduced in 2834a09 caused this one specific discovery tests to fail sometimes.

If nothing else is broken then I should have this done soon

[ci skip]

…tion` [ci skip]

[ci skip]

CMCDragonkai · 2023-11-02T14:23:38Z

Commit messages should be more descriptive. What was the bug.

CMCDragonkai · 2023-11-02T18:54:48Z

What are the new exceptions being used in place of enums? Can you spec that out into the PR spec?

tegefaulkes · 2023-11-02T23:20:54Z

CI faliure.

FAIL tests/nodes/NodeConnectionManager.mdns.test.ts
  NodeConnectionManager MDNS test
    ✓ should find local node without seedNodes (335 ms)
    ✕ acquireConnection should create local connection without seednodes (1216 ms)
  ● NodeConnectionManager MDNS test › acquireConnection should create local connection without seednodes
    ErrorNodeGraphNodeIdNotFound:
      766 |       addresses = await this.findNodeAll(targetNodeId, undefined, ctx);
      767 |       if (addresses.length === 0) {
    > 768 |         throw new nodesErrors.ErrorNodeGraphNodeIdNotFound();
          |               ^
      769 |       }
      770 |     }
      771 |     // Then we just get the connection, it should already exist.
      at constructor_.getConnection (src/nodes/NodeConnectionManager.ts:768:15)

I'll need to look into this.

Looking deeper, it's a failure of an mdns test. It's out of scope for this PR. The problem seems to be that it's failing to find the node through MDNS here. But it runs fine locally so may be a race condition happens in the CI.

Any insights @amydevs ?

…s with new `NodeId` found when searching [ci skip]

tegefaulkes · 2023-11-03T01:27:33Z

What are the new exceptions being used in place of enums? Can you spec that out into the PR spec?

We're using an enum in place of a magic number error code and message when forcing a connection to end.

Two new enums has been created. I would've used one but they only take numbers or strings, not a object of both.

enum ConnectionErrorCode {
  ForceClose = 1,
}

enum ConnectionErrorReason {
  ForceClose = 'NodeConnection is forcing destruction',
}

This is only needed in one place now. That's when forcing a NodeConnection to destroy.

CMCDragonkai · 2023-11-03T04:32:16Z

It's weird that only connection error codes are enums but not everything else that relies on exceptions.

CMCDragonkai · 2023-11-03T04:32:38Z

Plus don't you need all your enums to be together?

tegefaulkes · 2023-11-05T22:11:59Z

Not really an exception in the application here. It's just a code and message for stopping the connection. We don't directly need to throw anything in this case.

The enums are together here? I'm not sure what you mean.

tegefaulkes self-assigned this Oct 25, 2023

tegefaulkes changed the title ~~Feature connection crash fix~~ Addressing crashes from failing background connections Oct 25, 2023

tegefaulkes force-pushed the feature-connection-crash-fix branch 2 times, most recently from 73a9e72 to d32b895 Compare October 26, 2023 03:24

tegefaulkes mentioned this pull request Oct 27, 2023

feat: expose more connectionId information MatrixAI/js-quic#73

Merged

8 tasks

tegefaulkes mentioned this pull request Oct 30, 2023

Prevent PolykeyAgent crashes from connection failures #592

Closed

3 tasks

tegefaulkes mentioned this pull request Oct 31, 2023

fix ending QUICConnection with force: false MatrixAI/js-quic#75

Merged

9 tasks

CMCDragonkai mentioned this pull request Nov 1, 2023

Decentralised NAT Signalling #365

Closed

tegefaulkes force-pushed the feature-connection-crash-fix branch from 2e76198 to 575165f Compare November 2, 2023 08:22

fix: NodeConnectionmanager handles concurrent connection establishment

a999060

[ci skip]

tegefaulkes added 4 commits November 2, 2023 19:23

fix: connection map usageCount reflects handling reverse streams now

38f4e90

[ci skip]

feat: using enum for code and reason when force closing a `QUICConnec…

82d8e98

…tion` [ci skip]

dep: updated @matrixai/quic to ^1.1.1

a5e54bb

[ci skip]

fix: final fixes to duplicate connection handling

aade86d

[ci skip]

tegefaulkes marked this pull request as ready for review November 2, 2023 08:30

tegefaulkes force-pushed the feature-connection-crash-fix branch from 575165f to 6d08ff5 Compare November 2, 2023 08:31

fix: fixed bug where getClosestGlobalNodes tried pinging old addres…

1a26cb0

…s with new `NodeId` found when searching [ci skip]

tegefaulkes force-pushed the feature-connection-crash-fix branch from 6d08ff5 to 1a26cb0 Compare November 3, 2023 01:24

tegefaulkes merged commit 2523bb6 into staging Nov 3, 2023

Addressing crashes from failing background connections #609

Addressing crashes from failing background connections #609

Uh oh!

Conversation

tegefaulkes commented Oct 25, 2023 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Description

Issues Fixed

Tasks

Final checklist

Uh oh!

ghost commented Oct 25, 2023 • edited by ghost Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Legend

Uh oh!

tegefaulkes commented Oct 27, 2023

Uh oh!

CMCDragonkai commented Oct 27, 2023

Uh oh!

CMCDragonkai commented Oct 27, 2023

Uh oh!

tegefaulkes commented Oct 29, 2023

Uh oh!

tegefaulkes commented Oct 30, 2023 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

tegefaulkes commented Oct 30, 2023

Uh oh!

tegefaulkes commented Oct 30, 2023

Uh oh!

CMCDragonkai commented Oct 30, 2023

Uh oh!

CMCDragonkai commented Oct 30, 2023

Uh oh!

tegefaulkes commented Oct 30, 2023

Uh oh!

tegefaulkes commented Oct 30, 2023

Uh oh!

tegefaulkes commented Oct 30, 2023

Uh oh!

tegefaulkes commented Nov 1, 2023 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

tegefaulkes commented Nov 1, 2023

Uh oh!

tegefaulkes commented Nov 1, 2023

Uh oh!

CMCDragonkai commented Nov 1, 2023

Uh oh!

tegefaulkes commented Nov 2, 2023

Uh oh!

tegefaulkes commented Nov 2, 2023

Uh oh!

CMCDragonkai commented Nov 2, 2023

Uh oh!

CMCDragonkai commented Nov 2, 2023

Uh oh!

tegefaulkes commented Nov 2, 2023 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

tegefaulkes commented Nov 3, 2023

Uh oh!

CMCDragonkai commented Nov 3, 2023

Uh oh!

CMCDragonkai commented Nov 3, 2023

Uh oh!

tegefaulkes commented Nov 5, 2023

Uh oh!

Reviewers

Assignees

Labels

Milestone

Development

Uh oh!

3 participants

tegefaulkes commented Oct 25, 2023 •

edited

Loading

ghost commented Oct 25, 2023 •

edited by ghost

Loading

tegefaulkes commented Oct 30, 2023 •

edited

Loading

tegefaulkes commented Nov 1, 2023 •

edited

Loading

tegefaulkes commented Nov 2, 2023 •

edited

Loading