Skip to content

Conversation

@greg7mdp
Copy link
Contributor

@greg7mdp greg7mdp commented Mar 4, 2025

Resolves #1183.

It is needed for #1169.

Performance tests for reflection from peers keys in chainbase db to the unordered_flat_map (release build, AMD 7950x). I used 10,000 keys which is many more that can be expected.

  • update empty unordered_flat_map with 10,000 {name/public_key} pairs from chainbase: 1.71573ms
  • update unordered_flat_map containing any number of {name/public_key} pairs with 32 new keys: ~0.01ms.
  • each pair occupies 88 bytes, so 10,000 pairs occupy about 1MB per map, or 2MB total.

Note: when a peerkey row is deleted in chainbase via delpeerkey, the key is not deleted from the unordered_flat_map. I don't think it should be a problem.

auto lower = idx.lower_bound(boost::make_tuple(t_id->id));
auto upper = idx.lower_bound(boost::make_tuple(next_tid));

for (auto itr = lower; itr != upper; ++itr) {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The table could be rather large. Seems like we should loop through the producer schedule, find each one, and only copy those out.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

There are 679 registered producers in EOS. Granted there are only 30 finalizers, but that can grow unbounded.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

A few hundred name/public keys pairs is tiny, a few KB. I feel that adding extra code is not desirable it since it will not make any measurable difference, but it may not be the prevalent viewpoint at ENF :-)

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

But what if the table is maliciously made large?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'll address this with the version as well.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think we should avoid a host function if at all possible.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@heifner one question I have, should we consider both proposers and finalizers?
Also should we use the current producers and the ones on the next schedule.
Should we consider the current finalizers + all the ones pending?

Upon reflection, I think it is better to just efficiently reflect the table from chainbase, using the version scheme I described above.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think I agree with that. The net_plugin can track the current proposers and finalizers easily. It already tracks the active and pending proposer schedule. The account has to pay for the on-chain ram, duplicating it in the node seems like would be safe. We could limit the copy time to say 5ms so it can't be weaponized.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

However, why bother keeping a copy. Seems like chainbase access should only be slightly slower than C++ memory lookup. You can use hash lookup instead of chainbase lookup, but seems like not worth making a copy.

Copy link
Contributor Author

@greg7mdp greg7mdp Mar 5, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

However, why bother keeping a copy.

So it can be accessed without locks from any net_plugin thread, even when the main thread is executing transactions and updating the chainbase db.

@spoonincode
Copy link
Contributor

Apologies if this was covered in some design doc I missed: Was using a ROtrx considered instead of fishing in the tables manually?

@greg7mdp
Copy link
Contributor Author

greg7mdp commented Mar 4, 2025

Apologies if this was covered in some design doc I missed: Was using a ROtrx considered instead of fishing in the tables manually?

I didn't consider it. Wouldn't really know how to do it either.

@heifner
Copy link
Contributor

heifner commented Mar 4, 2025

Apologies if this was covered in some design doc I missed: Was using a ROtrx considered instead of fishing in the tables manually?

Interesting idea. Currently read-only trxs are not enabled on all nodes, but that could be changed if we wanted to use them "internally" for something like this. Also I wonder how much overhead that would be compared to direct table access.

@ericpassmore
Copy link
Contributor

Note:start
category: Other
component: P2P
summary: Retrieval of peers public keys from chainbase db.
Note:end

fc::raw::unpack(ds, row_name);
fc::raw::unpack(ds, row_version);
fc::raw::unpack(ds, row_key);
assert(row_version > _version && row_version <= *cb_version);
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Is this putting too much trust in to the contract that the row contents will always be well formed? Previously one malformed table row would "just" cause the entire peer list to fail to update so I didn't bring it up, but now a malformed table row can abort the whole process if asserts are enabled.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

How could a row not be well formed though?

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It's too short (throws exception), public key is malformed (exception), or row_version==0 (assert), etc

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

You means the serialized data we unpack? How it it possible that the system contract would not populate a correct row?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

a malformed table row can abort the whole process if asserts are enabled.

I feel that that's the whole point of enabling asserts. You want the process to abort if something that shouldn't be possible does happen.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

That's not how I remember us describing the use case for asserts. I remember asserts being present to describe invariants that cannot be true. In this case this isn't an invariant but rather input validation on what the contract is communicating to nodeos. asserts should be impossible from the viewpoint of nodeos but in this case a misbehaving contract can force an assert to be true.

Copy link
Contributor Author

@greg7mdp greg7mdp Mar 10, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

invariants that cannot be true

According to how we understand the code to work IMO! And that should include contract code.
Really an assert says: If my code and the compiler are correct, this has to be true.

Different from user input validation that we use EOS_ASSERT for.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I don't consider the system contract and nodeos to be one entity, so communication between them would qualify as requiring input validation. Even the system contract isn't worthy enough to blindly trust to the point of allowing it to fail asserts.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I have EOS_ASSERTs now so we should be fine.

size_t num_updated = 0;
try {
const auto& db = chain.db();
auto cb_version = _get_version(db);
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I am not really seeing any benefit to using chainbase's revision over just simply the applied block num? That's really what the contract is tracking -- the last block num the row was modified at. (version seems a misnomer in that aspect) Relying on chainbase revision == head block num might be an implementation detail.

Copy link
Contributor Author

@greg7mdp greg7mdp Mar 10, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This tracks the version of the peerkeys table, and this version is incremented every time the peerkeys table is modified.
What I meant with cb_version is the version of peerkeys in chainbase, vs the version of peerkeys we currently have in the unordered_flat_map.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

But afaict you're just using chainbase's revision as a proxy for current block num. I don't believe that's a pattern used elsewhere -- normally you'd look at forkdb or the block passed in accepted_block.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

just using chainbase's revision as a proxy for current block num

Where do I do that? peer_keys_db_t::_get_version() returns the version stored in the peerkeysver table.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

whoops you are right I misread the earlier code

auto cb_version = _get_version(db);
if (!cb_version || *cb_version == _version)
return num_updated; // nothing to do
assert(*cb_version > _version);
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Won't this potentially assert on fork switches?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Hum, good point, let me think about it.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Should be addressed by 2bcfc00 and
VaultaFoundation/system-contracts@90773cb

which now update the reflected maps only up to lib.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Doesn't matter what signal you attach to, the state of the node is as of HEAD unless you are running in IRREVERSIBLE mode.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Sorry, I misunderstood your comment. You are only pulling rows with block numbers up to lib from the table.

auto lower = secidx.lower_bound(std::make_tuple(t_id->id._id, _version + 1));
auto upper = secidx.upper_bound(std::make_tuple(t_id->id._id, *cb_version));

auto new_map = boost::make_shared<peer_key_map_t>(*get_peer_key_map());
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Do we have confidence that given the size of this map can be maliciously controlled that performance will not be a problem at 100k, 500k, etc entries when doing the deep copy? It may be possible someone can make bloated 512KB PUB_WA in these entries as well, making this a very heavy copy (this may be something that the contract can defend against though by simply disallowing PUB_WA here). I'd like to see more data on the overhead and/or defense against abuse.

Copy link
Contributor Author

@greg7mdp greg7mdp Mar 10, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Good point for the PUB_WA. I'll disallow it in the contract.

regpeerkey registers a key for an existing account.

To register 500k peerkeys, you'd have to create 500k accounts. At a cost of 1GB of ram.

The cost for the table would only be 50MB. And because those updates would not occur in a single block it would not take an excessive time. It doesn't seem that that would be the most effective attack against spring.

I do believe that if someone created 500K accounts, there are likely many other issues in Spring that would degrade more that the user registering 500K peer keys and using 50MB of RAM.

There is a possible attack by creating an account with a name, registering a key, then deleting the account which I think would return the ram cost, and then doing the same thing over with a different name. I can add some protection against that if necessary.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

disallowing webauthn keys: VaultaFoundation/system-contracts@fd86ed8

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I wasn't too worried about changes; worst case (assuming action executes 100µs) I guess you would end up with 40000 changed keys to iterate through. I was more worried about the unbounded size of the map being copied. Nobody who owns 1GB, 10GB, or whatever should be able to cause large delays inside of nodeos.

Copy link
Contributor Author

@greg7mdp greg7mdp Mar 10, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The map with 10,000 keys is copied in 0.059ms. I don't see the large delays inside of nodeos happening.

But if you think this is worth worrying about, I can add some code to ensure that no large delay occurs even if the map was to grow to 1M entries or more.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Another alternative would be to keep two updated unordered_flat_map in memory, and switch the atomic shared_ptr back and forth between them.
I think I like this solution better because it makes an attack impractical, and in the case where there is no attack the cost of keeping duplicated maps in memory is trivial.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Implemented the flip-flop solution f7384d5

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'm not sure how the separate thread solution would work. Couldn't you just put a time limit of say 5ms per accept block? If it is delayed in getting the data no real harm especially if the data is an attack.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Couldn't you just put a time limit of say 5ms per accept block?

I implemented the flip-flop solution, so a time limit is not needed anymore. I think the current version is safe.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'm not sure how the separate thread solution would work

I was thinking the copy and merge would occur on a separate thread.

Copy link
Contributor

@spoonincode spoonincode left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

requires further discussion:
#1228 (comment)

symbol.cpp
whitelisted_intrinsics.cpp
thread_utils.cpp
peer_keys_db.cpp
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Originally this was in net_plugin. I think that is a better place for it.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I moved it in controller because I wanted to be able to test it in unit tests which I think is useful.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

net_plugin has unit tests.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The unit test is in eos-system-contracts/tests and accesses the controller via the tester. I don't think the cmake system is plumbed for accessing net_plugin, but I could be wrong.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I didn't realize you were using it in eos-system-contracts/tests. Seems like those tests should be moved to Spring as they test Spring functionality.

Copy link
Contributor Author

@greg7mdp greg7mdp Mar 11, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I didn't realize you were using it in eos-system-contracts/tests. Seems like those tests should be moved to Spring as they test Spring functionality.

Well, I need the contracts from eos-system-contracts to modify chainbase memory when registering keys using the regpeerkey action, so the test has to be in eos-system-contracts/tests. But in the same test I verify the reading from chainbase memory part which is in libraries/chain, so I can check that both the registering of keys, and the reflection into an unordered_flat_map work.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think I would rather test the eos-system-contracts in Spring than test Spring in eos-system-contracts. Not sure @spoonincode thoughts on this. I can approve as is, but seems cleaner to move all the tests to Spring.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think I would rather test the eos-system-contracts in Spring than test Spring in eos-system-contracts

The short answer is I agree with that.

I had a longer answer how maybe the ideal way to test this feature is in spring but not via eos-system-contracts, because my understanding from prior discussions is that reference-contracts/eos-system-contracts being part of spring's tests has been considered somewhat1 an anti-pattern, especially outside of integration level tests. But since this will ultimately involve net plugin tests too perhaps it will eventually be part of an integration test.

Footnotes

  1. Or maybe not even 'somewhat' depending who you ask

Copy link
Contributor Author

@greg7mdp greg7mdp Mar 13, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It is hard to make sure that the code writing into chainbase in eos-system-contracts is in sync with the code reading the data in spring, if we are not using both spring code and eos-system-contracts in the tests.

For example when I changed the code in eos-system-contracts to store block_num as uint32_t instead of uint64_t, the test failed because the code was not in sync with spring. I wouldn't want to lose that connection by duplicating the contract code in spring.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thinking on it more, it's clearer that the right place to test is in spring for two reasons --

  1. I take it eventually there will be an integration test for the feature as a whole (i.e. with net stuff tested) too and that can only be done in spring anyways. But even if not making such a test,
  2. contract tests should not depend on random internals of libchain because it prevents us from creating a clean libtester api. In other words, contract unit tests directly interacting beyond the tester apis is not something we want to encourage -- we want to reduce the leakiness not add more. As an extreme example of why it's a problem, consider if we want to rewrite the system contracts tests in Vert1. It's literally impossible to have this test with Vert.

I see @heifner has approved so I won't block the PR on this issue either. I'll rereview shortly.

Footnotes

  1. This might not even be an 'extreme' example since it has been floated more than once!

@greg7mdp
Copy link
Contributor Author

Silly me, I realize there is an issue with my flip-flop scheme... moving the PR back to draft.

@greg7mdp greg7mdp marked this pull request as draft March 11, 2025 21:37
@greg7mdp greg7mdp marked this pull request as ready for review March 12, 2025 00:58
@greg7mdp
Copy link
Contributor Author

Code is ready for review again. I ended up using a mutex, but considering the updates to the hash_map are quite fast and infrequent (per block containing new peer keys), I feel it shouldn't be an issue.
I could use a shared_mutex to only read lock for operator[](), but I doubt it would be an improvement.


if(CMAKE_CXX_COMPILER_ID MATCHES "Clang" AND CMAKE_CXX_COMPILER_VERSION VERSION_GREATER_EQUAL 14.0)
target_compile_options(eosio_chain PUBLIC -Wthread-safety)
endif()
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

In the future we should consider making this common somewhere and not copy paste it all over.

Also, since this is PUBLIC it means everything that links to libchain also gets it enabled. This is true today for everything that links to net_plugin -- for example nodeos' main.cpp currently have -Wthread-safety enabled. Is that a problem? If it's not a problem, all the more reason to consider just enabling it globally.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Would you like me to add it to the top CMakeLists.txt?

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I wasn't really sure what the side effect was for enabling -Wthread-safety on code that isn't prepared for it. We're already doing that a little because of net_plugin and now we're going to be doing it a lot because of libchain. If that's fine maybe we blanket enable it at top anyways, if that's bad then maybe we should change these to PRIVATE instead. Doesn't need to be done now.

Copy link
Contributor Author

@greg7mdp greg7mdp Mar 15, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I was just trying it out and it builds fine. Checking it in to see if there is any test issue. I can undo that commit if desired.
eb52ee9

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I wasn't really sure what the side effect was for enabling -Wthread-safety on code that isn't prepared for it.

I believe that it should be fine (a no-op)

private:
std::optional<uint64_t> _get_version(const chainbase::database& db);

bool _active; // if not active (the default), no update occurs
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Looks like this needs a = false

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks c60b6e9

"deserialized invalid version from `peerkeys`");

fc::raw::unpack(ds, row_key);
EOS_ASSERT(row_key.valid(), misc_exception, "deserialized invalid public key from `peerkeys`");
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I am suspicious this opens up the possibility for a denial of service. It looks like you should just ignore entries with a bad key instead of failing to update all entries.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I believe this is what we are doing with the try/catch block within the loop.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

you are right I got cross eyed with the double-uped try blocks

Copy link
Contributor

@spoonincode spoonincode left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

EOS_ASSERT(row_block_num > static_cast<uint64_t>(_block_num), misc_exception,
"deserialized invalid version from `peerkeys`");

fc::raw::unpack(ds, row_version); // no need to check, all versions must include the `key` optional
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Similar to https://github.com/eosnetworkfoundation/eos-system-contracts/pull/185/files#r2001804923 -- comment does not match my understanding of prior discussion. It seems like if version != 0 this should skip the row.

@greg7mdp greg7mdp merged commit 329f7d9 into main Mar 24, 2025
36 checks passed
@greg7mdp greg7mdp deleted the gh_1183 branch March 24, 2025 19:17
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

Add support for accessing the on-chain peer public keys

4 participants