This week @joshuef brings some news on progress with comms (communications between nodes). As patient readers will know, this has been a blocker for a while now, but we see a definite light at the end of this particular tunnel.
A quick round up of what the team are up to.
@anselme has been investigating a rogue key attack which affects some other implementations of BLS and has found ours is vulnerable too. He has submitted a fix to the
blsttc crate. He’s also looking into a bug that’s preventing clients from joining the network in tests.
@chriso is working on chunk and register signing, which is where data (or its Xorname in the case of chunk) is signed by the elders to make it valid on the network.
Related to this, @bochaco is making changes to the client API related to commands and chunks, and debugging errors related to them on CI.
Mostafa is busy on test cases for our consensus mechanism, and @bzee is continuing to poke at qp2p, where we believe there’s a deep-seated flaw that’s affecting connectivity.
So one thing we’re working on is removing our connection reduction code, and just relying on the underlying
quinn code to drop connections after a smaller timeout. This way we’re assuming less, and no longer second guessing what’s open and when.
In our testing this has had a positive impact on client tests, removing the likelihood that our connections are closed part way through.
In parallel, we’re also moving to streamline client/node comms using bi-directional streams. This means that we remove some state management complexity and will just wait on a response from elders. Previously (a long time ago in node-land), we’d used these kind of streams to communicate with clients, but managing the response stream was a complicated nightmare. But now the node is much simpler (due to the work done over the last year and a half or so), and this is much more manageable.
We’ve also been looking to remove anything that has been covering up these issues (such as our comms abstractions as mentioned above), and also client layer retries. We now have
ACK (acknowledgement) messages (they were introduced a few months back), to the client. These help tell us when a command has been seen by elders. But we were still just querying until a chunk was returned. @bochaco has been looking to be more strict here… not allowing retries and just saying “we’ve seen the
ACK… so why are we not seeing success first time?”. This has exposed some errors in file read/write. (It looks like deserialising the storage commands can take longer than queries… so even if they are sent afterwards, we’re processing them first).
In an effort to more properly regulate node command handling, we’ve previously added code which should have been organising incoming messages, and ordering them for processing. We’ve actually seen that this was not having the impact we were looking for at all (in fact it just triggered all messages to be processed one after the other regardless of priority).
Removing this code has allowed a lot of node simplification. Most significantly, we’ve been able to move node process handling off-thread in a lock-free way (after our single-threaded push removed the vast majority of locking code).
Previously, we could track messages coming in, being queued for processing, being handled and then queued for sending. This process under load could take considerable time. Sometimes it was relatively fast, sub second. Sometimes, with all the command processing and messaging IO… it could take 20 seconds.
Now we’re seeing messages routinely coming in to nodes, being processed immediately and messages going out sub-millisecond.
This is much closer to what we’d have expected from our comms previously, and feels very much like the right direction.
We’re looking to fully incorporate the bi-directional streams into clients, removing a lot of state management there. And we’ll also be aiming to have this same flow in elder->adult comms where responses are required, which should further simplify things there too. It should actually allow our
ACK messages to be reflective of adult storage, too, rather than just elders saying they’ve seen a message, which should also help with failed verifications.
Feel free to reply below with links to translations of this dev update and moderators will add them here:
As an open source project, we’re always looking for feedback, comments and community contributions - so don’t be shy, join in and let’s create the Safe Network together!