Here are some of the main things to highlight since the last dev update:
- sn_node and sn_client work to bring in line with the latest sn_data_types and other changes continues at pace, with CI now fully passing in sn_node.
- Routing has now been updated to flag when a new node should be accepted to the network, with work underway in sn_node to take advantage of this.
- The routing work to improve lost peer detection was this week approved and merged.
- The Safe Network lexicon changes discussed 2 weeks ago have begun filtering through to the UX designs. Sneek preview below!
Safe Client, Nodes and qp2p
After last week adding the Sequence CRDT operations signing, this week we’ve been integrating this into the stack, with some changes in sn_node here and sn_client here and here to get all operations signed before being passed on to the network. This highlighted some other issues with regards to local caching of Sequences (when to use local vs network, etc.). We’ve opted to remove this cache for now via this PR for simplicity. This has helped shore up the test suite, allowing us to more clearly see some CRDT type flows in action, for example, pushing changes are not immediately pulled from the network, so we have to wait for expected changes there.
With the above changes in, we’ve also been debugging some new section startup issues that were raised off the back of these, with some small client tweaks added here to prevent failure if an elder isn’t available (but enough others still are). With this in we can now see a subset of sn_client tests passing against an eleven member section on CI.
The rest of the failures that we are seeing are mostly due to a few hidden bugs at choosing destinations, and chunk retrieval in blob data chunk replication that we enabled last week. Rest assured, we are well underway with fixes for them and are ticking off the last of the failing tests from the client e2e test suite. Searching for these errors also highlighted that some elements of messaging were still missing and were now required to bring the control flow / error handling one step closer to what we call “lazy messaging”. Work is underway to address this.
Lazy Messaging is where we get a message we cannot handle for whatever reason (out of sync, future sequence number, etc.) and we error that message back to the sender with our last known history. The sender then knows they need to provide us with the missing link (we can also do the inverse (no error though) and update the sender if it is they who are behind). This saves us from holding messages until they order, which could be exploited to attack the network, and it would be more complex code. Lazy messaging is much closer to a message-passing actor model, and we have extended that to handle partially ordered events.
With a change in the new node joining dynamics in Routing, see PR#2234, we’ve also begun updating sn_node such that the nodes take responsibility for allowing new nodes to join the network. Effectively, elders of the network will now be keeping track of the supply/demand of resources in their section and accordingly request routing to let new nodes, who are queuing, to join their section.
We’ve also started getting the
authd and CLI adapted to the new UI/UX terminology, for example, moving away from “Accounts” to “Safes”, as well as making the necessary changes to have
authd store the applications’ keypairs on the network using a
Map as the storage data type for the “Safe”. We’ll be continuing with this task to get all CLI commands and auth features aligned with these new terminologies as well as with the new sn_client API.
Once done with the above updates, fixes, and any bugs they throw up, we’ll be all set to fire up our internal testnets once again at full throttle, tidy up the various modules, double-check their stability, tie up loose ends and hopefully deliver an early Christmas present to you folks!
BRB - Byzantine Reliable Broadcast
This week our consultant has advanced the Generation Clock idea mentioned last week and presented a pseudo-code algorithm to the team for comment. This hybrid approach imposes a total order over infrequent join and leave operations, but only a partial order over much more frequent data operations. In plain English, this means that join/leave must be bounded (i.e. we cannot allow non-responsive nodes to exist) and use a form of total order, but we can handle many leaves at once, etc, whereas regular data operations can occur with high levels of concurrency, so long as each is from a different source (Actor in CRDT parlance). So far the proposal seems solid and the next step is to implement it in code and write some tests. More on that next week.
As discussed in previous weeks, the work to improve lost peer detection was this week approved and merged. This takes advantage of the connection pooling feature in qp2p. This change means that the routing code base has been simplified and now allows more complex integration tests to be added to verify the features of the production code.
Some API feature work - Indication for section start-up and Age getter and
notify when key got changed during relocation - was also completed this week. With these in place, nodes will now be more informed of the routing status during start-up, and the updated keypair being used.
While testing, we observed an issue where during bootstrap, when the
NodeApproval message was followed immediately by another message, say
Relocate, bootstrap was completed after the
NodeApproval was processed. This left any following message, such as
Relocate, in the intermediary channel buffer never to be taken out and processed, i.e. we were losing that message. We’ve merged a fixing PR Fix losing messages during bootstrap to resolve this issue. It removes the intermediary channel, replacing it with a simple wrapper around the
ConnectionEvent receiver. Thus the “push/pull” model is changed into a simple “pull” model. This way, a message is never retrieved from the channel if not ready to process it.
The work to allow nodes to tell routing to accept new nodes or not mentioned in last week’s update was also completed and merged this week. Routing assumes the elder-nodes will track all the adult-nodes in the section and when they detect the average storage capacity (or some other resource) becomes too low, they will flip the flag so the section starts accepting new nodes. All the elder-nodes should detect this more or less at the same time, so that consensus can be reached. In addition to flipping the flag, if the section already has infants, one of them will be promoted with its age increased by one, effectively making it relocate and immediately join back as an adult.
Safe Network App & UX
Thanks for all your feedback on the proposed changes to the Safe lexicon. We’ve begun to filter these changes through the UX flows and Safe Network App prototypes, and you should see them popping up in the various Figma files as we work through it all.
While not directly related to the language changes, one interesting little side project that popped out of the work was a revisit of some of the onboarding flows, for example, when a user is ready to create their own Safe.
If you recall, the existing version set out all the options a user had for creating a Safe (or account as it was at the time), and let them select the appropriate route, with step by set instructions.
It looked like this:
But could we make it smoother? Could we perhaps make it less daunting, and help a user quickly get their Safe up and running without any outside assistance, and then get them following on earning Safe Network Tokens to boot.
Here’s a small clickable prototype of the new approach—happy path only—just to give you a flavour.
This won’t be the only route through to getting a Safe, there will be other alternative flows with a little less hand holding, but for the first time user, it will be interesting to see how this compares to the existing approach.
It’s also a pattern that could be applied to other areas of the app—such as earning tokens the first time, creating a SafeID, or choosing strong credentials.
It’s a bit more work for us from a design and flow-logic point of view, but if it is smoother and less intimidating for the user, and gets more safes and nodes up on the Network, it’ll pay off.
Feel free to reply below with links to translations of this dev update and moderators will add them here:
As an open source project, we’re always looking for feedback, comments and community contributions - so don’t be shy, join in and let’s create the Safe Network together!