Here are some of the main things to highlight since the last dev update:
- There will be a Community Safe Chat virtual meeting on Friday 26th February (tomorrow) at 9PM GMT. Full details here.
- Bug identifying and squashing continues, along with several efficiency improvements as we make significant progress this week on the path to a public testnet.
- There have been some significant
sn_routingPRs merged this week, specifically PR #2323, PR #2328, and PR #2336. Full details below.
- Two further significant
sn_routingPRs, #2335 and #2336, both critical for a stable testnet, are now raised and should be reviewed and merged in the coming days.
- @scorch has submitted a PR to (finally!) resolve “that issue” where the versions of the CLI itself and the external binaries such as
sn_authdwhere being confused.
- Check out @bzee’s excellent digging and analysis here as he looks to improve and update the bindings to Node.js.
Community Safe Chat: Friday 26th February, 9PM GMT
There’s been discussions and strategising on the forum, and in small zoom meetings over the past few weeks, including this one where we discuss some of @sotros25’s market research on adjacent projects.
We are delighted to have another virtual hang-out this Friday, with an open invitation for all community members to participate.
The first 45 minutes of the conversation will be about the community marketing strategy. After that, we’ll open up the floor for broader discussion on the Safe Network. The aim is to help define the marketing strategy, and to also use content from these discussions as video to build awareness and engagement.
Please be aware that this call will be recorded, streamed, and rebroadcast, so those whose schedules or time zones don’t quite work won’t miss out.
Safe Client, Nodes and qp2p
Work here this week has all been about stabilising networks. We started the week investigating some odd behaviour, only seen with certain logs enabled, which eventually led us into a wee snippet in routing where we were holding a
lock for the duration of a long async call (sending a client message), which was causing a deadlock in responses. It was a tricky one to figure out; but once we realised that the lock was called as part of the
if statement, we just had to move that out and ensure the lock was dropped once we had what was needed there. This got the client responses coming through again.
With the changes to messaging infrastructure completed last week, nodes do not include any logic for aggregation of signatures anymore, fully relying on routing’s ability to do so. Previously we only had aggregation at source, where Routing aggregated the signatures at the Elders before sending it to the target, resulting in the message with the aggregated signature being sent to the destination multiple times, with duplicates being filtered out at the target node. By moving the signature aggregation to the destination, we can reduce some load from the Elders and significantly reduce the number of messages being exchanged. We added support for destination accumulation in routing and used it in sn_node for the communication between a section and its chunk-holding Adults. With the above two fixes, we now have all client tests passing against a single section, with a massively simplified node code. However, a follow-up PR is needed to cover an additional use case, which is comms between Elders in one section and Elders in another section, a part of the rewards flow (as section funds are managed by one section, but held/verified in another section). This is being covered as we speak, and should be merged before end of week.
On that front, with a little update to some routing events we now get our section-sibling
PublicKey on split so we know where to send tokens to populate the resulting child section wallets. After some further flow debugging there, that appears to be going through and we’re now debugging a wee routing loop on split, where section info is not being detected properly, and we’re repeatedly passing the same (wrong) info to a new node. We are also refactoring the communication pattern amongst clients, nodes and their sections, where previously outdated section keys were causing bugs in the rewards and transfer flows. Therefore, we’ll now be enforcing PublicKey knowledge checking and updating with every message that would be sent across the network, and consequently have all the peers up to date with the latest knowledge of the network.
To start with, the split of the section funds will be one transfer chained after another, as we still don’t have
one-to-many transfers, and chaining was a trivial task, working well enough for a testnet. However, with the refactor of
TransferAgreementProof a couple of months ago into a signed debit and a signed credit, we can now relatively easily implement
one-to-many transfers by including a set of credits. A goodie for later :).
As a lower priority task, and in parallel to the above, we started preparing for the upgrade to a new Quinn version that allows us to finally upgrade to the stable Tokio v1. We’ve just given it a try and are preparing the PRs for this migration to happen as soon as Quinn release v0.7.0 is published.
Another improvement which made it in this week concerned the deletion of private data. Before the newly merged changes in
sn_client, deletion of a private blob meant deletion of only the root blob which was the data map of the actual data that is self-encrypted and stored on the network. Our latest addition to the team, @kanav, has implemented a recursive deletion approach that deletes the individual chunks, along with the chunks that store the data map(s), achieving deletion in a true sense.
API and CLI
@scorch has submitted a PR to remove the
-V option from CLI subcommands to avoid confusion between the version of the CLI itself and the version of external binaries such as
sn_authd. This change also included the addition of a
bin-version subcommand to
$ safe node and
$ safe auth subcommands to fetch the version of the external binaries, so that the semantics are clear, along with the distinction between the CLI version and
Currently, the qjsonrpc lib is implemented to support the JSON-RPC 2.0 standard. That said, there are certain error codes that are defined in the spec that were not exposed by the crate. This means consumers need to redefine the same constants themselves, which isn’t necessary since they are in some sense part of the implementation. For this reason @scorch also submitted a PR to expose these error codes as constants from qjsonrpc.
As mentioned in the previous section, we’ve also been trying to get ready for upgrading to Tokio v1, thus we have been preparing the CLI and authd crates for such an upgrade by doing some preliminary tests.
We’ve been iterating on the CRDT underlying the Sequence type in sn_data_types. Previously, Sequence was implemented with LSeq. We tried out a simpler List to resolve some panics with deep inserts, and then moved to GList to support the grow-only use-case. On analysis, all of these CRDT’s don’t do model versioning as we’d like them to. They try to linearise the order of documents, when in reality a document history forms a DAG. We have a design for a Merkle-DAG Register CRDT which would allow us to model document history faithfully and to read the most up to date versions.
We have also started removing the mutability of the Policies from our mutable data types, i.e. from our CRDT-based data types like Sequence. In our current implementation we’ve been trying to solve all types of conflicts that concurrent Policy mutations can create on a CRDT data. This has proven to make things quite complicated while not even covering all the possible scenarios for conflict resolution. Therefore, we decided to start moving onto a different approach where Policies become immutable once they have been defined at the creation of a piece of content. Changing a Policy will then mean cloning the content onto a new address with the new Policy, and some mechanism for linking these different instances can eventually be created and only used on a case by case basis by the applications.
BRB - Byzantine Reliable Broadcast
Our attempt to integrate the
sn_fs filesystem prototype with BRB has exposed a couple of rough edges. The reason is that
sn_fs is receiving operations from the operating system kernel faster than they can be applied by BRB. To this end, we’ve come up with a couple of related solutions: 1) bypass the network layer when sending an operation to self, and 2) keep track of when peers have received ProofOfAgreement so we can avoid sending the next operation until 2/3 of peers have applied the current op. This is necessary to meet the source-ordering requirement of BRB. Source-ordering means that operations coming from the same source (actor) must be sequentially ordered, however operations from many different actors may be processing concurrently.
Also as part of the
sn_fs integration, we modified the
brb_dt_tree crate to support sending multiple tree operations within a single BRB op. This effectively gives us an atomic transaction property for applying logically-related CRDT ops in all-or-nothing fashion. We intend to use this same pattern in other BRB data types.
This week we merged a PR changing the way client messages are handled, and so now they can be routed through the network the same way as node messages. This means that a client can send a request outside of its section, and receive a response back even when the client is not directly connectable to by the recipient(s) of the request due to restrictive NAT or similar issues.
As detailed in the Nodes section above, we also implemented message signature accumulation at destination, which means the users of routing no longer have to implement this flow manually, resulting in simpler code.
Finally, the fork resolution PR is now up and undergoing review. During the work on this PR we discovered a few additional bugs that were not related to fork handling. Throughout the week we have been busy debugging them and as of today, it looks like we mostly fixed them all. The internal stress testing results look very promising, plus we managed to run a localhost network with 111(!) nodes on a single machine and everything went smoothly. A PR with those fixes is currently up in draft status, and should be ready for review soon.
Feel free to reply below with links to translations of this dev update and moderators will add them here:
As an open source project, we’re always looking for feedback, comments and community contributions - so don’t be shy, join in and let’s create the Safe Network together!