Right now, joining a test net as a node is like starting the worst job you ever had. That one where you were dropped into the thick of it by a sadistic boss, with no proper training or instructions; where you had to deal with constant demands to deliver this message or do that task; and where at some stage you ran out of time or neglected to smile - at which point you were promptly fired.
While we don’t want malingering nodes screwing up the network, the rules as they are currently implemented amount to ‘one strike and you’re out’, which is why recent playgrounds have been short-lived: every time we hoof out a node, the data it holds needs to be relocated, causing floods of chunks and messages, which lead ultimately to network seizure.
This is not entirely unexpected as many optimisations are yet to be put in place, as we explain below, but there’s no substitute for real-world testing for showing where we need to focus - so heartfelt thanks for everyone who joined in the playgrounds and comnets. It may sometimes feel like we’re going backwards, but fear not! It’s all part of the plan.
@Jimcollinson, @heather_burns, and @andrew.james have been working through the documentation required by the Swiss authorities in setting up the new foundation there. The good news is, it’s all eminently doable and there are no obvious hurdles, which validates our choice of that country. Our documentation has been submitted for the foundation’s incorporation, and we’ll shortly start work on our registration with the Swiss financial authority.
@Anselme has been working on section handovers and consensus – how we select which adults will be promoted to elders on a split and how we resolve the situation when an adult joining a section is older than the elders. We need to ensure the elders agree on the same set of candidates when handover is performed, so consensus is needed. @Davidrusu has now pretty much completed the integration of the consensus algorithm, so we’re ready to integrate it.
As well as finalising the pull model and liveness tests (see below), @Yogesh has set up a local dashboard using ELK and Filebeat so we can analyse the logs more easily. Results so far are good, and he’s now working to make it more robust, and to ensure it captures all the metrics it needs to.
So why is the network less stable at the moment? The answer is that measures introduced to test for dysfunctional behaviour are currently all or nothing: previously they were switched off, so ‘nothing’, now they are ‘all’. We are basically killing off nodes for minor misdemeanours which means excessive churn and data relocation. So we need to dial back the punishments and introduce other checks.
Right now elders only track a nodes ‘liveness’ around its handling of data. However, this is only one metric for determining if nodes are behaving in a dysfunctional manner. We also need to manage (and compare) things like: connectivity, message quality and number of messages.
We need to watch connectivity in case of nodes rebooting/upgrading. If they can reboot and still be responsive we should not demote them, but we first need to check this is the case. Because the network has no time, their responsiveness needs to be relative to neighbours’ activities. Messaging can be monitored similarly.
Because it’s handy to have all this functionality for checking for malicious or malfunctioning nodes in one place, we are considering a new top-level crate to go in the
safe_network repo, replacing the liveness tracking we have now. This new crate will have expanded functionality to allow us to track and manage all kinds of node dysfunction.
Data replication is another major factor for a smooth running, stable network.
When a node goes offline, we need to transfer its data to other nodes. In the past, data was pushed from one node to the other, controlled by elders.
We now have adult-to-adult messaging for data replication, whereby if an adult goes offline or a new adult joins, all the other adults know about it. Knowing the current membership, every adult can calculate what data they should be holding and what data needs to be redistributed/replicated at the other adults to ensure the network maintains a minimum number of copies of a chunk.
This works in theory, but the playgrounds and comnets have demonstrated a few practical shortcomings, including replication messages failing to reach their targets and being lost, malicious nodes deliberately dropping messages, and bursts of data and messages when many adults notice a node going offline at the same time – which again leads to messages getting dropped.
The new approach, which @yogesh has been working on, aims to solve these limitations by implementing a
pull model. Whenever there is a change in the set of Adults, nodes will be notifying each other of what data they should be holding. The receiving nodes will then be responsible for pulling this data from any one of the existing nodes that hold it.
This makes sure that adults only pull data that they are meant to hold and are responsible for working this out. If they already have the data, the flow cuts off at the notification messages round. Data is sent only when replication is required - instead of the current fire-and-forget messaging which takes the network’s capacity for granted.
We will also batch the data to reduce the number of messages required. Our playground testnet proved one message per chunk was not efficient and nodes were going when there was a lot of data to be replicated.
Once this is fully in place, we’ll fire up a new playground to test it out.
Feel free to reply below with links to translations of this dev update and moderators will add them here:
As an open source project, we’re always looking for feedback, comments and community contributions - so don’t be shy, join in and let’s create the Safe Network together!