A healthy network is made up of healthy nodes, devices that do what is required of them as expected, within an accepted range of performance. It is the job of the elders to make sure the adults in their section are up to scratch and to take action by voting if they spot anything amiss. Dysfunction tracking is a big part of what the team is working on now.
This week we realised we were potentially losing nodes during high churn and splits, with the node thinking it had joined, but with churn the valid section key had moved on. The node, thinking everything was fine, would not try to reconnect. So @anselme has been working on verifying our network knowledge and after updating it, retrying the join process for these lost nodes.
DBC work continues, with the initial steps for storing the SpentBook in place.
In running testnets to debug data instability we’ve uncovered a pair of potential deadlocks. In one instance we saw DataStorage reads hanging during replication. So we set about creating some benchmarks to test ChunkStorage, and with this we uncovered a deadlock in the underlying storage code.
Another (unconfirmed) lock was during the
PeerLink cleanup process. It’s difficult to say if this was definitely happening, but we had seen stalled nodes and our cleanup code had been called frequently around the lockups. Digging in, there was a lot of potential for simplification, so we did just that.
Yet another potential lock was occurring in dysfunction, with the nested data structure incorrectly being read without a
It’s difficult to say if all these were happening for certain (requiring specific conditions to arise), but certainly these changes should be steps in the right direction!
With all those in, we’ve also got the DKG-Handover with Generation work integrated at last (which was much more stable atop other recent changes). We’re chipping away at the stability, with more tests/benchmarks to come for DataStorage, and some other tweaks to data replication flows being tested (we need to expand the pool of nodes we ask for data, as with heavy churn, the odds of hitting another freshly joined node increase and it looks like we start to lose data retention!).
Dysfunction tracking is an ongoing process. There are always new behaviours to test and model, and so it’s an iterative process, allowing improvements in performance and stability as we move forward.
Dysfunction tracking is different from handling malice. Malice is an objectively provable bad action, such as signing an invalid network message, and it’s the job of all nodes to identify such nodes and punish them. Dysfunction, on the other hand, means ‘substandard behaviour’ and it’s the duty of the elders to work out what that means.
A node may be underperforming because of environmental factors such as temporary local internet slowness, or conditions that build up over time such as insufficient storage or memory. Or dysfunction may be sudden, such as a power failure or forced reboot.
Dysfunction covers operational factors too, including the quality and number of messages, and the storing and releasing of data on request.
Some types of dysfunction (data loss, extended connectivity loss) are more serious than others and so should be treated differently.
We can test a node’s performance relative to other nodes in the section, but what if the whole section is substandard relative to other sections? What then?
As you can tell, dysfunction tracking is a complex issue with many variables.
By monitoring nodes as they go about their duties, we want to nip any problems in the bud before they grow. Ejecting a node should be a last resort, instead we want to take action - or have the owner of the node, the farmer, take action - to correct the issue.
This process of progressively correcting node behaviour should be automated and flexible, able to react to changing conditions rather than based on arbitrary hard-coded parameters.
Dysfunction tracking is about optimisation, but it’s not just optimising for performance - that would lead to centralisation. Home users would be unable to compete with data centre instances - they would always be relatively dysfunctional.
Currently we have a simplified version of dysfunction in place.
Liveness testing checks that nodes are online, enabling elders to take action if they are not. This has been expanded to penalise nodes not only for dropping chunks, but also for dropping connections and being behind in terms of network knowledge.
Liveness testing, once tweaked to ensure we’re not being too harsh (as we have been so far) or too soft on misbehaving nodes is a good first step in ensuring a stable network, but with dysfunction tracking we want to go further and build a model that optimises for other factors too.
Penalising nodes currently means dropping them, but will soon incorporate other measures like halving the node age. Decisions on what course of action to take will be made by the elders through consensus.
As mentioned, some degree of ‘dysfunctional behaviour’ is inevitable, and right now we’re experimenting with classifying nodes as good (95%+ success ratio for a given operation), mediocre (75%) or bad (30%) so we can treat each class differently according to its seriousness, without hard coding any ‘expected’ values (PR #1179). We also want to know how many nodes of each class are in our section.
sn_dysfunction crate is here.
We want to expand the number of tests we do and parameters we check in order to ensure we are modelling the network in a meaningful way.
Ideas include regular polling of adults to give them a small proof-of-work which could also check whether they are holding an up-to-date version of some mutable data. Mutable data is a CRDT and will eventually converge, but intermittent connectivity may mean that one replica returns an out-of-date version. How out-of-date it is would have a bearing on its dysfunction score, a cumulative tally accorded to each node. If this exceeds a threshold value, that node may be reclassified as ‘mediocre’ or ‘bad’ and be dealt with accordingly. Over what time period we keep adding to the score will need testing.
Not all issues can be tracked by comparing performance between peers (if all nodes are bad that means a bad section; if too many sections are bad that means a bad network), so we will look at suitable global parameters.
We also need to consider how we can use dysfunction tracking to encourage node diversity and avoid leading to centralisation, so that both an AWS instance and a Raspberry Pi can play a role.
And we want to consider how we present dysfunction information to the end user, the farmer, possibly as a set of default parameters that are tweakable via a config file, and later via a GUI.
Ultimately, Safe is like a global computer, with the elders being a multithreaded CPU and the adults as a giant hard drive. We want this hard drive to be self-healing, and for the CPU to be able to adapt to changes.
All of this takes us into the realm of chaos engineering, as used by the likes of Netflix and Google on their cloud platforms, to ensure that individual server failures don’t bring the whole system down.
There is likely something we can learn here too as we work to make the Safe Network robust and reliable.
Feel free to reply below with links to translations of this dev update and moderators will add them here:
As an open source project, we’re always looking for feedback, comments and community contributions - so don’t be shy, join in and let’s create the Safe Network together!