This week we revisit node age and look at some tweaks to leverage it for more network operations. To head off any wailing and gnashing of teeth, don’t worry, it’s not the sort of architectural revamp that’s going to take months. It builds on what’s already there in terms of new nodes having to prove themselves and move around the network, but it takes away critical data handling duties from those young nodes and restricts them to those that have already proved themselves.
After lengthy and occasionally heated community discussions, @JimCollinson and @andrew have got the spreadsheets out again and have worked through the various options for token distribution. We sincerely hope this provides the basis to move forward on this now.
@joshuef has been experimenting with testnets with even tinier virtual machines and small nodes. It’s been going pretty well but there have been a few bugs that look to be round the DKG (elders voting) process where sometimes votes aren’t received. Related to that, @anselme, @maqi and @davidrusu are taking a close look at DKG, what exactly triggers it, including looking into SAP generation (a new record of elders that is created every time there’s churn) and exactly where that triggers a DKG round.
@oetyng has simplified the join process, by moving it into the regular msg flows. After that, the relocation flow was simplified by making it also be a join, but to another section and including a relocation proof. @davidrusu found a potential need for asserting that a valid churn event was used, this work is coming up.
@bochaco has been debugging and finalising sn_comms, the communications module, which he is continuing to refactor.
And Mostafa has finished testing the consensus algorithm and added it to the main repo.
Thanks to @southside for suggesting the ChatGTP code commentary initiative. Anyone who wants to help out there (no tech skills required) should check out this post.
Node age and data
Responsibilities in the network are based on the notion of node age.
The node age, does not increase linearly, but exponentially, which means that every increase of age is based on 2x of what the previous increment was based on.
Time in the network is measured in number of events, and the measurement is approximate as we are doing a probabilistic evaluation.
A happens after
~n events, and age
A+1 happens after
The reason node age is measured in this way, comes from the empirical observation that nodes who have stayed online x time, is likely to stay online for at least another x time. So, if you have spent time
t in the network, it is likely that your total time in the network, will in the end be at least
What this means is simply that the younger the node the more likely that it will go offline, and the older the more likely that it will stay online.
Having very stable and very unstable nodes both storing live data, as we do now, is hard to manage when there is lots of churn. If a node goes offline, its data must be transferred to the next XOR-nearest candidate, which takes time. New nodes are not reliable and can go offline rapidly, meaning lots of data movement and a headache for the elders who have to manage it.
Primary and secondary storage
We’re looking at concepts around a
stable set of nodes (more on that to come down the line). But one idea this gives us is separating nodes into two storage tiers based upon age and (therefor) likelihood of churning, and thus giving them different duties.
For example, we’d want the most stable nodes (say, age 10+) to be responsible for primary data storage. These nodes look after the data and give it up to a client on request. They are not likely to churn any time soon.
Nodes outside of such a stable set, those that are still working on increasing their node age, hold extra copies of data (secondary storage). In doing so, they provide redundancy to support the stable set.
Their behaviour in handling this data is also used to evaluate their quality in the usual way. But since they only hold extra copies they do not need to be tracked so closely by the elders, and can fail without causing serious problems to the network or requiring mass data migration.
This would allow for us to have an increased replication count for data, while testing out incoming nodes more thoroughly, without sacrificing data stability to do so. All by levering our existing node age system.
Feel free to reply below with links to translations of this dev update and moderators will add them here:
Russian ; German ; Spanish ; French; Bulgarian
As an open source project, we’re always looking for feedback, comments and community contributions - so don’t be shy, join in and let’s create the Safe Network together!