Many thanks to @stout77 for another cover image
One of the simplest but also most fundamental and important features in Safe Network design is
Node Age. Essentially,
Node Age replaces systems like Proof Of Work in rewarding good behaviour, punishing bad, and making life very difficult for a Sybil attacker. It provides an important measure of the quality and ongoing trustworthiness of every node, and is our featured topic this time around.
On the back of our groundbreaking work with DBCs, in which @davidrusu and others have taken the ‘digital cash’ concept in a whole new direction making it Byzantine fault-tolerant and thus fit for a decentralised network, we are happy to announce that David Rusu will be heading up a new Safe Labs division. This will be our R&D umbrella for state-of-the-art cryptography, networking and more. Research will be primarily Safe-oriented rather than blue sky, but we want to pull in expertise from wherever it may exist in a more formal and structured way.
@Anselme finalised a PR for checking the SAP on handover, and has started looking into Byzantine behaviours in handover (the process of redistributing data on a churn event).
David Rusu gave a presentation on conflict-free replicated data types (CRDTs) at a Toronto CompSci meetup, mentioning what he’s been doing at MaidSafe (naturally!). Lots of interest in the topic and plenty of contacts to be made. He’s going back for another one on CRDT trees.
@Bochaco has completed a PR to check permission at the client side when performing operations on registers (mutable data) and is also working on the spent book client API.
And @Chriso has been looking at testnet failures caused by the temporary removal of features like
Testing internally, Metricbeat has shown us some nodes creeping up to some verrry high mem usage over a day or so. Diving in, we realised that there appeared to be quite an edge-casey deadlock occurring there (centred around clean-up of connections). We’ve a few fix options here and so are just looking and testing to see what makes the most sense there.
Meanwhile @Qi_ma gave a talk to the team on
Every node on the network has an address which is decided by as its ID, which is actually a key that’s generated when it joins the network. This
node ID is essentially a very large random number. Its first few bits (eg. 0101101…) determine what section the node will be in and therefore what data it will look after, while the last eight bits (e.g. 00000101) signifies its
Node Age - in this case 5.
When a node is first accepted into the network it is given a
Node Age of 5, so its ID ends …00000101 (the joining node must keep generating ED25519 keys until it gets one with the correct ending and the correct prefix, generally a sub-second process).
The longer the node remains an active participant on the network, the larger ite
Node Age will grow, up to a highly unlikely maximum of 255. But there are a couple of catches: (1) its
Node Age will only grow if it proves itself reliable at storing data chunks and giving them up when requested over a certain time period. (2) Each time its
Node Age is incremented, it must move to another section.
But Safe Network has no concept of time, so how can we track how long the node has been behaving? The answer is we use churn events (section membership changes) as a proxy for time.
Each section will contain 7 elders (decision-making nodes) and 60+ adults (storage nodes). Each time a node goes offline or joins the section, which happens frequently with adults, elders vote on what has happened. Each churn event has a 256-bit ID, which is the combined BLS signature of 5 out of the 7 elders. This
churn ID is also effectively a random number and cannot be predicted beforehand.
If the new node proves itself to be dysfunctional within the first few churn events it will be ejected and will need to ask to join again. No point in wasting resources on a dead weight.
On the other hand, if our new node performs its duties properly for a few churn events we want to reward it and increase its age by 1, but we don’t want to have to track it and record when it joined etc. So we use the
churn ID as a sort of lottery ticket.
The churn ID (a random number, remember) provides two functions so far as nodes are concerned. First it provides a way for nodes to get their
Node Age increased, and second, since we don’t want nodes to build their reputation in just one section because of the risk of malicious behaviour, the
churn ID also decides which random section the newly promoted node will join.
churn ID is modulo divisible by 2 exp
Node Age (churn ID % 2^age == 0) we will get promoted. So for our new node age 5, if the
churn ID is divisible by 32 - which will happen on average once every 32 churns - it gets its
Node Age bumped up to 6 and moved to a new section. It will then likely have to wait another 64 churns in its new section before it gets promoted again - promotion becomes exponentially more difficult the longer it remains. This means that Elders, the oldest 7 nodes in the section, have been around a long time and proved themselves in many different sections before achieving their voting status.
How does it work? On every churn event, the elders divide the
churn ID by age, starting with the oldest (255) and working down to the youngest (5). When one of those ages matches a set of nodes in our section, then we relocate up to
elder_count/2 nodes which have that
Node Age. There will usually be only one in that age bracket, but in the case of an excess we select the nodes with a
node ID closest to the
churn ID .
Nodes can also be demoted for dysfunctional behaviour (bad performance in comparison with their peers). In this case,
Node Age is halved before they are relocated.
This scheme has three main benefits. The first is Sybil resistance. In order to control a section, an attacker will need to control at least three elders. The process of becoming an elder is long and hard, and it’s impossible to know which section you’ll end up in. When the network is large, a 7 elders to 60+ adults ratio will make such Sybil attacks extremely difficult. In addition, new nodes are only allowed to join a section when more storage is needed, so attackers cannot flood the network with new joiners.
The second aim is to avoid undue work. If a node fails, it will likely do so early, so we kick it out before it can progress any further.
The third is general randomisation. Forcing the nodes to hop from section to section to gain trust also has the benefit of distributing capability evenly.
@Qi_ma has been working on the implementation of Node Age including the messaging flows between the section elders, the candidate for promotion, and the elders in the target section. He gave a talk to the team this week. Here is one of his slides.
Elders in the source section
- Agree on a churn event (membership change) and sign it (Churn ID)
- Check if there are any candidates for relocation
- Pick the oldest candidate(s)
- Calculate their destination sections from their node ID combined with the Churn ID
- Increase their age by 1
- Cast a vote for each one to be relocated
- When enough vote shares have been gathered, inform each candidate node
- Receives message from elders
- Acknowledges relocation process starting
- Generates a new ID with correct initial bits (section) and trailing bits (its new age)
- Bootstraps to the new section [it has authority to do so from its original section]
Elders in destination section
- Check the source section’s knowledge is up to date (the SAP)
- Update them if not and tell them to resend
- Check relocation signatures and details are in order
- Vote on the candidate joining
- If all goes well, candidate joins new section
Feel free to reply below with links to translations of this dev update and moderators will add them here:
As an open source project, we’re always looking for feedback, comments and community contributions - so don’t be shy, join in and let’s create the Safe Network together!