Step-by-step: the road to Fleming, 4: Network restarts

This one is quite a big topic, and while we haven’t answered everything, we wanted to share what we have. It’s also not entirely in the scope of Fleming, but we spent some time thinking about the issue to make sure that we don’t burn any bridges before we get to designing a full solution in the future.

The problem

SAFEs goal is to be a reliable data and communications network. Reliability means that even if a large part of the network crashes, or disconnects, or suffers some other kind of catastrophic failure, it should be able to come back online and continue providing data it has been storing before the failure.

This is not straightforward. There are two main challenges to consider:

  • After the failure, we need to restore trust. When nodes come back online, we need to be able to make sure that these nodes are legitimate members of the Network and can be trusted to restore the pre-crash sections to normal operation.
  • We need to restore data. We have to ensure that the data the restored nodes are attempting to republish has actually been legitimately put on the Network before.

During normal operation of the Network, we maintain trust by keeping Data Chains and running a consensus algorithm - PARSEC. But when the Network is failing, these things might not be enough - if more than one third of the nodes in a section go offline at roughly the same time, the consensus algorithm will be stalled and the section won’t be able to process any requests. The Network will stop functioning.

So, we need some special provisions to handle such events correctly. What can we do?

Attempts at a solution

More persistent data

First things first - if a node wants to rejoin the Network in the same section it was in before it lost connection, it needs to know what section that was and what was its identity. This means that we need to persist some more data than we are saving currently to the disk, namely:

  • the node’s public ID (and the corresponding private key),
  • the node’s routing table, complete with the section’s and neighbouring sections’ members’ endpoints,
  • the parts of the data chain the node is responsible for holding, which is important for restoring trust and keeping data about the nodes’ age,
  • all the node’s active PARSEC instances - there might be multiple ones during splits and merges, corresponding to the pre- and post-change sections,
  • the node’s stored data,
  • the node’s endpoint information.

This is fortunately a relatively simple thing to do.

Handling nodes leaving

So the Network sees that some nodes are losing connection. In most cases they won’t be able to tell right away if it’s a beginning of some larger failure, or just regular churn. To be prepared for both scenarios, we need to make sure that if we lose most of the nodes and they start returning, we will be able to resume normal operation.

This prompted us to consider options such as the following: when a node leaves and we vote it offline, instead of removing it from our routing table completely, we will mark it offline. An offline node will only be removed from the routing table when it is the youngest offline node and a new node joins. We can then also roughly measure the health of the Network by the number of offline nodes that we have (if there are many of them, it means that more nodes were leaving than joining for some time).

Handling nodes joining

From a section’s point of view

Say that a node tries to join our section. If it is rejoining after a brief disconnection, how do we ensure that it regains its status instead of starting from scratch?

There are two sides to this. One is that we must be able to recognise that this node is not new and in fact is rejoining. Thus, it would need to have an opportunity to claim that it has been a member before (which means modifications to the structure of the messages exchanged upon joining) and we need to be able to verify that it’s true - which would be helped by the way of handling leaving described above. This is also related to node ageing, as we want the rejoining nodes to be able to reclaim half of the age they accumulated before leaving the network, so that they don’t have to start accumulating age from scratch.

The other side is that it must be secure. Allowing rejoining could open us up to an attack in which someone buys a node identity (both private and public keys) from someone else. They could then bribe the owners of a few elders in a section, have them disconnect and use their identities to rejoin, thus hijacking the section. This shows that we need limitations on rejoining. One possibility we considered was limiting rejoining to nodes that reconnect from the same endpoint (IP and port) - but many people currently tend to have dynamically assigned IPs, so if their Internet connection goes down, they might have a different endpoint when it comes back up again. Also, we want to use random ports for the nodes for security reasons, which further complicates the matters. A partial solution we came up with is to restart the nodes on both the old endpoint and a new one, if possible - the old endpoint would be used to restore the connections from before the crash, and the new one for establishing new connections.

From the node’s point of view

Using the old endpoint alongside the new one would also help in another aspect. Imagine that your whole section went down, but you regained Network access and are trying to reconnect to your old section buddies. First thing you will try is probably contacting them on the same endpoints they had before the downtime - but what if that fails? If we allow for the possibility that their endpoints changed, we need a way of finding their new endpoints. If we can expect that they are using the old one, on the other hand, then in such a case we just assume that they are, indeed, offline and wait for them or declare them lost.

What about the data?

While data is not in scope of Fleming, data is an important aspect of restarts. Say there was a failure, nodes lost their connections but they are coming back online. Imagine that they are somehow reconnecting successfully, but at a given point only a half of them, or even less, come back. This means that the Network has less than half of the storage space it had before. If the space was over 50% full, it won’t be able to hold all the data from before the crash at this point. What should we do, then - should we drop some of the data, or should we wait for more nodes to come back? They might not come back, though, so how long do we wait? These are the questions that we will be answering in the scope of Maxwell when we design the precise way data is handled in the Network.

Summary

I hope this short write-up gave you some insight into the problems with Network restarts and how we are intending to approach them. While the issues are far from fully resolved, devoting time to them has helped us understand the challenges that are still ahead while focussing on the tasks at hand. Next up, we’ll be discussing Network upgrades which is how we enable updating all the nodes’ software while making sure that the Network keeps running correctly… Stay tuned!

53 Likes

Really nice clear explanation @bart - even I think I understand most of it anyway!

One thing that occurred to me is that allowing nodes to rejoin at old trust level creates an opportunity for attackers to ‘steal nodes’ by copying their data and identity, and then deleting that from the target. It would be expensive, but maybe it creates an efficient vector for taking over a section.

Just a thought.

9 Likes

Thanks Bart, this was a good read. I am wondering though…
how likely is it that a network restart will ever be necessary? I mean 1/3 or all nodes
ever getting disconnected at the same time… What would have to happen for that to
be an issue? some major natural disaster? world-war3?
It’s great that MaidSafe is thinking of all possibilities and wants to have
everything covered. But are we not getting carried away with devoting precious
resources to a scenario that is very unlikely to ever happen?

8 Likes

how likely is it that a network restart will ever be necessary? I mean 1/3 or all nodes
ever getting disconnected at the same time…

I can see a couple of scenarios that could happen:

  • an update introduces a bug that causes many nodes to crash
  • an attacker is trying to target nodes in a specific sections and manages to take out a number of nodes in the same section. (It’s not a full network restart, but from the local point of view of a node, it looks very close and the same mitigations could be applied)
  • one of the big farming operations goes down (or aws, digital ocean or the like). That could take down a big chunk of the network at once

Altogether, I think it is a scenario we must have an answer for in the long run if we want to be sure the Network delivers on its resilience goals.

20 Likes

Fully agree to that! Also Bart mentioned, that it’s not entirely in the scope of Fleming eitherway so I understand that you’re limiting your efforts on the minimum restart requirements for Fleming (e.g. persistent ID management in regard of node aging?).

6 Likes

Seems still some unresolved issues. Do we all think all of these will be resolved in next 6-8 months. Perhaps by Beta if it launches by end of 2019?

Just trying to make sure these get the attention they need : )

FYI Probing for ETA’s are pretty overplayed here.

There is one aspect of this that I like to bring up.

Geographical concentration gives natural fluctuations during the day. I would assume that we are moving towards smaller fluctuations, as devices increasingly are always on. But even so…

Have a look at this animation, it shows how individual devices are connected during the day:


It basically shows how SAFENetwork data is moving around the globe, as devices connect and disconnect.
It shows the data concentration on earth, over the day,
(A lot more at play though, considering what kind of devices, bandwith etc.)

The data is from 2012 IIRC, so fresher data needed. But the principle is the same.
I would like to correlate this with connectivity, bandwith numbers and IT resources per capita (how many devices, what kind of devices), as to get a + / - percentage from average network storage capacity, over the day. If we assume there will continue to be some fluctuations over the course of a day as devices are turned on in the mornings and off in the nights, it seems to me that this can aggregate to quite large differences as densely populated regions go on / off.

The geographical concentrations make local / regional events such as power outages, censorship, solar flares, etc. have larger effect. And if they were to occur at a time where the network is at a capacity minima…
Well, without having compiled numbers, it just seems possible that there can be quite disadvantageous circumstances.

Data centers is another factor.

On geographical data concentration, a look at the most densely populated regions ives a quick hint:

  • Tokyo,
  • New York,
  • Sao Paulo,
  • Seoul,
  • Mexico City,
  • Osaka/Kobe/Kyoto,
  • Manila,
  • Mumbai
    etc…

and add to that that SouthEast Asia is overrepresented. So that region is specifically “hot”.

However, It is said, that the largest population growth in the coming 50 years will not be that area of world, but Africa, going from 1 billion to 3 billions (or so…). And Africa is quite large. So, at least some balancing continent wise.

I think it would be good to delve a little deeper into these numbers, just to be able to rule out things (fresher data might show much less fluctuations for example), and maybe find something that can be taken into consideration.

12 Likes

Actually have you thought of the third scenario? That instead of the other nodes disappearing its a set of nodes losing connectivity to the rest of the section in stages so that the set of nodes losing connectivity do not realise that (eg as mentioned above data centre losing connectivity as it various outside links go down - could be a dns setting error).

So if you assume the node is still OK and its the other nodes that need to rejoin then the node will forever remain disconnected. This maybe reinforced because the node is still connected to 1/3 of the original section.

Basically the node thinks that its losing most of the nodes and wait for them to rejoin. But in fact the rest of the nodes is waiting for it and its mates to rejoin.

One simple solution is that rejoining nodes cannot be elders for quite a “time”. Being an elder only benefits the network and does not affect the users ability to have a vault and earn safecoin.

So the rejoining node cannot become an elder till it ages the normal amount that a node needs to to become an elder. ie if the average age a node becomes an elder in the section is 16 and the rejoining node was age 40 and thus rejoins at age 20 (1/2 age) then that node needs to be age 36 before it can become an elder again.

What about adjacent/neighbouring sections having some info on endpoints in that section. So then as others in the section come alive the neighbouring sections can provide that information suggesting where the nodes in that failed section can be found.

2 Likes

Wondering that in such an event, it could choose to not jump back into action which might complicate matters but pause while it rebalances and then move forward once recovered.

… and if the node is thinking about state, perhaps it wonders about what contribution it can best make to the network, which might add a level of flexibility?.. so, home nodes; etc, do what is most useful relative to what resources they have available.

I was thinking about what can happen to sections and what can we do about it, and I realized I undestand (sort of) how members of the section comunicate, but I have very little knowledge how comunication works between different sections. Is there anything one can read to better understand the “upper level” comunication?

My naive thinking is, we could use some knowledge about the rest of the network to help section “regroup” after big network errors (datacenter outage, cut undersea cable, some country trying "national firewall,…).

1 Like

Is there anything one can read to better understand the “upper level” communication?

The disjoint groups (now known as disjoint sections) RFC is probably a good starting point.

It may be a bit dense, but hopefully it can get you started.
For some discussions on this forum: RFC: Disjoint Groups - #21 by JBishop.

And if you have any question while checking these resources, feel free to ask :smile:

6 Likes

Excellent write up @bart. Clear and concise. Looking forward to hearing you talk about full asynchrony in PARSEC at some point too.

5 Likes

Having “elder” trusted nodes in the sections to maintain stability seems to be a good idea, but you still seem to have concerns. I agree. One way to address this is to have circulating elders. Take the elder concept to the global level for the pool of possible elders, then have the network re-assign elders to different sections, so they cannot maintain a long term presence in any one section.This may require these “elders” to have 2 addresses, the “owners” address (stable), and a different section address for elder responsibilities. This splits the elders local section for regular SAFE activities from their elder responsibilities. The section they would be an elder in would change periodically to another section on some quasi-random basis, and the specific section would be chosen by the network. Elders could start out being assigned to a section in need of an elder, or their regular section (as the current process seems to be). A strategy would need to be developed so that and elder has no choice of the section or when such a change would occur. If it happened weekly, that would probably be sufficient, and the changes would not all happen at the same or even on the same day. The gossip graphs could be transferred and a notice of the change of address for the elder replacement to create a transparent change on the gossip graph. If these are done as swaps, each elder leaving can confirm the transfer in the section they are leaving of the new elder entering and in the new section they are being transferred to for validation in the gossip graphs. This lets all elders in both sections monitor the new elder to ensure they are performing their duties.

Excuse my ignorance, but are the network base parameters (section size, elder numbers, etc) dynamic? As in, is/will the network be able to adjust those parameters without restarts? Is/Can this dynamic behavior be used to facilitate restarts, but also churn, increase/decrease resistance to sybil in a dynamic way, relaxing or tightening depending on the circumstances?

The elders, like the rest of the nodes, are also relocated in different sections from time to time. What changes is that, with increasing age, the chances of relocation decrease. Also as the network grows the number of sections increases so the total number of elders in the network should also increase.

If this did not occur, the age of the elders would stagnate while the age of the rest of the nodes would continue to rise which would lead to their replacement.

1 Like

Thank you for your reply digipl. From what I recall, sections are defined by XOR distance, which can change as the network grows and possibly shrinks, and the address assigned to a node (account) does not change. So it is possible to remain in a section for a long time. Relocation would only happen when the XOR space size (distance) for a section changes. Do all sections have the same XOR space size? Is there a limit to the number of nodes in a section? Are there documents describing the details of this process? This process is significantly different from what I proposed, which would seem to make it very difficult for elders to turn hostile (for whatever reason) in an effort to take over part of the network. A key feature of the SAFE network is the lack of trust, which helps to prevent malicious activities. The elder position is one of trust, so the trust should remain only with the duties, which are monitored continuously through gossip, and not have any possibility of remaining within a section where time could be exploited for future malicious activity in that section. Am I not understanding something correctly here? Thanks.

Here you have information about the life of a node in the network including relocation (in Source section).

https://raw.githack.com/maidsafe/routing/64116a8adacc3d928da899f331ce2c7446afffe6/routing_model/index.html

Basically a “Work Unit” counter will be used and those nodes that exceed 2^age units will be relocated.

2 Likes

For a high level view Chapers 4 and 5 of the Primer are still reasonably current although some of the details have changed. https://primer.safenetwork.org/

Basically, an elder is just one of the oldest members in a section. So a certain proportion (not settled upon yet AFAIK) of the nodes in a section will be given elder status by virtue of their node age. If they prove to be unreliable they will be demoted, moved or kicked off the network