Step-by-step: the road to Fleming, 4: Network restarts

Thanks Bart, this was a good read. I am wondering though…
how likely is it that a network restart will ever be necessary? I mean 1/3 or all nodes
ever getting disconnected at the same time… What would have to happen for that to
be an issue? some major natural disaster? world-war3?
It’s great that MaidSafe is thinking of all possibilities and wants to have
everything covered. But are we not getting carried away with devoting precious
resources to a scenario that is very unlikely to ever happen?

8 Likes

how likely is it that a network restart will ever be necessary? I mean 1/3 or all nodes
ever getting disconnected at the same time…

I can see a couple of scenarios that could happen:

  • an update introduces a bug that causes many nodes to crash
  • an attacker is trying to target nodes in a specific sections and manages to take out a number of nodes in the same section. (It’s not a full network restart, but from the local point of view of a node, it looks very close and the same mitigations could be applied)
  • one of the big farming operations goes down (or aws, digital ocean or the like). That could take down a big chunk of the network at once

Altogether, I think it is a scenario we must have an answer for in the long run if we want to be sure the Network delivers on its resilience goals.

20 Likes

Fully agree to that! Also Bart mentioned, that it’s not entirely in the scope of Fleming eitherway so I understand that you’re limiting your efforts on the minimum restart requirements for Fleming (e.g. persistent ID management in regard of node aging?).

6 Likes

Seems still some unresolved issues. Do we all think all of these will be resolved in next 6-8 months. Perhaps by Beta if it launches by end of 2019?

Just trying to make sure these get the attention they need : )

FYI Probing for ETA’s are pretty overplayed here.

There is one aspect of this that I like to bring up.

Geographical concentration gives natural fluctuations during the day. I would assume that we are moving towards smaller fluctuations, as devices increasingly are always on. But even so…

Have a look at this animation, it shows how individual devices are connected during the day:


It basically shows how SAFENetwork data is moving around the globe, as devices connect and disconnect.
It shows the data concentration on earth, over the day,
(A lot more at play though, considering what kind of devices, bandwith etc.)

The data is from 2012 IIRC, so fresher data needed. But the principle is the same.
I would like to correlate this with connectivity, bandwith numbers and IT resources per capita (how many devices, what kind of devices), as to get a + / - percentage from average network storage capacity, over the day. If we assume there will continue to be some fluctuations over the course of a day as devices are turned on in the mornings and off in the nights, it seems to me that this can aggregate to quite large differences as densely populated regions go on / off.

The geographical concentrations make local / regional events such as power outages, censorship, solar flares, etc. have larger effect. And if they were to occur at a time where the network is at a capacity minima…
Well, without having compiled numbers, it just seems possible that there can be quite disadvantageous circumstances.

Data centers is another factor.

On geographical data concentration, a look at the most densely populated regions ives a quick hint:

  • Tokyo,
  • New York,
  • Sao Paulo,
  • Seoul,
  • Mexico City,
  • Osaka/Kobe/Kyoto,
  • Manila,
  • Mumbai
    etc…

and add to that that SouthEast Asia is overrepresented. So that region is specifically “hot”.

However, It is said, that the largest population growth in the coming 50 years will not be that area of world, but Africa, going from 1 billion to 3 billions (or so…). And Africa is quite large. So, at least some balancing continent wise.

I think it would be good to delve a little deeper into these numbers, just to be able to rule out things (fresher data might show much less fluctuations for example), and maybe find something that can be taken into consideration.

12 Likes

Actually have you thought of the third scenario? That instead of the other nodes disappearing its a set of nodes losing connectivity to the rest of the section in stages so that the set of nodes losing connectivity do not realise that (eg as mentioned above data centre losing connectivity as it various outside links go down - could be a dns setting error).

So if you assume the node is still OK and its the other nodes that need to rejoin then the node will forever remain disconnected. This maybe reinforced because the node is still connected to 1/3 of the original section.

Basically the node thinks that its losing most of the nodes and wait for them to rejoin. But in fact the rest of the nodes is waiting for it and its mates to rejoin.

One simple solution is that rejoining nodes cannot be elders for quite a “time”. Being an elder only benefits the network and does not affect the users ability to have a vault and earn safecoin.

So the rejoining node cannot become an elder till it ages the normal amount that a node needs to to become an elder. ie if the average age a node becomes an elder in the section is 16 and the rejoining node was age 40 and thus rejoins at age 20 (1/2 age) then that node needs to be age 36 before it can become an elder again.

What about adjacent/neighbouring sections having some info on endpoints in that section. So then as others in the section come alive the neighbouring sections can provide that information suggesting where the nodes in that failed section can be found.

2 Likes

Wondering that in such an event, it could choose to not jump back into action which might complicate matters but pause while it rebalances and then move forward once recovered.

… and if the node is thinking about state, perhaps it wonders about what contribution it can best make to the network, which might add a level of flexibility?.. so, home nodes; etc, do what is most useful relative to what resources they have available.

I was thinking about what can happen to sections and what can we do about it, and I realized I undestand (sort of) how members of the section comunicate, but I have very little knowledge how comunication works between different sections. Is there anything one can read to better understand the “upper level” comunication?

My naive thinking is, we could use some knowledge about the rest of the network to help section “regroup” after big network errors (datacenter outage, cut undersea cable, some country trying "national firewall,…).

1 Like

Is there anything one can read to better understand the “upper level” communication?

The disjoint groups (now known as disjoint sections) RFC is probably a good starting point.

It may be a bit dense, but hopefully it can get you started.
For some discussions on this forum: RFC: Disjoint Groups.

And if you have any question while checking these resources, feel free to ask :smile:

6 Likes

Excellent write up @bart. Clear and concise. Looking forward to hearing you talk about full asynchrony in PARSEC at some point too.

5 Likes

Having “elder” trusted nodes in the sections to maintain stability seems to be a good idea, but you still seem to have concerns. I agree. One way to address this is to have circulating elders. Take the elder concept to the global level for the pool of possible elders, then have the network re-assign elders to different sections, so they cannot maintain a long term presence in any one section.This may require these “elders” to have 2 addresses, the “owners” address (stable), and a different section address for elder responsibilities. This splits the elders local section for regular SAFE activities from their elder responsibilities. The section they would be an elder in would change periodically to another section on some quasi-random basis, and the specific section would be chosen by the network. Elders could start out being assigned to a section in need of an elder, or their regular section (as the current process seems to be). A strategy would need to be developed so that and elder has no choice of the section or when such a change would occur. If it happened weekly, that would probably be sufficient, and the changes would not all happen at the same or even on the same day. The gossip graphs could be transferred and a notice of the change of address for the elder replacement to create a transparent change on the gossip graph. If these are done as swaps, each elder leaving can confirm the transfer in the section they are leaving of the new elder entering and in the new section they are being transferred to for validation in the gossip graphs. This lets all elders in both sections monitor the new elder to ensure they are performing their duties.

Excuse my ignorance, but are the network base parameters (section size, elder numbers, etc) dynamic? As in, is/will the network be able to adjust those parameters without restarts? Is/Can this dynamic behavior be used to facilitate restarts, but also churn, increase/decrease resistance to sybil in a dynamic way, relaxing or tightening depending on the circumstances?

The elders, like the rest of the nodes, are also relocated in different sections from time to time. What changes is that, with increasing age, the chances of relocation decrease. Also as the network grows the number of sections increases so the total number of elders in the network should also increase.

If this did not occur, the age of the elders would stagnate while the age of the rest of the nodes would continue to rise which would lead to their replacement.

1 Like

Thank you for your reply digipl. From what I recall, sections are defined by XOR distance, which can change as the network grows and possibly shrinks, and the address assigned to a node (account) does not change. So it is possible to remain in a section for a long time. Relocation would only happen when the XOR space size (distance) for a section changes. Do all sections have the same XOR space size? Is there a limit to the number of nodes in a section? Are there documents describing the details of this process? This process is significantly different from what I proposed, which would seem to make it very difficult for elders to turn hostile (for whatever reason) in an effort to take over part of the network. A key feature of the SAFE network is the lack of trust, which helps to prevent malicious activities. The elder position is one of trust, so the trust should remain only with the duties, which are monitored continuously through gossip, and not have any possibility of remaining within a section where time could be exploited for future malicious activity in that section. Am I not understanding something correctly here? Thanks.

Here you have information about the life of a node in the network including relocation (in Source section).

https://raw.githack.com/maidsafe/routing/64116a8adacc3d928da899f331ce2c7446afffe6/routing_model/index.html

Basically a “Work Unit” counter will be used and those nodes that exceed 2^age units will be relocated.

2 Likes

For a high level view Chapers 4 and 5 of the Primer are still reasonably current although some of the details have changed. https://safenetworkprimer.com/

Basically, an elder is just one of the oldest members in a section. So a certain proportion (not settled upon yet AFAIK) of the nodes in a section will be given elder status by virtue of their node age. If they prove to be unreliable they will be demoted, moved or kicked off the network

This is desirable to insure a homogeneous work load on the vaults. Some machinery must be implemented to tend to this ideal work model, like relocation to section having a bigger XOR space size (or equivalently, a smaller prefix).

I did some simulations with 2 important metrics to optimize about network homogeneity:

  • Number of distinct section XOR space size
  • Density gap (important because a bigger section may not be a problem when it has more nodes)

Yes, when the number of nodes is too big, the section in split in 2 new sections. In current implementation with default min section size (8), this happens when the section has 22 nodes or slightly more. This is not a fixed limit but the probability of a split increases with the number of nodes in the section.

1 Like

Thank you digipl, JPL and tfa for the additional info. The primer did not have much to say about elder relocation, but does nicely address node aging. Sorry, should have read that first. Since the primer does not explicitly exclude them from the aging process, it implies they relocate. The link digipl provided discusses this in more detail. If relocation occurs on a 2^age count of Work Units, nodes (adults and elders) relocate at an exponentially increasing duration as they age, unless they are doing a lot more work as they age. Without a limit, this still seems to give elders too much section stability over time. That stability could be worth building up and selling, like it’s done by gamers. This issue was raised in : Step-by-step: the road to Fleming, 4: Network restarts; in Handling nodes joining; From a section’s point of view. This would be addressed to some extent by section splits, as long as section growth is relatively consistent and elders are split up like adults. Ensuring no node is excluded from being relocated to another section, after some minimum amount of Work Units, and/or time seems worth considering. Maybe I’m still not understanding this process correctly.

BTW, though the SAFE network is taking a long time to go live, those making it happen are doing an impressive job. Resolving the large scale failure problem is a very difficult task. While it could collapse quickly (worst case scenario), it might return slowly to make recovery even more difficult.

6 Likes