Step-by-step: the road to Fleming, 1: The Big Questions: SAFE Fleming and Beyond

development
fleming
routing

#1

Setting the scene
With SAFE-Fleming, we’ll be releasing a fully-functional permissionless decentralised Network based on a combination of features that we’ve developed. As we discussed in ‘What is SAFE-Fleming?’, a critical component of this is PARSEC which solves one of the toughest challenges that any decentralised Network faces: consensus between peers.

Let’s talk about some of the discussions we’re having around implementation post-Fleming. Whilst these topics don’t translate to functionality you’ll see in Fleming itself, it’s crucial that the work we’re doing today takes account of the future requirements of the Network. So let’s kick off by giving you a high-level view of our approach to each question.

Thinking in layers
Let’s start with a general point: Routing in the SAFE Network (the way that nodes connect to each other) involves many interconnected design aspects. We introduced the concept of ‘layers’ to frame our approach to each design challenge. By considering what impact any decision might have on other layers in the Network, it allows us to work more efficiently as we approach each new question with clear focus and understanding of the impact.

Network Upgrade: Handle change
Once live, the Network will need to evolve in order to satisfy user demands and adapt to changes. Like all software, it’s critical that it has the ability to upgrade. However upgrading peer-to-peer Networks comes with its own set of unique challenges. For example, the Network needs to be fully functional from the moment of that upgrade. That means all stored information intact and all peers communicating with each other.

This can be broken down into a number of questions:

  • How do you ensure that upgrading peers doesn’t disrupt the Network?
  • How do the Network’s peers make the decision to upgrade (or use a newer version of the protocol)?
  • Should the Network handle multiple protocols and software versions and how?

Many other concrete decisions need to be taken as well, for example:

  • What data format should peers use for communication to ensure interoperability? Which protocol versions should be used?
  • How do the Network distribute upgrades to the peers?

So, in summary: how does the SAFE Network handle all future upgrades smoothly?

Network Restart: Handle massive failure
The internet is very resilient yet large services still fail on occasion. That may be due to accident or attack, but they tend to recover fairly quickly. Often this is because they can rely on a centralised infrastructure. But a decentralised Network can’t afford this luxury, which presents a unique challenge for any restart.

How can the Network handle and recover from a temporary catastrophic failure or a large number of peers leaving the Network at a similar time?

The peers in each Section of the SAFE Network together hold a specific part of the information of the Network and they are trusted by the rest of the Network to keep it. So as it recovers, peers need to not only communicate with each other, but also ensure that the information in each Section is recovered, trustworthy, and managed by a functional Section.

Also, losing a large number of peers may impact on other functionalities. For example, peers will join and leave on occasion. If the number of peers grows or shrinks sufficiently, the Network adapts its structure by merging or splitting Sections to maintain its overall integrity. We need to design to prevent unintended consequences when a large proportion of peers leave only to rejoin after the event.

Having the ability to recover from a full shutdown is essential for the reliability of the Network. We see the design of this functionality as an opportunity to also support upgrades.

Scalability: Handle growth
The SAFE Network is an infrastructure for everyone to use. It needs to handle a very large number of peers that will provide resources to the Network. As a result, we need to identify and mitigate the limitations of any components that would restrain the Network’s ability to grow.

While designing parts of the Network, we need to constantly consider how each component will scale within the constraints that are required for the Network to operate efficiently.
Consequently, we’re looking at larger Section sizes. This would impact on different part of the system such as connectivity with other peers and the efficiency of our consensus algorithm PARSEC. As a result, we need to ensure that all of these components are able to handle such requirements as part of this Fleming work.

Connectivity: Handle Networking constraints
There are certain strict constraints in the way that computers can connect. For instance, many consumer computers and routers can only maintain connections with a limited number of other computers (from tens to a few hundred).

In a peer-to-peer Network like SAFE, each computer needs to communicate with many of its peers. We need to work out the optimal way forwards here to avoid any hindrance to scaling and a negative impact on our considerations for network upgrade and restart.

There are many ways to deal with the issue and each has its own pros and cons. For example, Direct connections can be considered scarce and should be used sparingly. In addition we can exploit certain capabilities of some computers to enable them to connect with more peers. We can also shield some parts of the design from these technical details with the right abstractions. On top of that, we can also combine some of these different techniques.

History Pruning
Each peer needs to retain information about what the Network stores, who to trust, and have proof that the current state is the result of valid changes. As the Network changes and provides services, more and more data points are created, each with their own overhead (each data point needs proof of validity for example), and some older ones start becoming obsolete (data about a long gone peer for example).

In the blockchain world, all transactions since the inception of a network would be stored in an ever-growing ledger. Compare that to our data chain approach which allows nodes to forget information that isn’t needed by them as the Network evolves. SAFE can use the trusted history of a piece of information to establish that the information is valid. In an asynchronous network, where nodes will be receiving information at different times and in a contrasting order, it needs to preserve some history in order to be able to convince nodes that are slightly behind in their view of the world. Exactly how much can we prune? Where exactly is the cutoff point? How do we maintain proof that the pruned history is correct? These are some of the questions we’ve looked into.

Assignment of identities to peers
As we integrate the existing design for Node Ageing and Address relocation, we must consider when and how to persist a peer’s identity. For instance, stripping a peer of its identity may be an appropriate punishment for malicious behaviour. It deprives them of their valuable age. However, we also want to incentivise nodes to run the latest version of the software so clearly a node should also maintain its identity after a Network upgrade. When considering such nuances, we need of course to remain mindful of the threat model for any chosen approach.

Opportunities of an open culture
As we are answering all of the questions above, we favour simpler solutions that have a clear and well-defined impact on a small number of layers of the Network. If solutions already exist in the wider world, it’s worth stressing the point again: we’re always open to taking inspiration and collaborating rather than reinventing the wheel for the sake of it. We’re keen not to miss any insight from other teams in our space so we’ve been spending some time analysing how other projects have answered similar challenges.

What’s next?
Now we’ve given some context to the questions that we need to take account of on the Road to Fleming. Whilst they may not be implemented today, each needs to be considered now in order to ensure that the system fits together in the most efficient way after Fleming is out in the wild. On that note, next up we’ll be diving into a topic that’s of interest to everyone in the decentralised space: how the SAFE Network defends itself against Sybil attacks.


Step-by-step: the road to Fleming, 0: What is SAFE-Fleming?
Step-by-step: the road to Fleming, 2: Sybil resilience
#2

Great! I like this different view of the project and where it’s going.
Can’t wait for the next installment and to see how the whole series rolls out.


#3

I think there is a typo here. It should be:

Now we’ve given some context to the questions that we need to take account of on the Road to Beta. Whilst they may not be implemented today, each needs to be considered now in order to ensure that the system fits together in the most efficient way after Fleming is out in the wild.

Somebody correct me if I am wrong.


#4

You are both right :smiley: Some of these things will not be in Fleming, but do have to be considered for Fleming. For instance we had 2 hangouts today and yesterday on some of the proposed solutions for these, but post Fleming to take them further. So yes these are road to Beta maybe in terms of doing them, but hopefully we can answer them all beforehand.


#5

I hope you don’t plan for automatic updates? They are a big NO for me, simply for security reason.

Voluntary network shutdown for an upgrade could be another reason to restart a network but no, I don’t want Maidsafe implementing a kill switch.

If the network is really decentralized, then Maidsafe should have no powers to force updates or to stop the network. This limits the possible modifications to compatible ones, but this doesn’t block the evolution of the network as illustrated by the bitcoin network.


#6

Hello :slight_smile:

Whether the peer upgrade automatically or manually, we will probably want a mechanism to provide updates. This is one aspect we were looking at.

We are more looking at the way the network can recover when heavily disrupted, as it could during a mass upgrade (depending on the process).


#7

In my considered opinion the network has to be able to function with at least previous version and current (new) version. Ideally it would work even if there are a few versions variations in the nodes.

To attempt to run all nodes on the one version, especially after releasing the new version, is folly and doomed to massive failure of the network.

  • it would only take one mistake of a certain kind and the network never recovers in a suitable way. It would require people to restart nodes with different version in an attempt to recover.
  • Common security is to NEVER accept an update without verifying its worth, security and viability first. No matter how good the automatic checking systems are. Some people may accept it and trust the “system”, but anyone who has had anything to do with network updates or security will not.
  • The logistics of trying to run only one version ever at a time is overwhelming and amounts to a network restart at a coordinated time. <---- This is in violation of one of the fundamentals concerning Time It would require the nodes/protocols to then have a respect to actual time in order to coordinate the restart time and cannot work in the real world.
  • etc

The best situation is to have upgrades tolerant to various versions. So the question should be "How many versions back can we support"

That requires updates to be written in a tolerant way, to specify for each upgrade what versions will not be supported anymore. And of course be at least many months of upgrades.

It is also possible for the network to be segmented for a long period due to a country being segregated and the chances of some data not being accessible is high. So the network needs to be able to recover that data once the nodes in that country return. (IIRC @tfa was one of those who showed this probably is very high). If the nodes are not tolerant of older versions (at least 6-12 months) then that data is likely to be lost since those returning nodes can no longer be a part of the network and permanent loss of perpetually of that data/files occurs.


#8

Do you think versioning could work / use solutions (e.g. canary deployments) in the same way micro-services are updated?


#9

100% agree with Neo.

As I watch some other decentralized network, if they are going to update to non compatibile version, first they release one or two version before to make it compatibile step by step. They always see if majority have latest version, so they can wait with release of each update to have first the latest proven as stabilised and about 90% ready for non compatible update.


#10

That of course excludes the one single version concept for the whole network at any particular time since a subset is running the new version with the others running the old version running on the network.

It is though a method that could be used and David has suggested similar as a potential method.


#11

It is great to see these challenges being openly considered.

There is the old adage: “Be conservative in what you send and liberal in what you accept.” This may well be relevant to the topic of upgrades. I agree with the consensus expressed in this thread that single-event migrations are not feasible or desirable in a decentralised network.

One possible option would be to consider a sliding window of inter-operable versions such that any peer running a particular version is permitted to fully inter-operate with any other node running a version within the permitted window. Extending this concept to include a wider sliding window that allowed nodes running older versions (up to a point) to continue to be part of the network without losing their identity, but which would only permit them be a receiver and not a transmitter, may facilitate a more graceful upgrade process with less opportunity for the network to become disjointed.

Using the concept of a sliding window would afford the potential that a distant part of the network (in terms of peer hop connectivity) could be running a version that is incompatible with the local (pioneer) neighbourhood - but be bridged by a majority of nodes running versions whose sliding windows overlap both the pioneer and legacy versions.

What will be the motivation for a node to upgrade? Will this be a stick or a carrot or both? It sounds like @Jean-Philippe is suggesting a potential stick of losing identity - so what might be the carrot?

How will upgrade releases be controlled? Can anyone initiate the injection of an upgrade? In a truly decentralised network then the answer probably should be “yes”. Will consensus be used to agree that an upgrade be accepted into the network and for it to be rolled out? Again, this brings my thoughts back to the need for incentives: both for the proposer node (of an accepted upgrade) as well as pioneer nodes whose administrators put in the effort to assess and validate an upgrade’s worth, stability, security etc. As @tfa stated: automatic upgrades are a big NO - and I suspect that this may well be the position of a large proportion of administrators of non-consumer operated nodes - then my thoughts come back to the incentives for testing and validation.

Will there be the concept of roll-back? I can see the desirability for this in more than once reason:

  • a proposed upgrade does not gain consensus
  • a previously accepted upgrade is found to be less beneficial (stable, performant, secure etc.) than the previous version after it has reached consensus (but who can have the authority to make this decision to roll back? - or should it just roll forwards?)

I guess the network will distribute its own upgrade code. If the network were to test and validate the code automatically and, by consensus, upgrade automatically then it would have the potential to autonomously evolve - that is if nodes were to propose upgrades without human intervention. (Perhaps the incentive for submitting and testing upgrades shouldn’t be financial!) This may even lead to the network evolving in a way that purposefully excludes groups of nodes and culls certain content in a manner that we can not conceive at this time.


#12

PS. On the topic of node ageing and (trusted) elders - what is being considered to guard against the espionage concept of sleepers?
rfcs0045 leaves things a little open - even its mention of the potential to use blacklists raises questions about whether this could be used as an attack vector.


#13

Consideration of the Tesla Model would be useful here:

  1. Roll out the update to a small control group first. In Tesla’s case this is car owners who work for them. Analyze the feedback and make any necessary changes after a short period of time then re-release to the control group if necessary.
  2. Then expand the rollout to a slightly larger group, analyze feedback then, finally, send to the entire network.
  3. Updates should be voluntary for a certain period of time. After that time has expired the client will be required to update or be “orphaned”.

Not sure if this “controlled” rollout would be feasible in the Safe Network but maybe a variation of it could be implemented.


#14

The punishment of sleepers nodes is covered in the Datachain RFC.

Routing must punish nodes ASAP on failure to transmit a Link NodeBlock on a churn event. Links will validate on majority, but routing will require to maintain security of the chain by ensuring all nodes participate effectively. These messages should be high priority.


#15

Thx - though we might be talking at cross purposes as the type of sleeper I was thinking about is more akin to the discussion in the section on Archive nodes in the Datachain RFC).

These more reliable nodes and will have a vote weight higher than a less capable node within a group. There will still require to be a majority of group members who agree on votes though, regardless of these high weighted nodes. This is to prevent attacks where nodes lasting for long periods in a group cannot collude via some out of band method such as publishing ID’s on a website and soliciting other nodes in the group to collude and attack that group.

If nodes can get in and gain a higher level of trust over time then they have the potential to be more disruptive in the future if they go rogue (esp. with collusion).

Routing must punish nodes ASAP on failure to transmit a Link NodeBlock on a churn event. Links will validate on majority, but routing will require to maintain security of the chain by ensuring all nodes participate effectively. These messages should be high priority.

Whilst it makes reference to “Routing must punish …” it doesn’t state the form of the punishment. Can you point me to where the consequences are documented?

P.S. The hypertext reference from the word NodeBlock in the section of frcs/0029-data-chains.md quoted is broken.


#16

Will it be possible for the network to provide the update, while the nodes are updating?

In case I’m not asking that well… I hope that the updates will be available on the network… Could the network sort of roll over from <50% nodes updated to >50% nodes updated, all on the fly, while providing/hosting the update?

If updates happen automatically, would it likely be staggered? Something like 20% of the nodes at a time?


#17

Well, this is inherent in the concept of Node Ageing. Only those who have demonstrated, over time, their good behaviour reach the status of elders and participate in the consensus. Of course an elder has a greater capacity to harm but, also, is much more unlikely that it will.
In the end, preventing a enough number of evil elders colluding in the same section is what will ensure the network.

This part is under development and the punishment, as far as I know, has yet to be defined (removal, age reduction, warning,…)


#18

Thanks a lot for all the feedback, I’m really happy to see it :slight_smile:

Definitively a lot of good points and we are keeping many of them in mind during our investigations.
With this overview post, I did not go into so much detail, in each aspects.

One thing I tried to do is to sketch the challenges and not really discuss the solutions we are considering as we are still in the process, but also because each topic likely warrant its own post. It is also great to see all the direction this conversation is taking, and bringing new thought to our team.


#19

I’m a really on-technical, non-code literate sort of a person, but I read a bit about the subject and I have tried NixOS lately. I don’t know if this will have any bearing on the issues you are discussing, but I have a feeling it may be worth reading as you consider possible update mechanisms for the SAFE network.
https://nixos.org/nix/about.html


#20

An interesting philosophical question would be to consider the following two positions:

A. in a decentralized network you trust no one
B. you have to trust someone to use their updates

Sounds paradoxical? The premise of update seems to suggest there has to be some central figure of trust - let’s say MaidSafe. But again, if we start to introduce this sort of requirement on who can trust, some niceties of decentralized network start to break down. If we stick to the principle and don’t do that, then it means anyone can try to advertise updates and the consensus algorithm needs to be able to figure out which update is ‘real’.

Another way of solving this paradox might be to build update as a mechanism outside the autonomous Network itself - e.g., you have to manually download the update program from somewhere, making sure it’s legit and then running it across your client apps. And of course this doesn’t sound very nice or safe.