This one is quite a big topic, and while we haven’t answered everything, we wanted to share what we have. It’s also not entirely in the scope of Fleming, but we spent some time thinking about the issue to make sure that we don’t burn any bridges before we get to designing a full solution in the future.
The problem
SAFEs goal is to be a reliable data and communications network. Reliability means that even if a large part of the network crashes, or disconnects, or suffers some other kind of catastrophic failure, it should be able to come back online and continue providing data it has been storing before the failure.
This is not straightforward. There are two main challenges to consider:
- After the failure, we need to restore trust. When nodes come back online, we need to be able to make sure that these nodes are legitimate members of the Network and can be trusted to restore the pre-crash sections to normal operation.
- We need to restore data. We have to ensure that the data the restored nodes are attempting to republish has actually been legitimately put on the Network before.
During normal operation of the Network, we maintain trust by keeping Data Chains and running a consensus algorithm - PARSEC. But when the Network is failing, these things might not be enough - if more than one third of the nodes in a section go offline at roughly the same time, the consensus algorithm will be stalled and the section won’t be able to process any requests. The Network will stop functioning.
So, we need some special provisions to handle such events correctly. What can we do?
Attempts at a solution
More persistent data
First things first - if a node wants to rejoin the Network in the same section it was in before it lost connection, it needs to know what section that was and what was its identity. This means that we need to persist some more data than we are saving currently to the disk, namely:
- the node’s public ID (and the corresponding private key),
- the node’s routing table, complete with the section’s and neighbouring sections’ members’ endpoints,
- the parts of the data chain the node is responsible for holding, which is important for restoring trust and keeping data about the nodes’ age,
- all the node’s active PARSEC instances - there might be multiple ones during splits and merges, corresponding to the pre- and post-change sections,
- the node’s stored data,
- the node’s endpoint information.
This is fortunately a relatively simple thing to do.
Handling nodes leaving
So the Network sees that some nodes are losing connection. In most cases they won’t be able to tell right away if it’s a beginning of some larger failure, or just regular churn. To be prepared for both scenarios, we need to make sure that if we lose most of the nodes and they start returning, we will be able to resume normal operation.
This prompted us to consider options such as the following: when a node leaves and we vote it offline, instead of removing it from our routing table completely, we will mark it offline. An offline node will only be removed from the routing table when it is the youngest offline node and a new node joins. We can then also roughly measure the health of the Network by the number of offline nodes that we have (if there are many of them, it means that more nodes were leaving than joining for some time).
Handling nodes joining
From a section’s point of view
Say that a node tries to join our section. If it is rejoining after a brief disconnection, how do we ensure that it regains its status instead of starting from scratch?
There are two sides to this. One is that we must be able to recognise that this node is not new and in fact is rejoining. Thus, it would need to have an opportunity to claim that it has been a member before (which means modifications to the structure of the messages exchanged upon joining) and we need to be able to verify that it’s true - which would be helped by the way of handling leaving described above. This is also related to node ageing, as we want the rejoining nodes to be able to reclaim half of the age they accumulated before leaving the network, so that they don’t have to start accumulating age from scratch.
The other side is that it must be secure. Allowing rejoining could open us up to an attack in which someone buys a node identity (both private and public keys) from someone else. They could then bribe the owners of a few elders in a section, have them disconnect and use their identities to rejoin, thus hijacking the section. This shows that we need limitations on rejoining. One possibility we considered was limiting rejoining to nodes that reconnect from the same endpoint (IP and port) - but many people currently tend to have dynamically assigned IPs, so if their Internet connection goes down, they might have a different endpoint when it comes back up again. Also, we want to use random ports for the nodes for security reasons, which further complicates the matters. A partial solution we came up with is to restart the nodes on both the old endpoint and a new one, if possible - the old endpoint would be used to restore the connections from before the crash, and the new one for establishing new connections.
From the node’s point of view
Using the old endpoint alongside the new one would also help in another aspect. Imagine that your whole section went down, but you regained Network access and are trying to reconnect to your old section buddies. First thing you will try is probably contacting them on the same endpoints they had before the downtime - but what if that fails? If we allow for the possibility that their endpoints changed, we need a way of finding their new endpoints. If we can expect that they are using the old one, on the other hand, then in such a case we just assume that they are, indeed, offline and wait for them or declare them lost.
What about the data?
While data is not in scope of Fleming, data is an important aspect of restarts. Say there was a failure, nodes lost their connections but they are coming back online. Imagine that they are somehow reconnecting successfully, but at a given point only a half of them, or even less, come back. This means that the Network has less than half of the storage space it had before. If the space was over 50% full, it won’t be able to hold all the data from before the crash at this point. What should we do, then - should we drop some of the data, or should we wait for more nodes to come back? They might not come back, though, so how long do we wait? These are the questions that we will be answering in the scope of Maxwell when we design the precise way data is handled in the Network.
Summary
I hope this short write-up gave you some insight into the problems with Network restarts and how we are intending to approach them. While the issues are far from fully resolved, devoting time to them has helped us understand the challenges that are still ahead while focussing on the tasks at hand. Next up, we’ll be discussing Network upgrades which is how we enable updating all the nodes’ software while making sure that the Network keeps running correctly… Stay tuned!