Question I would be asking is how many times are you going to be changing any of the actual protocols? Bug fixes to protocol does not usually require any changes to the protocol version since it must have been working well enough previously. If like for tcp/ip then 4 bits would be enough, but if you expect regular changes then maybe 16 bits would be best as really you do not want to be returning to zero if at all possible. Cause that will bite you hard one day. Well maybe not you but your children or later programmers.
Most definitely and even to the point of knowing if the peer can accept the later version of that protocol. Maybe in the handshake you send the highest & lowest version you can handle. And so the 2 peers talk in the highest that each recognises.
And later version can remove code for lower versions that are no longer in the network.
Obviously this may end up sometimes with an elder or farmer unable to function if they “never” upgrade and some features of the new protocol are essential. I guess at that stage the upgraded nodes would not accept those many versions old packets/messages.
You will potentially have a collapse of the network on each upgrade. And probability of one (any one) section failing is way too high.
You need to support the previous version of the protocols no matter the method of upgrade. And I’d suggest supporting more than one version of the protocols if the changes are more frequent. This is extremely important since restarts of nodes can be after a significant time period (eg a large block of the internet is segmented by cable cuts or government) You definitely need to support the previous version and any versions less than 6 or 12 months old (excepting a seriously faulty one)
Ah you recognise it too. Restarts can also be from other things than an upgrade and after a period of time too as happens with a cutoff block of the internet.
Here is a suggestion.
Since you are storing the state of the node in case a restart of the node s/w is needed and you want a seamless cutover then do what is done in the power industry when spinning down a generator and replacing it with another. You have both generators spun up and synchronise them then remove the generator you wanted spun down when the voltage/current is crossing the zero line.
This translated in terms of the nodes is
- current node is running
- An upgrade available message propagates through the network with details of location, checksum, authentication etc
- current node initiates a download of that software which includes an install script.
- the installation uses a version specific directory so as not to interfere with running node.
- the state is kept in another directory so as it does not live in the node s/w directory
- The current node verifies the new version using the details that is in the update messages.
- The current node starts the new version in a special idle state
- the new node is not communicating but initialises itself ready to start
- it is reading the current state so its state matches the current node’s state
- Once the current node receives a signal from the new node that it has synchronised then it waits for a suitable moment
- At the moment current node determines that it can hand over operations to the new node it signals the new node to take over
- the current node does no more communicating with the other nodes on the network
- the new node now does the communications to the network
- At this point you could get creative and have the old node watch the new node to see if it continues to function.
- If the new node dies or does some unexpected behaviour then the now previous node could kill -9 the new node and resume operations.
EDITS: fix my (engineer) bad grammer & speeling, still not perfect but hopefully readable.