We are taking down test4 now as it has served it’s purpose (we thought it would only last a few days at most, but it’s been more persistent than that).
Findings of TEST4 :
- Breaking messages up into much smaller chunks to send them has been successful.
- Using OS provided scheduling for socket handling (Mio crate) has driven down resources considerably.
- Not sending GROUP_SIZE number of messages each time has reduced traffic considerably.
All in all we have reached a point where resource usage is more than acceptable for now. Also nice to see fully static binaries for Linux and ARM making a long overdue appearance.
We still have not focussed on data persistence, but that is about to change.
The next steps
We will now carry out more tests, over today and the weekend till Tuesday when TEST5 will hopefully start. We are looking at testing the following immediately on droplets before releasing to community again:
- Swapping the service discovery to not start vaults if it find another close by. This is the reverse of it’s purpose, but for this test beneficial. In the last test we still had too many nodes per machine to the tests detriment.
- Test rewrite of nat_traversal (should be as good as it was, but still needs to go further).
- Never dropping messages related to data relocation, even under high load. We need to evaluate whether this will help prevent data loss or hurts the network by increasing traffic.
- Re-enabling caching, which has been reimplemented to work with split messages now.
- Re-enable bootstrap-cache. This will allow nodes to find other bootstrap nodes and not required we keep droplets running. This will also mean community tests can continue almost automatically.
- Reduce message chunks even lower than the 20KB we currently have.
- Alter group & quorum sizes to confirm resilience
Two main issues we want to address fairly urgently:
Small capability nodes. This is when a vault is so poor that it can only damage the network. The network has to cut such nodes (they may fluctuate in capability, so non trivial). We have reduced the minimum requirement, but have not taken any steps to address the root issue of maintaining a minimum level of resource. The network should and will measure and maintain this.
Data retention - We require the network to be able to restart and republish all data. This is linked to 1. Above as potentially not all nodes require to store all data in the group (they should not have to). There are many advantages to this approach, but the focus is on data loss protection and security of data in this case. We will present further advantages here soon, for now though the focus is on data retention and security.
These may happen during Alpha and possibly not make it into Alpha 1, but this is what the tests are all about. Alpha though will include installers and more. So when we have a download here button we will be in Alpha, we will also let everyone know in advance.
Refactors currently happening (to catch up with test findings and fixes)
- Routing client to be split from node in core.rs
- Move much of the core.rs handling into a peer_manager to better and more simply handle peer related states in routing
- safe_core passing back more information to API to allow launcher to display much better user feedback.
Thanks again folks