Laptop vs Desktop Performance
For comparison, here’s the difference in performance between a desktop i7-7700 and a laptop i7-4500U.
The laptop exhibits behaviour the desktop didn’t - it starts slowing down with larger safenetworks. At low client loads there’s no slowdown but once the client load gets high, the larger safenetwork becomes about 30% slower than the smaller network.
The heatmaps have all been set to use a scale between 0-2000s. Anything above 2000s will be the same shade of red. This allows direct comparison of colours between charts, and shows an ‘encroaching red’ for slower machines vs ‘spreading green’ for faster machines.
i7-7700 Desktop
i7-4500U Laptop
There’s definitely some maximum combination of vaults+clients, but ‘lots of clients’ hits the limit faster than ‘lots of vaults’.
This is sort of reassuring since the safenetwork should ideally be pretty efficient at looking after itself (routing messages, verifying signatures etc) and should spend most of the time looking after user data. The characteristics of the two scaling factors (clients vs networksize) seem to suggest this is the case.
To directly address your question @andreruigrok, this behaviour is expected (for the reason above). The stress test is specifically designed to stress the network. Seeing the safenetwork increase in size without being additionally stressed is very positive. Seeing double the amount of safenetwork stress when client stress is doubled is expected. If it were a ‘normal client test’ then I’d be worried since there should be spare capacity to take that extra load, but this is a client stress test so it’s expected. If the safenetwork stress grew exponentially with client load it would be worrying, but it grows linearly so I consider that a reasonable result.
Clients are usually distributed among many machines. So in real life the doubling of clients on any single machine is very unlikely. If client load started becoming a bottleneck, more machines would be added to the network and that would rapidly reduce the client load.
At the end of this post is an exploration of large local networks.
AWS Performance
The table below shows the performance of different AWS EC2 vm sizes (using the test for 32 vaults 20 clients). The order of this table changes depending on which networksize/load is selected, so the choice of 32/20 is kinda arbitrary.
Rank Device Time To Run
1 m5.4xlarge 206
2 t2.2xlarge 300
i7-7700 desktop 309
3 m5.2xlarge 399
4 m5.xlarge 804
i7-4500U laptop 1182
5 m5.large 1628
6 t2.xlarge 2560
7 t2.large 4077
m5.4xlarge
m5.2xlarge
!
m5.xlarge
m5.large
t2.2xlarge
t2.xlarge
t2.large
The m5 instances performed much more consistently than the t2 instances as the load increased. The inconsistency of t2 instances is not surprising considering the they’re burstable performance instances. Once the t2 runs out of cpu credits it slows down to about 20% of burst speed. The effect of running out of cpu credits can be seen really clearly in the chart below at around chunk 16 where speed goes from 3s per chunk to 15s per chunk.
It’s satisfying to see that performance almost exactly matches power for all m5 instances. The m5.2xlarge is twice as powerful and twice as fast (and twice as expensive) as the m5.xlarge.
The choice of instance size really depends on the client load. m5 is better for consistently high loads, but t2 is better (not surprisingly) for intermittent loads. Since the real-world client load is hard to know it’s not really possible just yet to say which instance size would be the best choice. In theory due to the even distribution of names for vaults / clients / chunks the load should be fairly even and consistent.
For my future testing I’ll be using m5 instances since they provide better consistency and I don’t need to wonder whether cpu credits was a factor in the results.
t2.medium (and smaller) instances took a very long time to run the test so were not included in the results.
Cost Efficiency
Which size is the best value, ie most efficient per dollar? The Cost Per Test column gives an idea of which instance size offered the best efficiency of work per dollar. Pricing is available at EC2 Instance Pricing.
The cost efficiency changes depending on the client load, so the choice of 32 vaults 20 clients is somewhat arbitrary.
Rank Device Time $/h Cost Per Test
1 t2.2xlarge 300 0.3712 0.0309
2 m5.2xlarge 399 0.3840 0.0426
3 m5.xlarge 804 0.1920 0.0429
4 m5.large 1628 0.0960 0.0434
5 m5.4xlarge 206 0.7680 0.0439
6 t2.large 4077 0.0928 0.1051
7 t2.xlarge 2560 0.1856 0.1320
Because the t2.2xlarge was always in ‘burst mode’ it was very cost effective. But if it had consumed all the cpu credits it would drop down the list and the m5 instances would be much more cost effective.
It’s also important to restate these results are only true for this test and when network factors such as latency and bandwidth and client factors such as intermittent demand are included these results may change.
Large Local Networks
I tried starting up some larger networks on my local machine and ran a single client_stress_test to see how big the network can get locally. It takes a long time to start large networks, so I’ll update this later with additional tests for a 2000, 4000 and 8000 node network. I really want to get a doubling of time from network size.
It’s amazing that increasing the safenetwork size doesn’t add much load. Almost all the work of the safenetwork is spent managing client data, not managing relations between vaults.
CPU load during the stress test was around 50%, same as all other 1 client tests.
There were some vaults that didn’t start in large networks for the reason “All 13 resource proof responses fully sent, but timed out waiting for approval from the network. This could be due to the target section experiencing churn. Terminating node.” There were no sustained spikes in cpu usage and resource proof was disabled in the routing config, so I’m not sure why this happened.
Reflections
It’s fascinating to see the behaviour. I’m very interested to see how things evolve on the live network since I can’t predict what will be the optimum setup. I guess lots of measurement will be important to vault operators trying to optimise their setup. But crucially, what they measure will depend on how safecoin incentives are defined, so that’s another factor in the mix. It’s a very different beast to bitcoin mining. I can’t wait to see what the nerds do when they get their hands (vaults?!) on a live network.
Considering my local machine can write 200MB from /dev/urandom to disk in 1s but uploading 200MB on the minimum viable local safenetwork (8 vaults 1 client) takes 76 seconds, it gives me a lot of hope for massive improvements to the vault performance (maybe 50x). Some basic accounting of (total messages x signature verification time) + (total chunks x hashing time) + ...
would give an indication of what an ideal minimum might be, but I haven’t done that yet! I’m not really looking for performance improvements yet since features need to be completed first, but I’ll definitely be excited when that time comes. I’m just happy to have established some yardstick that can be compared against.