Profiling vault performance


Laptop vs Desktop Performance

For comparison, here’s the difference in performance between a desktop i7-7700 and a laptop i7-4500U.

The laptop exhibits behaviour the desktop didn’t - it starts slowing down with larger safenetworks. At low client loads there’s no slowdown but once the client load gets high, the larger safenetwork becomes about 30% slower than the smaller network.

The heatmaps have all been set to use a scale between 0-2000s. Anything above 2000s will be the same shade of red. This allows direct comparison of colours between charts, and shows an ‘encroaching red’ for slower machines vs ‘spreading green’ for faster machines.

i7-7700 Desktop


i7-4500U Laptop


There’s definitely some maximum combination of vaults+clients, but ‘lots of clients’ hits the limit faster than ‘lots of vaults’.

This is sort of reassuring since the safenetwork should ideally be pretty efficient at looking after itself (routing messages, verifying signatures etc) and should spend most of the time looking after user data. The characteristics of the two scaling factors (clients vs networksize) seem to suggest this is the case.

To directly address your question @andreruigrok, this behaviour is expected (for the reason above). The stress test is specifically designed to stress the network. Seeing the safenetwork increase in size without being additionally stressed is very positive. Seeing double the amount of safenetwork stress when client stress is doubled is expected. If it were a ‘normal client test’ then I’d be worried since there should be spare capacity to take that extra load, but this is a client stress test so it’s expected. If the safenetwork stress grew exponentially with client load it would be worrying, but it grows linearly so I consider that a reasonable result.

Clients are usually distributed among many machines. So in real life the doubling of clients on any single machine is very unlikely. If client load started becoming a bottleneck, more machines would be added to the network and that would rapidly reduce the client load.

At the end of this post is an exploration of large local networks.

AWS Performance

The table below shows the performance of different AWS EC2 vm sizes (using the test for 32 vaults 20 clients). The order of this table changes depending on which networksize/load is selected, so the choice of 32/20 is kinda arbitrary.

Rank  Device           Time To Run
   1  m5.4xlarge       206
   2  t2.2xlarge       300
      i7-7700 desktop  309
   3  m5.2xlarge       399
   4  m5.xlarge        804
      i7-4500U laptop  1182
   5  m5.large         1628
   6  t2.xlarge        2560
   7  t2.large         4077















The m5 instances performed much more consistently than the t2 instances as the load increased. The inconsistency of t2 instances is not surprising considering the they’re burstable performance instances. Once the t2 runs out of cpu credits it slows down to about 20% of burst speed. The effect of running out of cpu credits can be seen really clearly in the chart below at around chunk 16 where speed goes from 3s per chunk to 15s per chunk.

It’s satisfying to see that performance almost exactly matches power for all m5 instances. The m5.2xlarge is twice as powerful and twice as fast (and twice as expensive) as the m5.xlarge.

The choice of instance size really depends on the client load. m5 is better for consistently high loads, but t2 is better (not surprisingly) for intermittent loads. Since the real-world client load is hard to know it’s not really possible just yet to say which instance size would be the best choice. In theory due to the even distribution of names for vaults / clients / chunks the load should be fairly even and consistent.

For my future testing I’ll be using m5 instances since they provide better consistency and I don’t need to wonder whether cpu credits was a factor in the results.

t2.medium (and smaller) instances took a very long time to run the test so were not included in the results.

Cost Efficiency

Which size is the best value, ie most efficient per dollar? The Cost Per Test column gives an idea of which instance size offered the best efficiency of work per dollar. Pricing is available at EC2 Instance Pricing.

The cost efficiency changes depending on the client load, so the choice of 32 vaults 20 clients is somewhat arbitrary.

Rank  Device      Time  $/h     Cost Per Test
   1  t2.2xlarge  300   0.3712  0.0309
   2  m5.2xlarge  399   0.3840  0.0426
   3  m5.xlarge   804   0.1920  0.0429
   4  m5.large    1628  0.0960  0.0434
   5  m5.4xlarge  206   0.7680  0.0439
   6  t2.large    4077  0.0928  0.1051
   7  t2.xlarge   2560  0.1856  0.1320

Because the t2.2xlarge was always in ‘burst mode’ it was very cost effective. But if it had consumed all the cpu credits it would drop down the list and the m5 instances would be much more cost effective.

It’s also important to restate these results are only true for this test and when network factors such as latency and bandwidth and client factors such as intermittent demand are included these results may change.

Large Local Networks

I tried starting up some larger networks on my local machine and ran a single client_stress_test to see how big the network can get locally. It takes a long time to start large networks, so I’ll update this later with additional tests for a 2000, 4000 and 8000 node network. I really want to get a doubling of time from network size.

It’s amazing that increasing the safenetwork size doesn’t add much load. Almost all the work of the safenetwork is spent managing client data, not managing relations between vaults.

CPU load during the stress test was around 50%, same as all other 1 client tests.

There were some vaults that didn’t start in large networks for the reason “All 13 resource proof responses fully sent, but timed out waiting for approval from the network. This could be due to the target section experiencing churn. Terminating node.” There were no sustained spikes in cpu usage and resource proof was disabled in the routing config, so I’m not sure why this happened.


It’s fascinating to see the behaviour. I’m very interested to see how things evolve on the live network since I can’t predict what will be the optimum setup. I guess lots of measurement will be important to vault operators trying to optimise their setup. But crucially, what they measure will depend on how safecoin incentives are defined, so that’s another factor in the mix. It’s a very different beast to bitcoin mining. I can’t wait to see what the nerds do when they get their hands (vaults?!) on a live network.

Considering my local machine can write 200MB from /dev/urandom to disk in 1s but uploading 200MB on the minimum viable local safenetwork (8 vaults 1 client) takes 76 seconds, it gives me a lot of hope for massive improvements to the vault performance (maybe 50x). Some basic accounting of (total messages x signature verification time) + (total chunks x hashing time) + ... would give an indication of what an ideal minimum might be, but I haven’t done that yet! I’m not really looking for performance improvements yet since features need to be completed first, but I’ll definitely be excited when that time comes. I’m just happy to have established some yardstick that can be compared against.


Fantastic news, it was kind of difficult to understand for me, so thanks for the explanation, very much appreciated!


Network Effect Tests

Having previously explored the bottlenecks around CPU usage, this test adds an extra factor - network effects. The two main network effects are

  • latency - delays caused by the time it takes a message to travel from one location to another
  • bandwidth - how much data can be sent from one place to another in a given time period

Test Conditions

These tests are all performed with 8 VMs on AWS EC2 using m5.large instances.

Each VM may run multiple vaults depending on the total network size, eg total network size of 64 means each VM runs 8 vaults.

A separate VM in Ohio is used for running the client stress tests. This VM never got above 10% cpu while running the stress tests, and was usually around 1% cpu, further clarifying that the main cpu load during tests is from vaults managing client data, not the clients themselves.

Prior Tests

As a reminder and as a point of reference, here’s the results with all the vaults running on a single m5.large VM. This has no impact from network effects.


Same Datacenter - Ohio

VMs are located in the same datacenter in Ohio.

Network effects should be latency of less than 1ms on LAN and bandwidth of 2 Gbps.


Same Continent - USA

VMs are located in the following 4 regions (2 VMs per region):


Network effects should be latency of up to 70ms between the east and west coast of USA and bandwidth of 2 Gbps.



VMs are located in the following 8 countries (1 VM per country):

Network effects should be latency of up to 400ms (typically the highest latency will be between Australia and Europe) and bandwidth of 2 Gbps.

Sao Paulo



I’m not really inclined to try and deduce too much from this test. Mainly the purpose was to have a point of reference to compare future networks with. But there are some interesting things to ponder:

Computational Power

In terms of total computational power, these tests are equivalent to the test of a single m5.4xlarge machine. Each m5.large instance has 2 vCPUs, making 16 vCPUs total, and each m5.4xlarge also has 16 vCPUs.

The main difference is adding latency and bandwidth constraints. This makes the Ohio test especially interesting since bandwidth and latency effects are very small. The results for 1 4xlarge vs 8 large were nearly identical for low client loads, but at higher client loads the distributed network was about 20% faster than the single machine. I have to admit this result was pretty exciting and unexpected.

Regional Proximity Per Section

In the usa test, the 64 vault network was about 25% faster than the 8 vault network at high client loads. This is sort of counterintuitive but very interesting. I think this is due to the 8-vault-network needing to always negotiate across the latency gap to form consensus, but in larger networks consensus may be formed entirely between low latency vaults without waiting for higher latency vaults.

I think the reason this effect isn’t so evident in the global test is because the number of vaults per region is lower; the network would probably exhibit similar behaviour for larger networks where there’s more opportunity for a majority of vaults in a section to be from a nearby region. It’s just a theory though… I don’t have any robust explanation.

The impact of latency on consensus could have strong implications for the geographical centralization of sections. It could also be significant in the mechanism to allocate new safecoins.

Operator Incentives

Operators are likely to tune their performance based on a compromise between the diametric objectives of ‘supplying maximum resources per machine’ and ‘being fast enough to get rewarded’. Since the latter depends only on the speed relative to other vaults, operators could easily find themselves in a ‘race to the top’ to always have faster and faster machines/connections. I have a feeling the network may need to have some ‘magic targets’ to help reduce centralizing forces, but this is not a concrete idea for me just yet.

Client Scaling

The reason why I’m not concerned about the decline in performance as the number of clients increases is because these test networks are operating on a fixed amount of computation. In reality if client load increases then vault operators can easily add more computational power. Furthermore, these tests show the extra vaults running on that extra computational power have almost no effect on performance. That’s a really cool property of the network and a credit to the teams building the routing / communcations protocols.


Global Large Scale Test

This tests what happens when the network goes from small in size (few vaults) to large in size (many vaults).

This is as close to ‘the real network’ as these tests will get. It reaches a size of 1200 vaults running on 150 globally distributed VMs.

In previous tests the total available resources remained the same as the network grew, however in this test the available resources increases with network size. This behaviour is much closer to how the real network will operate.

Messages take longer to reach their destination in large networks. This test measures how much effect that has, since a large network should in theory be slower than a small network due to this increase in time needed to perform messaging. Does messaging add a lot of overhead or only a little?

Test Details

This test is performed using 8 vaults per VM (instance type is m5.large, vault version 0.17.2, same as alpha2 vaults).

VMs are located in fifteen geographical regions:

America - Virigina, Ohio, California, Oregon, Central Canada, Sao Paulo
Asia - Mumbai, Seoul, Singapore, Tokyo
Europe - Frankfurt, Ireland, London, Paris
Australia - Sydney

One VM is started in every region (8 vaults per VM gives network size of 120 vaults).

After the network is established the client stress test is run (PUT then GET 100 immutable and 100 mutable data). This is done for several different loads between 1 to 40 simultaneous clients.

Then the network is expanded by starting another one VM in each region (growing the network by 120 more vaults each time). The stress test is run again for the new network size.

This is repeated until my limit of instances on AWS is reached (10 VMs per region for a maximum network size of 1200 vaults - 15 regions * 10 vms per region * 8 vaults per vm).

The clients are run on a single VM in Ohio (instance type is m5.xlarge).


Increasing the network size by a factor of ten resulted in a 25% slowdown in test time.

As the bottom part of the heatmap shows, increasing client load is no problem when the resources of the network expand in line with the network size. In prior tests, doubling client load also doubled the time to complete the test, however doubling client load in this test had very little effect on the performance of the network.

Points Of Interest

This test threw out a few interesting things along the way:

A delay of about 10s between starting vaults was enough to ensure very few joining vaults were rejected. This delay is one of the main time sinks when running this test. 1200 vaults * 10 seconds is about 3.5 hours. Add in time for starting VMs, running stress tests, handling churn etc and it becomes a pretty long test to run. In reality some degree of parallel startup is fine, especially for large networks, but joining was done one vault at a time to give the best chance at not losing resources during the test. And then considering the disallow rules for ageing in the real network it will be interesting to see how quickly it can grow.

This is really interesting to me since other cryptos can grow as fast as needed in certain areas (eg doubling bitcoin mining power can happen very rapidly) but not in others (eg doubling the on-chain transaction rate for bitcoin). The speed by which the network can grow is definitely part of ‘being scalable’. I would say this is a potential (but unlikely) bottleneck especially for small networks.

On all VMs, “network traffic to the Internet is limited to 5 Gbps (full duplex)” (source). If bandwidth were the bottleneck on the client, the test would take only about 1 second - 400 MB of data / (5000000000 bps / 8 B per b / 1024 B per KB / 1024 KB per MB).

It could be argued the client should be run from several geographical regions rather than on a single VM from Ohio, but since neither bandwidth nor cpu are bottlenecks for the VM running the clients, and latency averages out anyway, I think this is a satisfactory way to run the test.

There’s something strangely exciting about seeing the network perform at scale. While the tests run it only appears as a bunch of slowly scrolling logs on a computer monitor, but it represents something pretty amazing. The combination of so many little moving parts, from the underlying crypto to the network protocols to the routing protocols etc etc right up to the test code itself… it’s a bit dizzying.

Knowing that in the end it will come together and form a really simple but powerful experience for end users is unspeakably awesome.

What Next?

These tests have been really fun because they establish a yardstick by which future networks can hopefully be compared in performance. They also establish the magnitude and effect of various bottlenecks such as cpu, latency and messaging overhead. These tests have looked at what sort of performance characteristics are present. It feels like this aspect is pretty-much done (for now).

I’m keen to move on from ‘what’ and try to understand ‘why’ these performance characteristics happen. I think the baseline performance is not as good as it could be. If it can become clear where the performance is lacking it’s hopefully also clear where the easiest wins are to improving performance.

So the next tests I hope to do will be some more detailed investigation into where time is being spent by vaults.


Imagine what the early steam engine engineers used to feel when running their engines for the first time. Your experience sounds a lot like it must have been for them, except the heat and noise.

Excellent work @mav again

The 25% slowdown for 10 times in size. It’d be interesting to know (and I am not asking you to try) to see if it slows down 25% for each 10 times increase or if it slows down less for each 10 times increase.


This kind of detailed poking about, carried out so professionally would cost a proprietary software company huge amounts of money. Both in doing the tests, but especially spec’ing these. Amazing work and fascinating to see.


Great job mav!

Would there be any benefit to running a larger test? I also have some instances per region available on my aws account… Maybe a few other folks do too.


its likely the part about each section limiting new nodes to 1 at a time(which they do to try and schedule resource proof checks accordingly and scale the routing table predictably…). so as the network is larger, more nodes “can join” as they go to different sections, but multiple nodes trying the same section can result in none getting accepted to the section. Although this is also one of the things the guys in routing are looking at quite closely to try and not have that chance of 0 but always be able to take a minimum of 1 or even more at a time.


Really wonderful work, @mav! Bravo!


So much this! :raised_hands: :pray::call_me_hand:


As Viv said, “each section limiting new nodes to 1 at a time(which they do to try and schedule resource proof checks accordingly and scale the routing table predictably…)”

Not at this stage I don’t think. Maybe later it would be a good thing to try. We could set a 24h period and have as many people try to join the network as possible and run the client stress test continuously throughout. It’d be pretty chaotic but could be fun!


I would say next steps are that you need to work for MAID because the research you bring to the table every few weeks is fantastic :smile: . At the same time it speaks volumes of this project and the community that an independent highly qualified engineer like yourself interested in this project can go about and tune/scale the product and produce entirely unbiased data driven analysis of the application. Love reading this work, keep it up!


Crypto Operations

This post looks at the cryptographic operations being performed by the vault, since they should (ideally) take most of the time for the vault. All the if-thens and internal organisation of messages etc should be very light-weight.

Crypto Operations Summary

The vault binary is modified to log the exact time (precise_time_ns) whenever a cryptographic operation is started and finished. These include

  • sha3_256 hash
  • ed25519 message signing
  • ed25519 signature verification

The cryptographic operations themselves are generally highly optimised and are not a target for improving performance. But there are a few things that might allow performance improvements for vaults:

  • concurrency - can the vault do more than one thing at a time?
  • duplication - is the vault doing duplicate work that could be avoided?
  • non-crypto operations - can the internal logic and operations of the vault be improved?

The Test

This test is mainly interested in the quantity and proportion of operations rather than the actual timing of each operation, so network factors and cpu power are unrelated. It’s not about how quickly each operation happens, but when and how often and how necessary each action is.

Start 8 vaults locally (compiled to include custom logging for cryptographic operations in safe_vault, routing, crust; the version is safe_vault 0.17.2 alpha2).

Store then fetch 1 immutable data (using the client stress test with flags -i 1 -m 0)

Parse the vault logs to generate a report of the cryptographic operations that happened.

Logging is set to only happen for cryptographic operations on the single immutable data PUT and GET. The network startup and account creation are not included in the report for cryptographic operations, ie

  • start network
  • create account
  • turn logging on
  • PUT then GET immutable data
  • turn logging off

Results - General Accounting

How many cryptographic operations are being done by the vaults?

This is interesting because work on the network cannot come in amounts smaller than one chunk, so ideally the work to process one chunk should be kept as small as possible.

Cryptographic operation tally

         Hash  Sign  Verify                               
  Total   730   482     944

Vault 1   114    81     129
Vault 2    77    50     104
Vault 3    74    48      88
Vault 4    74    48      87
Vault 5   150   102     226
Vault 6    74    48      87
Vault 7    77    50      99
Vault 8    90    55     124

This is not especially illuminating by itself, but it does give an idea of the magnitude of what’s happening. 2000 crypto operations for 8 vaults to store+fetch 1 chunk is somewhat higher than I would have thought, but I went into this with no preconceptions so it is what it is.

Results - Concurrency

How much of the work being done is happening concurrently?

It seems that a lot of work can be done in parallel. Since the network itself can operate on many chunks at the same time it seems reasonable that vaults should also be able to do operations simultaneously.

But on the other hand, if the currently scheduled work depends on some other work having been done the process can turn sequential quite easily. Having dependencies on others to complete work leads to waiting, which results in a reduction in concurrency and a slowing of the overall process.

The results below show that about a third of the time is spent not doing any cryptographic operations, and another quarter of the time is spent with only one operation happening. That’s a lower degree of concurrency than I would have expected.

Only 2.2% of the time is spent with all vaults doing something cryptographic at the same time.

Time and Percent Of Time spent doing C simultaneous cryptographic processes

C   Time (ms) Percent
0  111.290613   36.3
1   81.117536   26.5
2   31.834255   10.4
3   14.067316    4.6
4     14.0439    4.6
5   17.007624    5.6
6    17.30288    5.6
7   12.956297    4.2
8    6.669473    2.2

Results - Duplication

How many times is the same data hashed or signed or verified by each vault? Would it be useful to store the results of some operations for reuse in the future?

The results below show there are a lot of operations where the same data is hashed or signed or verified multiple times and if it were cached it could reduce the time spent doing cryptographic operations. Most vaults spend 50% (and up to 75%) of their total cryptography time doing duplicate work.

But it’s not a simple matter of ‘deduplicate and win’ because caching comes with some overheads. Presumably the overheads would be less than repeating the cryptographic operations again. I haven’t implemented a cache to test how much improvement might be had.

If someone were to create SAFE ASIC hardware to speed up crypto operations for their vaults it would be very wise to include a cache since many cryptographic operations by vaults are duplicates.

This is not necessarily a case of trying to reduce duplication, but to manage it more effectively. The degree of duplication depends on the messaging protocols, so maybe a better protocol (eg the gossip protocol) will reduce duplication, but that’s not the goal here. The goal is to identify whether a cache is worth the expense.

Duplicate cryptographic operations for each vault

            total dupes  % time duping                    
   Vault 1          192        58.6094
   Vault 2          112        51.9114
   Vault 3           98        47.1170
   Vault 4           98        49.8167
   Vault 5          344        74.8758
   Vault 6           98        46.3739
   Vault 7          106        45.7459
   Vault 8          129        52.1418

Note that larger data (eg a 1 MB chunk) generally takes longer to process than smaller data (eg a publickey). So if the duplicate processes are only happening on smaller data it may not be significant overall. This is why the measure is % time duplicating and not % number of duplicates.

Results - Non-crypto Operations

The concurrency result shows 36% of time is spent not doing any cryptographic operations. What is happening during that time?

And 98% of the time there is at least one vault ‘idling’ (ie not doing cryptographic work). This seems like a lot more overhead or waiting than I would expect. I guess vaults are either a) waiting for other vaults to complete their cryptographic work or b) doing some other internal operations that are not cryptographic.

Why do vaults spend so much time ‘looking after themselves’ rather than doing work on client data? Unfortunately this test doesn’t really address it.

Hopefully the introduction of the new gossip protocol for messaging will improve concurrency and reduce time spent on non-cryptographic operations.

Personal Observations

The vault operates asynchronously but on a single thread.

To me this seems a waste of resources since some of the message processing does not depend on other messages so can be handled concurrently.

But the times where order does matter, well, they’re very important times.

Perhaps there needs to be concurrent treatment of immutable data messages and serial treatment of messages for mutable data, splits, merges etc? It adds complexity to the code (which for now is unwarranted) but it may lead to some performance improvements. I’d want a better understanding of the vault (especially the non-crypto blockages) before recommending this change for serious investigation.

For now I think it’s good for the main priority to be on naive-but-correct implementation of the core features. Optimisation is not desirable in that phase.

I couldn’t find in the code where the messages get sorted by priority. There are 3 different priorities so it’s not enough to just add to the front or the back of the queue, they do need sorting. The reason why this interests me is I wanted to see how long sorting took but I couldn’t find it to time it.

The work on the gossip protocol for messaging should hopefully improve the situation. This post is not a call to action or even a suggestion about features to investigate, it’s just a record of the current situation so that future comparisons can be made.


Another gem of a post there @mav

You talk of duplication and perhaps caching them to save operations

The question I have is

Are these operations done on storing the data then cache for when the data is read off disk?

I ask because if it is then you cannot cache these since you must verify that the data read from disk is still valid.


These operations are done in many places (search the vault / routing / crust repos for sha3 or sign_detached or verify_detached). For example, a common place that duplicate hashing happens is converting a publickey to an xorname (routing > > name_from_key).

The intention is to avoid repeating a calculation of ‘the truth’ when once is enough. The relation between data : hash (or signed_data : is_verified or data : my_signature) never changes, so why compute it many times?

If my vault is asked for the hash of some data and it’s already been calculated by my vault why should it be calculated again? I can trust myself. That relationship should be kept in memory for a while if it’s cheaper than recomputing (but only if I know there’ll be a need for it in the near future, which the test shows there often is).

What would make the data read from disk not valid? If it was validated when it was stored what leads to it becoming invalid?

Definitely I agree with you that cache cannot be allowed to compromise security.


My only concern is if the cached values persisted from when the data is stored to when its is read.

Bad disk. Died disk. Error rates on disks is not zero and the purpose of SAFE is to check data integrity. Someone uses an old SSD that really should not be used for this.

The idea is that you never trust data that has been untouched on a storage device for a period of time.

A bit of maths
Take a 2TB WD Blue (Yes there are better drives but this is an illustration)
Non-Recoverable read errors per bits read == <1 in 10^14
So a 1MB chunk is 8x10^6 bits
medium vault is 500GB which is 4 x 10^12 bits
Now for 10,000 vaults there could be 4 x 10^16 bits.
So we can expect 40 errors across the 10,000 vaults (if they were full) Or for that amount of data to be read.
Yes a re-read of the disk but the OS can result in a successful read, but no where near all the time. Also its not clear from the specs I am looking at whether this translates to a non-recoverable sector read or is absolute bit error and ECC can fix bit errors.


Great post @mav! They always are, but this one is very nice.

I would solve this by identifying the entry point to the vault for the source data, and at first location for the crypto operation, create a struct that is henceforth passed on. Might not always be suitable, but in the naïve case. Like so:

source1: [source1],
source2: [source 2],
value: [computed value]

So another abstraction for the data handled.

I’m just assuming (haven’t studied the code now) that the data today is passed on through function calls, and operated on in multiple places along the way, and this approach would then be less of a change than introducing a cache.


We do memoise in some cases and perhaps not in all. It is a good point to check where memoisation is missing and should be maintained. Nice point @mav thanks again, but really thanks is not enough. We will need to make a charity donation or do something, these posts are pure gold for us.


Exactly what neo said. One of the best features of safe is enduring data integrity, but no need for fancy sas controller cards and enterprise drives or filesystems… hard error rates happen, bits rot, etc.

EDIT : I guess we can’t get around the need for ECC ram though… something to consider for farming hardware… although there is a library called SoftECC that might help…


Correct since SAFE provides the redundancy built in.