Profiling vault performance


#47

As @digipl says, there are a few factors that affect the tx rate…

  • Group size and quorum size will be larger on the real network and thus expected to be slower than my test network. However since the vaults in this test never came anywhere close to breaking a sweat the result of 100 tx/s is extremely conservative (from a vault perspective). The actual tx/s that vaults can handle would be much higher but couldn’t be tested because the launcher couldn’t handle that much data. To put out a haphazard guess: since cpu load didn’t change on the vaults during testing and is measured in 0.25% increments, there’s potential for at least a 400-fold improvement in tx rate from the vault side taking the estimate to at least 40K tx/s/group.

  • Global transaction rate increases as the network size increases (ie number of groups), so there is no upper bound on global transaction rate. This is an amazing property of the safe network and is so different to bitcoin that comparing tx rates between the networks is basically an instant red-flag for trolling (guilty!).

  • The rate will be affected by other work groups have to do such as storing immutable data, churn etc. However prioritizing certain data may help retain the high overall tx rate (if that’s considered a priority in the first place).


#48

Vaults with Disjoint Sections

It’s been a long time between tests, so let’s see how vaults perform with the new Disjoint Sections feature.

Results first: I couldn’t get the network started. But the test was still very interesting for other reasons.

Versions

Same versions as Test 12.

Modifications

Methodology

Same as prior tests:

  • Start 28 vaults on 7 pine64s
  • Time the duration to upload a large file (ubuntu-16.04-server-amd64.iso 655 MiB)
  • Repeat nine times and take median upload time

Results

The network never got started so no uploading could be done.

Observations

Resource Proof

v0.13.0 of the vault introduces another new feature besides Disjoint Sections: resource proof (ie upload speed must be at least about 6 Mbps)

Resource proof finds its way into the vault via the routing source code.

Of most interest is the verify_candidate method.

The candidate must provide a certain size of data (RESOURCE_PROOF_TARGET_SIZE = 250 MiB) in a certain amount of time (RESOURCE_PROOF_DURATION_SECS = 300 s) which ends up being a little over 6 Mbps.

This is an overly simple calculation, since there are other factors of timing to consider such as the time taken to traverse routing paths. From the source code comment: “This covers the built-in delay of the process and also allows time for the message to accumulate and be sent via four different routes”. This added complexity is very interesting from a security perspective, as it potentially allows nodes to alter the perceived performance of other nodes on their routing path. Enabling the network to ‘monitor itself’ creates many interesting considerations. The links above are a good starting point for more detail.

Private safe networks would most likely want to modify these parameters to make it faster to join the network, although at this stage the bandwidth requirement should be trivial to complete for any local network.

The implementation is quite elegent, and is easy to see how it can be extended to a cpu proof of resource.

My main doubt of the current resource proof is it uses the SHA3-256 hash function as the basis of proof (with trivial difficulty), yet the majority of current operations on the network are signature validation operations. The real-world performance of a node (especially one with custom hardware) depends on signature verification, so proving they have fast hashing isn’t necessarily going to determine how useful their ‘real’ performance will be on the network. Again, this is a slightly-too-simplistic look at things, but is a starting point in the consideraiton of resource proof. Hashes are perfectly useful as a means to determine bandwidth, but I have doubts about how long it will remain that way due to their disconnect with actual resource demands.

Proof Of 6 Mbps

I first tried running the vaults with the original 6 Mbps proof setting. The gigabit network should trivially handle this proof, and the logs showed the expected message:

Candidate XXXXXX… passed our challenge in 0 seconds.

However shortly after, log messages began showing the challenge taking some nodes 10+ seconds.

The consecutive times to pass the challenge from the first vault log were 0 0 1 2 2 4 7 5 8 6 10 12 8 31 18

This is still way under the 300s threshold, but the degree of variation seems like it can get quite large. It begs the question ‘what exactly causes it’ and ‘how far can it go’ and finally ‘can it be exploited by other nodes to their advantage’?

The variation is concerning to me, but resource proof is a complex topic and one I’ve only just started exploring. I’m sure there will be many interesting developments in the near future as the topic is explored further.

As to a reason for this delay… subjectively, there were a lot of log messages ‘Rejected XXXXXX… as a new candidate: still handling previous one.’ Unfortunately I don’t have the time to investigate this more deeply just now.

It’s tempting to draw conclusions from observations, but I think it’s important to take observations as-is and not make incorrect assumptions about the potential causes. I’m not familiar enough with resource proof to draw much meaning from this test, but find the observations interesting in their own right.

Ultimately my network never got started with 6 Mbps resource proof. The largest I saw the routing table get to was 7, which was from a sample of five attempts each given about half an hour to start.

Proof Of 27 bps

I changed the proof from

TARGET_SIZE = 250 * 1024 * 1024 = 6 Mbps

to

TARGET_SIZE = 1024 = 27 bps

The reason for lowering target size and not increasing allowed time was because I wanted the vaults to acknowledge quickly and the network be operational sooner, not simply to accept lower bandwidth.

The network would still not start. The cause isn’t clear from the logs.

Proof Of 0 bps

My expectation for 0 bps proof is still failure, since it seems that messages are not getting through regardless of the proof requirement.

0 bps resource proof also failed to start. I’m not sure how the Test 12 network was started in the first place. If there’s anything I might have overlooked that could help get the network started, I’d be interested to know.

Conclusion

I couldn’t get a network to start, so the performance of vaults with Disjoint Sections could not be tested.

The implementation of resource proof is extremely interesting. I’m looking forward to seeing how it progresses in the future.


#49

Always love to see your results and really respect your very unbiased and logical approach. I’ve been seeing commits on github improving log messages so perhaps those changes will help and stick around! There is talk about test 12b coming soon!


#50

I have a very old Intel Q6600 and I have succeeded in running a local network of about 50 nodes on it. Besides removal of check for peers on same LAN like you have done, I have modified handle_candidate_identify() method of routing/states/node.rs module:

        let (difficulty, target_size) = if self.crust_service.is_peer_hard_coded(&peer_id) ||
                                           // A way to disable resource proof in local tests but not in global network
                                           self.crust_service.has_peers_on_lan() ||
                                           self.peer_mgr.get_joining_node(&peer_id).is_some() {
            (0, 1)
        } else {
            (RESOURCE_PROOF_DIFFICULTY,
             RESOURCE_PROOF_TARGET_SIZE / (self.peer_mgr.routing_table().our_section().len() + 1))

This is my way to implement the 0 bps proof of resource. The advantage is that the same binary can be used for both local network (with 0 bps PoR) and global TEST network (with PoR as programmed by Maidsafe).

I also do not start the vaults all at once. I start the first one with RUST_LOG=info ./target/debug/safe_vaut -f and then I launch a group of 10 vaults with a 6 seconds delays between each vault with the following script:

for i in {1..10} 
do
	echo "Launching vault $i"
	./target/debug/safe_vault&
	sleep 6
done

I launch it 5 times, but each time I wait for the routing table size to reach the next multiple of ten before launching the next group. It can take several minutes before this and the CPU can be high.

Lastly, I have modified vault_with_config() method of safe_vault/vault.rs module:

        chunk_store_root.push(CHUNK_STORE_DIR);

        // To allow several vaults on same station
        use rand::{self, Rng};
        use rustc_serialize::hex::ToHex;
        chunk_store_root.push(rand::thread_rng().gen_iter().take(8).collect::<Vec<u8>>().to_hex());

The aim is that the vaults on a same station do not share the same chunk store directory, because I am afraid that there is a lock contention on the chunk file created by vaults handling the same chunk. It doesn’t explain why your network doesn’t start up but I wonder if it could explain the observed slowdown during uploads and its variability.

In conclusion, I would say that Maidsafe made it very difficult to run a local test (I mean a real one not a mock one) with following obstacles to overcome:

  • check that vaults are not on same LAN
  • sharing of a common chunk store directory
  • costly Proof of Resource

Ideally, they could add a simple -l flag in safe_vault program to allow such a use case. I would thank them a lot if they implement it.


#51

Great test and write up @mav, thanks. Also @tfa thanks for sharing your experiments. I love reading these :slight_smile:

@mav:

As to a reason for this delay… subjectively, there were a lot of log messages ‘Rejected XXXXXX… as a new candidate: still handling previous one.’ Unfortunately I don’t have the time to investigate this more deeply just now.

You may know this, but if not it could help… Only one node is tested per section at any time, so any others trying to join that fall into a given section get rejected and will keep trying until they get to be the node under test.

The smaller the network, the fewer sections, so for these tests it’s not surprising you see a lot of those messages as your nodes start up and begin clamouring “me next, me next”! :slight_smile:


#52

This is very promising. I’ll try adding your changes to my vaults.

I was initially using a delay of 3s between starting vaults, which worked well on past networks. I changed it to 60s just to be sure (26m startup time!!) but it made no difference.

Could you use chunk_store_root in safe_vault.vault.config? That’s how I’ve been managing multiple chunk stores on the same machine. Are you aware of any benefit to your technique instead of the configuration option?

I agree but I think maidsafe are focused on the right stuff at the moment. I’m in no hurry to get a ‘local network’ feature. It gives me a genuine motivator to dig into the code. Good list of obstacles though, it’s handy for future reference.


Having implemented the change to handle_candidate_identify the network still isn’t starting.

The first vault to be tested (ie the second to start) passes the challenge and is accepted into the routing table.

The second vault to be tested passes the challenge but is never accepted into the routing table. The log message that seems relevant to this is
Timed out waiting for ack - the full log message is:

TRACE 22:12:58.311444296 [routing::states::common::bootstrapped bootstrapped.rs:103] Node(8f734b..) - Timed out waiting for ack(hash : b994..) UnacknowledgedMessage { routing_msg: RoutingMessage { src: Section(name: 319d4c..), dst: Section(name: 319d4c..), content: CandidateApproval { candidate_id: PublicId(name: 319d4c..), client_auth: Client { client_name: 1452df.., proxy_node_name: 8f734b.., peer_id: PeerId(9e17..) }, sections: {Prefix(): {PublicId(name: 8f734b..), PublicId(name: d2bcb9..)}} } }, route: 2, timer_token: 16 }

Tempting as it is to chase this rabbit, I’m going to wait for a new release. The sort of testing I’m aiming for in this thread is meant to be simple to report and simple follow (eg comparison of upload timing), and since this test is getting beyond that scope I’m leaving it there :slight_frown:


#53

I suppose that I would have either to duplicate the binary file together with its config file (50 times!) or to modify the config file before each invocation which is more complex to program than what I did (and which is risky because I don’t know if the config file is only read at start up).


#54

FYI, one more obstacle added recently for running a local network:

ERROR 15:45:33.575315300 [crust::main::bootstrap mod.rs:184] Failed to Bootstrap: (FailedExternalReachability) Bootstrapee node could not establish connection to us.
INFO 15:45:33.575814800 [routing::states::bootstrapping bootstrapping.rs:210] Bootstrapping(0ebae3..) Failed to bootstrap. Terminating.

The workaround I found, was to pass constant CrustUser::Client instead of crust_user variable in the 2 calls to start_bootstrap in src/states/bootstrapping.rs, eg:

                let _ = if self.client_restriction {
                    CrustUser::Client
                } else {
                    self.crust_service.set_service_discovery_listen(true);
                    CrustUser::Node
                };
                let _ = self.crust_service.start_bootstrap(HashSet::new(), CrustUser::Client);

With all the mentioned modifications in this topic I still can run a local network on my station with current master branch, but I am afraid that one day this won’t be possible anymore.


#55

Disjoint Sections (part 2)

I managed to start a network using Test 12c versions of software. The performance of disjoint sections appears to be about 2.5 times slower than pre-disjoint-sections.

Parameters

vault 0.13.1
routing 0.28.2
launcher 0.10.1
demo_app 0.6.2
group size 8 (default)
quorum 60% (default)
removed limits such as upload volume and peers on lan
upload file is hippocampus.mp4 (96.2 MiB)

Results: Local Network

The upload on local test12c network took 50m.

The comparison with pre-disjoint-section would be with a group size of 8 and quorum of 5. Using the table from a prior post this upload would take about 21m (group 8 quorum 6). So disjoint sections appears about 2.4 times slower in this test.

The upload profile has that similar look as pre-disjoint-sections, where it starts out looking fast but ends up taking quite a bit longer, and has a sharp increase at the last one percent.

Results: Alpha1 v Test 12c

Alpha took 8.5m to upload 96.2 MiB
Test 12c took 24.1m to upload the same file.

This is about 2.8 times slower, so again not a great result for disjoint sections.

The ETA profile is familiar.

General Observations

  • The LAN routing table never got above 12 nodes. There should have been 27, so this is something that warrants further investigation.
  • The network starts very easily and much more quickly. The delay between starting each node can be drastically reduced. Before disjoint sections, I was delaying 20s between each node starting, but with disjoint sections I leave only 2s.
  • The main errors I’m seeing in the vault logs are
crust::main::active_connection active_connection.rs:171
{ code: 104, message: "Connection reset by peer" }

crust::main::connection_listener mod.rs:137
{ code: 11, message: "Resource temporarily unavailable" }

#56

Strange. Your result show an upload speed of 66,5KB/seg but I made my own upload test with12C, with directories of about 100MB, and the speed of the real network is near 300KB/seg.

Seems logical to think that the local test should have been faster than the real network.


#57

Like your tests but I have read some time ago that there is no optimization for network performance done yet.

Maybe a team member could give us some info on it because last days some topics involve networking speed and I’m starting to get curious too. (Maybe they have no idea till optimization is done)


#58

Agreed. I am dubious of the quality of the pine64s, but they have a gigabit ethernet port and quadcore cpu so they should be great. Testing with a normal network file copy shows they do function as expected. So… I can’t really explain it.

I’m verymuch considering moving away from the pine64s for testing.


#59

Having further investigated the bootstrapping and now able to reliably start the full 27 nodes (the issue was around the disconnection of peers based on whitelisting), the results are still slower than pre-disjoint-sections, but much more consistent.

Results

96.2 MiB file (hippocampus.mp4) was uploaded several times.

Upload Time (minutes)
33.6
33.1
33.2
32.8
29.4
33.2
33.7

The upload time is still slower compared to pre-disjoint-sections (around 21m), but not as bad as first tested.

There was a second factor beside network bootstrap which contributed to the previous very slow upload time (this really took me by surprise). These most recent tests were done on my normal test setup, but the prior test was done using a computer for launcher and demo_app running qubes-os, which I noticed to be quite slow in general use. I didn’t expect that slowness to affect the newtork layer, but these results indicate it may have. It amazes me how many variables there are in this testing, and how there can be no assumptions about which ones may be ignored. This also feeds my unhealthy need to be a control freak in these tests, since it can be very hard to isolate all variables. I was considering switching to cloud servers, but now I think that environment would lead to too many uncontrolled variables.

Deviation

It’s extremely pleasing to see such an improvement in consistency of upload time.

This is the previous variability test.

A normalized standard deviation is used to compare the variability (ie normalizing upload times with the average then calculating the standard deviation).

Standard Deviation (higher is more variable is worse)
-----
Pre Disjoint Sections  0.0742
    Disjoint Sections  0.0456

More reliable upload times means

  • better user experience
  • greater predictability in the software product which means
    • easier to identify areas for improvement
    • easier to measure the impact of changes

Very keen for Test15 to come out. Still feeling a bit disappointed with the slower upload times, but I have some ideas I want to test once Test15 is out. Now that consistency in upload time is better, I’m confident that improvements made to upload time will also be consistent.


#60

Are you sure that your vaults don’t share a common chunk store directory? Lock contentions on the chunk files would explain the bad performance.

Did you try my way to avoid this ? (mentioned earlier in this topic). You said that you use vault config file for that, but do you really duplicate your exe 7 times on each of your 4 stations and define a different chunk store directory in each config file?


#61

I think something must be wrong in your test. I still think that it is not normal to have an upload speed almost 6 times lower than the last real network.
And the speed of the Pine64, even lower than other computers, do not explain such differences at all.


#62

Yes, I’m sure the vaults don’t share a common chunk store directory. Each directory allocated to each vault accumulates data.

No, I didn’t. I may try it when I get a chance. I do duplicate the vault binary 28 times (4 per machine, 7 machines) and create unique configs for each vault binary. I’m confident the setup is correct, and the vault logs also indicate this is the case.

I completely agree, but trust the numbers more than my intuition. When I get time I do plan to try to identify where the most time is spent like in this previous post but right now my time is sadly not enough to do that kind of analysis. Hopefully soon my schedule will allow it.

edit: I should add, anyone could (and should!) be benchmarking testnet upload speeds using their own standard 24 MB file (because the limit is 25 MB). It’s easy to do: just record the upload time for 10 different uploads, and let us know the results. I would be very interested to see this.


#63

Disjoint Sections (part 3)

Running the same timing analysis as this prior post on the vault v0.13.1 methods turns up some useful results.

Method Timing

An important difference between this list of method timings and the prior one is this only times the file upload, whereas the prior list includes time spent bootstrapping and registering a new account. [Un]fortunately better ways to approach this testing are becoming clearer the more I do it, so some older results aren’t directly comparable.

Prior test: 650 MiB file taking 60m with group size 3 and quorum 2
 This test:  96 MiB file taking 30m with group size 8 and quorum 60%

                                        Method Time (s)  Count
                                mod.rs:105:put  1481.79    792
       maid_manager.rs:347:forward_put_request  1116.05    776
                    vault.rs:168:process_event   459.64  11294
       data_manager.rs:1171:send_group_refresh   133.86    792
      data_manager.rs:820:handle_group_refresh   110.60    792
                                mod.rs:152:has    76.73   2344
 maid_manager.rs:294:handle_put_immutable_data    43.53    776
             data_manager.rs:1150:send_refresh    32.92    792
      data_manager.rs:311:insert_pending_write    24.62    792
                 maid_manager.rs:94:handle_put    14.35    776
              maid_manager.rs:281:send_refresh    13.19    776
     data_manager.rs:930:update_pending_writes    10.07    792
        maid_manager.rs:125:handle_put_success     6.08    776
                data_manager.rs:404:handle_get     4.92     56
                       vault.rs:200:on_request     3.29   9544
                data_manager.rs:430:handle_put     2.43    776
   maid_manager.rs:187:handle_get_account_info     2.37     16
                       utils.rs:22:client_name     2.22   1568
                          mod.rs:212:do_delete     2.05    792
 data_manager.rs:961:send_gets_for_needed_data     0.95  17630
                         ... only top 20 shown

I don’t understand why there’s 792 calls to put, since it should be about 291 (3 copies * 97 chunks).

Disjoint sections adds some time for group management, but generally vaults seem to spend a lot of time waiting for other vaults to complete their computation (see below), which really extends the upload time.

Timing Results

Each vault spends some time performing computation, and some time waiting for others to finish their computation.

As seen below, the most heavily-utilized vault spends about 544s of time performing computions. So the upload time for this 30m upload could possibly be cut down to about 9m. How close we could get to that ideal depends on the specific reasons for why vaults spend time waiting.

Vault Name  CPU Time (s)
      50_4  544.41
      51_3  481.35
      50_2  416.57
      50_3  411.70
      51_2  190.96
      51_1  164.65
      50_1  164.18
      54_3  130.75
      55_1  127.37
      54_2  127.12
      53_2  125.24
      51_4  119.32
      52_1  117.83
      52_4  100.09
      52_2   92.19
      52_3   85.25
      55_2   21.34
      53_4   19.19
      53_1   16.50
      53_3   13.30
      56_3   12.64
      55_4   12.31
      56_2    9.97
      56_1    8.96
      55_3    8.69
      56_4    8.05
      54_4    7.58
      54_1    6.54

About a third of the network of 28 vaults contribute very little compute time. This small network does not seem to be ‘too small’ to do practical testing. In an overly-small network I would expect all vaults to be doing similar amounts of computation, whereas this network has a lot of ‘idle’ vaults.

Vault Storage

The number of files in each vault chunk store is given below, and is quite consistent (as expected).

30
37
39
36
32
35
40
34
41
16
16
16
28
34
30
29
16
32
40
21
37
39
30
33
24
38
29
16

The total number of files stored (excluding the lock file) is 848, which is not clear to me why this is. I would expect about 291 files (3 copies * 97 chunks)

Vault vs Launcher

The launcher and demo_app create essentially zero load on the pc they run from, which to me seems that both vaults and launcher have a lot of spare performance headspace to work with.

Where To From Here?

The next thing I plan to investigate more closely is why the total time for upload is so much more than the maximum compute time. Where does the waiting happen? And why?

Final Point

There a couple of pertinent quotes from earlier in this thread by @AndreasF worth repeating:

https://safenetforum.org/t/profiling-vault-performance/10331/36

I know, there’s a lot of room for obvious improvements here. But we have to make the network fully secure and resilient first, before we start implementing optimisations.

https://safenetforum.org/t/profiling-vault-performance/10331/43

Disjoint groups are not about performance, and at first, they may well impact it negatively: the average group will probably be at least 150% of the minimum group size and we’re not working on optimisations yet.


#64

848 is almost 8 copies per chunk.

Did you change the GROUP_SIZE: usize = 8; in safe_vault/src/lib.rs?


#65

Hi, @mav,

Thank you very much for the evaluation.

We are currently investigating the result and will keep you updated.

Regarding your concern of the number of puts and files, if the Group_Size at https://github.com/maidsafe/safe_vault/blob/master/src/lib.rs#L251 hasn’t been changed, there are shall be 8 copies for a chunk, not 3. so 97*8=776, there are also chunks for account and directory info, so the numbers seems correct.

Thanks again and will let you know if we got any progress asap.

Cheers,


#66

Ah yes, group size 8 - I spent so long testing group size 3 this difference just slipped by me. Thanks!