Profiling node performance

Maybe you’re swapping to disk: if I understand correctly your physical machines have 2GB memory shared by 4 virtual machines. That’s probably not enough. In a way you’re still using disk.

Swap was not being used. I’m going to run it again and double check, but consider:

3 copies of 686 MB file = 2058 MB
Assume roughly equal distribution of chunks across all 7 devices
2058 / 7 = 294 MB of memory

The test file should easily fit in memory.

edit: I should also add, the vaults are not running in virtual machines, they run on the native os using separate ports and chunk store directories, so each vault has access to as much or as little memory as it requires from the full 2GB.

Appreciate the thoughts, swap is certainly a contender and I will be sure to double check it.

1 Like

Long post so I’ll put the conclusion first:

The routing event loop can be improved, which may improve upload speed.

Here’s the trail of evidence:

From previous tests, send_gets_for_needed_data is currently consuming the most time in the vault (assuming chunks are written to disk in the background).

The graph below shows there are intermittent delays. send_gets_for_needed_data completes in less than 100ms 90% of the time, but the other 10% can take up to 6s to complete.

send_gets_for_needed_data is called very often - 20,831 times. For context, this is more than 10 times the number of put calls (2046 puts). The high volume means it must be fast all the time otherwise those few slow calls really begin to add up.

Looking at the timing inside send_gets_for_needed_data shows where the problem is:

let src = Authority::ManagedNode(try!(self.routing_node.name()));
time spent: 5916.90330509
"all other lines of code in send_gets_for_needed_data combined"
time spent: 1.882290369

What’s taking so much time in routing_node.name? Here’s the line-by-line timing:

routing:node:name - create channel - 0.340465389
routing:node:name - send request   - 0.500385346
routing:node:name - receive result - 1794.10717215

Waiting for the result accounts for 99.9% of the total time executing name().

Why so much waiting for the name? It’s just a property of the node. It should be returned almost instantly.

name() returns the result asynchronously via action_sender, so it must wait for other items in the event queue to complete. This can be seen on the graph below.

In the logs it’s also very clear. Here’s a particularly bad example of the backlog causing name to be slow to return:

21067720255750 start data_manager.rs:579:send_gets_for_needed_data
21067720267834 Name Request Happens
21067721697429 CrustEvent::Message::Hop
21067771355696 CrustEvent::Message::Hop
21067821400019 CrustEvent::Message::Hop
21067870456471 CrustEvent::Message::Hop
21067920365164 CrustEvent::Message::Hop
21067941593830 CrustEvent::Message::Hop
21067963335098 CrustEvent::Message::Hop
21067987394495 CrustEvent::Message::Hop
21068008237521 CrustEvent::Message::Hop
21068033712512 CrustEvent::Message::Hop
21068053956183 CrustEvent::Message::Hop
21068077750611 CrustEvent::Message::Hop
21068097937196 CrustEvent::Message::Hop
21068121544659 CrustEvent::Message::Hop
21068144560600 CrustEvent::Message::Hop
21068196893925 CrustEvent::Message::Hop
21068247295219 CrustEvent::Message::Hop
21068296141581 CrustEvent::Message::Hop
21068348779625 CrustEvent::Message::Hop
21068398292887 CrustEvent::Message::Hop
21068448006780 CrustEvent::Message::Hop
21068501791076 CrustEvent::Message::Hop
21068549567647 CrustEvent::Message::Hop
21068572497293 CrustEvent::Message::Hop
21068593907383 CrustEvent::Message::Hop
21068645624768 CrustEvent::Message::Hop
21068696443703 CrustEvent::Message::Hop
21068745756540 CrustEvent::Message::Hop
21068797673474 CrustEvent::Message::Hop
21068821118431 CrustEvent::Message::Hop
21068871433389 CrustEvent::Message::Hop
21068921130949 CrustEvent::Message::Hop
21068971024724 CrustEvent::Message::Hop
21069021942705 CrustEvent::Message::Hop
21069071882982 CrustEvent::Message::Hop
21069122009266 CrustEvent::Message::Hop
21069146865192 CrustEvent::Message::Hop
21069197306196 CrustEvent::Message::Hop
21072624102073 CrustEvent::Message::Hop
21072673830676 CrustEvent::Message::Hop
21072696150382 CrustEvent::Message::Hop
21072718986066 CrustEvent::Message::Hop
21072771431853 CrustEvent::Message::Hop
21072818312267 CrustEvent::Message::Hop
21072842778928 CrustEvent::Message::Hop
21072892935213 CrustEvent::Message::Hop
21072944337545 CrustEvent::Message::Hop
21072967923757 CrustEvent::Message::Hop
21072992335958 CrustEvent::Message::Hop
21073037820778 CrustEvent::Message::Hop
21073057693935 CrustEvent::Message::Hop
21073099556203 CrustEvent::Message::Hop
21073118999886 CrustEvent::Message::Hop
21073169036208 CrustEvent::Message::Hop
21073219543548 CrustEvent::Message::Hop
21073246460801 CrustEvent::Message::Hop
21073299304603 CrustEvent::Message::Hop
21073325347156 CrustEvent::Message::Hop
21073377902739 CrustEvent::Message::Hop
21073405462600 CrustEvent::Message::Hop
21073457020270 CrustEvent::Message::Hop
21073482864108 CrustEvent::Message::Hop
21073537913992 CrustEvent::Message::Hop
21073591831168 CrustEvent::Message::Hop
21073616487545 CrustEvent::Message::Hop
21073642275714 CrustEvent::Message::Hop
21073696074718 CrustEvent::Message::Hop
21073721220822 CrustEvent::Message::Hop
21073772763825 CrustEvent::Message::Hop
21073799069930 CrustEvent::Message::Hop
21073852988731 CrustEvent::Message::Hop
21073879074327 CrustEvent::Message::Hop
21073931569116 CrustEvent::Message::Hop
21073982336008 CrustEvent::Message::Hop
21074008557110 CrustEvent::Message::Hop
21074193074109 CrustEvent::Message::Hop
21074220315457 CrustEvent::Message::Hop
21074244276266 CrustEvent::Message::Hop
21074257859481 CrustEvent::Message::Hop
21074283286928 CrustEvent::Message::Hop
21074307013854 CrustEvent::Message::Hop
21074320526107 CrustEvent::Message::Hop
21074345425160 CrustEvent::Message::Hop
21074367746908 CrustEvent::Message::Hop
21074381849600 CrustEvent::Message::Hop
21074404708618 CrustEvent::Message::Hop
21074428633260 CrustEvent::Message::Hop
21074452144636 CrustEvent::Message::Hop
21074466254203 CrustEvent::Message::Hop
21074489858041 CrustEvent::Message::Hop
21074513806224 CrustEvent::Message::Hop
21074538099129 CrustEvent::Message::Hop
21074561772428 CrustEvent::Message::Hop
21074575123050 CrustEvent::Message::Hop
21074599396621 CrustEvent::Message::Hop
21074622817286 CrustEvent::Message::Hop
21074645575008 CrustEvent::Message::Hop
21074670535772 CrustEvent::Message::Hop
21074684696716 CrustEvent::Message::Hop
21074708283428 CrustEvent::Message::Hop
21074731759636 CrustEvent::Message::Hop
21074754820870 CrustEvent::Message::Hop
21074781077848 CrustEvent::Message::Hop
21074803612771 CrustEvent::Message::Hop
21074825461252 CrustEvent::Message::Hop
21074850483267 CrustEvent::Message::Hop
21074872613467 CrustEvent::Message::Hop
21074896753199 CrustEvent::Message::Hop
21074920094694 CrustEvent::Message::Hop
21074944880492 CrustEvent::Message::Hop
21074966802351 CrustEvent::Message::Hop
21074990332477 CrustEvent::Message::Hop
21075006702004 CrustEvent::Message::Hop
21075028687031 CrustEvent::Message::Hop
21075043091068 CrustEvent::Message::Hop
21075066726990 CrustEvent::Message::Hop
21075080974938 CrustEvent::Message::Hop
21075104259263 CrustEvent::Message::Hop
21075129753588 CrustEvent::Message::Hop
21075143761485 CrustEvent::Message::Hop
21075158159230 CrustEvent::Message::Hop
21075180201218 CrustEvent::Message::Hop
21075195514663 CrustEvent::Message::Hop
21075218302470 CrustEvent::Message::Hop
21075241674424 CrustEvent::Message::Hop
21075258707642 CrustEvent::Message::Hop
21075282958004 CrustEvent::Message::Hop
21075298858680 CrustEvent::Message::Hop
21075321388103 Action::Name   -----------> This is where name is evaluated
21075321630362 Name Response Happens
21075321686572 start data_manager.rs:200:needed_data
21075321647446 CrustEvent::Message::Hop
21075321712448 end data_manager.rs:200:needed_data
21075321753074 start data_manager.rs:243:print_stats
21075321766867 end data_manager.rs:243:print_stats
21075321779659 end data_manager.rs:579:send_gets_for_needed_data

Ideally, calling name() should not wait for an arbitrary number of Crust events to complete, especially if none of those events affect the value of the response.

How might this be resolved?

  • Prioritize local events over network events.
  • Prioritize quick events over slow events (eg returning a property vs verifying a cryptographic signature).
  • Use more than one thread to manage the event loop
  • Process independent events concurrently.

Message prioritization is complex, so I can’t say for sure if the suggestions above will work as intended. I plan to find out!

I’m going to try changing the event processing so name doesn’t take so long to return. There’s no avoiding the intervening Hop messages (which require signature verification), but we’ll see what happens. Stay tuned…

17 Likes

Awesome, thanks for the detailed analysis!

Use more than one thread to manage the event loop

… or do the vault stuff in the event thread instead of a separate one. That would be my preference; I don’t like that safe_vault and routing communicate via events, and I think that one event loop should be enough: Everything that’s computationally expensive shouldn’t happen in that thread anyway.

5 Likes

To me this is the most interesting thread on this forum at the moment. There is clearly an issue with performance of the vaults in the wild. I was wondering if this was deliberate as part of the test process, but the lack of response from anyone in the know when this issue has been brought up leads me to suspect otherwise. I think you are onto something. Good luck with your future experiments. Wish I had the technical nous to help out.

1 Like

from the TEST8 thread:

1 Like

You don’t say whether this is at the beginning of the testnet or later. That’s relevant because the behaviour of the testnet changed radically over the days and weeks of its existence, from quick launcher responsiveness to eventual non-responsiveness. Without that time dimension it is worthless.

The tests are being done on a private network, as I understand it, so it’s unrelated to the testnets.

2 Likes

@bluebird:

Without that time dimension it is worthless.

Hardly. Keep up the amazing and informative work @mav!

5 Likes

This comment was not referring to TEST 8.

@Krishna_Kumar just edited his comment:

Edit: Not the network/testnet right now. I was referring to testnet before alpha release to emulate delays of slow network and handle the same from launcher.

See this post:

Sorry for the confusion :slight_smile:

5 Likes

The test is not being done on testnet because there are too many uncontrolled variables. I’m doing these tests on a private network. Here’s how I’m testing:

Hardware

7 x Pine64, Quadcore, 2GB RAM, Gigabit Ethernet
64GB Kingston MicroSD
Gigabit Switch
Cat6 Cable

Software

Debian Jessie aarch64 headless
Vault 0.10.5
Routing 0.25.1
Launcher 0.7.1
Demo App 0.5.1
Group Size 3
Quorum Size 2
4 vaults per device for a total of 28 vaults

Methodology

  1. Build, distribute and start vaults, launcher and demoapp
  2. Upload 686 MB file - ubuntu-16.04-server-amd64.iso
  3. Retrieve logs
  4. Wipe all vaults and vault data
  5. Aggregate and analyze logs
  6. Customize vaults, launcher and/or demoapp based on results
  7. Repeat

Hopefully this adds clarity to the previous posts.

10 Likes

Your superlatives and back-patting are not a reasoned argument; you’re sounding no different from @bigbtc.

Please tell us how it is “amazing and informative.” No hurry, I can wait.

In case anyone thinks my criticisms personal: I am merely stating what outsiders would, that you’ll hear sooner or later, and have no interest in seeing such research discontinued. Might as well hear it now rather than later when we have more publicity.

Thanks for the clarification; I was wondering whether it was real, a simulation, or that you were dipping into the testnet. But even if a private network, the lack of a time series, how it might evolve over time, makes it largely irrelevant to the current stage of development. SAFEnet testnets nearly always start out promising and then always go downhill. Does your network do that? If so, please point me to where you describe that.

This topic is about Profiling vault performance. I removed several personal back and forth between people. Please keep on-topic.

8 Likes

Chill @bluebird , this kind of emotion leads to missing stuff. I find this info helpful and I kinda design these things, Andreas finds it helpful and he is the team leader of Routing (and a very exact thinker), perhaps that is good enough? then if others in the community can see benefit it’s even better.

if you don’t then that’s also cool. If you see error then perhaps be specific in helping identify the error, if any and as unemotional as possible, then we will all gain.

It’s easy to ignore something that is very important otherwise.

16 Likes

Yes it does do that, as stated in a previous post:

There’s also a chart and logs showing the slowdown.

I understand you’re referring to longer time-spans and not over the course of just a single file, but my test can’t pass a single file upload so it doesn’t make sense to test longer than that yet. You’re just a couple of steps ahead by trying to understand testnet problems. I’m not there yet!

Please read the posts. My previous response to you about the test environment was also covered in an earlier post. I’ve also asked repeatedly at the end of each one for suggestions about how to improve the testing, but so far I have yet to see any offered by you. If you have suggestions for how to improve the performance or testing of vaults, I’d be keen to hear it. Otherwise I’m afraid I’m a bit unclear as to the purpose of your posts.

8 Likes

As I dug further into routing it was important to explore and understand the message processing queue.

These stats are generated using the same test as before, uploading a single 655 MiB file (I was using 686 MB earlier; I am now switching to MiB since chunks are measured in MiB).

Routing Events

Hop messages dominate incoming routing events. Here’s the frequency of received events, measured at routing:state_machine.rs:L293:handle_event.

Total events: 844663
                            Source   Duration (s)     Count
                      Message::Hop   11623.452889    771594
           Action::NodeSendMessage    1155.635871     10202
     Message::Hop::UserMessagePart     627.514541     33307
       CrustEvent::BootstrapAccept      27.649895       689
                   Action::Timeout      21.810442       470
                 Message::Hop::Ack       8.026048       728
                   Message::Direct       3.038348       607
                      Action::Name       2.441469     22900
        CrustEvent::ConnectSuccess       1.775223       239
              CrustEvent::LostPeer       1.740036       741
                           Unknown       1.655566       939
CrustEvent::ConnectionInfoPrepared       1.652124       247
                Action::CloseGroup       0.175473      1993
        CrustEvent::ConnectFailure       0.001992         4
       CrustEvent::ListenerStarted       0.000365         1
                 Action::Terminate       0.000072         2

Hop Frequency

Hops account for about 95% of all received events. Hop messages are heavy on signature verification.

Considering the file is about 655 chunks, there’s an average of about 1200 hop messages per chunk!

With a group size of 3 and quorum size of 2, how many hops should be expected? I need to get a better understanding of the architecture to answer that; maybe someone can shed some light?

Crypto Workload

My hope is to understand how much reduction in upload time may be possible if crypto signing / verification / hashing are offloaded to some mythical ‘instant’ hardware module. This may give some idea of how much non-crypto overhead is possible to address.

Signature Generation
Total events: 2101191
Total duration: 1060.10999838
                            Source   Duration (s)     Count
                     HopMessageNew     677.496887   1193402
                  SignedMessageNew     382.492785    907452
                  SendNodeIdentify       0.095010       287
                SendClientIdentify       0.025316        50


Signature Verification
Total events: 806283
Total duration: 672.703389051
                            Source   Duration (s)     Count
                  HopMessageVerify     672.454186    805927
              VerifySignedPublicId       0.208050       302
        StructuredDataCheckAllKeys       0.041153        54


Hashing
Total events: 591238
Total duration: 469.917269425
                            Source   Duration (s)     Count
                  GroupMessageHash     359.539391    586913
               ImmutableDataDecode     110.364227      3930
                       PublicIdNew       0.003636        44
          HandleGetNodeNameRequest       0.003110       101
             Authority::ClientName       0.002464        69
            CalculateRelocatedName       0.002045       101
           HandleBootstrapIdentify       0.001315        43
              HandleClientIdentify       0.000965        36
                     NodeFirstName       0.000116         1

The total time spent doing crypto operations is about 36m. But this is deceptive since many of these operations will be / should be performed in parallel, so it’s difficult to compare directly with the total upload time.

Best case

Assume the best-case scenario (ie most benefit) for eliminating time spent on crypto.

This scenario happens if all crypto operations are synchronous. Thus the total amount of upload time spent on crypto would be about 36m (1060+672+469)/60.

There would still be at least 24m upload time (60m-36m) even if crypto was instant. So there is definitely room for improvement on the handling of messages.

Worst case

Assume the worst-case scenario (ie least benefit) for eliminating time spent on crypto.

This scenario happens if all crypto operations happen asynchronously across all 28 vaults at the same time. Thus the total amount of time spent on crypto would be about 1m18 (1060+672+469)/60/28.

There would still be about 60m upload time for this 655 MiB file.

Crypto Conclusion

During the 60m upload there is somewhere between 1m and 36m spent on ‘unavoidable’ crypto operations. The exact amount depends on the degree of parallelism between vaults.

These stats look at the network as a whole, so there may be additional insight gained by looking at loads for individual vaults, eg some nodes have much higher load than others, so should be target of further analysis.

On a related note, RFC-43 async safe_core seems to be aimed at improving processing speed, which is very exciting.

Lower Bound

For comparison, simply copying the file over the network takes about 45s (using scp). That means there’s about 70-fold more time spent on ‘stuff that isn’t client upload’ due to the operation of the safe network.

Measuring Improvements

Performance can be improved with some simple changes. I’m not suggesting these changes be incorporated into production, simply trying to pinpoint where the bottlenecks exist and their magnitude.

Full Sized Image

When both changes are incorporated the test was slower than expected, which seems to be due to variation between identical tests. It would be best to perform tests on a deterministic network (which requires big changes to the codebase) or multiple times with average / standard deviation (see next steps below).


In summary, most of what I have investigated may not be directly applicable to production vaults, but the numbers do indicate areas that may be of interest. Such areas include:

  • save files to disk as a background process
  • message architecture may be optimized (ie regardless of the speed of the underlying crypto operations)
  • hop messages (and the required signing) may be optimized since they’re the highest message type by both volume and duration.

My next steps are

  • upgrade to newer versions and compare performance with old versions
  • test how much variation there is by running the same test multiple times
  • consider a set of metrics that might be applied to production vaults to indicate bottlenecks with real-world loads. This will hopefully result in optimizing for real usage and not contrived scenarios as I have been exploring so far.
11 Likes

With a group size of 3 and quorum size of 2, how many hops should be expected?

Currently routing splits every 1 MB chunk into 50 individual messages, and every group member sends a message for each chunk (either the chunk itself or its hash), which would be a factor of 150 for messages sent by a group, and a factor of 50 for messages sent by an individual node.
Then each chunk gets routed, so the number of hops would be the average route length (~ log2(network size)) times 150.
Every message gets an acknowledgement, so you’d have to roughly double that number again.

And then e.g. a Put requests goes from the client to the client manager and from there to the data manager, so that would be one node message and one group message, causing 400 times average route length hops, I think.

But there are also other messages sent by routing itself to learn about new nodes, establish connections etc., that have nothing to do with the data itself.

(I know, there’s a lot of room for obvious improvements here. But we have to make the network fully secure and resilient first, before we start implementing optimisations.)

7 Likes

This post is about testing variability and repeatability.

Software Versions

Vault 0.11.0
Routing 0.23.4
Launcher 0.8.0
DemoApp 0.6.0
SafeCore 0.19.0

Changes from default operation

group size: 3
quorum size: 2
upload / storage limits: extremely large
remove one-vault-per-lan restriction

Methodology

  • Load and start 28 vaults on a network of 7 pine64s.
  • Create an account using random password / secret.
  • Upload 655 MiB file via the demo app (ubuntu-16.04-server-amd64.iso).
  • Record the timing of the upload.
  • Stop and delete the vaults.
  • Reboot pine64s and repeat for a total of ten identical tests.

Results

Test  Time (m)
   1      59.5
   2      59.6
   3      54.7
   4      55.5
   5      54.6
   6      52.5
   7      65.6
   8      55.4
   9      51.6
  10      59.9

Min: 51.6
Max: 65.6
Average: 56.9
Median: 55.4
Standard Deviation: 4.2


I was quite surprised by the degree of variation, considering the network is a completely isolated / controlled environment. Factors that may contribute to this variation are

  • arrangement of nodes relative to each other due to the randomized naming process
  • entry point to the network for the client due to the randomized login credentials
  • message routes and message queue length, thus processing demand and delays due to blocking
  • processing load due to the ‘heavy’ processing nodes being on different vs the same pine64
  • the Edimax ES-5800M V2 switch being used has three different priorities depending on the physical port on the switch.

Factors that probably do not contribute to variability are

  • ram vs swap - the 2GB of ram per pine64 is never fully consumed
  • disk speed - all devices have the same brand / model of microsd
  • network speed - network cables are the same length and brand of cat6
  • churn - there should be no network churn during the upload since vault names are the same at the start and end of the test
  • other running processes - the devices are dedicated to this test with no other processes running (except to keep the os running of course!)

It’s a little confusing why there is so much variation. I assume this is mainly due to differences in the vault names and thus the topology that messaging must negotiate, but it’s hard to know without measuring.

The main takeaway for me is that the effect of changes to the codebase should be measured using averages over multiple tests, since the error on a single test may be quite significant (much much larger than I initially thought).


I did a second test where the file was uploaded, deleted, then reuploaded multiple times. These tests also shows an unusual amount of variation. In this test the vault and client naming is identical between tests, so the messaging patterns between vaults should be very close if not identical. Yet there was still significant variation.

In summary, there’s much less consistency in upload time than I would have expected, which must be considered when measuring the effect of changes to the codebase.

8 Likes

Is there any advantage to this variability? Can it be used in any way?