Profiling node performance

@bluebird:

Without that time dimension it is worthless.

Hardly. Keep up the amazing and informative work @mav!

5 Likes

This comment was not referring to TEST 8.

@Krishna_Kumar just edited his comment:

Edit: Not the network/testnet right now. I was referring to testnet before alpha release to emulate delays of slow network and handle the same from launcher.

See this post:

Sorry for the confusion :slight_smile:

5 Likes

The test is not being done on testnet because there are too many uncontrolled variables. Iā€™m doing these tests on a private network. Hereā€™s how Iā€™m testing:

Hardware

7 x Pine64, Quadcore, 2GB RAM, Gigabit Ethernet
64GB Kingston MicroSD
Gigabit Switch
Cat6 Cable

Software

Debian Jessie aarch64 headless
Vault 0.10.5
Routing 0.25.1
Launcher 0.7.1
Demo App 0.5.1
Group Size 3
Quorum Size 2
4 vaults per device for a total of 28 vaults

Methodology

  1. Build, distribute and start vaults, launcher and demoapp
  2. Upload 686 MB file - ubuntu-16.04-server-amd64.iso
  3. Retrieve logs
  4. Wipe all vaults and vault data
  5. Aggregate and analyze logs
  6. Customize vaults, launcher and/or demoapp based on results
  7. Repeat

Hopefully this adds clarity to the previous posts.

10 Likes

Your superlatives and back-patting are not a reasoned argument; youā€™re sounding no different from @bigbtc.

Please tell us how it is ā€œamazing and informative.ā€ No hurry, I can wait.

In case anyone thinks my criticisms personal: I am merely stating what outsiders would, that youā€™ll hear sooner or later, and have no interest in seeing such research discontinued. Might as well hear it now rather than later when we have more publicity.

Thanks for the clarification; I was wondering whether it was real, a simulation, or that you were dipping into the testnet. But even if a private network, the lack of a time series, how it might evolve over time, makes it largely irrelevant to the current stage of development. SAFEnet testnets nearly always start out promising and then always go downhill. Does your network do that? If so, please point me to where you describe that.

This topic is about Profiling vault performance. I removed several personal back and forth between people. Please keep on-topic.

8 Likes

Chill @bluebird , this kind of emotion leads to missing stuff. I find this info helpful and I kinda design these things, Andreas finds it helpful and he is the team leader of Routing (and a very exact thinker), perhaps that is good enough? then if others in the community can see benefit itā€™s even better.

if you donā€™t then thatā€™s also cool. If you see error then perhaps be specific in helping identify the error, if any and as unemotional as possible, then we will all gain.

Itā€™s easy to ignore something that is very important otherwise.

16 Likes

Yes it does do that, as stated in a previous post:

Thereā€™s also a chart and logs showing the slowdown.

I understand youā€™re referring to longer time-spans and not over the course of just a single file, but my test canā€™t pass a single file upload so it doesnā€™t make sense to test longer than that yet. Youā€™re just a couple of steps ahead by trying to understand testnet problems. Iā€™m not there yet!

Please read the posts. My previous response to you about the test environment was also covered in an earlier post. Iā€™ve also asked repeatedly at the end of each one for suggestions about how to improve the testing, but so far I have yet to see any offered by you. If you have suggestions for how to improve the performance or testing of vaults, Iā€™d be keen to hear it. Otherwise Iā€™m afraid Iā€™m a bit unclear as to the purpose of your posts.

8 Likes

As I dug further into routing it was important to explore and understand the message processing queue.

These stats are generated using the same test as before, uploading a single 655 MiB file (I was using 686 MB earlier; I am now switching to MiB since chunks are measured in MiB).

Routing Events

Hop messages dominate incoming routing events. Hereā€™s the frequency of received events, measured at routing:state_machine.rs:L293:handle_event.

Total events: 844663
                            Source   Duration (s)     Count
                      Message::Hop   11623.452889    771594
           Action::NodeSendMessage    1155.635871     10202
     Message::Hop::UserMessagePart     627.514541     33307
       CrustEvent::BootstrapAccept      27.649895       689
                   Action::Timeout      21.810442       470
                 Message::Hop::Ack       8.026048       728
                   Message::Direct       3.038348       607
                      Action::Name       2.441469     22900
        CrustEvent::ConnectSuccess       1.775223       239
              CrustEvent::LostPeer       1.740036       741
                           Unknown       1.655566       939
CrustEvent::ConnectionInfoPrepared       1.652124       247
                Action::CloseGroup       0.175473      1993
        CrustEvent::ConnectFailure       0.001992         4
       CrustEvent::ListenerStarted       0.000365         1
                 Action::Terminate       0.000072         2

Hop Frequency

Hops account for about 95% of all received events. Hop messages are heavy on signature verification.

Considering the file is about 655 chunks, thereā€™s an average of about 1200 hop messages per chunk!

With a group size of 3 and quorum size of 2, how many hops should be expected? I need to get a better understanding of the architecture to answer that; maybe someone can shed some light?

Crypto Workload

My hope is to understand how much reduction in upload time may be possible if crypto signing / verification / hashing are offloaded to some mythical ā€˜instantā€™ hardware module. This may give some idea of how much non-crypto overhead is possible to address.

Signature Generation
Total events: 2101191
Total duration: 1060.10999838
                            Source   Duration (s)     Count
                     HopMessageNew     677.496887   1193402
                  SignedMessageNew     382.492785    907452
                  SendNodeIdentify       0.095010       287
                SendClientIdentify       0.025316        50


Signature Verification
Total events: 806283
Total duration: 672.703389051
                            Source   Duration (s)     Count
                  HopMessageVerify     672.454186    805927
              VerifySignedPublicId       0.208050       302
        StructuredDataCheckAllKeys       0.041153        54


Hashing
Total events: 591238
Total duration: 469.917269425
                            Source   Duration (s)     Count
                  GroupMessageHash     359.539391    586913
               ImmutableDataDecode     110.364227      3930
                       PublicIdNew       0.003636        44
          HandleGetNodeNameRequest       0.003110       101
             Authority::ClientName       0.002464        69
            CalculateRelocatedName       0.002045       101
           HandleBootstrapIdentify       0.001315        43
              HandleClientIdentify       0.000965        36
                     NodeFirstName       0.000116         1

The total time spent doing crypto operations is about 36m. But this is deceptive since many of these operations will be / should be performed in parallel, so itā€™s difficult to compare directly with the total upload time.

Best case

Assume the best-case scenario (ie most benefit) for eliminating time spent on crypto.

This scenario happens if all crypto operations are synchronous. Thus the total amount of upload time spent on crypto would be about 36m (1060+672+469)/60.

There would still be at least 24m upload time (60m-36m) even if crypto was instant. So there is definitely room for improvement on the handling of messages.

Worst case

Assume the worst-case scenario (ie least benefit) for eliminating time spent on crypto.

This scenario happens if all crypto operations happen asynchronously across all 28 vaults at the same time. Thus the total amount of time spent on crypto would be about 1m18 (1060+672+469)/60/28.

There would still be about 60m upload time for this 655 MiB file.

Crypto Conclusion

During the 60m upload there is somewhere between 1m and 36m spent on ā€˜unavoidableā€™ crypto operations. The exact amount depends on the degree of parallelism between vaults.

These stats look at the network as a whole, so there may be additional insight gained by looking at loads for individual vaults, eg some nodes have much higher load than others, so should be target of further analysis.

On a related note, RFC-43 async safe_core seems to be aimed at improving processing speed, which is very exciting.

Lower Bound

For comparison, simply copying the file over the network takes about 45s (using scp). That means thereā€™s about 70-fold more time spent on ā€˜stuff that isnā€™t client uploadā€™ due to the operation of the safe network.

Measuring Improvements

Performance can be improved with some simple changes. Iā€™m not suggesting these changes be incorporated into production, simply trying to pinpoint where the bottlenecks exist and their magnitude.

Full Sized Image

When both changes are incorporated the test was slower than expected, which seems to be due to variation between identical tests. It would be best to perform tests on a deterministic network (which requires big changes to the codebase) or multiple times with average / standard deviation (see next steps below).


In summary, most of what I have investigated may not be directly applicable to production vaults, but the numbers do indicate areas that may be of interest. Such areas include:

  • save files to disk as a background process
  • message architecture may be optimized (ie regardless of the speed of the underlying crypto operations)
  • hop messages (and the required signing) may be optimized since theyā€™re the highest message type by both volume and duration.

My next steps are

  • upgrade to newer versions and compare performance with old versions
  • test how much variation there is by running the same test multiple times
  • consider a set of metrics that might be applied to production vaults to indicate bottlenecks with real-world loads. This will hopefully result in optimizing for real usage and not contrived scenarios as I have been exploring so far.
11 Likes

With a group size of 3 and quorum size of 2, how many hops should be expected?

Currently routing splits every 1 MB chunk into 50 individual messages, and every group member sends a message for each chunk (either the chunk itself or its hash), which would be a factor of 150 for messages sent by a group, and a factor of 50 for messages sent by an individual node.
Then each chunk gets routed, so the number of hops would be the average route length (~ log2(network size)) times 150.
Every message gets an acknowledgement, so youā€™d have to roughly double that number again.

And then e.g. a Put requests goes from the client to the client manager and from there to the data manager, so that would be one node message and one group message, causing 400 times average route length hops, I think.

But there are also other messages sent by routing itself to learn about new nodes, establish connections etc., that have nothing to do with the data itself.

(I know, thereā€™s a lot of room for obvious improvements here. But we have to make the network fully secure and resilient first, before we start implementing optimisations.)

7 Likes

This post is about testing variability and repeatability.

Software Versions

Vault 0.11.0
Routing 0.23.4
Launcher 0.8.0
DemoApp 0.6.0
SafeCore 0.19.0

Changes from default operation

group size: 3
quorum size: 2
upload / storage limits: extremely large
remove one-vault-per-lan restriction

Methodology

  • Load and start 28 vaults on a network of 7 pine64s.
  • Create an account using random password / secret.
  • Upload 655 MiB file via the demo app (ubuntu-16.04-server-amd64.iso).
  • Record the timing of the upload.
  • Stop and delete the vaults.
  • Reboot pine64s and repeat for a total of ten identical tests.

Results

Test  Time (m)
   1      59.5
   2      59.6
   3      54.7
   4      55.5
   5      54.6
   6      52.5
   7      65.6
   8      55.4
   9      51.6
  10      59.9

Min: 51.6
Max: 65.6
Average: 56.9
Median: 55.4
Standard Deviation: 4.2


I was quite surprised by the degree of variation, considering the network is a completely isolated / controlled environment. Factors that may contribute to this variation are

  • arrangement of nodes relative to each other due to the randomized naming process
  • entry point to the network for the client due to the randomized login credentials
  • message routes and message queue length, thus processing demand and delays due to blocking
  • processing load due to the ā€˜heavyā€™ processing nodes being on different vs the same pine64
  • the Edimax ES-5800M V2 switch being used has three different priorities depending on the physical port on the switch.

Factors that probably do not contribute to variability are

  • ram vs swap - the 2GB of ram per pine64 is never fully consumed
  • disk speed - all devices have the same brand / model of microsd
  • network speed - network cables are the same length and brand of cat6
  • churn - there should be no network churn during the upload since vault names are the same at the start and end of the test
  • other running processes - the devices are dedicated to this test with no other processes running (except to keep the os running of course!)

Itā€™s a little confusing why there is so much variation. I assume this is mainly due to differences in the vault names and thus the topology that messaging must negotiate, but itā€™s hard to know without measuring.

The main takeaway for me is that the effect of changes to the codebase should be measured using averages over multiple tests, since the error on a single test may be quite significant (much much larger than I initially thought).


I did a second test where the file was uploaded, deleted, then reuploaded multiple times. These tests also shows an unusual amount of variation. In this test the vault and client naming is identical between tests, so the messaging patterns between vaults should be very close if not identical. Yet there was still significant variation.

In summary, thereā€™s much less consistency in upload time than I would have expected, which must be considered when measuring the effect of changes to the codebase.

8 Likes

Is there any advantage to this variability? Can it be used in any way?

The creative in me says yes, there are ways it may be used.

However the engineer in me says no. Randomness (when required) should be taken from a source with a known quality of randomness. Since this variability may (will?!) be reduced in the future, it should not be used as a source of randomness.

Sorry for late reply, Iā€™ve been away :slight_smile: :palm_tree:

2 Likes

That makes a lot of sense to me. Along that line we are aware that SAFE has min number of nodes threshold for security- along the same line and above that threshold or above such a measureable and consistently knowable quality of randomness could this randominity be useable and potentially reliable? I understand we want to eliminate it and donā€™t want to rely on it as that could be a point of breakage or vulnerability. Still, if Ive understood correctly new approaches use noise as a channel and it makes me also wonder about noise as a clock or time signature. Say an index of the networkā€™s potentially unique noise gathered sequentially as clock. All the nodes of another network might have a tough time being in the same place at the same time. The networkā€™s biometric print across time? Could be hard for another net to spoof or estimate this to internal precision.

1 Like

This test focuses on the effect of group size and quorum size on upload time and network performance.

Summary

As usual, the results first:

Upload time increases dramatically as group size increases.

Upload time does not change significantly as quorum size increases.

Methodology

This test uses the same software versions as the last test, primarily centred around testing vault 0.11.0, only modified for additional logging.

The file uploaded for these tests is Hippocampus (downloadable from vimeo), and is 96.2 MiB.

  • Compile vault with custom group and quorum size
    • Deploy and start 28 vaults on 7 pine64s
    • Wait for network to be formed
    • Upload 96.2 MiB file using demo_app
    • Record the time to upload and the number of hop messages received by vaults
    • Stop vaults and remove old logs
    • Repeat 5 times with the same group/quorum size
  • Repeat with a new group/quorum size

The tests spanned group sizes from 4 to 14 and quorum sizes from 2 to 12.

The results below show the median from five repetitions.

Upload Times

How many minutes did it take to fully upload the 96.2 MiB file? (Iā€™ll put a dynamic table here if the forum can allow <table> tags)

Hop Messages

How many million hop messages were received by all vaults? (Iā€™ll put a dynamic table here if the forum can allow <table> tags)

Variation Due To Group Size

The effect of different group sizes can be observed by keeping the quorum size the same. There is a strong increase in upload time and the number of hop messages as the group size increases.

Variation Due To Quorum Size

There is relatively little increase in upload time as the quorum size increases, but hop count does seem to increase slightly (but not as much effect as changing group size).

Raw Data

The raw log data for all tests is available on the alpha network at http://www.mav.safenet/2016_10_10_group_size_test.7z. Download is 80 MiB, and when decompressed is about 4.8 GiB. For a more permanent link, itā€™s also available on mega.

Discussion

These results seem to match intuition, but the increase in upload time was more than I expected.

Itā€™s reassuring to see quorum size only affects security / durability of data and is not a factor in network scaling (as expected).

Iā€™m not sure how disjoint groups will perform, since there will be significant variation in group size. Not saying itā€™ll be bad, just saying I donā€™t know.

As @AndreasF pointed out in a previous post about expected number of hops, there is room for improvement on hops but security must come first. Will disjoint groups improve the hop messages situation?

Does the small size of my network (28 vaults) affect the results? I wouldnā€™t think so, since any node only sees part of the network anyhow, but Iā€™m open to speculation about this.

Asides

  1. To compare with the alpha network (group 8 quorum 5), it took 8.5m to upload the same 96.2 MiB file to the alpha network. This is an average rate of 1.55 Mbps. speedtest.net shows my connection can reach 35.16 Mbps upload, so it wasnā€™t saturated. I reckon thatā€™s pretty good performance, at least compared to the pine64 network which was about 20.5m for similar group and quorum sizes. Iā€™d say this is due to two main differences: alpha has more nodes, and the nodes are (presumably) on more powerful machines than the pine64s Iā€™m using.

    This result was a surprise to me. I expected alpha to be slower than my dedicated network. Shows that cpu performance matters when latency is low.

  2. Received hop message count includes those required for the network to start up and registering an account etc (the point being this was consistent across all tests).

  3. There was a huge variation within the results of each configuration, which is still very surprising to me. Hover over any of the cells in the tables above to see what I mean (hovering shows the data for all five tests).

  4. The last 1% (going from 99% to 100%) always takes much longer than any other. If a normal percent passes in 25s the last percent will pass in about 400s. What is happening in that last 1% that takes so long? Itā€™s very frustrating!

  5. The first 10% (going from 0% to 10%) is almost instant (300ms). I donā€™t know whether this is because the progress bar is faulty in the demo app or if the vaults are really saving 10% of the file very quickly.

  6. The amount of time for the network to start (ie populate the routing table) increased as group size increased. I didnā€™t measure it, but subjectively it was quite noticable.

  7. There appears to be a loose correlation between upload time and hop message count. This wasnā€™t the point of the test, but anyway hereā€™s the chart for anyone curious:

  8. The Disjoint Groups RFC has an interesting point in the drawbacks section related to these tests:

    The group message routing will involve more steps, either doubling the number of actual network hops, or making the number of direct messages per hop increase quadratically in the average group size!

    Although this seems to be addressed in the Test 11 update:

    we will implement the new group message routing mechanism. It will be slightly different from the one specified in the RFC, however, delivering the same level of security but without the huge increase in the number of hops or hop messages.

    Itā€™s still not clear how this will perform relative to the existing hop mechanism but it is being considered.

  9. There were some tests that did not complete and were discarded. This was either due to very slow bootstrapping of the network (mostly with large group sizes), or the demo app showing a Request Timeout error.

  10. The safe network is amazing. Seeing it come together is a real privilege.

Main Point

My main takeaway from these tests is the underlying performance characteristics of the safe network arise as a result of being primarily a messaging platform, with data storage being ā€˜merelyā€™ the end result of that. Messaging is the key. This conclusion is not surprising in hindsight, but the tests have shifted my balance of thought strongly toward messaging and away from data.

25 Likes

A fascinating read. We all owe you a big thanks for the time and work put in!

10 Likes

Awesome, thanks for the detailed analysis!
Most of the numbers are intuitive, although something weird seems to happen with quorum size 2, going from 8 to 10 nodes.

Disjoint groups are not about performance, and at first, they may well impact it negatively: the average group will probably be at least 150% of the minimum group size and weā€™re not working on optimisations yet. That RFC is about defining the web of trust of the node keys, and how to authenticate message senders.

But in many cases, those two things - the cryptographic web of trust and the actual web of TCP connections or who sends direct messages to whom - wonā€™t need to coincide, and thatā€™s where weā€™ll be able to significantly reduce the number of hop messages in the future: if A signs something for B it doesnā€™t matter what actual route that signature takes in the network. In theory, not even nodes in the same group have to be fully interconnected. The notion of ā€œconnectedā€ that the RFC really refers to is just ā€œhaving each otherā€™s public keyā€.

12 Likes

This test is around performance of vaults with structured data. I admit to being slightly provocative because it may be compared to the bitcoin transaction rate (which is currently under some heavy contention).

The test is to modify structured data as rapidly as possible. This should hopefully find the maximum transaction rate for my personal safe network (which isnā€™t especially powerful).

Motivation

SafeCoin will be implemented as a structured data, and is intended to scale very efficiently, both in cost to the end user and load to the network.

Since the cost to update structured data is zero, itā€™s worth investigating the impact of high loads on the network, and what the transaction rate of SafeCoin may be.

Setup

Using Test 11 software versions, which is required for low level apis.

  • safe_launcher 0.9.2
  • safe_core 0.22.0
  • safe_vault 0.12.1 (assumed from release date of 2016-10-19)
  • routing 0.27.1
  • custom script to modify structured data

Modifications were only to remove check for peers on LAN. Account limits donā€™t matter for this test, since updates to SD donā€™t count toward usage. Group size is 3 and Quorum size is 2.

Hardware for the network is same as all prior tests:

  • Network of 7 pine64s running 4 vaults each on gigabit ethernet
  • Client is laptop with quadcore i7-4500U @ 1.80GHz, 8GB RAM uploading on wifi at 300 Mbps

Method

The general idea is to rapidly modify structured data until an issue comes up.

The test script does the following:

  • Create a structured data containing 30 random characters
  • Measure the time taken to update that data X times with 30 new random characters
  • Repeat several times to obtain an average update rate

Results

Update Rate

How long does it take to update SD, and how many updates / second can be achieved?

Whether updating 1 time or updating 100 times in rapid succession, the rate was usually between 120 - 130 updates per second.

However there was some variation where up to 300 u/s were seen.

The good news is that increasing the load did not cause any increase in update rate (as expected).

CPU load across all vaults was negligible.

CPU load on the client was 100% of one core (presumably due to the single-threaded nature of the launcher and safe_core).

Maximum Updates and Failure Point

The maximum number of updates before an error showed was about 1000 updates in a row.

The error returned is simply ā€˜EOFā€™, so the launcher server seems to drop requests (ie not the vaults flaking out). This is good news in one way (the launcher is a relatively easy fix) but bad in others (the launcher should be more stable than this). The other aspect to this bottleneck is it makes it difficult to test the limits of vaults with respect to DOS using SD updates.

I had intended to test many concurrent requests to the launcher, but it threw errors with only a low synchronous load so didnā€™t see much point pushing it further.

Hopefully the revamped authenticator and safe_core will expose a more robust interface to the network.

Conclusions

The update rate for structured data is between 100 - 300 updates per second per user (on my private safe network).

The current bottleneck to increasing this is the launcher, which throws errors when the load becomes too high (around 1000 updates in a row).

It costs nothing to update structured data, and it can be updated at a fairly fast rate. This leads to the notion that some vault rate limiting may be required when dealing with structured data updates.

Iā€™m not terribly satisfied with this test, as it doesnā€™t really profile the vaults at all. But it satisfied a curiosity about the approximate order of magnitude to expect for structured data performance.

Asides

  • The launcher UI froze after running these tests. The server was still running but the tabs wouldnā€™t change. I assume this is from the logging tab which has a lot of work to do in a very short time. Once the UI locked it never recovered, needing to restart the launcher to make the UI work again.
  • The value of the SD was not checked, so itā€™s assumed it was updated to contain the correct value and there were no race conditions. This is a pretty big assumption, but these details were outside the scope of this test.
  • I didnā€™t test this on the live test 11 network.
  • The rates in this test are for a single user, and the overall network transaction rate would presumably be much higher with many concurrent users. I didnā€™t get into modifying the launcher to test this, mainly due to the imminent overhaul to safe_launcher > authenticator and structured_data > mutable_data.
  • Other simultaneous upload activity such as immutable data may affect this rate in the real world.
26 Likes

You are doing absolute great work for SAFE. Thank you very much :+1:.

Is it safe to say that each Disjoint Group (say between 8 and 20 nodes) can handle at least 100 Ts/sec.? And does this mean that we could scale to 10.000 Ts/sec. with 100 groups?? Like you say, it all depends on how many other structured data objects need to be handled by that group as well. but this looks promising.

7 Likes

In this test the group size is 3 and quorum size is 2 and with bigger group size we can expect that the number of SD(MD) updates will be lower because we need more messages between nodes to reach consensus.

But even if we divide the final number of transactions by 10 or 20, a single disjoint sector (group) is capable of more transactions than the whole bitcoin network. Multiply by hundreds, thousands or millions of sectors, which the network is able to support, and the network capacity is simply amazing.

10 Likes