Profiling node performance

This post is about testing variability and repeatability.

Software Versions

Vault 0.11.0
Routing 0.23.4
Launcher 0.8.0
DemoApp 0.6.0
SafeCore 0.19.0

Changes from default operation

group size: 3
quorum size: 2
upload / storage limits: extremely large
remove one-vault-per-lan restriction

Methodology

  • Load and start 28 vaults on a network of 7 pine64s.
  • Create an account using random password / secret.
  • Upload 655 MiB file via the demo app (ubuntu-16.04-server-amd64.iso).
  • Record the timing of the upload.
  • Stop and delete the vaults.
  • Reboot pine64s and repeat for a total of ten identical tests.

Results

Test  Time (m)
   1      59.5
   2      59.6
   3      54.7
   4      55.5
   5      54.6
   6      52.5
   7      65.6
   8      55.4
   9      51.6
  10      59.9

Min: 51.6
Max: 65.6
Average: 56.9
Median: 55.4
Standard Deviation: 4.2


I was quite surprised by the degree of variation, considering the network is a completely isolated / controlled environment. Factors that may contribute to this variation are

  • arrangement of nodes relative to each other due to the randomized naming process
  • entry point to the network for the client due to the randomized login credentials
  • message routes and message queue length, thus processing demand and delays due to blocking
  • processing load due to the ‘heavy’ processing nodes being on different vs the same pine64
  • the Edimax ES-5800M V2 switch being used has three different priorities depending on the physical port on the switch.

Factors that probably do not contribute to variability are

  • ram vs swap - the 2GB of ram per pine64 is never fully consumed
  • disk speed - all devices have the same brand / model of microsd
  • network speed - network cables are the same length and brand of cat6
  • churn - there should be no network churn during the upload since vault names are the same at the start and end of the test
  • other running processes - the devices are dedicated to this test with no other processes running (except to keep the os running of course!)

It’s a little confusing why there is so much variation. I assume this is mainly due to differences in the vault names and thus the topology that messaging must negotiate, but it’s hard to know without measuring.

The main takeaway for me is that the effect of changes to the codebase should be measured using averages over multiple tests, since the error on a single test may be quite significant (much much larger than I initially thought).


I did a second test where the file was uploaded, deleted, then reuploaded multiple times. These tests also shows an unusual amount of variation. In this test the vault and client naming is identical between tests, so the messaging patterns between vaults should be very close if not identical. Yet there was still significant variation.

In summary, there’s much less consistency in upload time than I would have expected, which must be considered when measuring the effect of changes to the codebase.

8 Likes