Profiling node performance

I wanted to see if nodes get in each others way when doing 8-core-running-11-nodes, so I fired up some 16-core aws virtual machines which should leave some cores free (although now with multithreading in node that’s probably not true?!)

In summary, yes nodes do tend to get in the way of each other. Running 1 node per vCPU seems about the best ratio.

To clarify some terminology, a vCPU is not necessarily one cpu core. eg on my laptop I have 4 physical cores but with hyperthreading I get 2 vCPUs per physical core, so that’s a total of 8 vCPUs. This shows as 8 individual cpu graphs in task manager.

The basic test is:

Run baby fleming with node v0.49.8 and 11 nodes

Upload 10 MiB file using sn_cli v0.29.2

A1 ARM Procesors

Firstly, the A1 series of processors. From aws instance types: A1 instances are the first EC2 instances powered by AWS Graviton Processors that feature 64-bit Arm Neoverse cores and custom silicon designed by AWS.

Since maidsafe does not put out releases for arm architecture, I built the code on the first vm I started then copied those binaries to each of the other vms for the rest of the tests.

Type Time (s) vCPUs RAM (GB)
a1.medium >600 1 2
a1.large >600 2 4
a1.xlarge 82.705 4 8
a1.2xlarge 42.559 8 16
a1.4xlarge 31.138 16 32
a1.metal 29.645 16 32

A tangential observation, I could not build with musl on arm. The ring crate was throwing an error, didn’t dig into it though, maybe one to look into later.

Command to try to build for aarch64 musl was
cargo build --release --target aarch64-unknown-linux-musl

M6G ARM Processors

From AWS: “deliver up to 40% better price/performance over current generation M5 instances and offer a balance of compute, memory, and networking resources for a broad set of workloads.”

Type Time (s) vCPUs RAM (GB)
m6g.medium >600 1 4
m6g.large >600 2 8
m6g.xlarge 63.871 4 16
m6g.2xlarge 35.680 8 32
m6g.4xlarge 25.345 16 64
m6g.8xlarge 25.091 32 128
m6g.12xlarge 25.315 48 192
m6g.16xlarge 25.512 64 256
m6g.metal ? 64 256

I couldn’t ssh into m6g.metal for some reason so there’s no result.

Once we get to 16+ cores the time stays pretty stable, shows that 11 nodes on 8 or less vCPUs is hitting some cpu bottlenecks.

M5 X86 Processors

“the latest generation of General Purpose Instances powered by Intel Xeon® Platinum 8175M processors. This family provides a balance of compute, memory, and network resources, and is a good choice for many applications.”

Type Time (s) vCPUs RAM (GB)
m5.large 113.684 2 8
m5.xlarge 57.282 4 16
m5.2xlarge 30.285 8 32
m5.4xlarge 17.201 16 64
m5.8xlarge 14.951 32 128
m5.12xlarge 13.844 48 192
m5.16xlarge 13.952 64 256
m5.24xlarge 13.719 96 384
m5.metal 14.125 96 384

And yes, all 96 cores are used, shown here:

Improved bls lib

Following on from this post which says “3. Integrate a faster threshold_crypto” I thought let’s see what that’s like on the fastest platform the m5.24xlarge

m5.24xlarge gives 8.633s for the new lib vs 13.719s for the old one, quite a lot faster.

I also happened to test m5.metal which gave 12.422s for the new lib vs 14.125s for the old one.

17 Likes