Profiling node performance

Hi, @mav,

Thank you very much for the logs. It does help me a lot.
And here is my understanding of why the CPU usage time is in such a clustered pattern:
1, 50_4 is confirmed to acts as the accumulator of the ClientManager group.
It handled over 35k routing messages with total size of 470MB, stored 35 copies
2, 51_3 is confirmed to acts as the proxy to the client, and also as a member of the CM group.
It handled over 20k routing messages with total size of 658MB, stored 38 copies
3, 50_2, 50_3, 53_2, 54_2, 54_3 and 55_1 are the other members of the CM group.
They handled over 15k routing messages each, with total size ranged from 4.5MB to 12MB, stored 32 - 39 copies.
4, other vaults handled less than 5k routing messages, and some stored only 16 copies.

The heavy duty of 50_4, together with the other two middle duty 50_2 and 50_3, affects each other (as the disk and network adaptor is still shared, and also their names are close hence handle almost the same chunks at the same time). Hence their performance is much slower than other nodes.
For example, a chunk’s put operation took around 0.0134s to complete in 54_2, but the same chunk will take 50_2 0.54s, 50_3 0.54s and 50_4 0.42s to complete

This explains why the two nodes of 50_2 and 50_3 required almost the same CPU time as to 50_3 and 51_3, and those vaults sits in 51 ranked higher than others.

Regarding the chunk_store::put method takes a long time to complete, a task ([MAID-2033] - JIRA) is now raised and there will be some work to be done soon to address this.

Regarding the proxy node being a bottleneck, some thoughts have been discussed. However, the solution may incur concerns in bootstrap flow and security. So it will take some time till any conclusion can be done.

And thanks again for your profiling work and it does help us a lot.

Cheers,

13 Likes