Pre-Dev-Update Thread! Yay! :D

Southside · August 25, 2022, 1:29pm

THis ran fine overnight with these changes to churn.rs


const ADDITIONAL_NODES: u64 = 32;
const FILES_TO_PUT: i32 = 40;
const FILE_SIZE_LENGTH: usize = 1024 * 1024 * 50; // 50mb

So thats a total of 47 nodes getting 2Gb fired at them first of all. Got up today at the crack of noon and all looked fine. Saved a 250Mb dir OK and then fed it a 2Gb dir which crashed all nodes after a couple of mins. I had no logging enabled on this

It used a lot of memory but I have seenit use much closer to 100% for a longer time.

Experiments continue…

joshuef · August 25, 2022, 2:36pm

Interesting.

This is good stuff thanks @southside

One issue we’re seeing with some of the CI runs is that nodes (on droplets no less), stall occasionally. I’m wondering if it’s a msg quantity / time thing. So the churn test may be too gentle atm (it puts data one after the other with a wee break), whereas your 2gb dir would (i think) be being PUT in a parallel fashion.

Southside · August 25, 2022, 2:51pm

I suspect similar but because I had logging off to see if I could save memory, I cant prove tat right now - certainly I have had puts with large nos of subdirs and files stalling. I tried a few nights back to start new nodes and kill the apprently stalled ones but I screwed up the logging dirs. I’ll try that again more carefully if I get another stall rather than a crash

joshuef · August 25, 2022, 3:14pm

Yeh, I can see it, with a looot of client conns, we have a bad time. I suspect that’s the issue we’re seeing. It comes back down, buttt on one machine it may well be overload.

But aye, it’s a lot more mem than we’d want so we’ll have to look into what we can do here.

This is the genesis node while i’m running a 5mb upload and then have 250 clients concurrently trying to dl from it.

Southside · August 25, 2022, 3:18pm

Im asweing this on the phone cos I have locked up the main box. The nodes all crashed and the memory is back down but its still dead slow.
Full reboot coming up. Should NOT happen on linux…

Vort · August 25, 2022, 4:33pm

The question is whether memory leaks or not.
It should not just come down, but come down to the same value (maybe after several iterations, but nevertheless).

joshuef · August 25, 2022, 6:18pm

Well that depends what we’re looking at I guess no? Leak or no leak, there’s improvements to be had there. A 2gb spike like that per node might well be too much for one machine running a testnet though, so it’d explain the node’s dying I think.

But if we’re asking if there’s a mem leak here. Here’s a later chart for the same node (no further activity since the test), so I think we’re good on that front.

(heaptrack is reporting 12mb leak, but that’s in line with all nodes in general, so I’m not too worried about that atm.)

The spike is all about wire_msg::serialize from the heaptrace data, which as a fn itself seems pretty tight just now. That is “normally” (it seems to me) indicative of us holding on to messages for too long before we get them out. So this spike looks like we accept too many client msgs and may overwhelm the node. I think this could be something we can tweak and make configurable, so nodes at least are not dying due to this (though they may never be elders if they have a lower limit eg…)

We’ll see. It’s good to have a case to test against and we can see if improvements here might help things for @southside.

Vort · August 26, 2022, 3:15am

For me, first of all program should be made to run correctly.
Only then it should be optimized to run efficiently.
Otherwise you may end up optimizing wrong behaviour.

That is great.

Resource usage should be controlled in one way or another.
It may be good to start with simple configurable limit and then think if there are more suitable solutions.

joshuef · August 26, 2022, 4:32am

So I’ve woken up and am poking at this more. It’s looking like it is msg quantity that’s killing nodes here, and perhaps making us wildly inefficient.

@Southside try changing https://github.com/maidsafe/safe_network/blob/main/sn_client/src/api/client_builder.rs#L46 to 5. Locally the same 250s test ran a looot faster. And nodes seem happier too…

Here you have yesterday’s test on the left, this morning’s on the right. The difference was the suggestion above

The issue is we have a clear attack vector to gum up or bring down nodes… So we’ll still need something for nodes to manage too many reqs. I feel also that the msg send func may be calling serialize way to much (once per msg recipient), when we just need to update a few bytes in the msg header.

Vort · August 26, 2022, 6:04am

Making network DDoS resistant is important task.
Such task, however, may result in many architectural changes.
Which will move network further away from working state.

For me, reaching “working state” goal looks more important.

joshuef · August 26, 2022, 6:24am

Don’t disagree. Hence why it’s good to know if the above mitigation works (and if it is the real issue @southside is coming up against).

There’s various things that may reduce the impact of this spike though, many options that are not to o large to a job to try out (if it is what I think it is).

DeusNexus · September 9, 2022, 11:16am

New release

Toivo · September 13, 2022, 6:37pm

Lot’s of green there:

upstate · September 21, 2022, 12:47pm

The green is about everywhere now!

Southside · September 21, 2022, 12:51pm

I know, but the price of it is shocking!!! ← insert emoji of empty bong here

schultz · September 21, 2022, 8:41pm

Looks like dev update will be interesting this week.

Southside · October 5, 2022, 5:36pm

We may or may not get a new release tomorrow but lots is happening on GitHub

This caught my eye Connection query stability debug by maqi · Pull Request #1631 · maidsafe/safe_network · GitHub

The agogometer thrums quietly for now.

Josh · October 10, 2022, 11:54pm

Is resource proof no more?

dirvine · October 11, 2022, 7:59am

No longer needed. We can tell a node behaves using dysfunctional checks. Resource proof is a kinda single check in a single time period where dysfunctional tests are continuous

stout77 · October 25, 2022, 6:46pm

images (12)

Topic		Replies	Views
MaidSafe Dev Update :safe: 10th November 2015 Updates	19	3550	November 12, 2015
MaidSafe Dev Update :safe: 1st December 2015 Updates	28	4679	December 5, 2015
MaidSafe Dev Update :safe: 16th November 2015 Updates	25	3626	November 21, 2015
MaidSafe Dev Update - June 29, 2017 Updates	32	5772	July 5, 2017
MaidSafe Pre-Dev-Update Update :safe: 25th April 2016 Updates	102	7551	May 4, 2016

Pre-Dev-Update Thread! Yay! :D

Related Topics