Pre-Dev-Update Thread! Yay! :D

THis ran fine overnight with these changes to churn.rs


const ADDITIONAL_NODES: u64 = 32;
const FILES_TO_PUT: i32 = 40;
const FILE_SIZE_LENGTH: usize = 1024 * 1024 * 50; // 50mb

So thats a total of 47 nodes getting 2Gb fired at them first of all. Got up today at the crack of noon and all looked fine. Saved a 250Mb dir OK and then fed it a 2Gb dir which crashed all nodes after a couple of mins. I had no logging enabled on this :frowning:

It used a lot of memory but I have seenit use much closer to 100% for a longer time.

Experiments continue…

7 Likes

Interesting.

This is good stuff thanks @southside

One issue we’re seeing with some of the CI runs is that nodes (on droplets no less), stall occasionally. I’m wondering if it’s a msg quantity / time thing. So the churn test may be too gentle atm (it puts data one after the other with a wee break), whereas your 2gb dir would (i think) be being PUT in a parallel fashion.

6 Likes

I suspect similar but because I had logging off to see if I could save memory, I cant prove tat right now - certainly I have had puts with large nos of subdirs and files stalling. I tried a few nights back to start new nodes and kill the apprently stalled ones but I screwed up the logging dirs. I’ll try that again more carefully if I get another stall rather than a crash

1 Like

Yeh, I can see it, with a looot of client conns, we have a bad time. I suspect that’s the issue we’re seeing. It comes back down, buttt on one machine it may well be overload.

But aye, it’s a lot more mem than we’d want so we’ll have to look into what we can do here.

This is the genesis node while i’m running a 5mb upload and then have 250 clients concurrently trying to dl from it.

5 Likes

Im asweing this on the phone cos I have locked up the main box. The nodes all crashed and the memory is back down but its still dead slow.
Full reboot coming up. Should NOT happen on linux…

2 Likes

The question is whether memory leaks or not.
It should not just come down, but come down to the same value (maybe after several iterations, but nevertheless).

1 Like

Well that depends what we’re looking at I guess no? Leak or no leak, there’s improvements to be had there. A 2gb spike like that per node might well be too much for one machine running a testnet though, so it’d explain the node’s dying I think.

But if we’re asking if there’s a mem leak here. Here’s a later chart for the same node (no further activity since the test), so I think we’re good on that front.

(heaptrack is reporting 12mb leak, but that’s in line with all nodes in general, so I’m not too worried about that atm.)


The spike is all about wire_msg::serialize from the heaptrace data, which as a fn itself seems pretty tight just now. That is “normally” (it seems to me) indicative of us holding on to messages for too long before we get them out. So this spike looks like we accept too many client msgs and may overwhelm the node. I think this could be something we can tweak and make configurable, so nodes at least are not dying due to this (though they may never be elders if they have a lower limit eg…)

We’ll see. It’s good to have a case to test against and we can see if improvements here might help things for @southside.

9 Likes

For me, first of all program should be made to run correctly.
Only then it should be optimized to run efficiently.
Otherwise you may end up optimizing wrong behaviour.

That is great.

Resource usage should be controlled in one way or another.
It may be good to start with simple configurable limit and then think if there are more suitable solutions.

2 Likes

So I’ve woken up and am poking at this more. It’s looking like it is msg quantity that’s killing nodes here, and perhaps making us wildly inefficient.

@Southside try changing https://github.com/maidsafe/safe_network/blob/main/sn_client/src/api/client_builder.rs#L46 to 5. Locally the same 250s test ran a looot faster. And nodes seem happier too…

Here you have yesterday’s test on the left, this morning’s on the right. The difference was the suggestion above

The issue is we have a clear attack vector to gum up or bring down nodes… So we’ll still need something for nodes to manage too many reqs. I feel also that the msg send func may be calling serialize way to much (once per msg recipient), when we just need to update a few bytes in the msg header.

13 Likes

Making network DDoS resistant is important task.
Such task, however, may result in many architectural changes.
Which will move network further away from working state.

For me, reaching “working state” goal looks more important.

8 Likes

Don’t disagree. Hence why it’s good to know if the above mitigation works (and if it is the real issue @southside is coming up against).

There’s various things that may reduce the impact of this spike though, many options that are not to o large to a job to try out (if it is what I think it is).

11 Likes

New release :dizzy:

15 Likes

Lot’s of green there:

13 Likes

The green is about everywhere now!

7 Likes

I know, but the price of it is shocking!!! ← insert emoji of empty bong here

10 Likes

Looks like dev update will be interesting this week.

8 Likes

We may or may not get a new release tomorrow but lots is happening on GitHub

This caught my eye Connection query stability debug by maqi · Pull Request #1631 · maidsafe/safe_network · GitHub

The agogometer thrums quietly for now.

16 Likes

Is resource proof no more?

11 Likes

No longer needed. We can tell a node behaves using dysfunctional checks. Resource proof is a kinda single check in a single time period where dysfunctional tests are continuous

22 Likes

images (12)

18 Likes