[Offline] A pre-christmas playground present

Are there more than one copy of the chunk? I guess if there’s only one, this is what is expected to happen?

2 Likes

There are always multiple copies of each chunk, four I think, so that losing nodes won’t lose files.

5 Likes

But in this testnet losing only some of them was a problem? Or was it that the remaining ones were in wrong place (not relocated)? Losing all of them because enough adults leaving would not be a bug, but expected thing to happen when network shrinks below certain limit.

(I am asking if there are many copies in these testnets already, or if that is not implemented yet?)

2 Likes

There are already replicants. AS churn happens they do relocate data (make more copies) but here they did not see new Adults and did not replicate, but they should have.

7 Likes

Hmm… Maybe it is a bit silly to keep asking, as I am sure there’s going to be deep analysis and solution in due time, but I am just wondering why didn’t the network find any of the replicas? Why does it matter that the new nodes didn’t get theirs as there were another in another node? How many replicas there are in the first place?

(Just leave the answer for later if you have anything better to do :slightly_smiling_face:)

4 Likes

4 and losing the all with no new adult will lose the data.

Never is :wink:

We need to confirm there was a replica left for any data and if so then this question is pretty important. The remaining node should have replicated the data for certain, also it should have returned the data, but seemed not to.

10 Likes

It is actually a combination of problems that lead us to failed GETs despite replications. Primarily, nodes were timing out on connections when trying to respond to queries. This, combined with the loss of data(due to nodes dropping and not republishing) made GETs fail as availability of the chunks reduced(remember even if one chunk fails, the whole file would fail to decrypt due to the nature of self-encryption). So when everything went right, i.e connections were held and data was found, we had successful GETs, otherwise it ended up failing even if one of those conditions failed.

There probably could be more to this than the current inferences, as we have mountains more of logs to go through and I could be totally wrong here! :grinning_face_with_smiling_eyes:

13 Likes

Maybe because they crashed?
Did you checked if all elders were running when problems start appearing?

Yeps! No timeouts were logged pertaining to the Elders, so logs show they were alive at that moment. It also turns out that most of the connection timeout messages were seen with nodes that weren’t in the D.O. setup that we hosted(meaning they were from the community) which is understandable as folks would test for sometime and stop their nodes.

9 Likes

so there is a bug when nodes exit?

1 Like

It isn’t a bug actually :slight_smile: Healthy nodes are trying to contact a node who has left, which would obviously fail and its connections are expected to timeout.

6 Likes

So files would have been found by just waiting for longer, until the network with stable nodes have time to do it’s business after this constant joining-leaving is over?

1 Like

i reference to the issue with files not getable/catable after some period of time

1 Like

Ideally, yes. Since nodes leave with data, the replication count for the chunks that they hold decrements. And since we do not have new nodes joining(which is the actual bug), the availability of those chunks stay reduced. If new nodes join successfully, the chunk count would be maintained by republishing chunks to them :slight_smile:

So eventually we would be able to fetch them back as usual.

6 Likes

…and the network was “always joinable” which made the situation worse?

(Edit: in the beginning, when nodes were not needed, but could come to “steal” a few chunks?)

1 Like

“always-joinable” is a developmental feature that does not restrict the network in taking new nodes. Meaning if we start a network with that feature enabled, the network would always accept nodes no matter what the storage capacity ratio is. So it should’ve actually helped us to accept new nodes though the bug itself seems to have restricted nodes from joining.

That is definitely another perspective to this :slight_smile: But if new nodes kept joining ideally, the network would always try to maintain it’s data availability(chunk count) despite chunks being “stolen”.

1 Like

maybe there should be teen nodes, where if they get chunks they are also available in other 3 adults meaning that if someone just joins into the network and leaves in a short period it doesnt affect the chunks that are replicated in adults.

if a teen is reliable for a day lets say then upgrade it to adult

4 Likes

Aye, node ageing does help a lot in situations like these. Can do a bunch of smart things with it :slight_smile:

3 Likes

The bug here I feel is waiting on new nodes to replicate. We should replicate to existing nodes. Imagine a network break like a big segmentation. We may lose a lot of nodes at once and we must replicate data quickly. If we end up replicating too much it’s better than what we have. Waiting on new nodes is not great.

The always joinable is killing us too though, I agree. We must treat even our own community as potential attackers (in the nicest way) where they will churn nodes fast as hell and upload tonnes of data. It’s all good, but we need to treat public tests as invites to our great community plus bad guys who will want to do harm.

16 Likes

Why don’t you just make “never leavable” network :wink:?

3 Likes