[Offline] A pre-christmas playground present

Toivo · December 18, 2021, 5:01pm

Are there more than one copy of the chunk? I guess if there’s only one, this is what is expected to happen?

happybeing · December 18, 2021, 5:20pm

There are always multiple copies of each chunk, four I think, so that losing nodes won’t lose files.

Toivo · December 18, 2021, 5:27pm

But in this testnet losing only some of them was a problem? Or was it that the remaining ones were in wrong place (not relocated)? Losing all of them because enough adults leaving would not be a bug, but expected thing to happen when network shrinks below certain limit.

(I am asking if there are many copies in these testnets already, or if that is not implemented yet?)

dirvine · December 18, 2021, 5:40pm

There are already replicants. AS churn happens they do relocate data (make more copies) but here they did not see new Adults and did not replicate, but they should have.

Toivo · December 18, 2021, 6:40pm

Hmm… Maybe it is a bit silly to keep asking, as I am sure there’s going to be deep analysis and solution in due time, but I am just wondering why didn’t the network find any of the replicas? Why does it matter that the new nodes didn’t get theirs as there were another in another node? How many replicas there are in the first place?

(Just leave the answer for later if you have anything better to do )

dirvine · December 18, 2021, 7:16pm

4 and losing the all with no new adult will lose the data.

Never is

We need to confirm there was a replica left for any data and if so then this question is pretty important. The remaining node should have replicated the data for certain, also it should have returned the data, but seemed not to.

yogesh · December 19, 2021, 6:43am

It is actually a combination of problems that lead us to failed GETs despite replications. Primarily, nodes were timing out on connections when trying to respond to queries. This, combined with the loss of data(due to nodes dropping and not republishing) made GETs fail as availability of the chunks reduced(remember even if one chunk fails, the whole file would fail to decrypt due to the nature of self-encryption). So when everything went right, i.e connections were held and data was found, we had successful GETs, otherwise it ended up failing even if one of those conditions failed.

There probably could be more to this than the current inferences, as we have mountains more of logs to go through and I could be totally wrong here!

Vort · December 19, 2021, 3:58pm

Maybe because they crashed?
Did you checked if all elders were running when problems start appearing?

yogesh · December 20, 2021, 7:16am

Yeps! No timeouts were logged pertaining to the Elders, so logs show they were alive at that moment. It also turns out that most of the connection timeout messages were seen with nodes that weren’t in the D.O. setup that we hosted(meaning they were from the community) which is understandable as folks would test for sometime and stop their nodes.

SmoothOperatorGR · December 20, 2021, 7:21am

so there is a bug when nodes exit?

yogesh · December 20, 2021, 7:26am

It isn’t a bug actually Healthy nodes are trying to contact a node who has left, which would obviously fail and its connections are expected to timeout.

Toivo · December 20, 2021, 7:29am

So files would have been found by just waiting for longer, until the network with stable nodes have time to do it’s business after this constant joining-leaving is over?

SmoothOperatorGR · December 20, 2021, 7:32am

i reference to the issue with files not getable/catable after some period of time

yogesh · December 20, 2021, 7:38am

Ideally, yes. Since nodes leave with data, the replication count for the chunks that they hold decrements. And since we do not have new nodes joining(which is the actual bug), the availability of those chunks stay reduced. If new nodes join successfully, the chunk count would be maintained by republishing chunks to them

So eventually we would be able to fetch them back as usual.

Toivo · December 20, 2021, 7:39am

…and the network was “always joinable” which made the situation worse?

(Edit: in the beginning, when nodes were not needed, but could come to “steal” a few chunks?)

yogesh · December 20, 2021, 7:47am

“always-joinable” is a developmental feature that does not restrict the network in taking new nodes. Meaning if we start a network with that feature enabled, the network would always accept nodes no matter what the storage capacity ratio is. So it should’ve actually helped us to accept new nodes though the bug itself seems to have restricted nodes from joining.

That is definitely another perspective to this But if new nodes kept joining ideally, the network would always try to maintain it’s data availability(chunk count) despite chunks being “stolen”.

SmoothOperatorGR · December 20, 2021, 7:49am

maybe there should be teen nodes, where if they get chunks they are also available in other 3 adults meaning that if someone just joins into the network and leaves in a short period it doesnt affect the chunks that are replicated in adults.

if a teen is reliable for a day lets say then upgrade it to adult

yogesh · December 20, 2021, 8:22am

Aye, node ageing does help a lot in situations like these. Can do a bunch of smart things with it

dirvine · December 20, 2021, 9:45am

The bug here I feel is waiting on new nodes to replicate. We should replicate to existing nodes. Imagine a network break like a big segmentation. We may lose a lot of nodes at once and we must replicate data quickly. If we end up replicating too much it’s better than what we have. Waiting on new nodes is not great.

The always joinable is killing us too though, I agree. We must treat even our own community as potential attackers (in the nicest way) where they will churn nodes fast as hell and upload tonnes of data. It’s all good, but we need to treat public tests as invites to our great community plus bad guys who will want to do harm.

Toivo · December 20, 2021, 10:15am

Why don’t you just make “never leavable” network ?

Topic		Replies	Views
Update 24 February, 2022 Updates	40	4459	March 8, 2022
[Offline] Update 08 December, 2022: A wild testnet appears! Updates	281	4412	December 26, 2022
Update 22 December, 2022 Updates	33	1812	December 31, 2022
Update 27 January, 2022 Updates	33	4178	February 6, 2022
Update 02 February, 2023 [The feb2 testnet - Offline] Updates	135	3505	February 27, 2023

[Offline] A pre-christmas playground present

Related Topics