Chunk distribution within sections

How even is chunk distribution within each section? It should be pretty even, right? I was fairly sure chunks would be evenly distributed in the section because of the way hashing and xor distance works.

But I was curious whether chunks with names close to the very start or very end of the section boundaries would ‘clump’ and cause higher load for those vaults (at least that’s the intuition, the reality is probably more complex because xor distance messes with the idea of ‘start’ and ‘end’ of a section).

I wrote a small simulation which uploaded data into a section, then charted the distribution of chunks within the section to see if clumping of chunks happened and if so how much.

The results showed a big difference between the most loaded and least loaded vaults. Regardless of the number of chunks or the size of the section, it was about a 4x difference between the vault with the most chunks and the least chunks. This was really surprising to me.

Also the distribution of chunks was surprising. In the charts below, the x axis left to right is vault name from smallest to largest. So I was expecting maybe to see excessive chunks close to the edges of the chart, but that’s not what happened.

So my expectation of some vaults having heavier load than others was true, but why there’s such difference (and the particular vaults which are affected) is not clear to me from the results so far.

To put these numbers into perspective, with 1M chunks (1TB) loaded into 100 vaults with 8 duplicates means some vaults have stored 30 GB and some vaults have stored 120 GB. For perfectly even distribution we’d expect 80K chunks per vault, but it varied between 31K and 125K chunks per vault. Pretty interesting I reckon.

What’s the most vaults we’d expect per section? About 400 judging by rfc-0057. And what’s the most chunks we’d expect per vault (this can be derived from the desired average bandwidth and time to relocate)? What happens when a weaker vault finds themself in a heavily loaded part of the section? It’s really very interesting to have 4x variation in load within a section.

Any doubts? Thoughts? Further ideas?

25 Likes

Maybe to verify the randomness of your simulation input data you could see if the distribution is fairly even with duplication factor of only 1 instead of 8, if possible?

2 Likes

I guess it’s important to note that vault names are random, not perfectly evenly distributed, so there will be some bunching of vault names.

Chunk names are also random.

2 Likes

So the 5 vaults that got >30000 chunks above were just lucky to not be close in name to other vaults?
I guess that as vaults join and leave the section then the distribution would even out somewhat.

2 Likes

The first graphs look like scrambled normal distribution.

I think the exact methods for calculating the correct addresses for every operation in the network is needed, to know the distribution - if uniform or not.

I am thinking that it needs to map any distribution of input (since we cannot affect what data is uploaded, thus the hashes of it) to a uniform output.

3 Likes

Yes. Lucky or unlucky, depending on your view!

It uses xor distance to find the closest 8 vaults. The distance of XOR(chunk_name, vault_name) is done with every vault in the section and the 8 vaults with the smallest values (ie closest distance) are responsible for the chunk. If one leaves, the next smallest from the remaining vaults in the section joins in and fetches the chunk from the remaining 7. If a relocated vault comes in they may bump one vault out of the close group.

1 Like

Data = potential for farming attempts, so I’d go with lucky (up until reaching resource limit of vault)

Normal distribution and some spread makes sense but the biggest surprise to me (probably due to lack of understanding) is what seems like quantization in the distributions. Why is that?

1 Like

I was not quite referring to this though. Rather: what is the hashing function that generates an xor address out of
A. Data
B. Random (for node address)
And in what way is this then combined with other functions.

I know more or less from both the documentation, the source and when translating the Google attack simulations.

What I mean is needed, for us to know expected distribution (as well as any quantization), is the exact hashing functions used and their properties, and possibly also the exact steps, as they might introduce quantization (several xor:ing steps with various variables).

Then it can be verified.

If we want uniform rewards, we want uniform traffic, if we want uniform traffic we want uniform hash output when generating data addresses (both self encrypted as well as MD adress hashed from string), as well as node addresses (join / relocate).

Do we have uniform hash (+ additional operations such as xoring and what not) output?
If not, then it will have implications on how the farming algorithm is designed for example.

1 Like

Where do you see this? If we had the vault / data addresses as well, then that’s where it could be I think. But it’s not in the data posted here,if I’m not mistaken (I’m on mobile phone)

1 Like

It seems that many vaults receive the same number of chunks (or practically the same).

In the plot with 1 duplicates for instance I discern only 5 substantially different levels of chunks. 2, 4, 8, 16 thousand.

1 Like

Nodes belonging to less populated sections are responsible for more chunks. There are 2 distributions involved: the chunks names and the nodes names, and their deviations are cumulated.

This means that relocation strategy should rebalance the sections.

Addresses beginning with a set of 0s or a set of 1s are not special (not less probable or not more probable for both nodes and chunks).

Good observation. I think this is because 2 nodes that are close in the xor space are responsible for the same set of chunks.

3 Likes

Uniform is not random. Continuing to add more randomness does not result in a uniform image, it results in a speckled image with no patterns. Hashes give random names, not uniform names, no matter how deep or wide you go.

The thing with random numbers is that if we take a subset of the distribution it should look equally random (eg jagged bar chart, speckly image). It doesn’t get more or less jagged or speckly as you increase or decrease the resolution or the zoom or the quantity of random data. At very zoomed in (eg only 3 vaults instead of 300) it definitely becomes harder to tell.

The practical consequence of this for sections would be

  • we can’t change the chunk name, the client picks it (quite literally in the case of MutableData) so data names will always be speckled no matter how many vaults or sections exist. Maybe you could have every chunk have a ‘section agreed nonce’ or something to improve the distribution but that seems really messy.
  • we can change the name of a vault (which is essential for relocations anyhow) but the more control we want over the name (eg make it very close to the nearest vault rather than just fairly close or vice versa) the longer it takes to generate the keypair for that name. So even if uniform spacing of names was desirable, or ‘matching’ distribution of vault names to chunk names, it can become expensive (I talk about bruteforcing keypairs for relocation in this post on the dev forum. It’s not impossible, but for large networks it keeps getting harder and harder). And then future data may mess the distribution up anyway.

I think this nonuniformity is unavoidable.

Yes this is definitely a strong reason for appreciating the topic since it may impact the economic model. It may also not, eg it may allow weaker vaults to live alongside stronger vaults which improves consensus diversity, but I think in general the implications probably do need some amount of consideration.

Agreed with what @tfa says on this. Just want to add the charts are maybe a little visually deceptive since the bars are all uniformly spaced, but in reality the bars would be at varying distances apart to accurately capture the random distribution of vault names. I suspect tall bars would probably have a lot of space around them and short bars would be very close to others (probably too close to easily distinguish the difference visually).

And to extend on @tfa use of the word compounding: mapping an uneven (ie random) distribution of chunk names onto an uneven (ie random) distribution of vault names will be expected to look more/less/equally uneven compared with one of those distributions being uniform?

5 Likes

Hi @mav,
When accepting a Vault into a section, we give it an interval so it has to generate an address that it will slot in the biggest gap, so vaults address should be somewhat close to evenly distributed. (At least that is what we are aiming for). Would be interesting to see how that impact these simulations.

Thanks a lot for this insight, very interesting.
JP

12 Likes

Yes, this is what I have been saying a long time,
when it has been argued that data is evenly (i.e. uniformly) spread out in the network - i.e. when assumptions has been made about the economy for example based on this idea.
I have for a long time in various discussions (last time was yesterday in the topic about farming algos, I thought that’s what initiated this topic) said that I suspect we will see a normal distribution rather than uniform, from generating these random values.

I.e. we can not expect that all sections will see same traffic. A few will have radically different values for traffic, data and thus store cost and farming reward.

But, since it is always stated that it will be evenly distributed (like right above here, by Jean-Philipe) I say that it has to be verified, in every step, and it is done by verifying the properties of the hash function, as well as the results of the additional operations applied, like xoring.

So, what I am saying is that there is contradictory information (I believe one thing, MaidSafe says another thing), what we need, in order to know which it really is, is to verify.

2 Likes

Yes I agree. I’m very keen to see actual data from real tests again, it’s going to be fascinating!

2 Likes

That’s why @mav’s secure random relocation is nice because the vaults could continuously flow through the sections and never be stuck in a poor economic situation. The rate of migration could be proportional to nodal age, with youngsters migrating more often.

2 Likes

Oh, oops…when I had originally suggested taking a look at the input data with a duplication factor of 1, the idea was to see the raw inputs of the simulation without any duplicates. So I guess I actually meant a duplication of 0. In that case I think the variations we see would be expected to be related to the initial spacing of the vault names and we’d not expect to see such quantization.
It is interesting to see the effects of these different parameters.

1 Like

I think it will look more uneven, or more exactly: the difference between the vault with the most chunks and the least chunks will be more important with the double uneven distribution.

Suppose as an example:

  • a duplication factor of 8 (each chunk is stored 8 times)
  • a uniform distribution of vaults with 16 vaults per section and also well balanced sections with 8 vaults in each section half (a section half is an area in the xor space with a prefix equal to the section prefix concatenated to 0 for the first half and to 1 for the second half)
  • a uniform distribution of chunk addresses

In this situation each vault will store exactly the same number of chunks, which is also the number of chunks managed in each section half (I mean having the same prefix as the section half).

Now if the chunks are not uniformly distributed anymore, then there will be more chunks in some section halves and less in some in some other. Which means that some vaults will store more chunks than some other.

For now every vaults in a section half still store the same number of chunks. But if, additionally, the vaults are not uniformly distributed anymore, then there will be some section halves with less than 8 nodes and some other with more than 8 nodes.

Among the section halves having less chunks there will be some with more than 8 vaults which means that these vaults manage even less chunks each (because they don’t store all the chunks of the half anymore).

Conversely, among the section halves having more chunks there will be some with less than 8 vaults which means that these vaults manage even more chunks each (because they have to store some chunks belonging to the other half).

In this example the double uneven distribution increases the difference between the vault with the most chunks and the least chunks. This is only an example and I can’t prove it mathematically, but I think this is true more generally.

2 Likes

This can be observed on community network:

Looking at “Mut. data” column we see that TFA–08 and rid–00 have both 703 chunks, same for neo–00 and happybeing–00 with 635 chunks. (don’t look at “Im. data” column because vaults store more chunks than necessary depending on the section history)

We can check the relative proximity of these vaults (TFA–08 and rid–00 on one side and neo–00 and happybeing–00 on the other side):

5 Likes