How even is chunk distribution within each section? It should be pretty even, right? I was fairly sure chunks would be evenly distributed in the section because of the way hashing and xor distance works.
But I was curious whether chunks with names close to the very start or very end of the section boundaries would ‘clump’ and cause higher load for those vaults (at least that’s the intuition, the reality is probably more complex because xor distance messes with the idea of ‘start’ and ‘end’ of a section).
I wrote a small simulation which uploaded data into a section, then charted the distribution of chunks within the section to see if clumping of chunks happened and if so how much.
The results showed a big difference between the most loaded and least loaded vaults. Regardless of the number of chunks or the size of the section, it was about a 4x difference between the vault with the most chunks and the least chunks. This was really surprising to me.
Also the distribution of chunks was surprising. In the charts below, the x axis left to right is vault name from smallest to largest. So I was expecting maybe to see excessive chunks close to the edges of the chart, but that’s not what happened.
So my expectation of some vaults having heavier load than others was true, but why there’s such difference (and the particular vaults which are affected) is not clear to me from the results so far.
To put these numbers into perspective, with 1M chunks (1TB) loaded into 100 vaults with 8 duplicates means some vaults have stored 30 GB and some vaults have stored 120 GB. For perfectly even distribution we’d expect 80K chunks per vault, but it varied between 31K and 125K chunks per vault. Pretty interesting I reckon.
What’s the most vaults we’d expect per section? About 400 judging by rfc-0057. And what’s the most chunks we’d expect per vault (this can be derived from the desired average bandwidth and time to relocate)? What happens when a weaker vault finds themself in a heavily loaded part of the section? It’s really very interesting to have 4x variation in load within a section.
Any doubts? Thoughts? Further ideas?