Chunk distribution within sections

I expected a more uniform distribution over xor space with as many as 100 vaults.

Redistribution of vaults would make xor addresses less random, (and a section possibly less secure?)

1 Like

Let’s comparing three strategies for vault naming: uniform vault names vs random vault names vs best-fit vault names (where vaults are required to join the section with a random name within the biggest existing gap between vault names).

I’ve been using the term ‘spacing’ for the gap between two vault names.

Variation in spacing is a good measure of uniformity. Low variation of spacing means all the vaults are pretty evenly spaced out, and high variation means there’s a lot of bunching and also big gaps.

The variation is very definitely noticeable, but doesn’t really bring the variation much closer to uniform.

Naming  Variation
Uniform     0.018
Best Fit    1.420
Random      5.340

(Nerd details: Uniform variation is not exactly 0 because you can’t exactly fit 100 vaults uniformly into a binary namespace. Variation is actually the standard deviation of all spacings divided by 10^18 using vault names of type uint64. The variation is taken from five tests to give an average variation.)

An example vault distribution is shown below, first random vault names, then best-fit vault names. To my eye there’s not much difference, but looking through the statistical analysis it’s actually quite significant and easily detected with statistics.

In summary, the strategy to relocate between two specific vaults definitely seems to help give a more uniform spacing.


The next question is, does spacing of vaults contribute to keeping the chunk load fairly even?

From the charts below, my eye would guess the distribution for best-fit vault naming definitely falls between random and uniform naming. Best-fit naming seems much closer to random than uniform. The difference between the heaviest loaded vault for each naming strategy seems to roughly line up with the variation test.

This is a distillation of a lot of info and testing, there’s potentially a lot of detail to get into so if you have questions I’m happy to go deeper.

Repository of tests here: https://github.com/iancoleman/safe_chunk_responsibility_simulation


Oh sorry my wording is confusing. Duplicates 1 means 1 copy of each chunk in the section. So 1 is no duplicates…! 8 duplicates means 8 copies, not original+8. I used the word ‘duplicates’ as a form of ‘duplication’ but the translation wasn’t quite right huh.

Super super super cool!

9 Likes

A good reason to have binary section sizes 128, 256, 512 etc.?

When you take into account vault size distributions does this even matter? A 256TB vault should have more chunks than a 256GB vault. Now if instead all vaults are of fixed size than the vault placement is more critical, but random placement for random sizes might be just fine. Does your sim take into account vault size, or placement based on size?

2 Likes

That was my initial interpretation, but then I got confused (and still am) because you guys speak of close vaults being responsible for the same chunks. Isn’t it only one vault responsible for the chunk if there is no duplication?
Anyway, I don’t want to distract from the more interesting discussion about spacing of vaults.

1 Like

I don’t think we need to absolutely reduce the variation in spacing (of vaults). We just need to have a relocation strategy that tries to reduce the heterogeneity between sections.

Here is a reminder of the rules I used to get this result in my simulations (what I call network homogeneity here):

  • 20% local relocations (to the weakest neighbor section)
  • 80% distant relocations (to the weakest neighbor section of a random distant section)
  • when a node is relocated to a section it is placed in the less populated half of this section
3 Likes

I’ll add this strategy too.

Another thing to try is whether relocations for the best-fit strategy improves spacing. I’m just doing the initial joins to a section and not doing any relocations in or out. That might even things out? Will test it when I have some time.

Also I want to work out how much work it takes to be more balanced. Random is no work, best fit is some extra work, uniform is extreme amounts of work (impossible). So you’d be able to say ‘with X work you can get Y spacing’. Depends how fast you can generate new keypairs and how many you’re able to retain for later use.

Not really. Uniformity doesn’t matter and perfect uniformity requires generating an exact public key which is almost impossible.

Yes it matters because eventually the smaller vault will be relocated to a part of a section which puts high load on it, so larger vaults have the advantage (they can do any relocation, but small vaults can only do some relocations). Maybe it’s ok because it’s a way to gradually prune out the smaller vaults?!

No it doesn’t. Might be helpful for allowing smaller and larger vaults to coexist on the network.

The relocation rules are definitely an area that could be quite useful but I think it needs pretty careful consideration from a security perspective. My intuitive preference is still leaning toward random relocation rather than targeted, although @tfa seems to have a pretty good compromise.

An algorithm like poisson disk sampling with variable influence region based on vault size might be what you are looking for. It would operate like random relocation, but you could tailor the distribution…

1 Like

Best fit is impossible too, because the nodes don’t have a global knowledge of the network, so the 2 close nodes with the max distant between them cannot be selected.

In a simulation care must be taken to use an algorithm that can be implemented in the real network. This strategy can only be approximated in the real network, for example using a best fit relocation strategy in the vicinity of a random section.

Another thing: xor distant is not the same as linear metric distance. Two nodes that are close in the xor space will be also close in the metric space, but the converse isn’t necessarily true, the most extreme counter example being addresses 0x7FFF… and 0x8000… which have a maximal distance in the xor space but a minimal distance in the metric space. Though, I don’t know if this would impact the results of your simulations.

2 Likes

Within a section you can pick the max distance, right? My understanding of how to do this is 1) order all vaults in the section in ascending order of name 2) calculate the distance between each vault to find the maximum distance. 3) Wherever the largest distance is, you tell the joining/relocating vault to generate a key between those two vault names. Am I missing something?

Do you mean you can’t pick the section within the global network that has the maximum distance? Because I’m only assuming you pick ‘any’ section and then after that you can find the best fit within that section. There may be other sections with larger spacings. I’m not concerned by picking the section to relocate to, only concerned with where to relocate within the destination section.

Note in step 2 the spacing between vaults may be calculated as subtract(biggerName, smallerName) or it may be xor(biggerName, smallerName). Do you know which is more suitable to use? I’ve got an option for both in my simulation.

Ah yes, I’ve not been dealing with relocations to the ‘best’ section, just to any section. I’m not interested in the global network for these simulations, only looking within a section. I think finding the ‘best’ section is a separate problem to where within that section to relocate.

Doing a relocation is a two step process

  1. decide the new section
  2. decide where within that section to be relocated

I think these two steps should be completely uncoupled, even if they both aim to achieve the same goal. In my simulations I’m only dealing with step 2.

3 Likes

The distance metric is xor, so I would stay with the xor-based spacing strategy that you have now. When you plot the data, the uniformity might look odd if you plot vault locations on a linear axis based on id. Instead, you could sort the vaults by id, then start with the lowest id, followed by it’s nearest neighbor. You should be able to walk the nearest neighbors until all vaults are depicted. This may be rather challenging in 1D, so a 2D scatter plot might offer a better perspective.

1 Like

Taking a big step back out of the weeds (which I’m very happy to be digging around in), let’s revisit the original purpose of this topic…

Regardless of how we choose to relocate vaults or how we measure spacings, this is what seems to happen:

  • There’s about double the storage from the smallest vault in the section and the largest vault in the section, eg 300 GB smallest vs 600 GB largest.

  • There’s about 50% more storage from the first quartile (ie middle of the smallest half) and the third quartile (ie middle of the biggest half), eg 350 GB first quartile vs 525 GB third quartile.

I think this is ok. It’s definitely not an ‘even’ or ‘uniform’ load but it’s not orders of magnitude different.

Running multiple vaults will (probably) smooth this out on a per-operator basis. You (probably) get hit by the variation worst when only running one vault.

Relocations will (probably) smooth this out over time for any single vault, but may cause a vault to go from storing lots to storing little (which is manageable but inconvenient because of less rewards) or go from storing little to lots (which may cause the vault to be booted from the network if they’re already near their storage capacity).

I guess what’s interesting to me is (like bitcoin mining variability) finding what’s the rough range of expected variation and how does it affect the incentive structures and participation rates.

5 Likes

Where do 16TB drives fit in? I know we’ve discussed this before, but since network bandwidth is the limiting factor, it seems that large vaults are a benefit to the network. A 128TB vault will be commonplace.

1 Like

Nowhere in the community network where those figures come from

Nowhere really in the early network since storage needs are at the minimum for the life of the network. Also bandwidth requirements probably kills any real chance of using multiple vaults to take advantage of them.

Maybe later on when storage needs means larger vaults are useful but even then you’d probably want to split them up.

But the price on them are at a premium. >$1000 in Australia at the moment

2 Likes

Thanks a lot for this work great to see the trend and how important this critieria is.

The current code to do that is the following.
So yes your method is reasonably accurate. A couple of things:

  • Need to handle upper and lower bound (i.e the biggest gap may be before first vault.
  • I was not fully accurate, with our best fit we shrink the interval to the 1/3 in the middle of the interval, this will likely drive it a bit closer to uniformly distributed.
/// Calculate the interval for a node joining our section to generate a key for.
pub fn calculate_relocation_interval(
    prefix: &Prefix<XorName>,
    section: &BTreeSet<XorName>,
) -> XorTargetInterval {
    let (lower_bound, upper_bound) = (prefix.lower_bound(), prefix.upper_bound());

    let (start, end) = iter::once(&lower_bound)
        .chain(section)
        .chain(iter::once(&upper_bound))
        .tuple_windows()
        .max_by(|&(x1, y1), &(x2, y2)| {
            let diff1 = y1 - x1;
            let diff2 = y2 - x2;
            diff1.cmp(&diff2)
        })
        .unwrap_or((&lower_bound, &upper_bound));

    let third_of_distance = (*end - *start) / 3;
    let new_end = *end - third_of_distance;
    XorTargetInterval(new_end - third_of_distance, new_end)
}
6 Likes

I think this maybe doesn’t do what is intended because it uses operations in linear space instead of xor space. Not totally sure but the results became quite strange when I included this ‘narrowing’ feature (every time for multiple tests).

Without the extra ‘narrowing’ of the relocated name:

best_fit_no_narrowing

With the extra narrowing of the range of the relocated name:

best_fit_with_narrowing

A few vaults got a huge number of chunks, and it happens quite reliably.

Consider the range of names between (100,200) which has an xordistance of 100^200=172. There’s 100 possible names to pick between those ones. Names between (133,166) are the middle third, but they aren’t the ones that best break up the gap. For example, picking name 156 will create two new gaps (100,156,200) of 100^156=248 and 156^200=84. So there’s now a much larger gap of 248 than the previous gap of 172. New names which would lead to a smaller gap than 172 are in the ranges (101, 127) and (192,199). Any other name leads to a larger new gap.

edit: it’s late here and been a long day so I think I’ve got this last paragraph back to front. Do we want to make the new xordistances as large as possible or as small as possible? Will revisit tomorrow… thanks Jean-Philippe for the clarifications especially the code.

edit 2:

My calculation for the space between vaults is bigName^smallName but it looks like from the Jean-Philippe code block starting .max_by it should be bigName-smallName…?

7 Likes

Edit:
No, this is OK Yes, there might be a problem because std::ops::Sub has been implemented for XorName to redefine the - operator with a subtraction in XorName converted to big uint.

Note:
The piece of code hilighted by @Jean-Philippe is aimed at selecting a range around the middle of the largest interval between 2 nodes (or the lower bound of section and its lowest node or the higher bound of the section and its highest node).

We cannot just take the middle of the largest interval because an address is a public key and we cannot generate the private key from the public key. So a set of key pairs are repeatedly generated until the public key falls inside this range.

1 Like

This precisely what poisson disk sampling does in real space. Randomly generate coordinates until they fall within a given bound, plus some optimizations… :checkered_flag:

3 Likes

For my own future reference, the operations on xornames are in routing/src/xorname.rs

The place in code where Sub is implemented for XorName (ie the - operation)

and also for Div (ie the / operation)

Both look to be operating in linear space to me, not xorspace. (Not to say this is ‘right or wrong’ or ‘better or worse’, just pointing it out)

There’s the function closer (which takes two xornames and finds if the first one is closer to us than the second one)

I can’t find a function xordistance that gives the distance between one xorname and another. Looks like chunk and vault placement is done relative to closer or not closer, rather than by a distance number. Just found this curious.


This test was using xordistance to measure spacing between vaults, which is not how it’s currently done in routing (thanks again Jean-Philippe for clarifying).

I changed to linear spacing plus the extra ‘narrowing’ of the new name; the results look normal again (2x max:min and 1.5x 3qtl:1qtl) with no really heavily loaded vaults. I have a chart but not much point uploading, it looks just like the others!

Main point: we can tweak the relocation mechanism but mostly it ends up with vault storage showing very similar distributions.

7 Likes

I find it very difficult to think easily about xor distance so did some exploring and hope to shed some intuition on this fairly unusual way of thinking.

The first thing to do up is build some background understanding using the normal ‘linear’ way of thinking of distance. It’ll seem obvious but is important scaffolding for exploring deeper.

If there are two towns 150km away and 250 km away, we can easily say that they’re 100km away from each other by simple subtraction. Let’s call this separation.

We can also say if we’ve travelled 20km from the first town toward the second town there’s 80km left to go, also by simple subtraction. Let’s call this splitting.

One of the obvious properties of linear distance is that any split will always sum to the separation.

Now, moving to a very simple case of xordistance, let’s look at two towns, one at 0km and one at 255km.

The separation is 0^255=255km, exactly the same as the linear result.

Amazingly, any split also sums to 255. eg splitting at 5 gives 0^5 + 5^255 = 255.

This is really obvious from the chart below (which is the same for both linear distance and xordistance).

But things get strange in xorspace when we move to unusual intervals.

The separation between 100 and 200 is 172, 100^200=172.

And by splitting at, say, 156 we end up with larger distances than the original separation! We get 100^156=248 and 156^200=84. So by ‘driving’ a bit over ‘halfway’ between the two locations the combined distance before and after our current location is 248+84=332, more than twice the separation. This is really unintuitive and we can’t really use metaphors or intuitions from linear thinking because they break down.

The chart below shows how crazy xor distances get when dealing with ‘unusual’ intervals. Splitting in xor space can get very unusual.

This is important because when a chunk ‘splits’ several vaults it’s not necessarily intuitive which vaults are closest. In linear distance you just grab the visually nearest ones on a 1D axis, but that’s not necessarily true with xordistance.

So… not really any great takeaway from this other than maybe don’t try to think too hard about xordistance using normal euclidian space or cartesian coordinates because it’s very different to linear distances.

It’d be great to have a visual language that makes xordistance more intuitive. Anyone know of something suitable?

It’d also be great to have some plain-English sentences that help break out of linear distance thinking, a bit like zen koans perhaps. One that I’ve been trying has been “many matching leading bits give small xordistances” but eg 100^200 is hard to intuit what the leading bits are.

12 Likes

In order to easily conceptualize xor space you need to think about it like the mapping of a dense tree. A given linear or cartesian coordinate in binary represents the path taken from the root of the tree down a branch to the leaf. For example,
a binary coordinate of 0101, would mean traversing left-right-left-right down four levels of the tree.

When you take the xor distance of two linear coordinates it is a measure of the number of levels you need to traverse up the tree, from leaves to the root before you reach the ‘nearest common ancestor’ of those two coordinates. The xor distance metric is distance in ‘hops’ to nearest common ancestor’.

This paper gives a decent overview, there are many others out there.

Frisken, S.; Perry, R., “Simple and Efficient Traversal Methods for Quadtrees and Octrees”, Journal of Graphics Tools, Vol. 7, No. 3, May 2003.
http://www.merl.com/publications/docs/TR2002-41.pdf 1

Edit: @mav to use your two towns 255km apart metaphor, linear distance is what an airplane would travel in a straight line, two times xor distance is what a car would drive through an optimal branching tree network of roads if each ‘hop’ is 1 km.

6 Likes