I expected a more uniform distribution over xor space with as many as 100 vaults.
Redistribution of vaults would make xor addresses less random, (and a section possibly less secure?)
I expected a more uniform distribution over xor space with as many as 100 vaults.
Redistribution of vaults would make xor addresses less random, (and a section possibly less secure?)
Letâs comparing three strategies for vault naming: uniform vault names vs random vault names vs best-fit vault names (where vaults are required to join the section with a random name within the biggest existing gap between vault names).
Iâve been using the term âspacingâ for the gap between two vault names.
Variation in spacing is a good measure of uniformity. Low variation of spacing means all the vaults are pretty evenly spaced out, and high variation means thereâs a lot of bunching and also big gaps.
The variation is very definitely noticeable, but doesnât really bring the variation much closer to uniform.
Naming Variation
Uniform 0.018
Best Fit 1.420
Random 5.340
(Nerd details: Uniform variation is not exactly 0 because you canât exactly fit 100 vaults uniformly into a binary namespace. Variation is actually the standard deviation of all spacings divided by 10^18 using vault names of type uint64. The variation is taken from five tests to give an average variation.)
An example vault distribution is shown below, first random vault names, then best-fit vault names. To my eye thereâs not much difference, but looking through the statistical analysis itâs actually quite significant and easily detected with statistics.
In summary, the strategy to relocate between two specific vaults definitely seems to help give a more uniform spacing.
The next question is, does spacing of vaults contribute to keeping the chunk load fairly even?
From the charts below, my eye would guess the distribution for best-fit vault naming definitely falls between random and uniform naming. Best-fit naming seems much closer to random than uniform. The difference between the heaviest loaded vault for each naming strategy seems to roughly line up with the variation test.
This is a distillation of a lot of info and testing, thereâs potentially a lot of detail to get into so if you have questions Iâm happy to go deeper.
Repository of tests here: https://github.com/iancoleman/safe_chunk_responsibility_simulation
Oh sorry my wording is confusing. Duplicates 1 means 1 copy of each chunk in the section. So 1 is no duplicatesâŚ! 8 duplicates means 8 copies, not original+8. I used the word âduplicatesâ as a form of âduplicationâ but the translation wasnât quite right huh.
Super super super cool!
A good reason to have binary section sizes 128, 256, 512 etc.?
When you take into account vault size distributions does this even matter? A 256TB vault should have more chunks than a 256GB vault. Now if instead all vaults are of fixed size than the vault placement is more critical, but random placement for random sizes might be just fine. Does your sim take into account vault size, or placement based on size?
That was my initial interpretation, but then I got confused (and still am) because you guys speak of close vaults being responsible for the same chunks. Isnât it only one vault responsible for the chunk if there is no duplication?
Anyway, I donât want to distract from the more interesting discussion about spacing of vaults.
I donât think we need to absolutely reduce the variation in spacing (of vaults). We just need to have a relocation strategy that tries to reduce the heterogeneity between sections.
Here is a reminder of the rules I used to get this result in my simulations (what I call network homogeneity here):
Iâll add this strategy too.
Another thing to try is whether relocations for the best-fit strategy improves spacing. Iâm just doing the initial joins to a section and not doing any relocations in or out. That might even things out? Will test it when I have some time.
Also I want to work out how much work it takes to be more balanced. Random is no work, best fit is some extra work, uniform is extreme amounts of work (impossible). So youâd be able to say âwith X work you can get Y spacingâ. Depends how fast you can generate new keypairs and how many youâre able to retain for later use.
Not really. Uniformity doesnât matter and perfect uniformity requires generating an exact public key which is almost impossible.
Yes it matters because eventually the smaller vault will be relocated to a part of a section which puts high load on it, so larger vaults have the advantage (they can do any relocation, but small vaults can only do some relocations). Maybe itâs ok because itâs a way to gradually prune out the smaller vaults?!
No it doesnât. Might be helpful for allowing smaller and larger vaults to coexist on the network.
The relocation rules are definitely an area that could be quite useful but I think it needs pretty careful consideration from a security perspective. My intuitive preference is still leaning toward random relocation rather than targeted, although @tfa seems to have a pretty good compromise.
An algorithm like poisson disk sampling with variable influence region based on vault size might be what you are looking for. It would operate like random relocation, but you could tailor the distributionâŚ
Best fit is impossible too, because the nodes donât have a global knowledge of the network, so the 2 close nodes with the max distant between them cannot be selected.
In a simulation care must be taken to use an algorithm that can be implemented in the real network. This strategy can only be approximated in the real network, for example using a best fit relocation strategy in the vicinity of a random section.
Another thing: xor distant is not the same as linear metric distance. Two nodes that are close in the xor space will be also close in the metric space, but the converse isnât necessarily true, the most extreme counter example being addresses 0x7FFFâŚ and 0x8000âŚ which have a maximal distance in the xor space but a minimal distance in the metric space. Though, I donât know if this would impact the results of your simulations.
Within a section you can pick the max distance, right? My understanding of how to do this is 1) order all vaults in the section in ascending order of name 2) calculate the distance between each vault to find the maximum distance. 3) Wherever the largest distance is, you tell the joining/relocating vault to generate a key between those two vault names. Am I missing something?
Do you mean you canât pick the section within the global network that has the maximum distance? Because Iâm only assuming you pick âanyâ section and then after that you can find the best fit within that section. There may be other sections with larger spacings. Iâm not concerned by picking the section to relocate to, only concerned with where to relocate within the destination section.
Note in step 2 the spacing between vaults may be calculated as subtract(biggerName, smallerName)
or it may be xor(biggerName, smallerName)
. Do you know which is more suitable to use? Iâve got an option for both in my simulation.
Ah yes, Iâve not been dealing with relocations to the âbestâ section, just to any section. Iâm not interested in the global network for these simulations, only looking within a section. I think finding the âbestâ section is a separate problem to where within that section to relocate.
Doing a relocation is a two step process
I think these two steps should be completely uncoupled, even if they both aim to achieve the same goal. In my simulations Iâm only dealing with step 2.
The distance metric is xor, so I would stay with the xor-based spacing strategy that you have now. When you plot the data, the uniformity might look odd if you plot vault locations on a linear axis based on id. Instead, you could sort the vaults by id, then start with the lowest id, followed by itâs nearest neighbor. You should be able to walk the nearest neighbors until all vaults are depicted. This may be rather challenging in 1D, so a 2D scatter plot might offer a better perspective.
Taking a big step back out of the weeds (which Iâm very happy to be digging around in), letâs revisit the original purpose of this topicâŚ
Regardless of how we choose to relocate vaults or how we measure spacings, this is what seems to happen:
Thereâs about double the storage from the smallest vault in the section and the largest vault in the section, eg 300 GB smallest vs 600 GB largest.
Thereâs about 50% more storage from the first quartile (ie middle of the smallest half) and the third quartile (ie middle of the biggest half), eg 350 GB first quartile vs 525 GB third quartile.
I think this is ok. Itâs definitely not an âevenâ or âuniformâ load but itâs not orders of magnitude different.
Running multiple vaults will (probably) smooth this out on a per-operator basis. You (probably) get hit by the variation worst when only running one vault.
Relocations will (probably) smooth this out over time for any single vault, but may cause a vault to go from storing lots to storing little (which is manageable but inconvenient because of less rewards) or go from storing little to lots (which may cause the vault to be booted from the network if theyâre already near their storage capacity).
I guess whatâs interesting to me is (like bitcoin mining variability) finding whatâs the rough range of expected variation and how does it affect the incentive structures and participation rates.
Where do 16TB drives fit in? I know weâve discussed this before, but since network bandwidth is the limiting factor, it seems that large vaults are a benefit to the network. A 128TB vault will be commonplace.
Nowhere in the community network where those figures come from
Nowhere really in the early network since storage needs are at the minimum for the life of the network. Also bandwidth requirements probably kills any real chance of using multiple vaults to take advantage of them.
Maybe later on when storage needs means larger vaults are useful but even then youâd probably want to split them up.
But the price on them are at a premium. >$1000 in Australia at the moment
Thanks a lot for this work great to see the trend and how important this critieria is.
The current code to do that is the following.
So yes your method is reasonably accurate. A couple of things:
/// Calculate the interval for a node joining our section to generate a key for.
pub fn calculate_relocation_interval(
prefix: &Prefix<XorName>,
section: &BTreeSet<XorName>,
) -> XorTargetInterval {
let (lower_bound, upper_bound) = (prefix.lower_bound(), prefix.upper_bound());
let (start, end) = iter::once(&lower_bound)
.chain(section)
.chain(iter::once(&upper_bound))
.tuple_windows()
.max_by(|&(x1, y1), &(x2, y2)| {
let diff1 = y1 - x1;
let diff2 = y2 - x2;
diff1.cmp(&diff2)
})
.unwrap_or((&lower_bound, &upper_bound));
let third_of_distance = (*end - *start) / 3;
let new_end = *end - third_of_distance;
XorTargetInterval(new_end - third_of_distance, new_end)
}
I think this maybe doesnât do what is intended because it uses operations in linear space instead of xor space. Not totally sure but the results became quite strange when I included this ânarrowingâ feature (every time for multiple tests).
Without the extra ânarrowingâ of the relocated name:
With the extra narrowing of the range of the relocated name:
A few vaults got a huge number of chunks, and it happens quite reliably.
Consider the range of names between (100,200) which has an xordistance of 100^200=172. Thereâs 100 possible names to pick between those ones. Names between (133,166) are the middle third, but they arenât the ones that best break up the gap. For example, picking name 156 will create two new gaps (100,156,200) of 100^156=248 and 156^200=84. So thereâs now a much larger gap of 248 than the previous gap of 172. New names which would lead to a smaller gap than 172 are in the ranges (101, 127) and (192,199). Any other name leads to a larger new gap.
edit: itâs late here and been a long day so I think Iâve got this last paragraph back to front. Do we want to make the new xordistances as large as possible or as small as possible? Will revisit tomorrowâŚ thanks Jean-Philippe for the clarifications especially the code.
edit 2:
My calculation for the space between vaults is bigName^smallName
but it looks like from the Jean-Philippe code block starting .max_by
it should be bigName-smallName
âŚ?
Edit:
No, this is OK Yes, there might be a problem because std::ops::Sub has been implemented for XorName to redefine the - operator with a subtraction in XorName converted to big uint.
Note:
The piece of code hilighted by @Jean-Philippe is aimed at selecting a range around the middle of the largest interval between 2 nodes (or the lower bound of section and its lowest node or the higher bound of the section and its highest node).
We cannot just take the middle of the largest interval because an address is a public key and we cannot generate the private key from the public key. So a set of key pairs are repeatedly generated until the public key falls inside this range.
This precisely what poisson disk sampling does in real space. Randomly generate coordinates until they fall within a given bound, plus some optimizationsâŚ
For my own future reference, the operations on xornames are in routing/src/xorname.rs
The place in code where Sub
is implemented for XorName (ie the -
operation)
and also for Div
(ie the /
operation)
Both look to be operating in linear space to me, not xorspace. (Not to say this is âright or wrongâ or âbetter or worseâ, just pointing it out)
Thereâs the function closer
(which takes two xornames and finds if the first one is closer to us than the second one)
I canât find a function xordistance
that gives the distance between one xorname and another. Looks like chunk and vault placement is done relative to closer or not closer, rather than by a distance number. Just found this curious.
This test was using xordistance to measure spacing between vaults, which is not how itâs currently done in routing (thanks again Jean-Philippe for clarifying).
I changed to linear spacing plus the extra ânarrowingâ of the new name; the results look normal again (2x max:min and 1.5x 3qtl:1qtl) with no really heavily loaded vaults. I have a chart but not much point uploading, it looks just like the others!
Main point: we can tweak the relocation mechanism but mostly it ends up with vault storage showing very similar distributions.
I find it very difficult to think easily about xor distance so did some exploring and hope to shed some intuition on this fairly unusual way of thinking.
The first thing to do up is build some background understanding using the normal âlinearâ way of thinking of distance. Itâll seem obvious but is important scaffolding for exploring deeper.
If there are two towns 150km away and 250 km away, we can easily say that theyâre 100km away from each other by simple subtraction. Letâs call this separation.
We can also say if weâve travelled 20km from the first town toward the second town thereâs 80km left to go, also by simple subtraction. Letâs call this splitting.
One of the obvious properties of linear distance is that any split will always sum to the separation.
Now, moving to a very simple case of xordistance, letâs look at two towns, one at 0km and one at 255km.
The separation is 0^255=255km, exactly the same as the linear result.
Amazingly, any split also sums to 255. eg splitting at 5 gives 0^5 + 5^255 = 255.
This is really obvious from the chart below (which is the same for both linear distance and xordistance).
But things get strange in xorspace when we move to unusual intervals.
The separation between 100 and 200 is 172, 100^200=172.
And by splitting at, say, 156 we end up with larger distances than the original separation! We get 100^156=248 and 156^200=84. So by âdrivingâ a bit over âhalfwayâ between the two locations the combined distance before and after our current location is 248+84=332, more than twice the separation. This is really unintuitive and we canât really use metaphors or intuitions from linear thinking because they break down.
The chart below shows how crazy xor distances get when dealing with âunusualâ intervals. Splitting in xor space can get very unusual.
This is important because when a chunk âsplitsâ several vaults itâs not necessarily intuitive which vaults are closest. In linear distance you just grab the visually nearest ones on a 1D axis, but thatâs not necessarily true with xordistance.
SoâŚ not really any great takeaway from this other than maybe donât try to think too hard about xordistance using normal euclidian space or cartesian coordinates because itâs very different to linear distances.
Itâd be great to have a visual language that makes xordistance more intuitive. Anyone know of something suitable?
Itâd also be great to have some plain-English sentences that help break out of linear distance thinking, a bit like zen koans perhaps. One that Iâve been trying has been âmany matching leading bits give small xordistancesâ but eg 100^200 is hard to intuit what the leading bits are.
In order to easily conceptualize xor space you need to think about it like the mapping of a dense tree. A given linear or cartesian coordinate in binary represents the path taken from the root of the tree down a branch to the leaf. For example,
a binary coordinate of 0101, would mean traversing left-right-left-right down four levels of the tree.
When you take the xor distance of two linear coordinates it is a measure of the number of levels you need to traverse up the tree, from leaves to the root before you reach the ânearest common ancestorâ of those two coordinates. The xor distance metric is distance in âhopsâ to nearest common ancestorâ.
This paper gives a decent overview, there are many others out there.
Frisken, S.; Perry, R., âSimple and Efficient Traversal Methods for Quadtrees and Octreesâ, Journal of Graphics Tools, Vol. 7, No. 3, May 2003.
http://www.merl.com/publications/docs/TR2002-41.pdf 1
Edit: @mav to use your two towns 255km apart metaphor, linear distance is what an airplane would travel in a straight line, two times xor distance is what a car would drive through an optimal branching tree network of roads if each âhopâ is 1 km.