Speed and Resiliency through Forward Error Correction

Speed as in lower latency*and resiliency in a geographical sense.

* i.e. not higher bandwidth; we’re already good on that front

The problem

  1. When a small enough geographical region (e.g. a country, through political intervention) gets isolated from the rest of the network, it will lose access to a lot of files simply because they will have blocks with none of the 4 mandatory copies* stored within the isolated area, and thus inaccessible.

    * Is that still how it works?

  2. I did a small experiment recently, where I randomly placed a million nodes across the world based on population density*, and then I connected them at random. If I remember correctly, the average latency between two nodes were 45ms, but that would probably double after adjusting for IP routing overhead and the lower lightspeed in fiber or copper.

    * I used a 1/4 degree grid to select the general area, then added some noise with 1/2 degree sd.

    What this means is that each SAFE routing hop will add about 0.2s to the roundtrip latency, so we’ll have to wait a full second even if we’re only 5 hops away from the vault that host the block we want. Well, the good thing is, we can request a bunch of stuff parallel, so the bandwidth will be quite high. Still, a one-second (on average! it can easily be 4x as much) pause feels long.

How could FEC help?

Let’s say we split up each block into 8 pieces (i.e. 125 KB each), then generate 8 more pieces in a way that any 8 of the 16 pieces are sufficient to restore the whole data. Now we can lose more than 2.67x the number of blocks (i.e. 8 vs 3) without losing the block, so it’s much more likely that an isolated area has access to enough to put it together.

What may not be obvious at first is that we can also lower the average distance between a client and the destination, we just have to pick those 8 blocks out of the 16 that are closest to one of our close nodes, and request them instead of just a random selection. (When some of them isn’t returned, we can start requesting the pieces farther away from us.)

Of course, the 8-of-16 setup is just an example. It could be 4 of 12 or 8 of 32, or any other combination.

The good

  • Lower latency, happier users.
  • Data is transferred for fewer hops, so bandwidth is better utilized.
  • Less storage is required for the same (or, actually, much higher) redundancy.
  • If part of the network gets isolated, it would still have access to much more of the data.

The bad

  • More requests are needed for the same amount of data. (But we may still end up with similar, or not much more, requests network-wise because of the lower number of hops. Just to avoid confusion, these requests are for 128 KB partial blocks, not the 1 MB full blocks.)
5 Likes

Great stuff. What is the expected “average” number of his being 5 based on? Can you elaborate the calculation. Thanks.

1 Like

Just a guess; it may be 3 or 7 for all I know (probably not higher than that), but 5 seemed like a conservative lower bound for the mean. Maybe @dirvine or someone else from the team can give us a more educated estimate.

1 Like

What/how did you test exactly? A million nodes across the world doesn’t sound like a small experiment to me.

What we do, replicate :slight_smile: So there are 8 copies of each chunk all geographically distributed due to xor space etc.

At the moment
You request a chunk, 7 nodes give you a small message with hash of chunk and who is sending it. The sender node sends the data.

Better

All 8 send small message with hash of data. You select the first one to answer and get the whole chunk from it. Then you get the data at fastest network speed but at cost of an extra message.

5 Likes

It isn’t exactly big either. You just get a population grid from e.g. Columbia Uni’s website, then use the numbers as the weights for a multinomial distribution. Whichever cell is picked for a given node, you put down the coordinates next to it. Then, you pick pairs at random, compute the great circle distance between them, and divide it by the speed of light to get the possible smallest time that it takes to send a packet between the two nodes around the globe. THE END.

4 Likes

Are there any stats for how many XOR hops it is between the two nodes usually? Extra messages especially can introduce extra latency.

Based on my simulation (for when SAFE is adopted across the globe), a single hop already takes around 60-100 ms on average, twice as long for the roundtrip, so too many hops can introduce significant delay. Twice as much if there’s an extra message exchange involved. (I can’t remember the variance, and I’m not even sure the delays were normally distributed, so I’m not sure how multiple hops fared.)

I admit latency is not important in many cases (e.g. movies, live streams, etc), but it could be very annoying at the few where it is (I’m reminded of Google’s experiment about search page load delay.)

2 Likes

@dirvine, are “archive” nodes included in the “8 copies” or are they separate again and thus more than 8 copies over time? Or are they part of teh 8 copies and inactive chunks move into archive nodes over time?

2 Likes

As I understand, with Datachain the archives nodes, possibly 3 in each group (sector),will be those with longer chains.

The nodes will compete to have longer chains, and in the end become archive nodes, because with longer chain you have more options to win Safecoin.

And, for the moment, I don’t know how one group (sector) can determine between active or inactive chunks.

3 Likes

I revisited this subject today in a new thread. Lots of stats. Hopefully not completely wrong :joy_cat:

2 Likes