Are Erasure Codes (Storj) better than Replication for the SAFE network?


#21

Also with your nodes staying up for months means that you are targeting your nodes are a decisively different level of audience.

We totally allow people downtime to upgrade and update, but yeah, people storing data on laptops is actively harmful. It’s harmful to any distributed storage system.

I should bow out of this conversation, but I guess consider me a messenger that there are hard, physical limits lurking around here, and anyone in this space needs to take a deep look at their architecture’s resource demands with real modeling and confirm that the resources even reasonably exist.


#23

Cheers for the answers. What I am saying is that you don’t know which parts exist in the erasure code scenario, so you can ask for A & B and wait on a timeout (if say B is missing) and then have to ask for C, or you could ask for all of them and only need two. So this is where erasure codes will cause a delay (till error) or require you download more chunks than you need as you do not know what are missing.

This is an example of what I mean, you may not be able to know the 20 to download if you start losing some.

Again I think you must download more than you need or have many waits to see if a part is missing.

Yes but only 1x of a chunk and only locally to the replicant group (not geo local, network local). I imagine with erasure you get some network location to manage each group of parts and they must monitor this and get more than a chunk back or wait. Then you need to upload the missing pieces at least. So the tradeoff to save bandwidth might not be what you imagine.

I agree to an extent, but also note bandwidth speed increases faster than cpu does right now. Disk space also, but there are some big advances in disk space right now. For disk, it is very interesting. I still believe that erasure codes are much more space efficient, but I am not convinced they are bandwith efficient as the control messages and rpcs to rebuild and re-upload are not insignificant.

A huge part is how the network knows when it loses a piece or chunk and how quickly it recovers it. That time T needs to be as short as posisble and short enough that recovery is well within that time period. We try to get that at network speed, by knowing a machine has the data and knowing in a group when that machine is unreachable. Then the group adds a new member ot hold that missing chunk and all the rest close to the xor address.

Anyhow, I hope this helps you see what I am saying. It’s nice to collaborate like this, so thanks very much.


#24

No worries, chunks are max 1Mb so a file can have many. min filesize of 3k is stored in a single file, So losing a chunk means losing a part of a file.

It is related and I should have made that more clear. Thanks again


#25

It’s nice to collaborate like this, so thanks very much.

I agree! A rising tide raises all boats. We’re a small enough corner of the cloud space right now that any decentralized cloud storage successes will help all of us.

Cheers for the answers. What I am saying is that you don’t know which parts exist in the erasure code scenario, so you can ask for A & B and wait on a timeout (if say B is missing) and then have to ask for C, or you could ask for all of them and only need two. So this is where erasure codes will cause a delay (till error) or require you download more chunks than you need as you do not know what are missing.

You are right, that for best performance on reads you probably should speculatively download more to deal with potential failures. In the scenario where the RS parameters are something like 30/80 though, you don’t need to download much more. If on average, 10% of your requests fail (which is high), you only need to request on average 38 pieces to have an overall success.

But isn’t this problem of potential failure also true for replication? Perhaps the replica you choose fails? You’ll have to wait or download from two replicas to deal with that type of failure. With replication maybe the mean is better but the variance and the 95th percentile situation is worse?

In terms of data repair and maintenance, it’s okay to wait for a piece. Data repair and maintenance isn’t performance critical, so it should be fine to take the wait on timeout approach there.

Thanks for the reply!


#26

Yes, I meant it as relative efficiency, that is, higher redundancy for the same number of bytes.

This is important actually. With the significantly higher redundancy that can be reached under sensible storage budgets, most losses wouldn’t require urgent action.

Now this is something I can’t argue with :slight_smile: Firstly, it’s a more complex method and, while the algorithm would be coded only once, it would pose overhead each time a block from a chunk got lost.

I can’t argue about the computation requirements either. We seem to be getting to the end of the range of Moore’s law and erasure codes don’t lend themselves easily to parallelization.

Yes, erasure coding would include an additional layer between the stored bytes and the actual data which could add additional latency.

This could be more than offset as erasure coding could also speed up things, and this may be important if latency becomes a problem. A client could request more blocks (e.g. 5 instead the necessary 4) and use the fastest 4 to put together the chunk. That would lower the mean latency and its variation a lot at minimal extra bandwidth cost, and it’s also tuneable (e.g. by requesting 6 instead of 4 blocks). Getting the same increase without them would waste a lot of bandwidth, so something like this is not necessarily feasible without erasure coding. EDIT I just saw this is discussed under “3.4.2 Erasure codes’ effect on long tail” in the new Storj paper…


#27

100%

Not in our scheme. We store all replicants xor close to the address of the chunk (where it’s name == hash of its contents). So you ask for a chunk and the group closest will have it and guarantee it’s recovery to you. This is done by the network, so all nodes close to the address must hold those chunks. The nodes are tested on joining a group and throughout the session. Nodes that lose chunks are punished and another node will deliver the chunk, but the client does not see all of that. This close goup notion allows us to have nodes close to a node, manage the node etc. so we can punish bad behaviour as well that way.

We see that is is though, if the wait is longer than the time needed to replicate a chunk (in either scheme) then data loss can happen and we should not allow that. For us say we have 8 copies, so even if we had normal kademlia like networking with 30 min refresh, then 8 is chosen to ensure all nodes will not vanish in the refresh time. This is how kad works anyway.

We updated that to recursive based kad with constantly connected group members. So those members see a node vanish and immediately promote another in its place. Therefore our 30 min refresh is reduced to network speed. You can think our kad refresh rate is reduced to node offline time (which can be up to 2 mins with tcp and linger etc.)

As I said mind you a chunk is a file part, so we never download the whole file to replicate a part of it. Therefore we have no message complexity relating to rebuild/republish if that makes sense. The group has chunks and they do not know of which file (for security), but they know they must look after those chunks if they wish to be rewarded through farming.


#28

We must remember this issue though.

With erasure coding, the parts must be validated somehow, otherwise a hacker can give you a bad piece and working out which is bad is computationally complex. So you would need to have an additional layer again that not only splits the file with info dispersement or similar, but also hash each copy or sign it etc. So this is extra cpu work as well as larger data map. So if the file was small & you did wish a 50/30 type split then the data map or similar descriptor could easily be larger than the file as well. Then that needs stored, so again recursively would require split, but it does not recurse down. There is a lower limit where this protection is an infinite loop. So min file size would be limited by the size of the data map/file descriptor.

This is some of the complexity as you get deeper into this, but to allow file rebuild you really do not want bad pieces to be given to you (either hacked or just bit shifted)


#29

EDIT I just saw this is discussed under “3.4.2 Erasure codes’ effect on long tail” in the new Storj paper…

Yep!

Not in our scheme. We store all replicants xor close to the address of the chunk (where it’s name == hash of its contents). So you ask for a chunk and the group closest will have it and guarantee it’s recovery to you. This is done by the network, so all nodes close to the address must hold those chunks. The nodes are tested on joining a group and throughout the session. Nodes that lose chunks are punished and another node will deliver the chunk, but the client does not see all of that.

Wouldn’t this work for guaranteeing erasure share delivery as well?

We see that is is though, if the wait is longer than the time needed to replicate a chunk (in either scheme) then data loss can happen and we should not allow that. For us say we have 8 copies, so even if we had normal kademlia like networking with 30 min refresh, then 8 is chosen to ensure all nodes will not vanish in the refresh time. This is how kad works anyway.

As @JoeSmithJr pointed out:

Sure, in an erasure code scenario, let’s say you have 50 total pieces, and you need 20 of them. You don’t really need to repair until you drop to a certain threshold. Maybe 30?

This is important actually. With the significantly higher redundancy that can be reached under sensible storage budgets, most losses wouldn’t require urgent action.

We encode things so that repair action isn’t needed for weeks to months at a time. There’s no urgency to repair a file.

With erasure coding, the parts must be validated somehow, otherwise a hacker can give you a bad piece and working out which is bad is computationally complex.

It’s not so bad. It only matters with repair, which again, can happen slowly in the background, where we only need to keep a hash of the overall piece. When data is being retrieved for gets, the data is encrypted with authenticated encryption anyway, so invalid recoveries can be detected and corrected with something like Berlekamp-Welch [1], which itself is slightly more CPU, but you only really pay for when data is corrupted.

[1] https://innovation.vivint.com/introduction-to-reed-solomon-bc264d0794f8


#30

One way I can imagine this, erasure blocks would be hash-addressed and collected into chunks through public data maps, and chunks mapped into files the same as now.

Vaults would store a) the erasure block maps for the chunks closest to them in XOR space, b) the actual data for the hash-addressed erasure blocks that are closest to them in XOR space. So yes, it’s an additional layer, but not prohibitively complex.

I understand it would need to bring significant benefit to be considered, but if latency becomes a problem, it may be a solution.


#31

As a side issue/question. What then happens if there is a major event, like we’ve seen with a country cutting off the internet? eg Egypt, Pakistan. Might also occur if China changes rules in their firewall.

What about smaller scale like Japan’s earthquake cutting off significant infrastructure/datacentres for extended periods (few months). Or Yellowstone park erupting and causing major disruptions in parts of North America.

Basically the question is what happens if a significant portion of your storage nodes fail all at once. For SAFE the chunks are stored globally in a fairly random geographical locations so the effect will only cause minor issues if random probably (rare) has some chunks only located in one region or even as large as a country. And then when those nodes come back online the chunks will be recovered.

The issue I see with the extra layer will be like in NNTP where par files are used. The smaller the device you use the longer this rebuilding of the file, even with no errors, will make the file loading sluggish. And if one wants low end phones to access the network (Not as a node but as a client) then “erasure” codes or similar will require the size of the sections to be very small to enable reasonable rebuild speeds. And the rebuild time slows at a faster rate than the file size increases.


#32

How likely is it that all the copies of a chunk would be affected by such an event? In other words, how many chunks will be lost completelyif Egypt goes dark again? I believe we’d lose all 4 copies of many chunks if that would happen, less if we had 8 copies, but still quite some. Erasure coding, with much larger redundancy for the same storage cost, would shine in a situation like that.

The probability that a large file, that is made up of many chunks, would have a lost chunk is not insignificant. If it’s a movie, a few seconds will be missing. If it’s more complex data, the entire file may become useless. Moreover, if it’s private data, we can’t even count on having extra copies due to popularity.


#33

If you read what I said again then it does take into account what you said. On a global network scale the issues of some files being broken is still minor. Maybe major to those who do lose access to the file but minor on a global network scale.

Do the birthday paradox with 8 in the one portion. Yes its significant but still minor when considering the whole.

I think our difference is purely terminology

And the point was I was wanting to know what happens in stoj’s case.


#34

You tend to be more hard core on the “forever safe” side so I’m a bit confused by this.

I think you misunderstand the problem. The birthday paradox is not about the 8 in 1 side of the question but about the 1 in 10,000 side. The probability of having a chunk with all 8 copies lost has to be very small if we want to make it sufficiently unlikely that we’d have such a lost chunk in the 10,000 chunks that make up our 10 GB file.


#35

For the users of very large then its huge and some large files maybe broken, but for millions (majority) of small files <10MB they will not be broken. Still on the scale of a global network made up of the typical mix of files, it will be minor failure. There was a topic that this was discussed in and the probability is definitely not insignificant that files will be broken for the duration of the outage. The more surprising was just how small an outage can be for a file somewhere to be broken. I forget if this was countered or not. And of course the fix is that when nodes come back online there will be data recovery.

Yes very much agreed. And I agree that significant outages may cause these files to be broken during the outages. This is another reason that the farming algo must favour the home user as opposed to the large data centre farms. Still this is a discussion best to continue in the topic(s) where it was previously discussed and not take this topic off-topic

So still wanting to enquire about the issues Stoj may or may not have in such scenarios.


#36

So still wanting to enquire about the issues Stoj may or may not have in such scenarios.

We also geographically spread data out as much as possible, given configured maximum performance latencies.


#37

Possibly, the reason it works for us in this way is that all replicants are at the same address, although goe distributed. I think with erasure schemes the parts are individual, so if it is not there then you need to go to another address. Our intent is that a part is always there and never missing, so a get should always return that part. You could mix the schemes to give a similar effect, but then I am less sure that it would be sensible as the space savings with erasure will be obfuscated by the space inefficiencies of replication. It would mean repair is a simple process though, where you never need to rebuild the file.

I think this is a big difference in the network structure as opposed to coding really. So 30 of 50 parts or 20 replicants would give similar levels of loss protection. The difference is really when does the network know it has lost a part or a replicant and how quick can it resolve that? In SAFE the network should do that at network speed, but I am not sure how you folks do that? I presume there is some kind of check time of parts or whole files and that time t is short enough that you can never lose 19 peices? (sorry I don’t have that info on storj). This time verses safety (parts or replicants) is really the important part for me. Not related to erasure verses replication though :wink:

The week to months part seems to indicate you have different levels of safety? This is something that gets brought up here occasionally, pay more for more replicants etc. but our position has always been we can lose no data, so I am always intrigued when I see a level that can be “safer”. Straying off track a wee bit though, but I think it helps frame the reasons for selecting one method over another (I suspect it becomes vi/emacs though, where obviously VI wins :slight_smile: ).

Anyway, I would be glad to know a wee bit more on how Storj has come to this decision. If it turns out vi/emacs then it will be great to share the results of our experiments as we move along.


#38

Is there a summary of the pros / cons of implementing erasure coding at the client vs network level?

Seems to me (relatively uninformed) that erasure coding at network level doesn’t necessarily assure more time for repair than redundancy.


#39

Both having multiple identical copies and erasure coding are ways to achieve redundancy, just differently. Both methods can reach the same level of redundancy but at different costs

Multiple identical copies: simple to implement, needs less communication for restores, needs less computation during normal operation. However, it requires more space for a given level of redundancy.

Erasure coding. It’s more complex, it needs another layer of abstraction, and it needs more computation and communication. It can be fine tuned to provide practically any required level of redundancy for a given storage cost, and it can also be used to lower the latency till the chunk is ready at a reasonable (and tuneable) bandwidth cost.


#40

I’m really appreciating the detailed conversation in this topic.

It will be interesting to see how this is balanced with the following goal in section 2.6 of the whitepaper: “any distributed system intended for high performance applications must continuously and aggressively optimize for low latency not only on an individual process scale but also for the system’s entire architecture.”


Regarding cpu performance for erasure codes compared with network factors such as bandwidth and multi-hop latency, I found these figures from backblaze Java implementation of the same (reed solomon) algorithm:

“A Vault splits data into 17 shards, and has to calculate 3 parity shards from that, so that’s the configuration we use for performance measurements. Running in a single thread on Storage Pod hardware, our library can process incoming data at 149 megabytes per second. (This test was run on a single processor core, on a Pod with an Intel Xeon E5-1620 v2, clocked at 3.70GHz, on data not already in cache memory.)” (source)

Seems pretty fast, but the details and implementation within storj will be great to see. If it’s faster than pure redundancy for the end user then that’s the main thing.


To respond to my own prior question about the difference of client encoding vs network encoding and expand a bit on what @JoeSmithJr has already said, erasure coding seems to have some benefit when used at the network level compared with only at the client level.

Network level allows the network to coordinate in a potentially more efficient way (mainly interested in bandwidth efficiency) to restore lost data which it can’t do with client-only erasure coding.

The other benefit is there’s more ‘levers’ to tune the performance than with pure redundancy techniques, both for client downloads and the performance of network coordination.

I still haven’t fully digested the implication of erasure codes at the network level, but it’s very interesting and will continue to ruminate.


From 3.4.2 - “Erasure codes enable an enormous performance benefit, which is the ability to avoid waiting for “long-tail” response times”

Surely this also applies to pure redundancy where the fastest response is used. Selecting the fastest N of P parts doesn’t seem better than selecting the fastest 1 of P redundant chunks.

I think erasure codes are being overstated in their ability for this particular benefit.

But the expansion factor benefits tabulated in the start of 3.4 are very cool. Although I’m not sure if expansion factor is the aspect that needs the focus (in the short and / or long term).


Some other previous conversations on this forum about erasure codes (with further links within each):

Aug 2015 - Mojette Transform and SAFE
July 2017 - Preventing data loss using parity checks


#41

Nice work @mav and @JoeSmithJr

The big thing for me from an architect perspective is this really (always open to improvements).

When any bit of data is lost then the network must recover it quickly and definitely. My take has always been if you can lose a bit you can lose it all. So our scheme is not replication alone, it is chunking and replication. So each chunk can be considered the whole file in terms of loss. We lose one chunk and the file is dead. For erasure coding it could be you can lose say 20 chunks and the file may still survive.

That sounds great, however the big if really does matter.

For each piece you lose, does the network realise that and when? Lets ignore rebuild times for erasure codes and say the chunk can just be created somewhere, there is a copy somewhere we can get. So the time from loss to recover is hugely important, lets call it T (I realised after writing this that I do not refer to T again directly, but left it here as it is important)

For replication we have R replicants arranged in a common group, or network address space. In that space we have nodes that are keen to maintain their chunks (for farming) and not be penalised. A node could lose a single chunk (weird) and recover it by asking? More likely a node is lost. So we have nodes waiting to take it’s place as soon as possible (its a high priority event). The nodes to take its place already have the chunks they need.

So we set R or group size at the level where we cannot lose all holders in the time it takes to replace a single holder. This is similar to kademlia, although it uses a magic number for the selection of R (well in kad its not R it is called k, but irrelevant).

So we try and recover as fast as possible, but set the safety factor higher than we envisage the total number of copies can vanish. So R is used for a particular reason.

If we had Erasure Codes then we would have n and p, so p parts of n is the file. So the network can lose n-p parts say. So the same logic should apply is we want to not lose data at all. This would mean the architecture would need to know when we lose n-p parts as that is when the file is gone. This to me means when we lose a single p then we must start that process, whatever it is. This is the crunch part though.

So for us we have sections/groups that are responsible for the single file part and they do not know the whole file. This is for security and efficiency (mainly the latter as rainbow lists etc. etc. for the former). Therefore the group knows a part is missing and acts immediately, We cannot go faster than that and its the network minimum refresh setting (in kad talk).

In EC then when a p goes missing, there are no replicants, group members do not know or care it’s gone. However the other n holders must start acting (even if that means waiting on more p going missing). So now there is a mesh of RPCs connecting all n parts and the health of each n part. That links all holders of that file together (intake of breath time for privacy). So even if we could do that then a holder of X parts of X files will need that mesh of connections/RPCs for each part. That will mean multiplexing connections and knowing the IP address of a huge number of nodes (again security, DOS attack etc.) and that is prohibitive for us.

There is a lot more to it, but erasure codes as far as we know from the architecture side would be very heavy on comms and affect on network architecture. There could be things done like have nodes know what neighbors hold in terms of what n parts they have, but then the web gets huge, fast. So I am interested in how people solve this.

In terms of speed, then replication will be faster AFAIK, even simplify to RAID where all things are local, chunking and replication (striping) are significantly faster to read and write than erasure code data (parity).

If you compare erasure codes to whole file replication then its much faster, but we do not do that, we do chunking and replication, it is a huge difference. I showed the NHS in Scotland this, where we have a 300Mb/s disk array to store data, I said, we can compress, encrypt and store faster than 300Mb/s, or we can do all these operations on a file and write it faster and we did. We had over 900Mb/s, because of chunking. This is where the speed improvements come from, not erasure codes, but smaller file parts. This is why bittorrent is fast, it’s part. If you then have cpu complexity putting the parts together like erasure code algorithms, then that slows it down.

To be clear, we do not just put parts back, we do decrypt, decompress etc. as well as hash check each part. So we are slower than just striping, but not too much, erasure codes will be slower that striping as well, but not too much, both will be faster than whole file retrieval.

However the hash check is vital, otherwise it would be significant work to find a bad part, say a bit flipped p or chunk. If it were pure erasure codes then this part is critical in a hostile Internet, not so much in a hard drive though ;-). So I think erasure codes will require more work than just implementing a reed solomon type algorithm. I would love to see folks attack these problems though and Storj seems to be looking at least, so it will be great to see how it all pans out. We can all learn all the time.


What’s up today?