Vitalik praises Erasure Coding


So there we go, that’s an introduction to the power of erasure coding – arguably the single most underhyped set of algorithms (except perhaps SCIP) in computer science or cryptography. The ideas here essentially are to file storage what multisig is to smart contracts, allowing you to get the absolutely maximum possible amount of security and redundancy out of whatever ratio of storage overhead you are willing to accept.

I’m not sure it’s under-hyped (major storage vendors who have products based on it are hyping them and those products are successful and widely known), but it’s an interesting read for those who aren’t familiar with it.

Edit: since this is my post it’s probably OK for me to add that I thought about erasure coding as a potentially better way to protect data around the network, but I thought it wouldn’t work well because it’d need a lot more communication among the nodes.
Storage products (storage arrays) based on erasure coding have been on the market for around for 10+ years, but I’m not aware of decentralized storage that uses such algos (I imagine it’s because it must be difficult to do over network).

Ethereum's Swarm (p2p storage) and Whisper (p2p messaging)

Its a vim/emacs war erasure coding verses replication. I fall on the replication side of the fence, the reason being, why lose data at all? Erasure coding leaks info between chunks of data as there is a link (has to be). I did have a PHD thesis on this very issue from a French uni, but I cannot lay my hands on it.

It sounds nice and its what RAID does really, I just think its a gamble of how much data you may lose, if you are losing any data why not increase replicant count?

As I say its a huge debate and will rage for many years with no definitive answer.


Do you mean if “metadata” is stored alongside “chunks”, so that to “rebuild” stripes the network would ask all “member vaults” that participate in a stripe to recheck themselves which would via this peer-to-peer configuration reveal locations of stripes (or chunks)?

I was thinking about a “client side” approach. In a private but inefficient implementation perhaps it could be done client-side (e.g. each client would have one non-erasure code “drive” (blob) for erasure code chunk distribution info (metadata), and actual data contents would be distributed around the network using erasure coding) but that would seem inefficient as well (and very taxing on the client). However this approach may one day (when networks are faster, etc.) become attractive for clients who want to do client-side deduplication, although I’m not even sure about that (IIRC due to recent advances in cryptography there are already compression schemes that can deduplicate content that’s already been encrypted).

Either way, nothing prevents “farmers” to put their vaults on such protected storage today.
If distributed erasure coding becomes viable it would be possible to enable it for new data or even create an option to “restripe” old-style replicas to erasure coding (incidentally that’s exactly what I had in mind here, which was before Vitalik’s post came out, when I mentioned “defrag-like” data reorganization).


I decided to do a blog post to show the answer is much simpler than all this in terms of proof of storage/retrievability etc.

In MaidSafe we have an implementation of Shamir’s data dispersion algorithm here already in place, but not for chunks. This can be used for group ownership of keys (n+p) so you can have the group quorums able to operate in an electronic system without manual inputs etc. There is even a wee tool I wrote in a day (so crappy) that allows a group to create a private signing key, never see it but come together and sign docs or software etc. The key only ever exists in ram for a small time and nowhere else, That tool is here and allows all this to be done. So I see n+p as having great value, but erasure coding/striping as another issue altogether (very similar). I like N+P but prefer replication over striping, but I also prefer vim over emacs :smiley: if you see what I am saying.