Self encryption and de-de-duplication

I have been working with IPFS and Clojure lately and that has lead me to a lot of thinking about the benefits of immutable data etc…

Most of these immutable data systems save quite a bit of memory/space because they retain most of the data - when there is a change, they only create a new overarching data structure that provides a link to to all of the existing data that is the same, as well as the new or changed data…

So for example if I have a google doc and I am editing it, There is a list of every revision and I can back up to any point — But chances are if a document has been changed 50 times, there aren’t 50 full copies… There is one data structure that points to the current “chunks” and older data structures that point to chunks that where valid at previous revisions. The bulk of the chunks are likely the same… saved once and re-used up to 49 times.

With self-encryption however, it seems like this wouldn’t be very possible… If you change a comma on page 4 , the document will hash out differently, and create totally different chunks, causing the full document to need to be saved again.

Am I overlooking something? Is there a better way?

1 Like

I don’t know of a way to get around that.
You can see that Maelstrom for example require one to republish the site if a comma is added on one page, also because the site (torrent’s) hash changes.
We don’t know if google actually uses atomic changes - maybe they don’t either.

It’s not as bad as that. I think it only affects the block containing the change and those following it. So if there change is in the first block, yes it’s a while new copy of every block, if in the last block, only one block is added. Read the self encryption docs for the definitive version though.

1 Like

I don’t know, Mark. Isn’t the process circular?

1 Like

Sounds like a job for… Super @dirvine!

1 Like

You could divide files into smaller chunks and self encrypt the smaller pieces with one another in a merkle tree type fashion – Or you could skip encryption altogether unless needed – Or you could use another encryption scheme that put the keys in a separate Stuctured Data… I suspect there are a few options…

I would be pretty shocked if they didn’t. It makes a massive amount of sense to do it that way…

self encryption needs 3 chunks.

So if you change something in a chunk, but do not change the size. For example change that full stop to a comma. The effect to the stored file is to write 3 chunks to the network and return a new datamap for the current version.

Now if you have a 100 chunk file and change chunk 50 which changes the size of the file then 51 or 52 chunks will be written since all chunks after chunk 50 are also changed.

In my opinion IDEs and program editors APPs will be created that will compensate for this issue. There are a number of ways this can be done, some as old as UNIX/Main-Frames editors when these files were stored to 9 track tape, or DEC-tape. These tapes took a long time to erase a file and rewrite it, so techniques were devised to only store changes and the recreation of the text required the original and the diffs. UNIX from memory had a system for this in the dark ages of slow disks/9-track-tapes/DEC-tapes.

So fast changing files (typically text type) will like have APPs to reconstruct&store these files that will minimise the changed chunks to store. This will benefit the user and the network, just as de-dup does.

This is where SDs might be used as an companion to these fast changing files to store minor changes and minimise new chunks needed.

1 Like

I don’t think this is quite true – Because each file is encrypted with a key created by it’s neighbors, all of the final chunks will have been encrypted by a different ring of keys…

So even if the first 33% didn’t change, they would still be encrypted by a dirivative of the hash of the last 33% and the resulting file would be different.

Self encryption works, so I would suppose this is easily testable rather than speculating…

I have no idea how the safe network itself handles such “editing a file” events. But I assume that you could write your own “file system” on top of the safe network, and just leave the original (potentially large file) as it is, while storing the changes in another file. As far as I remember a chunk does not need to be 1 MB, you should be able to also store small files without wasting space. If this is not the case, you would have to accumulate many changes locally and then store them in one chunk (or store them in this strange free memory that is reserved for messaging, or similar)…

A developer certainly could hack around this by building “these are the changes” files and loading the original file and running through the updates again… But if competing technologies have filesystems do this by default without the user even noticing, SAFE is failing to compete on this plane. GIT is seeping into everything.

This is also likely expensive to the user – they have to pay for the same file to be saved over and over again even though it is massively the same as previous iterations.

I think you could use smaller chunk blocks and run a GIT style filesystem on SAFE, but I don’t know if that is a supported option or what the overhead would be for retrieving smaller chunks instead of larger ones.

Suppose the file has 100 chunks numbered from 1 to 100 and chunk number 50 is modified. There are 2 cases depending on how it is modified:

  • If the modification is limited to chunk number 50 (for example a character is replaced by another one), then chunks 50, 51, 52 have to be reloaded (because XOR pad to compute chunk n uses hashes of chunks n-2, n-1 and n)
  • If the modification propagates to following chunks (for example a character is inserted), then chunks 1, 2, 50, 51 … 99, 100 have to be reloaded.

These are the optimizations on PUT operations that could be done in theory, but in practice I don’t know if current self_encryption crate is able to apply them on an existing datamap.

1 Like

If you took your 100 chunk file and made it into 32 sets of 3 and one set of 4 you could just replace the set where the change took place, and the rest could remain the same. You would just need a new pointer at the top that points to all the unchanged ones - as well as the new changed chunks.

That would be my understanding anyway, hopefully a developer can correct me if I am wrong.

Wouldn’t a decent version control software just build the document from the ground up? That is every time a change is made, that change is added to the log as a new piece of data.

That seems like a desirable thing to me anyway.

Perhaps. But the point is that if you use most non-mutable data structure systems, that is built behavior, not an add on that needs put into software…

Version control is overkill for most things. Most people want the current version, They don’t want to download the original then all of the iterations then parse them together. IPFS / GIT etc – if you load the latest hash you get the latest version. It does’t load anything that is obsolete… The obsolete is still there if you choose to download it using the obsolete hash… But where the obsolete and the current are the same, two copies are not needed. SAFE’s Self encryption would probably interfere with that as it is described unless a workaround is engineered.

The self encryption requires 3 chunks to encrypt the one. If you only make a change that affects one chunk only then only 2 more next to it are required to encrypt it again then it gets PUT (3 chunks read one PUT)

But if you add/remove from the file (at chunk 50 say) then the rest of the file (chunks 51-100) are also modified because you shift the bytes up/down the file from the point where the change was made. Thus you need to re-encrypt the 50 chunks.

This is where APP will be written to reduce this shifting up/down for files that can support “null” space. For example program source files.

Currently SAFE will modify the file for you and return the new datamap. So if you want ot work on a file and only keep the lastest then SAFE does that already. BUT it also allows you to keep the previous version datamap and access the previous version before the changes as nothing was actually deleted.