Didn’t sound disrespectful at all. I just really thought I understood what exactly deduplication was. I thought that all chunks of a file were scrambled with the hash of the file, changing just one bit would radically change the hash and then it would change the chunk once encrypted.
Some more reading to do, I guess you can’t figure out 7 years of a man work in two weeks
It does. The thing is though, if you change a small thing in what becomes the first chunk, and that changes how the rest of the file is divided (chunked), the chunks all hash out differently. As I recall, the chunks are stored according to their hash, so they’d all change.
If your file change doesn’t affect how the file breaks up, the change will only affect that chunk and the one that it is used to hash. So you’d retain most of the dedup but loose some of it.
There are two separate questions in this thread: (1) the de-duplication process, and (2) recognizing files for payment.
Files less than 1KB are currently stored in the DataMap itself (so with the directory metadata), and as such are not de-duplicated. Files larger than 1KB are split into a minimum of 3 chunks, where each chunk has a maximum size of 1MB. The encryption keys for each chunk are the unencrypted hashes of the prior two chunks (with modulus).
If a single bit was flipped on a file larger than 1KB, at minimum 3 chunks are changing due to the encryption process. So in a 100GB, if one bit was changed, 3MB of data are definitely uploaded, and the rest may be uploaded (when uploading occurs is whole other discussion). The user will only need enough SafeCoin for storing 3MB of data.
However, as @fergish mentioned, if bit(s) were removed or added to the first chunk, then it will likely change all of the chunks. This is not guaranteed though, because patterns in data could leave some chunks identical.
As for payment for accessing files, I’m not sure about those details, and whether they’ve even been completely finalized. Its likely this will happen at the chunk level because there are no references to files on the network (no i-nodes) - which is yet another long discussion. In other words, the algorithm would likely be “anytime these chunks are requested, notify X”. I think this is still open for discussion; I can’t think of where its implemented currently.
I would expect any chunk that matches, even if that chunk comes from different files, would be pointed to by multiple tables but would only be stored once. Just a guess, but I ‘think’ that is how true dedup would function.
I think you are right, and this would answer my previous question about what is happening to old chunks. They would remain there because the vaults won’t know if they are part of anyone else’s file or not.
Correct, this was discussed in another post. Its difficult to know whether anyone is still pointing to the chunk, and simultaneously store nothing that can be user identifiable with the chunk. David has some thoughts on this, but I’m not sure if its something that will make it by launch.
Also - files on the network are stored with a version history, but that history will have a hard limit. So if a file is deleted, eventually it is possible that nothing is pointing to it.
Its mainly for application developers, and its unlikely end-users will see this very often (although this will depend on the application). The advantage is that the developer can say “replace this version of the document”. If multiple computers are trying to modify the same document at the same time, only one will “win”, the other(s) will get an error. The previous versions can still be retrieved if the developer needs to inspect differences between versions and try to re-commit the desired changes.