An associate browsed the MaidSafe.net features page and sees the deduplication as a safety concern. I think he is working under some misconceptions/assumptions about how the system works. Here’s the specific concern…
The system having the ability to identify identical data segments negates the inherent security of those data segments. In the statement, it makes it seem as if no one knows where the segments are when in fact, they are closely tracked and managed by the system. Any file you store that another person also stores will be identified as such, allowing authorities to identify users who store the same file simply by monitoring the deduplication process.
My guess would be that there is no “deduplication process” to monitor because deduplication is a simple consequence of how chunks are routed to locations in XOR space based on their hash codes, which are calculated before they even enter the network. Also, I believe there is no way for authorities to monitor a user’s activity, apart from the user’s computer already being compromised, since each time a user logs on they are assigned a different XOR address.
Are my statements correct, and do they fully address his concern?
You are correct, he’s making assumptions on how deduplication is achieved that are not valid.
Files are ‘chunked’ and then encrypted before they leave your machine, and only you have a map that can be used to relate those chunks together.
Once on the network those chunks end up at random locations and only someone who has a data map for the particular file can know they correspond to that particular file, and nobody can tie a chunk on the network to a user (or file) without first having that same data map (for the particular file).
Some of the reason being that if you don’t charge then
it tells you that the file (chunks) exist and reduces security.
the complexity to “refund” the amount make the network more complex.
something to do will having to pass up the line to refund the amount.
the network has done the work for processing the chunk except the very last step of actually storing it
also by charging each time it doesn’t favor one person over another. Everyone treated the same in storage costs.
Improves the network economics
Of course this doesn’t consider a technique completely separate to de-duplication is to write an APP that does the self encryption without trying to store the file and then requesting the chunks. This tells you if the chunks exist without needed to attempt to store the chunks.
I suspect we can estimate cost of checking for existing data vs saving the data, then run with it. The main point is that it should cost something no matter how small. People don’t like spending money to do DoS generally.
I tend toward the view that if you want to PUT something, it should cost, whether or not it actually results in a new PUT.
“Profit” from this to go to general funds split between farmers, PtP whatever.
Anything, just anything that results in fewer pictures of kittens on teh intrawebs…
But absolutely no freebies for something that could get used as a DoS vector.
I agree with this. If you commit to something then its no loss to you that the system has already stored it for someone else.
But still (if I wasn’t in a hurry) I would use an APP that checks if the chunks exist already. So it would do something like take my 20GByte file and self encrypt the first 10 chunks and then try and retrieve them from the safenetwork. If they don’t exist then unlikely the rest do and so upload the file in the normal manner. If the first 10 exist then good chance the rest do so it encrypts more and checks say the last 10 and so on. Thus I don’t even ask the network to store the data and cannot be asked to pay and the network hasn’t done 20GB worth of attempted store.
I’m trying to learn about SAFE Network and I’m stuck on deduplication (which is pretty early in the FAQ). I understand that the pre-encryption hash allows a client to detect if a duplicate chunk is already stored on the network, but isn’t the stored chuck encrypted by a client-specific key? If so, then client B could not decrypt the encrypted chunk stored by client A. Also, does data deduplication create a problem when a client wants to delete a file/chunk? Is there a use-count for each chunk? I’m a programmer so feel free to get technical or point me to source code files. Thanks!
No, the self encryption algo uses the data itself to form the encryption. This allows anyone who has the datamap of the chunks to decode the file. Each chunk on its own is “impossible” to decrypt, but once you have all the chunks in sequence (datamap) then its possible to decrypt the chunks.
Sorry I don’t have the link to the self encryption encryption with me, but I’m sure a search on “self encryption” in the forum will bring a wealth of information.
EDIT: follow this link and the post has a video in it that explains it (hopefully since I didn’t watch the vid to check)