Recognising file copies


#1

Hey, I feel that the current flaw with Gdrive and systems alike is that people often upload the same file (e,g a photo of a troll face, latest single, most famous cat video), these platforms store multiple copies of these files. Wouldn’t it be awesome system efficiency, as well as the most effective spam prevention mechanism (as an attack would probably be loads of copies of the same thing) if the SAFE network recognised copies of the same files and only stored a single copy giving access to people who “uploaded” it, and removing access from people who “deleted” it? Is this even realistically possible? Obviously if 1 bit of the file is different, a copy has to be made. An advanced mechanism would also recognise the files that are named differently but otherwise are 100% the same.


#2

Your proposal is already implemented in the SAFE network through self-encryption of files. Each chunk is then stored under it’s hash-name: if the chunk is already there, it’s not saved again, but a counter is increased. For deletion this counter has to go to zero, ie all who ever saved this file want it deleted.


#3

Question: How do you handle identicle files with different filenames. Say I uploaded awesomesong.mp3 and then uploaded the same song in a different album only this time it was called awesome_song.mp3, how does the safe network tell the difference? It would be even more important for pictures which often change file names but remain pretty much identicle. Say you downloaded a picture from a social network and uploaded it then later went through your collection and renamed all your files as somethng sensible and then reuploaded all of them, the pictures are the same but the filenames are different. How does the system cope?


#4

We do not care about any of the metadata for de-duplication. Its all content so we are covered here.


#5

That’s fascinating but do you mind explaining how that works? I’m a bit confused, intrigued, but confused.


#6

No worries, I think it is in a paper on the wiki, oh just checked. I will get the paper and post here and on the wiki soon (dentist now). It is basically convergent encryption ++ :smile: Oh a version of it is here www.google.nl/patents/US20120210120?cl=en It has been updated, since to swap the AES and XOR as well as optional compression (which we use).


#7

Maybe below might help with what data deduplication is, referring to how it works.


#8

File/object name is metadata, the content is data. Suppose you have a.mp3 and b.mp3, with the latter being a copy of the former. Knowing that the two checksums match, when you PUT b.mp3 on the network, you point data pointers from b.mp3 to data content of a.mp3 (actually to the chunks, if the file is larger than chunk size which is 1MB).

It gets tricky if you start supporting deletion, then deletes must check that they aren’t referenced by other objects on the network.


Grouping vaults by owner, to strengthen redundancy?