How will MaidSafe deal with tiny files?

Hi guys,

I am not a computer expert so bear with me…

Our “computer vision” company uses Dropbox and we recently had the experience of a developer accidentally uploading his “training set” to his Dropbox folder. He basically uploaded a couple of gigabytes of tiny images. Normally with our fast internet connection, we’d eat 2 gig like candy but these tiny files threw the whole thing into chaos! It simply couldn’t deal with them, it was saying “4 days and 10 hours” to synchronize. We gave it half a day before giving up and manually deleting the files but one computer was stubborn and kept trying to upload files to the others. It was basically a big disaster.

I am wondering if MaidSafe can handle these kinds of scenarios and perhaps, the extreme example of someone malicious intentionally trying to clog the whole thing with tiny files?

IIRC (if I recall correctly) all files smaller than 1MB will be padded to reach the one megabyte minimum. Uploading a bunch of files in an attempt to clog the system as you put it will not work very well (at least when the network reaches critical mass). Safecoin will be one of a few limiting factors. The other one relevant in this scenario is that routing (a part of the system that controls how data flows throughout the network) will shift your traffic to nodes capable of handling your requests. This ensures that these small files will almost always find a home. Remember, this is a global network. Anyone with the space and bandwidth can accommodate your desire to store data. Think of your data flowing through a freeway that is instantaneously torn down and rebuilt in a different directions to prevent traffic jams.

The synchronization issue likely come from the natural throttling that occurs when dealing with small files. It’s like the difference between carrying a 50 pound bag of rice to your car and carrying each grain individually one by one. SAFE partially deals with the synchronization issue by only synchronizing the files you modify. It detects modification on your machine before telling the network that it needs to be updated. IIRC current solutions like dropbox query the servers to verify that no changes have occurred. This causes the some of the overhead you described.

I hope that covers your questions. If I missed something let me know. Otherwise I’m sure someone with more intimate knowledge will step in. Sorry I couldn’t be more thorough. Time for my nightly alcohol indulgence. Gotta love them screw drivers baby! :laughing:

7 Likes

Holy hell that was a great explanation @Tonda !!!

3 Likes

Right now I think the plan is that the filesystem you see when you mount your SAFE storage drive is a virtual one that only shows you what files exist and their metadata, and only starts to fetch a file as you open it.
So unless you NEED all those files, you won’t have a massive amount of congestion trying to download them all.
Plus I think Dropbox has the limitation of you connecting to a single server, so you can’t expect to open a couple hundred connections and saturate your pipe, unlike with SAFE. I just wonder how it’ll balance out. SAFE will allow you to make many connections, but the small files will be padded to 1 megabyte each. That would make downloading, say, the source code of an application rather time consuming, if every source file is padded to a megabyte and there are, say, 600 source files. Instead of downloading (on average) 600*0.2 megabytes=120 megabytes, you’re downloading 600 megabytes.

I’d see a possible optimization strategy where you have several files in one chunk, this would only allow you to download files in groups of 5 in our example, but it’d be faster since you don’t have to download 0.8 megabytes of padding with each file.

I’d like to ask the devs if this sort of optimization is planned?

1 Like

4 KB is the minimum as far as I know, unless that changed to 1 MB and I missed the update.

2 Likes

Wait, what kind of data is this? Messaging will allow for this size but not immutable data right!? Let us know brother. Break it down. You’re one of the ones I think of when I write intimate knowledge. :sunglasses:

1 Like

The minimum size of a chunk, in the actual implementation, is 1KB, so the file must be bigger than 3KB. to self_encryption.

/// MAX_CHUNK_SIZE defined as 1MB.
pub const MAX_CHUNK_SIZE: u32 = 1024 * 1024;
/// MIN_CHUNK_SIZE defined as 1KB.
pub const MIN_CHUNK_SIZE: u32 = 1024;

3 Likes

It is worth checking if very small files get stored in the datamap itself - something I recall reading a long time ago. No details I’m afraid, and it might all have been just a dream :slightly_smiling:

2 Likes

Yes, you are right and it wasn’t a dream. A very small file (less than 3072 bytes) is not split into chunks and is not encrypted but self_encryption still returns it as a datamap.

Rust enums can wrap different kind of data under the same type: in our example a datamap is either a vector of chunks or the content itself.

2 Likes