NFS internals - Is each file being stored as a single Immutable?

I have this simple website made with Metronic bootstrap that uses a lot of small js/css/image files. In my machine, it has 4k files, occupying 50MB (~12.5k each file).

Let’s say I have uploaded it to Safe. In this scenario, will the current Safe’s NFS implementation consume 4GB of space (4k * 1MB) or it will try to group many small files in just 1 immutable (and use some kind of map to reach them)?

PS: of course I would concatenate all my js files in a single file and do the same with css. This is just an hypothetical scenario, it’s just an example.

1 Like

In your scenario the consumption of disk space in the network is 4 * 50MB = 200MB, but you have to pay for 3 * 4K * 1MB = 12GB worth of data (compared to a situation where all files would have a size > 3MB and multiple of 1MB)

  • each file is split in 3 chunks (because their size is > 3KB and < 1MB)
  • each chunk is stored 4 times (because backup data is not yet implemented)

Note: Space taken in the network is an approximation because chunks are encrypted and compressed.

2 Likes

Oh yeah, I totally forgot the extra copies the network does. But is that true that each 12.5kb file (from my example) would be stored as a single immutable? Or does the NFS have some mechanism to spare space grouping small files in a single immutable?

1 Like

Take a peek at this: self_encryption/lib.rs at bca09e8ffaed9ba8127eed57c85400c40b4656ff · maidsafe/self_encryption · GitHub. That appears to basically say: < 3KB takes no chunks (it’s actually embedded in the data map which in turn is embedded in the structured data that defines the directory), < 3MB always 3 chunks no less, otherwise the number of megs (always rounded up). Also remember that writing files alters structured data for directories which may induce costs for writes there, I am not sure.

Note, this impl can probably change. There is nothing that seems to “group” small files (for many reasons I can think of). I would expect nodes/vaults to have deduplication of data on their end though even though it won’t affect the price. It brings up an interesting point though: farmers are surely going to try to use the best dedupe algorithm that doesn’t get them penalized for speed when serving/saving data to get the most out of their disk.

3 Likes

That’s different, doesn’t the network handle de-duplication the same for all nodes.

Writes to existing SD objects incur no cost. SD only cost to create.

They have no concept of what is being stored and the vault does not do DeDup

DeDup is done by the ? managers doing a check on each chunk given to them to store to see if the chunk already exists with that hash (address) and thus has already been stored, if so then no need for further work with that chunk by the ? managers.

The farmers do not get any choice. Also DeDup has already been performed before any chunk is stored so it is impossible for a vault to ever get a duplicated chunk.

If you thought about it, even IF no dedep was being done by the network before sending he chunk to be stored in a vault, what are the chances of EVER getting a duplicate in your vault out of the 100’s of thousands/millions of vaults. But in any case dedep prevents even the remotest possibility of a duplication of chunks in your vault.

1 Like

I think we mean different things here. I mean literal deduplication of (post-encrypted) bytes on disk. I am not talking about duplicated chunks. This is a common feature of NASs and SANs which may serve as good devices for storage here.

I am exactly talking of " literal deduplication of (post-encrypted) bytes"

That is what dedup does. How can a ?manager check unencrypted. The client only sends encrypted chunks off to be stored and the managers have no idea what is stored because it is self-encrypted by the client

It’s all just bytes to the filesystem. Deduplication works at that level.

Most certainly does.

Each chunk being stored is assigned an effective address by using the hash of it (encrytped already by the client) and so if another chunk lives at that “address” then its a duplicate

If I have a 3 MB file but only one chunk worth (1MB) duplicates another file then it will not dedep because after self encryption all 3 chunks of each file will be different

But if I store 2 files and teh 1st say 6 chunks worth is the exact same then many chunks will dedup

I think we’re just missing each other here. If half of that 1MB has something in common with another half of another 1MB, my filesystem might just reference those on disk. It’s no different with two cat pictures with some bytes in common or two byte arrays. But yes, as for safenet-level dedup, that is hash based on the entire chunk and is not related to what I mean about disk-level dedup for operating systems.

1st the SAFE network works with chunks (upto 1MB in size)

Each chunk is self encrypted before being sent out to the network to be stored.

Each chunk is indivisible content wise. And being encrypted it is almost inconceivable that you will come across any pair of chunks with the 1st half (or whatever) exactly the same without the whole being exactly the same.

As such it is pointless to talk of a vault trying to “dedup” based on a portions of chunks.

If you want to attempt to save storage by checking each 512 byte sector (or 4K block) then any NAS software that already has it will do the same. But even at 4K of a 1MB encrypted chunk that is encrypted again before finally being sent to the vault for storage, would be like generating random 4K blocks and having it as a duplicate of another previous generated one, but if you want to try then go ahead.

But mathematically you might get one 4K disk storage block duplicate every 100 billion chunks sent to you. Have fun. Maybe actually 1 in 100 billion billion chunks if you are lucky.

For local stored files, even if encrypted by the NAS, you are very likely to find duplicates. Like the backups you keep on the NAS That OS file is the same today as it was the last 100 backups and encrypts in the NAS to the same bytes. But not with already dedup chunks that are further randomised by another encryption by teh managers before being given tot he vault to store. The 2nd encryption is to prevent someone self encrypting certain files then looking for those chunks being stored in their vault or cached or passing along the chain to the client

1 Like

No, it doesn’t have any mechanism to group small files in a single immutable. The consequence is that you will have to pay more safecoins to put many small files rather than merging them yourself in one big file. But the space taken in the network will be approximately the same in both case.

1 Like

What I just said is true for files greater than 3k. Below that limit the price is divided by 3 because only one chunk is put in the network (and you don’t benefit of encryption):

  • files < 3072 bytes cost one PUT
  • files >= 3072 and < 3MB cost three PUTS