I noticed something strange with self_encryption, and I learned some things from investigating it. This post goes into the strangeness and possible changes that might come from it.
Look at this 4.64 MiB words.txt file.
When I do a normal zip operation to create a new file words.txt.zip it becomes 1.23 MiB.
When words.txt passes through self_encryption it creates 5 chunks and a datamap total 1.39 MiB. The chunks are (in order of the datamap):
size name 320128 c73c6cc5d28dba8a19ea8f47aa694d00de0dd693dd1964f77ff9347dcb5d6291 320304 045c4bca99cda93f4c82dc23197f4bf4586d2dc577bdd0011f5129fa4aba4786 314576 9645290a4ba2e4318912fd53fe92e49a272d77c059536a5831ed602e16df1180 305664 f6fee1c01b6bf89f6276bc79031dd57b6348a550e65b27f26b77bd35f8389c28 195696 21b4c71b9a0eaf77bc09c8d824d880a2632ddf4626fdfd4423b5cff39dc6af39 --- 1456368 Total - 1.39 MiB
The file is split into 1 MiB chunks, then each chunk is compressed, then each compressed chunk is encrypted. Note each chunk is about 300 KiB
Creating words2.txt with two concatenated copies of the list,
cat words.txt words.txt > words2.txt
self_encryption creates 10 chunks of varying sizes, doubling the number of chunks
size name 320128 01356d19050bb6994f77d4421e2c0dce12475bb6c97ac2e196a8bd110efc89a1 320304 d70a9bc76f72e21e0b81b6c92ea0003a7dc4ace46b172b8eb6912d06942f83d2 314576 9645290a4ba2e4318912fd53fe92e49a272d77c059536a5831ed602e16df1180 305664 f6fee1c01b6bf89f6276bc79031dd57b6348a550e65b27f26b77bd35f8389c28 313536 7d7eff882bf6a0abd09c28b6f16e4bce1e96dac93a9c0b8a0021a2d77b0ef337 316944 3909757af398e796028af60d30552d0852b166d4a771ee623bff45084751acb1 326496 1ad8dd0cf2d1e1b8f7efa9f3ab02789c4a6ab5553a578ff85208ebc69b124b0f 297824 c8dffc7cd1909a2cb1e54ba9f06739c98d102889f79866e66f34b58623adf51e 308016 ceedab3fa35e4ebec93492d119e906c0b7e51782f2384abb891585f42369ce45 88704 82b9c54d77e2b6f863c3ff0815b9e631835ec443323311eb803055d39b8129d8 --- 2912192 Total - 2.78 MiB
I found it strange there are 5 chunks for 1.23 MiB of compressed data, and 10 chunks for 2.78 MiB of data, averaging around 300 KiB per chunk. Seems inefficient when chunks can be up to 1 MiB.
Incidentally, there are 2 matching chunks between words.txt and words2.txt, at index 2 and 3. This level of detail is not exactly within the scope of deduplication, which operates at the file level rather than the byte level, but I found it interesting all the same. Why chunks 2 and 3 are identical but not chunks 0 and 1? We can work out why from the algorithm but it’s beyond the scope of this topic. Just wanted to observe it as a point of interest.
When should we compress data?
Compressing words.txt as a whole using the same compression as self_encryption (brotli level 6) the size becomes 1453829 bytes (1.386 MiB)
When words.txt is split into 1 MiB chunks and each chunk is individually compressed, it gives 1456368 bytes, ie 2539 bytes more, but very close in size; there’s not much size difference if the data is compressed before or after chunking.
However the benefit of precompressing is that it prodcues less total chunks, 3 instead of 5:
484624 ec99b22fbce573414d692a8feb09c1bbf90c8d914f0803a98f5fdcd71f1d9bc4 484624 97ff008d97849721d4c87446b928d37745d3f9bd4c71c2b2c32e14c57c00abe9 484624 e9e227969f333dd501016762d08fe27c9f9801e50faf2a4090db156d86ce5557 --- 1453872 Total - 1.39 MiB
Would it be best to compress the entire data before self_encryption and not compress each individual chunk? This would reduce the number of chunks. Fewer chunks means less requests for the client to make and less node routing.
As for when to compress, the client can choose to manually compress their files before self_encrypting and uploading, which is fine, I accept that as an option, but it doubles the compression (precompression plus compress each chunk). Should we maybe add file-level compression to sn_api or sn_cli or sn_client or self_encryption? Also the browser? I feel chunk-level compression is suboptimal compared to file-level compression and maybe we should consider removing chunk-level compression.
Should we use the highest compression available, ie change from 6 to 11?
Using 6 vs 11:
size: 1453829 vs 1243206 bytes
compress time: 0.521924085 vs 11.329777847 seconds
decompress time: 0.032754702 vs 0.031691953 seconds
This takes a fair bit longer for the client to do the compression, but since data is stored permanently and may be fetched many many times, it makes sense to do one lot of extra upfront work to improve many future stores and transfers, especially when decompression time is identical.
Also worth noting here a caveat about interoperability for compression levels 10-11, from brotli readme:
Rust brotli currently supports compression levels 0 - 11 They should be bitwise identical to the brotli C compression engine at compression levels 0-9
Should we allow variable compression levels, and variable compression algorithms?
Two-pass compression (eg 7zip) gives much better results. Compression algorithms will improve in the future.
Should we be storing some metadata in the datamap about this?
For example words2.txt (two concatenated copies of words.txt) gives twice as many chunks, whereas a two-pass compression algorithm would avoid that. Same problem shows up using zip one-pass vs 7zip two-pass compression:
words.txt.zip 1284951 bytes words2.txt.zip 2569994 - twice as big words2.txt.7z 1228738 - two pass filter
I personally think 1) no we should not allow variable compression levels, but 2) yes we should allow variable compression algorithms. This increases complexity though, and I feel it’s not the domain of self_encryption, probably is better to implement at the client layer via something like a Content-Encoding header/metadata.
Self_encrypting compressed words.txt uses the medium encryptor to create 3 lots of 484624 byte chunks.
Would we be better to create one chunk of MAX_CHUNK_SIZE and one remainder chunk, to have only 2 chunks?
It seems like a waste to do 3 mid-sized requests/stores when we could have 2 requests/stores, one full-size and one small.
Clients can upload non-self_encrypted data and nodes have no idea if the data they hold has been self_encrypted or not. Self_encryption is unenforceable.
Do we need to consider some sort of Content-Encoding header equivalent?
I wonder if we need to better differentiate to clients the difference between self_encrypted data vs chunked data. Not all data on the network will necessarily be self_encrypted, but all data will be chunked. How will clients know whether to self_decrypt or not? This could be app-specific but then data won’t necessarily be interoperable between apps.
Just reaching toward the edges of this problem here, I’m not convinced myself of the importance of this particular question in the short-term. And this also seems outside the scope of self_encryption itself, more likely a client-layer issue.
Tied in with the concept of a two-pass compression algorithm, can we skip ahead with self_encryption and how does the compression algorithm affect this? eg if we have a video that is 20 chunks and we want to skip immediately to half way, can we read from chunk 10 and still decrypt+decompress without needing to fetch chunks 1-9?
What’s the best way to manage streaming and skipping? This needs consideration for several parts: the chunking process, the compression process, the encryption process.
This question can be answered by closely inspecting the self_encryption algorithm but I was hoping for someone who already knows to simply say yes or no. But keep in mind the ability to read from mid-stream may be a factor for any future changes so it’s worth being aware of this stream-ability property even if there’s a simple answer for today, we probably want to continue considering it with future changes.
Brotli has multithreaded compression as of v3 “so multiple threads can operate in unison on a single file”.
I couldn’t find an example so haven’t tested it but maybe this is something to consider in the future? Especially if it increases the viability of higher compression ratios?
Fundamentals #19 says
Only ever use encrypted services and encrypted traffic
Services or traffic must be encrypted if they are to be used by the Safe Network.
Since self_encryption is optional (clients can upload plain blobs) should we consider adding chunk encryption at the node level?
I assume nodes can’t enforce self_encryption. Is this a fair assumption?
I also assume we can’t force nodes to encrypt data, they can run a node that stores whatever is given to them directly.
We can assume that traffic is encrypted since a node that tries to send unencrypted traffic would be booted, it’s not possible.
Maybe to achieve this goal we could have every chunk self_encrypted by the nodes? Not sure really about this but I wonder how far we take this point, is it only for traffic or is it also for data at rest?
The original motive for this question was if someone uploads a plain 900K blob of a gore image (not self_encrypted, in a single chunk), and that image happens to land in my node, and I happen to open the chunks directory in my file explorer, and my file explorer happens to automatically render thumbnails, I will see gore that I don’t want to see. This is a kind of symbolic-test-case that covers many other situations (law enforcement, malware etc) and I feel it leads to fundamental #19 failing. Just looking for opinions, maybe I’m being too fundamentalist here
But also I’m trying to acknowledge the limitations of self_encryption so we aren’t talking around different conceptual boundaries.