Self_encryption and compression, questions and thoughts

I noticed something strange with self_encryption, and I learned some things from investigating it. This post goes into the strangeness and possible changes that might come from it.

Background

Look at this 4.64 MiB words.txt file.

When I do a normal zip operation to create a new file words.txt.zip it becomes 1.23 MiB.

When words.txt passes through self_encryption it creates 5 chunks and a datamap total 1.39 MiB. The chunks are (in order of the datamap):

size   name
320128 c73c6cc5d28dba8a19ea8f47aa694d00de0dd693dd1964f77ff9347dcb5d6291
320304 045c4bca99cda93f4c82dc23197f4bf4586d2dc577bdd0011f5129fa4aba4786
314576 9645290a4ba2e4318912fd53fe92e49a272d77c059536a5831ed602e16df1180
305664 f6fee1c01b6bf89f6276bc79031dd57b6348a550e65b27f26b77bd35f8389c28
195696 21b4c71b9a0eaf77bc09c8d824d880a2632ddf4626fdfd4423b5cff39dc6af39
---
1456368 Total - 1.39 MiB

The file is split into 1 MiB chunks, then each chunk is compressed, then each compressed chunk is encrypted. Note each chunk is about 300 KiB

Creating words2.txt with two concatenated copies of the list,
cat words.txt words.txt > words2.txt
self_encryption creates 10 chunks of varying sizes, doubling the number of chunks

size   name
320128 01356d19050bb6994f77d4421e2c0dce12475bb6c97ac2e196a8bd110efc89a1
320304 d70a9bc76f72e21e0b81b6c92ea0003a7dc4ace46b172b8eb6912d06942f83d2
314576 9645290a4ba2e4318912fd53fe92e49a272d77c059536a5831ed602e16df1180
305664 f6fee1c01b6bf89f6276bc79031dd57b6348a550e65b27f26b77bd35f8389c28
313536 7d7eff882bf6a0abd09c28b6f16e4bce1e96dac93a9c0b8a0021a2d77b0ef337
316944 3909757af398e796028af60d30552d0852b166d4a771ee623bff45084751acb1
326496 1ad8dd0cf2d1e1b8f7efa9f3ab02789c4a6ab5553a578ff85208ebc69b124b0f
297824 c8dffc7cd1909a2cb1e54ba9f06739c98d102889f79866e66f34b58623adf51e
308016 ceedab3fa35e4ebec93492d119e906c0b7e51782f2384abb891585f42369ce45
 88704 82b9c54d77e2b6f863c3ff0815b9e631835ec443323311eb803055d39b8129d8
---
2912192 Total - 2.78 MiB

I found it strange there are 5 chunks for 1.23 MiB of compressed data, and 10 chunks for 2.78 MiB of data, averaging around 300 KiB per chunk. Seems inefficient when chunks can be up to 1 MiB.

Incidentally, there are 2 matching chunks between words.txt and words2.txt, at index 2 and 3. This level of detail is not exactly within the scope of deduplication, which operates at the file level rather than the byte level, but I found it interesting all the same. Why chunks 2 and 3 are identical but not chunks 0 and 1? We can work out why from the algorithm but it’s beyond the scope of this topic. Just wanted to observe it as a point of interest.

Question 1

When should we compress data?

Compressing words.txt as a whole using the same compression as self_encryption (brotli level 6) the size becomes 1453829 bytes (1.386 MiB)

When words.txt is split into 1 MiB chunks and each chunk is individually compressed, it gives 1456368 bytes, ie 2539 bytes more, but very close in size; there’s not much size difference if the data is compressed before or after chunking.

However the benefit of precompressing is that it prodcues less total chunks, 3 instead of 5:

484624 ec99b22fbce573414d692a8feb09c1bbf90c8d914f0803a98f5fdcd71f1d9bc4
484624 97ff008d97849721d4c87446b928d37745d3f9bd4c71c2b2c32e14c57c00abe9
484624 e9e227969f333dd501016762d08fe27c9f9801e50faf2a4090db156d86ce5557
---
1453872 Total - 1.39 MiB

Would it be best to compress the entire data before self_encryption and not compress each individual chunk? This would reduce the number of chunks. Fewer chunks means less requests for the client to make and less node routing.

As for when to compress, the client can choose to manually compress their files before self_encrypting and uploading, which is fine, I accept that as an option, but it doubles the compression (precompression plus compress each chunk). Should we maybe add file-level compression to sn_api or sn_cli or sn_client or self_encryption? Also the browser? I feel chunk-level compression is suboptimal compared to file-level compression and maybe we should consider removing chunk-level compression.

Question 2

Should we use the highest compression available, ie change from 6 to 11?

Using 6 vs 11:

size: 1453829 vs 1243206 bytes
compress time: 0.521924085 vs 11.329777847 seconds
decompress time: 0.032754702 vs 0.031691953 seconds

This takes a fair bit longer for the client to do the compression, but since data is stored permanently and may be fetched many many times, it makes sense to do one lot of extra upfront work to improve many future stores and transfers, especially when decompression time is identical.

Also worth noting here a caveat about interoperability for compression levels 10-11, from brotli readme:

Rust brotli currently supports compression levels 0 - 11 They should be bitwise identical to the brotli C compression engine at compression levels 0-9

Question 3

Should we allow variable compression levels, and variable compression algorithms?

Two-pass compression (eg 7zip) gives much better results. Compression algorithms will improve in the future.

Should we be storing some metadata in the datamap about this?

For example words2.txt (two concatenated copies of words.txt) gives twice as many chunks, whereas a two-pass compression algorithm would avoid that. Same problem shows up using zip one-pass vs 7zip two-pass compression:

words.txt.zip  1284951 bytes
words2.txt.zip 2569994 - twice as big
words2.txt.7z  1228738 - two pass filter

I personally think 1) no we should not allow variable compression levels, but 2) yes we should allow variable compression algorithms. This increases complexity though, and I feel it’s not the domain of self_encryption, probably is better to implement at the client layer via something like a Content-Encoding header/metadata.

Question 3

Self_encrypting compressed words.txt uses the medium encryptor to create 3 lots of 484624 byte chunks.

Would we be better to create one chunk of MAX_CHUNK_SIZE and one remainder chunk, to have only 2 chunks?

It seems like a waste to do 3 mid-sized requests/stores when we could have 2 requests/stores, one full-size and one small.

Question 4

Clients can upload non-self_encrypted data and nodes have no idea if the data they hold has been self_encrypted or not. Self_encryption is unenforceable.

Do we need to consider some sort of Content-Encoding header equivalent?

I wonder if we need to better differentiate to clients the difference between self_encrypted data vs chunked data. Not all data on the network will necessarily be self_encrypted, but all data will be chunked. How will clients know whether to self_decrypt or not? This could be app-specific but then data won’t necessarily be interoperable between apps.

Just reaching toward the edges of this problem here, I’m not convinced myself of the importance of this particular question in the short-term. And this also seems outside the scope of self_encryption itself, more likely a client-layer issue.

Question 5

Tied in with the concept of a two-pass compression algorithm, can we skip ahead with self_encryption and how does the compression algorithm affect this? eg if we have a video that is 20 chunks and we want to skip immediately to half way, can we read from chunk 10 and still decrypt+decompress without needing to fetch chunks 1-9?

What’s the best way to manage streaming and skipping? This needs consideration for several parts: the chunking process, the compression process, the encryption process.

This question can be answered by closely inspecting the self_encryption algorithm but I was hoping for someone who already knows to simply say yes or no. But keep in mind the ability to read from mid-stream may be a factor for any future changes so it’s worth being aware of this stream-ability property even if there’s a simple answer for today, we probably want to continue considering it with future changes.

Question 6

Brotli has multithreaded compression as of v3 “so multiple threads can operate in unison on a single file”.

I couldn’t find an example so haven’t tested it but maybe this is something to consider in the future? Especially if it increases the viability of higher compression ratios?

Question 7

Fundamentals #19 says

Only ever use encrypted services and encrypted traffic
Services or traffic must be encrypted if they are to be used by the Safe Network.

Since self_encryption is optional (clients can upload plain blobs) should we consider adding chunk encryption at the node level?

I assume nodes can’t enforce self_encryption. Is this a fair assumption?

I also assume we can’t force nodes to encrypt data, they can run a node that stores whatever is given to them directly.

We can assume that traffic is encrypted since a node that tries to send unencrypted traffic would be booted, it’s not possible.

Maybe to achieve this goal we could have every chunk self_encrypted by the nodes? Not sure really about this but I wonder how far we take this point, is it only for traffic or is it also for data at rest?

The original motive for this question was if someone uploads a plain 900K blob of a gore image (not self_encrypted, in a single chunk), and that image happens to land in my node, and I happen to open the chunks directory in my file explorer, and my file explorer happens to automatically render thumbnails, I will see gore that I don’t want to see. This is a kind of symbolic-test-case that covers many other situations (law enforcement, malware etc) and I feel it leads to fundamental #19 failing. Just looking for opinions, maybe I’m being too fundamentalist here :wink:

But also I’m trying to acknowledge the limitations of self_encryption so we aren’t talking around different conceptual boundaries.

29 Likes

This is great stuff @mav !! Never thought about self-encryption being optional, but yeah, if the nodes don’t enforce it (and how would they know anyway?), then seems it certainly is.

As for the compression stuff - I think you’ve discovered “Middle-out” :wink: Pied-pipers not-so-proprietary compression scheme! lol But seriously great food for thought.

8 Likes

Yes, you are right. But finding that perfect size would be computationally hard. Compression level depends on type of data. So 2 files of 1Mb can be of different sizes on compression. So we have to find the perfect size of uncompressed data which leads us to 1Mb chunk size after compression. Doing this would make us do compression multiple times for each chunk before we push to network. Micro-optimizations might be possible though, but we would need more knowledge of how we can “predict” the size after compression.

Yes that is intended, and that happens because 0 and 1 chunks are encrypted using last and second last chunks as keys (See the diagram on github.com/maidsafe/self_encryption readme). This makes the chunk after encryption different for the two files (words.txt and words2.txt) and hence different hash.

I will answer to more of your questions in the coming posts.

11 Likes

Really nice observation. But I am worried about how de-duplication would be affected. As you said in the example of words.txt and words2.txt, it might be the case that the deduplication we see right now (chunk 2 and 3) might not be visible in pre-compression idea you propose. But I believe this idea is very very great as it reduces chunks by a great level.

Maybe we can add it to the config? We might say that more the level, more time to compress-before-push? Might need your thoughts here @lionel.faber

Yes we can. The data map contains the keys and addresses of individual chunks, so we can stream things this way! Though we would need special chunking for video streams, we need new clients for video data types.

The final question 7 is something I would need @dirvine 's help!

Thank you @mav for this in-depth questionnaire, this would surely help self_encryption become better!

8 Likes

It’s really a trade off between ram/disk space and streaming. Compress as first step for large files means writing a temp compressed file then SE it. Good to look at this though.

Yes this was back when speed was considered important, but I reckon now space efficiency wins. We should try this at least. @kanav

The issue is deduplication, if there were say 4 levels or 4 algos then instead of 1 set of chunks we have many. Is it a problem? I am not so sure these days, it still feels like it would be?

I agree not specifically an SE issue, but it is an issue. We are looking to enforce all data uploaded is either pointers in mutable data containers or chunks. I would love to force SE data, but it is not simple. I see it so that SE is so easy it’s a default choice, but we need to prevent bad guys uploading plain data. There are a few ways to detect that, but none rock solid yet. It’s likely we will enforce Elder encrypted data to adults.

Yes we can seek in SE.

Yes to read chunk 10 we do need also 8 and 9 but should be Ok. Actually that affects Q1 really, if we did a compress pass on whole file first it would break this (video wont likely compress, but …)

I think we will have no choice here, to cover all bases.

9 Likes

If I’m uploading a compressed video, or a png, or a 7zip file, it would be wasteful of time and power to compress that again with any algorithm. I think there should be the choice to opt out of compression, maybe even defaults on/off depending on file types (video, image, zip)? You will pay more if you upload larger files, if you’re fine with that and/or know what you’re doing then fine, right?

I feel like uploading large files (over 100 MB) would be quite uncommon, no? And for the most part, if you’re uploading a large file it’s probably going to be a video or zip, which tend to be very compressed already, so might as well skip compression entirely then?

There are probably not many drawbacks to using one standard compression setting, just in case it does make a difference. But the actual size of files probably makes a bigger difference, especially since I guess different people are unlikely to upload the same file several times over time. So we should not be afraid to update to a better algorithm when they come.

1 Like

That is exactly what people do on various private sharing servers with movies and other things.

1 Like

Is it possible to just detect video (and audio?) files and skip compression altogether? Should be able to at least detect based on given typing/extension as known by filesystem. It really seems a waste of resources to deal with compression on such files as they tend to be very large by comparison (and already highly compressed) and adding decompression (while super fast in any case) is still an extra drag if attempting to stream.

1 Like

You are great at researching all of this stuff in detail!

1 Like

I think it’s OK as they won’t compress. We did use to try and test that in SE but the work is almost too much. So we now say if they don’t compress it’s cool.

Yea, we cannot reply on extensions. You can easily fake these and also hide other stuff in these files. I.e. I can make a pdf that will open as a pdf but also run other code on your computer you won’t see. It’s a way I am sure some activists are attacked as they have (child) porn on their computer, but it can be put there and by opening a doc/pdf/even txt file that is actually an exe and although it displays what you expect it also does other things :wink: not always good things.

tl;dr It’s easiest to just run the lot via compression and as it’s client-side so the overhead on the network is zero, and on client-side should be minimal.

2 Likes

Is the XOR address, of the file before its compression?
If that’s the case, then the network could evolve and recompress files over time to a preferred solution?

I’m thinking there’s “of the network” compression; that “normal” for all data that clients should expect - everybody deflates gzip for everything; and then clients doing whatever with their own files ahead of time.

re Question 1

There are common kinds of compression that modern browser’s use like Apache mod_deflate gzip. Perhaps using that for all data would work well for what is put to browsers they can handle directly and other routes for data can accommodate that overhead as a normal.

re Question 2 & Question 3

Compression is a one time action; if decompression time is identical and/or if the number of chunks can be reduced, then it seems obviously worth the extra effort for the effect over time.

re Question 5

If a file is accessible for streaming, then surely it is not compressed or encrypted as one file; so, you grab at a chunk, network deflates and unencrypts to the ‘normal’ and that is deflated gzip of whatever and then what the user sees, is that segment of the file - which surely would be readable in whatever way streaming works.

I’m guessing but perhaps it’ll work trying this on a randon snip of a video file?.. if it doesn’t just work, then worst case “video” would need to be declared at the upload; so that frames can be cut in the right place for them to be put in a chunk that can be read as one chunk only.

re Question 7

I would have thought it’s obvious node data needs to be obscure. It’ll be much easier to defend the idea of grey goo in the nodes than any risk of specifics.

If node data is accessible directly, you’ll have people writing scripts to query or worse censoring what is there and destroying content directly that they do not want or approve of, which perhaps in theory corrupts the quality of the node in a detectable way but is not good.

I had in mind that nodes would be acting like mounted encrypted drives - taking that space that is allocated as their own and writing to it; so, the mount falls over it is not accessible. The question then is how does it recover.

So, I was expecting nodes use a key from the network as one key or unique to themselves - store that key on network perhaps but I don’t know, if this causes a problem with node recovery from a network that has fallen over entirely. I don’t know how a node can ask the network “This is my name, can you unlock my data?”, without that be hackable.

should we consider adding chunk encryption at the node level?

Yes. I think so. Do it once do it well, despite it being a bit more effort.
Users will be lazy, network should be robust.

If the only way to access data, is with a tool that is doing essentially the same as the network, then that can be defended. If a single node’s data is accessible without the network, fragmented as that might be, then that would be controversial?

1 Like

If upgrading the compression changes XOR URLs I’d worry how that might affect things where ids are involved. For example on clearnet, HTTP URL’s are used as identifiers in RDF triples. Would these HTTP URL’s be self encrypted/compressed to create an equivalent XOR URL on SAFE? Could apps be affected if suddenly one id has multiple XOR URL’s created with different compression algorithms or compression levels?

Or let’s say it was decided that a good way to create an XOR URL for an entity was to “hash”, by doing compression and self encryption, some other data for example the Wikipedia id of the entity. Then if compression upgrades will change the hash, there will be multiple ids for the same entity over time.

1 Like

I’m expecting necessarily that the XOR is before the “normal” compression like gzip and then also before any network level compression. Worth checking that though for the confusions you outline.

There’s a user level risk of confusing what is user level compression of the same file… if they compress it, then to be expected that would be a different XOR… and there might be reasons to want that too.

To allow network evolution the XORs would need to be relative to uploaded file, not relative to the network method.

1 Like

No, so we have XorAddress of Data map, only after SE
XorAddress of a chunk - after SE

So the address of data is all post SE, unless it’s mutable data and that is only going to be metadata, not actual data.

5 Likes

I do not know whether the usage of self-encryption is being disputed here, but I’d like to remind us that self encrytpion’s other goal is to protect a node’s owner to be liable of the content they store on their computer.

Nodes will store all kinds of content from the network, legal or illegal, and it is all fine as long as they have no ability to view or use this information.

5 Likes

Maybe the client API could have an option to disable compression, so it could be disabled in cases where the XorAddress is supposed to be a stable and unique hash of the content.

1 Like

Been thinking about this bad-actor not using SE issue/question and how to solve.

Wondering if a semi-secret key (specific to user) encryption layer with a network wide verification code could be used.

So:
{[{SE}+verifiyCode]Semi-secret-key} - sent to node with semi-secret-key … then node decrypts with semi-secret key, checks the verification code, if it’s there, then all good, else error back to user. Node then discards semi-secret-key.

In this way, all data stored on network is proven encrypted.

1 Like

Nice one, but … de-duplication. It’s a nice wee problem, not huge though, we will get there for sure.

3 Likes

But if node can decrypt outer layer, then XOR address can still be created from SE layer - verification-code. So shouldn’t it be the same as now?

Sorry, I’m a fool in these matter I suppose, I’m probably missing something obvious here and just wasting your time.

1 Like

There is even simpler aproach. Pick few random parts of the file, sum size of those parts has small size compared to original. Try to compress them. If it does not improve size, skip compression for whole file.

4 Likes