I see what you’re getting at.
But the amount of duplication would be the same so long as people follow the same process (which self encryption would standardize anyhow).
If we have multiple algorithms and multiple compression levels then yes that would affect deduplication because of the various combinations giving different chunks. Another thing that gets in the way of deduplication is various encodings, slight modifications, the presence of metadata such as temporal context within the data itself, so I’m guessing deduplication is never perfect anyhow.
File level compression would leave us with the same amount of deduplication as chunk level compression (if it’s used the same way by everyone, like how chunk level compression is intended to be).
Could be part of the config.
One thing to note here is the decompression does not require the level to be specified, so it’s the exact same code to decompress level 6 or 11 or any other level, which simplifies reading (no need to include metadata):
let mut decompressed = vec![];
let _result = brotli::BrotliDecompress(&mut Cursor::new(compressed), &mut decompressed);
So the compression level does not need to be stored with the data itself, and only affects the compression step.
Just wanted to clarify this.
I thought since brotli is a streaming compression algorithm we would be able to do self encryption in a streaming way, and we should not need the entire file in ram/disk to compress+encrypt. I’ll have a closer look at the way self encryption changes with full file compression rather than chunk compression. Thanks for pointing this out.
I tried decompressing just the second half of a brotli compressed file and it does not decompress, looks like the whole file is needed. So it does seem like compression at the chunk level is needed to enable streaming, and full file compression requires the full file to be fetched before we can decompress it. Bummer.
I wonder if we can do something similar to video keyframes within brotli that allows us to do full file compression but begin the decompression starting from any keyframe within the bitstream?
If we’re aiming for deduplication to save space, I feel that’s not a problem here and any amount of duplication for any reason is acceptable.
But deduplication for the purpose of organizing data, or determining some data statistics, or caching, then I guess the degree of duplication will affect the quality of the output for these activities.
So I guess the amount of strictness on pursuing deduplication depends on the motive for it. I personally don’t care about deduplication too much (I care a lot but only for a very specific reason, the many other reasons for deduplication I don’t care for).
Yes, easy and efficient will make it the logical choice for uploaders. If people think they can get a better deal or performance some other way they will definitely migrate to it. If they can get 3 chunks instead of 5 they will do it.
We could try using the Magic Bytes and something similar already happens in sn_api to detect the mimetype.
Not sure if this is secure or accurate enough but it’s a possible first step for investigating this path.
Brotli (used by self encyrption) is also in browsers and is superior - The current state of Brotli compression
This is an interesting little brotli design aspect to know about:
https://brotli.io/ > Brotli uses a pre-defined 120 kilobyte dictionary, in addition to the dynamically populated (“sliding window”) dictionary. The pre-defined dictionary contains over 13000 common words, phrases and other substrings derived from a large corpus of text and HTML documents. Using a pre-defined dictionary has been shown to increase compression where a file mostly contains commonly used words.
Brotli’s decoding memory use is limited to 16 MB. This enables decoding on mobile phones with limited resources, but makes Brotli underperform on compression benchmarks having larger files.
Depends what we mean by ‘same entity’. I consider the same file encrypted or compressed two different ways to be three different entities (the original plus two modified versions).
Yeah, simple enough. I was thinking we could use a similar idea to enforce compression at the network level, but it might be too much work:
- Client uploads chunk to node
- Node compresses the chunk
- If the chunk becomes more than 1% smaller then client has not compressed it and the fee is taken but data not stored and error returned (client is punished)
- If the node skips the compression check and relays an uncompressed chunk, the next node will do the compression check and if it finds an uncompressed chunk has been relayed, the node relaying it can be punished (since they signed it they definitely did relay it, it’s verifiable).
- Nobody will upload or relay uncompressed data.
I guess the impact of the choice to compress or not goes beyond just the uploader though.
Forcing compression also ensures the nodes are storing an efficient representation of the data. It’s almost like security, we don’t allow insecure traffic on the network, and data compression is another aspect of that security (if not for the user then for the nodes storing the data).
Maybe I’m conflating two separate things here? And if the uploader pays more for uncompressed then why do nodes care, they get the benefit of the fee?
I see what you’re getting at though. Don’t take away choices, give maximum freedom. That’s reasonable.
Not in the current form. It would triple the chunks. This code comment explains why:
medium_encryptor.rs
An encryptor for data which will be split into exactly three chunks (i.e. size is between
`3 * MIN_CHUNK_SIZE` and `3 * MAX_CHUNK_SIZE` inclusive). Only `close()` will actually cause
chunks to be stored. Until then, data is held internally in `buffer`.
- A 1 MiB chunk arrives to my node.
- My node does a second round of SE on that chunk.
- Since the chunk is between 3 * MIN_CHUNK_SIZE (ie 3 KiB) and 3 * MAX_CHUNK_SIZE (ie 3 MiB) my node will “split into exactly three chunks”.
So every chunk from the first-SE turns into three on the second-SE process. Triple the chunks!
The simple-but-not-perfect solution could work like this:
- Nodes generate an in-memory storage encryption key when they start, which is local and stays local. The node knows it but the operator does not and cannot (or at least is extremely difficult to extract from memory).
- Just before writing chunks to disk, the node encrypts using the local storage encryption key.
- Just as soon as chunks are read from disk they’re decrypted then sent off for processing and transfer.
- The operator cannot read the chunks, nor any third-party (unless they can pull the key from memory).
- There’s no extra load for the network since this is all local.
- Operators who don’t mind the risk of storing chunks unencrypted can run their node without this feature, that’s up to them.
Some broader thoughts on compression in general…
Brotli uses compression techniques tailored to html content. Video encoding obviously uses a lot of tricks specific to the nature of video. 7zip selects from a variety of optimizations depending on the first pass. Optimizing compression means it must be context specific.
I wonder if compression within a generic library like self_encryption is inappropriate? Maybe it should only be responsible for chunking+encrypting? I’m doubt it, the compression seems suitable to have, but it does trigger some incorrect-level-of-abstraction flags.
I also noticed the self_encryption readme says compression is optional but there’s nowhere in the code to set it as optional, chunks are always compressed.
Sorry for a bit of a braindump but the conversation is much more interesting and diverse than I had originally imagined.