Self_encryption and compression, questions and thoughts

mav · April 27, 2021, 11:48pm

I see what you’re getting at.

But the amount of duplication would be the same so long as people follow the same process (which self encryption would standardize anyhow).

If we have multiple algorithms and multiple compression levels then yes that would affect deduplication because of the various combinations giving different chunks. Another thing that gets in the way of deduplication is various encodings, slight modifications, the presence of metadata such as temporal context within the data itself, so I’m guessing deduplication is never perfect anyhow.

File level compression would leave us with the same amount of deduplication as chunk level compression (if it’s used the same way by everyone, like how chunk level compression is intended to be).

Could be part of the config.

One thing to note here is the decompression does not require the level to be specified, so it’s the exact same code to decompress level 6 or 11 or any other level, which simplifies reading (no need to include metadata):

let mut decompressed = vec![];
let _result = brotli::BrotliDecompress(&mut Cursor::new(compressed), &mut decompressed);

So the compression level does not need to be stored with the data itself, and only affects the compression step.

Just wanted to clarify this.

I thought since brotli is a streaming compression algorithm we would be able to do self encryption in a streaming way, and we should not need the entire file in ram/disk to compress+encrypt. I’ll have a closer look at the way self encryption changes with full file compression rather than chunk compression. Thanks for pointing this out.

I tried decompressing just the second half of a brotli compressed file and it does not decompress, looks like the whole file is needed. So it does seem like compression at the chunk level is needed to enable streaming, and full file compression requires the full file to be fetched before we can decompress it. Bummer.

I wonder if we can do something similar to video keyframes within brotli that allows us to do full file compression but begin the decompression starting from any keyframe within the bitstream?

If we’re aiming for deduplication to save space, I feel that’s not a problem here and any amount of duplication for any reason is acceptable.

But deduplication for the purpose of organizing data, or determining some data statistics, or caching, then I guess the degree of duplication will affect the quality of the output for these activities.

So I guess the amount of strictness on pursuing deduplication depends on the motive for it. I personally don’t care about deduplication too much (I care a lot but only for a very specific reason, the many other reasons for deduplication I don’t care for).

Yes, easy and efficient will make it the logical choice for uploaders. If people think they can get a better deal or performance some other way they will definitely migrate to it. If they can get 3 chunks instead of 5 they will do it.

We could try using the Magic Bytes and something similar already happens in sn_api to detect the mimetype.

Not sure if this is secure or accurate enough but it’s a possible first step for investigating this path.

Brotli (used by self encyrption) is also in browsers and is superior - The current state of Brotli compression

This is an interesting little brotli design aspect to know about:

https://brotli.io/ > Brotli uses a pre-defined 120 kilobyte dictionary, in addition to the dynamically populated (“sliding window”) dictionary. The pre-defined dictionary contains over 13000 common words, phrases and other substrings derived from a large corpus of text and HTML documents. Using a pre-defined dictionary has been shown to increase compression where a file mostly contains commonly used words.
Brotli’s decoding memory use is limited to 16 MB. This enables decoding on mobile phones with limited resources, but makes Brotli underperform on compression benchmarks having larger files.

Depends what we mean by ‘same entity’. I consider the same file encrypted or compressed two different ways to be three different entities (the original plus two modified versions).

Yeah, simple enough. I was thinking we could use a similar idea to enforce compression at the network level, but it might be too much work:

Client uploads chunk to node
Node compresses the chunk
If the chunk becomes more than 1% smaller then client has not compressed it and the fee is taken but data not stored and error returned (client is punished)
If the node skips the compression check and relays an uncompressed chunk, the next node will do the compression check and if it finds an uncompressed chunk has been relayed, the node relaying it can be punished (since they signed it they definitely did relay it, it’s verifiable).
Nobody will upload or relay uncompressed data.

I guess the impact of the choice to compress or not goes beyond just the uploader though.

Forcing compression also ensures the nodes are storing an efficient representation of the data. It’s almost like security, we don’t allow insecure traffic on the network, and data compression is another aspect of that security (if not for the user then for the nodes storing the data).

Maybe I’m conflating two separate things here? And if the uploader pays more for uncompressed then why do nodes care, they get the benefit of the fee?

I see what you’re getting at though. Don’t take away choices, give maximum freedom. That’s reasonable.

Not in the current form. It would triple the chunks. This code comment explains why:

medium_encryptor.rs
An encryptor for data which will be split into exactly three chunks (i.e. size is between
`3 * MIN_CHUNK_SIZE` and `3 * MAX_CHUNK_SIZE` inclusive).  Only `close()` will actually cause
chunks to be stored.  Until then, data is held internally in `buffer`.

A 1 MiB chunk arrives to my node.
My node does a second round of SE on that chunk.
Since the chunk is between 3 * MIN_CHUNK_SIZE (ie 3 KiB) and 3 * MAX_CHUNK_SIZE (ie 3 MiB) my node will “split into exactly three chunks”.

So every chunk from the first-SE turns into three on the second-SE process. Triple the chunks!

The simple-but-not-perfect solution could work like this:

Nodes generate an in-memory storage encryption key when they start, which is local and stays local. The node knows it but the operator does not and cannot (or at least is extremely difficult to extract from memory).
Just before writing chunks to disk, the node encrypts using the local storage encryption key.
Just as soon as chunks are read from disk they’re decrypted then sent off for processing and transfer.
The operator cannot read the chunks, nor any third-party (unless they can pull the key from memory).
There’s no extra load for the network since this is all local.
Operators who don’t mind the risk of storing chunks unencrypted can run their node without this feature, that’s up to them.

Some broader thoughts on compression in general…

Brotli uses compression techniques tailored to html content. Video encoding obviously uses a lot of tricks specific to the nature of video. 7zip selects from a variety of optimizations depending on the first pass. Optimizing compression means it must be context specific.

I wonder if compression within a generic library like self_encryption is inappropriate? Maybe it should only be responsible for chunking+encrypting? I’m doubt it, the compression seems suitable to have, but it does trigger some incorrect-level-of-abstraction flags.

I also noticed the self_encryption readme says compression is optional but there’s nowhere in the code to set it as optional, chunks are always compressed.

Sorry for a bit of a braindump but the conversation is much more interesting and diverse than I had originally imagined.

davidpbrown · April 28, 2021, 4:15am

I wonder there’s no way round this… it’s a simple case of fundamental “do not trust user input”. Nodes need to do that extra work to see the data is made to goo.

I wonder getting nodes to do the work, also provides option to

split the data and work amongst nodes… keeps everyone busy, which is good.
standardize the compression as encryption preferred.
allows for faster upload, in case the user does not have time or resources to do the work.
allows all devices to upload, regardless of not having cpu/ram/time for that.

the downside

unencrypted data is being put to the node to work on…
but if it’s public data, because private is heavily already encrypted, then [solution goes here!.. unclear what that can be that is not hackable…]
risk of spam - PoW seemed to be useful for antispam… but the user is paying for upload ahead of the work.

If public data is within nodes, then what does it mean for it to be encrypted?.. it will be possible to hack that and undo it?.. but only in the way that the network would be doing itself.

I suppose the most important aspect is that a node cannot censor what it holds.

I keep returning to an encrypted drive that holds the node but that risks the node not knowing its own key. Can network recover if nodes do not know their own key? This might not make any sense, if a key would be within memory…

neo · April 28, 2021, 4:49am

Add to that an amplification attack. Small work/cost by random clients to overload the nodes with work

Self encryption is a reasonable task to make of the client. David talks of the the data being encrypted again while sitting on Disks and in transport to cover appendable data that may not be encrypted

happybeing · April 28, 2021, 9:12am

Seems like a hard problem.

One quick note that if (as @mav suggests) nodes encrypt using an in memory key, this makes it impossible to recover a crashed node, or even reboot it after an OS upgrade unless it can store that key securely and persistently. Maybe this can be done by Elders, but it’s starting to get complicated again.

davidpbrown · April 28, 2021, 9:12am

Nodes and network would have option to note how busy they are and refuse more work for time… but yes… not obvious where the answer lies just yet.

or a large number of node OS doing it around the same time.

intrz · April 28, 2021, 9:18am

Just wondering whether will be issues if different versions of the SAFE client produces different XorAdresses for the same data because of updates the the compression scheme.

Might there be confusion if, say, a javascript library was published as part of an app, but then different published versions of the app uploaded the same library, but the XorAdress changed?

IPFS has InterPlanetary Linked Data and there’s talk of having linked data on SAFE with RDF. Could there be issues with linked data if the same data were to give different XorAdresses on different SAFE client version?

Maybe it’s not really a big issue or maybe metadata containing the information on the setting used to produce an XorAddress from som data might alleviate any potential issues.

happybeing · April 28, 2021, 9:25am

I’m not sure, but don’t think changing xor address causes problems whether it’s because you change compression or change the content.

The reason you want to track an object (such as an RDF document) is so you can store different versions of that object (document). To do this you don’t use a raw memory address (such as an xor address, especially if this is an address derived from the content, or compression type). Instead you use a reference, a fixed address (such as an RDF URI or a Safe NRS URI), so you can always get that file in its current version. And with Safe Network you can also get all previous versions because that’s what is built into NRS.

neo · April 28, 2021, 9:37am

But then it becomes an easy way to DDOS the network. Just have clients getting nodes to be loaded up and then legit requests are blocked. Someone with deep token pockets can DDOS the network with relative few bots (each bot acting like many clients) and files with wide spread of resulting XOR addresses. The attack is amplified massively over simple trying to store too many chunks for nodes to handle. (also inversely amplified reduced cost)

neo · April 28, 2021, 9:38am

One suggestion would be that the datamap contains the method of compression and encryption.

davidpbrown · April 28, 2021, 9:44am

I’m not convinced… if someone wants to throw money at the network, who are we to judge the reason? If there are people or bots out there who want to “spam” or DDOS the network with their data, that it what it is for!.. the arbiter - the ultimate authority, is who can pay. The idea of inflexible money supply, provides the positive and negative feedback required to encourage growth and balance. The risk with that is a bit of harmonic but offset that with not too much lag that it cannot be played and there perhaps is your answer??

Take the resources of your enemy and make them your own… I’d expect that’s somewhere in the art of war.

Edit: The risk of DDOS is early on where the network cannot accommodate demand but if the network refuses more data because it is busy, that can be taken as a positive signal. “The network wants more nodes?.. and will pay for them… who will provide…” and the reward price goes up. If that triggered at 80% full, then went exponential, I’d expect a natural response would form from use of log/exp.

neo · April 28, 2021, 10:09am

What does this have to do with what I said. My point was that it provides the ability to do this amplified attack.

The kind of amplified attack is making nodes do upto or more than 10,000 times the work they need to do, maybe even 100 thousands the work of what is needed to just store the chunks. Compression and encryption is compute/disk intensive. Nodes will need multiple chunks at anyone time to do the encryption remember.

So an attacker trying to DDOS the current network method of compression&encryption would need to be storing 10 to 100 thousand times the chunks in order to overload the nodes. Plus the nodes would be spread across the network. The idea of making the nodes the client connects to to do the work means its easy to overload those nodes. By running multiple clients then more nodes can be done by the one bot.

And self encryption on the client computer will do exactly that. By making nodes do that you are actually shifting the resources to nodes.

Thus 1 computer could DDOS 20 to 50 nodes real easy, maybe 100 or more.

Using self encryption on the client means the computer is doing this compute intensive work and the attacker now needs 10,000s of times of computers to do the same attack. If as you suggest nodes stop accepting requests when they are loaded then each attacking computer can then run up another client requesting to send the file to a new set of nodes since that client has new XOR address and new set of nodes it talks to.

Then the cost of the attack is 10 to 100 thousand times cheaper.

That is the point.

And the whole time the attack is going on legit people cannot upload spending tokens. Thus the network is not earning a lot of those tokens from legit uploads. Remember many uploads will be small and thus the attacker doing big files (many chunks worth) might be paying 1 PUT per MB, but legit people are trying to post their response, mail, add a record to something and some big files too. Thus the mix might end up with 1 PUT per 10K or 50K or 100K. Thus in fact the attack might drop the actual rate of token spending and turn off a ton of people trying to talk in a forum, watch a movie, etc.

This is why I feel putting the compression and encryption load on the nodes is a bad move. Unnecessary, and making nodes do up to 100 thousand times the work it needs to do.

Simple to visualise for even non-attack situation.

100 million clients chatting on forums, uploading pictures, movies, etc at any one time for a mature network. And 1 million nodes servicing the network (assuming we have such a good ratio in a mature network).

So now 100 million x average work needed to encrypt&compress is taken from the 100 million computers and now placed on 1 million nodes which are also trying to do other node work. Its just wrong in my opinion

davidpbrown · April 28, 2021, 10:46am

You trying to ddos my brain?

I don’t know where you’ve pulled that from… surely there are linear methods that do not risk amplify requests.

I thought the difficulty was the SE… elements that are not encrypted atm in nodes.

It seemed to me the difference was between charging the client a PoW and getting them to pay in another way - as money.

Expecting others understand this better than I do… I’ll leave it until the problem is more obviously stated.

TylerAbeoJordan · April 28, 2021, 10:54am

Just for reference as I had forgotten much of this earlier - and if anyone else is confused about SE (from the 2019 primer):

When a Client uploads a piece of data to the Network (for example an mp4 video) it is first broken into pieces with a maximum size of 1 MB and those pieces, or chunks, are then ‘self-encrypted’, a process patented by MaidSafe by which each chunk is encrypted using the hash of another chunk from the same file. These encrypted chunks are then hashed again using the hashes of neighbouring chunks to arrive at the XOR address where the chunk will be stored. At the same time a Data Map is created on the Client device, which maps the encrypted chunks to the original data, allowing the file to be recreated. A number of copies of each chunk are stored by Vaults in the Section to ensure redundancy.
The Client retains the keys to decrypt the data locally. That way no keys or passwords need leave a person’s device. The chunks are stored on the SAFE Network in a fully encrypted way. Users can choose to share these files with others by sharing their keys with them. They can also choose to make the files fully public, in which case the keys required to decrypt the files are made publicly available, as with the example of a blog.

Is the current process still the same?

neo · April 28, 2021, 11:16am

This is what is not good. Getting the nodes to do the work. Passing through or storing 1MB is not intensive compute work. Getting nodes to accept 3 x 1MB chunks in order to encrypt&compress 1 chunk of the 3 is intensive compute work, easily 10 to 100 thousand the compute work for the node. All the consensus and protocol work is similar for both cases

This is what I was replying to

davidpbrown · April 28, 2021, 11:21am

What I was replying to was only the problem of what is not encrypted in nodes.

I don’t know that math…

The client can look after their interest; the network can look after its own.

I’ll bow out for too much guessing from here.

TylerAbeoJordan · April 28, 2021, 11:27am

If the concern is taken seriously -and I don’t know how critical it is … but if it is taken seriously, that a malicious client will upload non-SE data and that data is illegal and could get the node in trouble for hosting it … then it seems there are only two general alternatives:

As @mav suggests, the node creates an extra layer of encryption (and holds the key securely).
The client adds a layer of encryption that the node can peel off to verify that there actually is at least one level of encryption on the data already.

There is no means to provably determine if a random string of data is encrypted or not. So either the node does it, or there must be a means for the node to undo the encryption and check for provable data.

In the case of #2, a verification code would have to be appended post-SE and the data encrypted again. The node would then decrypt and verify the presence of the code. Perhaps this code is given to the client by the node in advance or the code is a network-wide bit of data (“David Irvine Rocks”) or both. In any case, if the code matches what the node expects, then it stores the chunks with the outer layer of encryption intact. If not it bounces the request back to the client.

Either way is going to require more work from the node. But #2 way might? be less work or at least the node doesn’t have to hold onto the outer key - unless it’s public data, in which case it’d be stored in the data-map I suppose.

Okay, done with my crazed theories … lol, now everyone can tell me I’m wrong.

dirvine · April 28, 2021, 10:42pm

There are more. I will list some

Make chunks much smaller (balance of a few issues though)
Test entropy of chunks (we are running some data on this) - can include increasing the entropy of SE … here is some comment from our slack today include https://docs.rs/entropy-lib/1.0.2/entropy/struct.Entropy.html to let us calc the entropy of SE data. This is a tool we may use in nodes storing chunks. i.e. entropy must be in the acceptable range. Be aware our solutn uses an XOR last step reducing entropy from say AES. So not most is best, but find the range of entropy we know SE produces. So entropy measure will not be in SE but potentially in sn_node to do a chunk check, so as well as content hash == name we will test entropy in range.
Also What we will do is measure entropy and then increase ours. That's easy. So rather than cycle the PAD we will extend the pad via continuous hashing. That will give us entropy => AES (of course ignoring ECB mode, which is a joke mode) . It's simple stuff really. We will take legionnaires (or however you spell his name ) OTP and create a Shannon OTP with this method of increased entropy.
also With that increased entropy we can get assurance no plain data is stored as chunks, jpegs etc. still an issue, but we can address via chunk size changes (first part of this mini project in SE)

Point 1 interests me a bit as well. We will play around with both of these in some testnets and see what we can do with them. There are other options where Elders on Store can obfuscate the data. I explained it previously but cannot put my hands on it right now. It’s something we can fix but I would like to do it in a really simple way that does not affect deduplication and is simple for the network to do.

TylerAbeoJordan · April 29, 2021, 12:05am

Personally I’m not concerned with this “potential non-SE” data and would rather the network be optimized for speed and efficiency.

The “fix” to this problem though, whatever it might be (if anything), should also consider that the nodes themselves might cut corners to gain speed and efficiency in order to curry favor with the network and increase profitability.

For that reason, I think both of the first two ideas (one from @mav, and one from me) are both bad. The odds of a client not doing SE being “x”, the odds of a node cutting corners are x^100 IMO … in that sense, my idea is worse still as it creates more work for the average client - so win for @mav.

The entropy idea is interesting, but for the same reason as above, nodes will probably seek to strip out that code for more efficiency and what is the penalty or test to ensure that entropy levels are being checked?

So that leaves smaller chunks (unless I’m off base on other points - which I may well be) … Smaller chunks seem reasonable on the surface, but no clue on the impact here - looks like more work for the client and node - but the node won’t be able to cut corners here.

Thoughts on pros & cons of smaller chunks?

Toivo · April 29, 2021, 7:47am

Could the readable files be rejected just on the basis that they are readable? Node could say “Hey, I can actually see this is a jpg file, so I’m not gonna store it. Send it again, this time encrypted, please.”

I don’t know how any of this works, so probably not a workable solution…

TylerAbeoJordan · April 29, 2021, 7:55am

That’s sort of what the entropy test (that David mentioned above) would do I think - basically looking for anything with non-randomness.

Nodes could disable that sort of check though for efficiency and risk holding something illegal (maybe not a large risk anyway) … but if that happens then these nodes get an advantage over others … probably no perfect solution here in any case.

Edit: I stand corrected - see below.

Topic		Replies	Views
Bug: self_encryption of small files, not encrypted Development	4	1000	June 22, 2015
Self_encryption web-based tool Apps	10	584	May 18, 2021
What the size of the chunks after the data dispersal algorithm? Development	7	1210	March 5, 2015
Self encryption and de-de-duplication Features	14	1872	December 6, 2015
Questions about Chunks Features	1	1184	August 10, 2015

Self_encryption and compression, questions and thoughts

Related Topics