Data Density Attack


#41

That would be a problem for deduplication of ImmutableData.


#42

Yes, I do agree. In this specific case I think I’m coming down (so far) on the side of the safecoin deterrant approach, although I’m certainly not convinced yet.

For example, say we take this approach and allow an attacked section to charge a massively inflated price for PUTs, but that it does have the desired effect of stopping the attacker from being able to cause a catastrophic cascade merge. What we’ve also done is incentivise someone to write a replacement for self-encryption which rather than deduplicating chunks, instead splits and encrypts a file in such a way that the resulting chunks land in low-cost sections, or at least that no chunks will land in sections which will reject them.

Losing the deduplication is an unintended consequence which hurts the network, and is probably impossible to quantify. I think it’s probably better to try and find a way which means a section can’t be pushed to its limit before relying on this approach which allows a section to run at its limit, albeit temporarily.


#43

I have no intuition on the time it takes to brute-force/generate these chunks. Can anyone here provide some kind of a guestimate? Seems like this just puts a constraint on the minimum number of sections so that the section prefixes are long enough to make this impossible at launch given current technology.

This is a really good point. Rather than some kind of crafty adversary, this thread is also concerned with the situation where the network is just very very popular and storage is filling up faster than people bring new hard disks online. (remote possibility yes, but probably as remote as brute forcing enough section prefixes right?) Maybe we could also think of this as the “wd-samsung attack” ie. not enough hdd/ssd manufacturing capacity. Probably a good problem to have… :wink:


#44

Well, I think we can take both approaches: increase-cost and indirection.
That is, the store charge will increase based on the section’s capacity (metrics can be free space left over, or data stored, or any of combination). When section is full (or nearly full), the indirection approach kicks in, i.e. the chunk will be directed to a less density section (we may need to consider how to revoke such mapping when the fully occupied section get free space again, but that can be another story).

By doing this, during most of the time (hopefully all the time), network handle normal users’ requests with normal process, doesn’t incur additional overhead or complexity.
And, the issue of when section is full can be catered as well.


#45

The more I think about it, I like @tfa’s approach to the indirection more and more, it seems like it’s just better/simpler and more natural/robust to do it for every chunk all the time. No need to worry about revoking the maps and other case by case complicatedness. Doing this procedure by default may open up the opportunity to optimize away some of the performance loss through caching or other means.

That was my thought too at first. However, a constant indirection approach reminds me of what I’ve read on the forums about the indirection that occurs when a node joins a group. Unless things have changed, doesn’t a prospective node start out at age zero, then immediately gets passed on to a different group and attains age 1? It would seem only logical for SAFE to implement an analogous method for every data chunk too. Wouldn’t this again be analogous to the to node/section-datachains and complementary chunk-datachains? Self-similar, multi-scale, holographic, fractal… all good signs of an emergent complex system.

Apologies if I’m just repeating what everyone already said, thinking out loud a bit… maybe consider this as just a crude hybridized summary of what tfa, mav, and others have proposed.

  • The first hop of the data into the network uses the hash of the MD address or hash of ImmutableD self-encrypted chunk (ID1), and from there the hash of the previous block just added to either the chunk-datachain or section/node-datachain is uniquely combined with hash1 and then rehashed to hash2 in order to determine a chunk’s final resting place (ID2).

  • The network will still be able to determine whether or not deduplication is applicable to the chunk by checking the indirection map for an existing ID1->ID2 rule.

  • The data chunks in each section would fall into one of two categories, chunks storing the section’s indirection index that maps incoming ID1 to ID2, and actual client deposited chunks that are stored via ID2 which have been redirected from other sections.

  • If you decide from the start that there will be only a single hop for all data in every section, it may be possible to do some optimizations to minimize the performance loss by (as tfa mentioned) sending the data back to the client from the ID2 location (maybe not). If the data is popular, then it is going to be sent from the location of ID2 over to the ID1 location and reside in cache anyhow… right?

  • Constant indirection will keep the data randomized at all times across all sections regardless of what the clients do, so there is really no need to try/build/test logic that tries to determine the best ratio of free space to storage space a section needs have before the nodes start their indirection defense strategy.

  • Now that all chunks are provably randomized throughout the entire network, PUT cost and GET reward is sufficient to keep storage resources under control via safecoin as ALL sections in the network will always fill up at approximately the same rate.

It’s pretty hard to attack “always random” and it’s hard to see how a “data density” adversary could ever get a foothold under this scenario. This also treats every client equally with regard to one’s propensity to launch an attack ie. “Trust no one.” Right? What did I miss? Please correct me if I’m wrong.


#46

The issue is the extra work the network has to do. Its best to limit the indirection to when its essential. Indirection is essentially two (or more) accesses for each chunk. One to initially find the chunk’s actual address then the other (simpler one) to request the actual chunk from the other section.

By doing indirection only when its essential then only a small portion of the data stored is ever redirected. It should be rare that any section is attacked in such a way (because of expense) or that a section fills up in normal use and thus only a very small number of chunks are ever redirected using indirection. Thus a faster network compared to always using indirection on every chunk.


#47

True, but we’re essentially talking about O(Log N) vs. O(2 Log N) or less considering caching. I would agree it starts to cost way too much for 2 or more levels, let’s just consider 1 and always only 1. It might also be good to brainstorm any other positive benefits that naturally arise from constant indirection. Just like the extra resources required to keep all data forever, even junk/lost/destroyed data also serves a purpose by acting as extra mud swirl to help with obfuscation so we justify it’s existence and consider it a benefit. There may also be secondary benefits for constant indirection. I think tfa also mentioned that it might help protect against malicious vaults grabbing lots of GETs on public data they could easily hash the address for. It may cost more up front, but save a lot of complexity in solving other issues. Not that I know what those are :slight_smile: Other examples/ideas?


#48

But the only thing you save by always doing it is a single line of code that tests if to do indirection or not.

The savings for only doing it for essential cases is that you save the second request to another section that has to also come to consensus that its a valid request and the subsequent delays involved. Remember coming to consensus is the major part of any chunk retrieval. Maybe 6/10ths or 8/10ths of the total work by all parts of the network. And greatest delays since the nodes coming to consensus are global. Doubling it means that it becomes 12/10ths to 16/10ths of the work for no indirection.

We might be talking of a whole 0.001% of all chunk/MD requests globally needing indirection due to critical storage conditions in a section.

So the savings are based on a global basis. If there are 100 Trillion request per period and with (single) indirection always then there are actually 200 Trillion consensus decisions being made to do that.

Now if I said I could save 99.99 trillion of those 200 trillion consensus events, would you think that is a good idea. Then you ask at what costs? I say one line of code, a decision statement to see if to indirect or directly store. Why would you not do that. The decision statement is testing if critical storage condition exists.


#49

If out of the collective actions of those extra 99.9 trillion requests emerges an effective solution to a variety of other security issues while also offering a very simple and straightforward approach, and better optimizations can be had by coding for the constant indirection case. Again, I’m arguing with respect to my own ignorance on the matter, hoping that someone else will prove my point. I didn’t say I know what those issues are that could be addressed by constant indirection or that any even exist. (Remember, I was the guy who’s first comment was essentially “bahh humbug… why fuss with indirection, SHA3-512 fixes everything, even leaky faucets…”) Now I’m looking at it from the other extreme. There’s a saying about how when the number of objectives one is required to satisfy increases greatly in number, all solutions tend to become equally optimal. Secondary information can become the deciding factor. I’m just asking the question as to whether or not constant indirection would solve issues that other people may be aware of, and do so in a simple and straightforward manner. What’s the expression? “Premature optimization is the root of all evil…” something like that. Regardless, for now I agree with you. If it turns out that devs want to do it all the time, then they can just comment out your conditional statement. :smile:


#50

Yes, I forgot that.

How about using a second indirection service used only at creation time.

Hash of chunk produces an initial ID0. The data managers for ID0:

  • generate ID1 if they haven’t already done so in the past and save the correspondence ID0 => ID1 (for ulterior creators of same chunk),

  • warn data managers for ID1 that a client will put a chunk at that address,

  • ask them to generate an ID2 and create an indirection element ID1 => ID2,

  • lastly send ID1 back to the client.

The client stores ID1 in the data map and put the chunk at this address. The rest is as before with ID1 and ID2.

Note that the data managers for ID0 are only solicited at chunk creation or recreation times. The requests for get and update operations go directly to ID1 data managers.


#51

I think that’s still vulnerable unfortunately.

If I have a vault holding the chunk under ID2, I can hash it to create ID0 then tell my malicious client to pretend to have just created this chunk. It’ll then be told by the ID0 vaults what ID1 is, and the client can then do the GET for it.


#52

Yes, you are right. I am beginning to think that this branch doesn’t go anywhere. Your proposition that first vault encrypts the data before sending it to the second one seems better.


#53

Totally back-of-envelope stuff but my 4 year old laptop can hash 1MB chunks at about 140 sha256 hashes per second (safenet uses sha3_256 so it’s a different hash but it’s comparable).

dd if=/dev/urandom of=/tmp/chunk.dat bs=1M count=1; time sha256sum /tmp/chunk.dat
1+0 records in
1+0 records out
1048576 bytes (1.0 MB, 1.0 MiB) copied, 0.0151521 s, 69.2 MB/s
75a5aaadb718a890eacc30f90091e0ed93147d992032bb3e48d9917dc5763096  /tmp/chunk.dat

real	0m0.007s
user	0m0.003s
sys	0m0.003s

So that means to target a single chunk to a 7 bit prefix takes about 1s, 13 bits in 1m, 19 bits in 1h

An attacker needs to create many chunks with the same X prefix bits to have the desired density effects.

Additionally problematic is if chunks are targeted to a 20 bit prefix and the network is only at 10 bit section prefixes, it takes 10 splits before that dense data begins being spread out.

I have no preference to safecoin vs indirection just yet.


#54

But if the indirection also holds the hash of the immutable chunk then the first section that the chunk really belongs at knows that the chunk is stored and can return that the chunk store is complete. Thus dedup can work with indirection

But I still contend that indirection should only be done as the last resort if increased “PUT” cost does not fix it (naturally). Also always indirection places (jlpell) extra load on the network without need and adds extra storage without need.

So really if a simple indirection is done then dedup can still work as intended.

Section where chunk should be stored at.

  • the indirection address (maybe use one or series of the reserved tags so no collision with other user stored chunks)
  • The level of indirection if this is implemented.
  • The hash of the chunk so dedup works as intended
  • reason code for indirection (future ability for other features maybe)
  • actual tag and address of the chunk or MD

Section if multiple indirection has occurred (if implemented)

  • the indirection address (same idea as the initial section)
  • The level of indirection
  • The hash of the chunk so dedup works as intended
  • reason code for indirection (future ability for other features maybe)
  • actual tag and address of the chunk or MD

Section actually storing the chunk or MD

  • The level of indirection
  • The hash of the chunk so dedup works as intended
  • reason code for indirection (future ability for other features maybe)
  • actual tag and address of the chunk or MD
  • store the actual chunk or MD

#55

+1

+1


#56

Ahh, ok. I was afraid it would be that easy. (Sha3 is only about 1.5-2 times slower). Indirection is looking prettier and prettier, but given the hash rate of a wealthy attacker with a custom asic cluster, they could still accumulate a lot of chunks offline and then hit the network all at once with them right? I suppose it will be a while before section prefixes are greater than 160bit eh? :sweat_smile: Just a single indirection alone will decrease the power of the attack by ~4 orders of magnitude right? (128B indirection key/data vs. 1MB chunk) Even so, it would be nice to be able to eliminate the ability for an off-line attack method entirely. I can think of a few naive ways, but they’re all too complicated, client-side, and seem expensive unless routing between the client and managers could do double duty and kill two birds with one stone. Seems like this is analogous to “secure random relocation”, but for data chunks…

@neo, it seems like you could doubly impede the attacker by charging 2 PUTs if your indirection conditional is reached, therefore compounding the expense of the attack tremendously for larger indirection sequences. Maybe this is what you were thinking? (1 PUT for an indirection, and 1 PUT for the final chunk).

How about “always fuzzy indirection” where there is a probability of indirection on any given chunk/address which continuously adjusts according to the current amount of free space or other health metrics (farming rate) in the section/network. Most of the time the probability might be <0.10, but some times it would go to >0.90 if network conditions degrade. This would guard against magic numbers and always keep an attacker guessing.

It still seems like you would want to limit things to a maximum of one indirection though just to ensure a minimum level of read/write performance… no need to trigger an indirection cascade while trying to protect against a merge cascade, right? Although, a fuzzy approach might work well/better in this case too. For example, consider the case where a chunk carries its own personal indirection probability (that would be cut in half or by a factor of 4 every additional indirection hop, maybe just an incremented integer to use as an exponent for the section probability, or just update the chunk probability geometrically by multiplying it with the current section probability). Regardless, the updated chunk indirection probability would be used to determine if that particular chunk should be passed on to another section or not. If every hop costs a PUT, then a larger financial deterrent exists against abuse, but the PUT cost for any given chunk becomes non-deterministic to some degree since there still exists some finite but negligible probability that a chunk could bounce around for a long long while and cost innocent users. You might be able to estimate an average PUT cost for a single file considering network section health, but you would still want some type of max_indirects constraint to bound the max PUTS. Indirects could be cost free, but I’m not so sure you would ever want them to be since that might open up another attack vector in the case that you allow for longer indirection sequences (>1).


#57

But why? There is only a need when you get past a critical point. Sort of a relief valve that only activates when the pressure is too high. You want to only add this delays and work when its needed.

For the person storing they don’t know and are not told this has happened.

My thought was not to have it always do multiple indirections but just that the section being indirected to could also have storage issues too. So rather than have that section being overloaded too it also just passes on the chunk for storage elsewhere.

Maybe rather than storing an indirection block that 2nd section simply passes on the chunk to another section and limit this passing on for x times. When a section finally stores the chunk that section passes back to the original section the details of the storage of the chunk and the original section stores the indirection details.

Thus only one indirection but the ability for a section that is also loaded down to pass the chunk onto another.


#58

Using a simple conditional gives you that pressure relief effect, but you are essentially using a heaviside step function to determine the probability that a chunk will stay in the targeted section or be passed on. The fuzzy approach is also a pressure relief effect, but more controllable. Here you use a smooth sigmoidal to approximate the step function for a section so you can tailor how strong the pressure relief is, like a tunable valve that varies smoothly from all or nothing. Following the “trust no one” approach, the network doesn’t know which chunks are malicious, so as conditions degrade it can ratchet up the indirection effect in a more controllable fashion and ease into it. If most of the time the section indirection probabilities are low, you might only indirect one chunk out of a billion or a 100 trillion, it’s up to you. As things start to get bad it might start indirecting 1 chunk out of 1000, or 1 in 2 or all of them. The fuzzy approach still amounts to a simple one-liner conditional in code and doesn’t indirect every chunk just like you wanted.

Yes, sorry I wasn’t clear. This is what I was referring to as well. If you use a hard pressure relief, and a chunk unluckily hits some bad sections one after another then it could get passed around a while unless you put a hard stop on how many times. Also, using the smooth sigmoidal approach illustrates how you can slow this down quickly by making chunks that have already been indirected have a lesser probability of being indirected again even if a lot of sections are becoming rather unhealthy. Simple probability also puts an effective limit on the number of indirection hops, but I like having a known limit count too just to be sure.

initial_stayput_probability_for_all_chunks = 1.0;
initial_chunk_indirect_count = 0.;
max_indirect_hops=10;

while section.exists {
section.chunks_to_put = section.receive_new_chunks();
section.chunks_stayput_probability = smooth_sigmoidal(section.health.get_status());

for chunk in section.chunks_to_put {
chunk.stayput_probability = chunk.stayput_probability*section.chunks_stayput_probability;
chunk.indirect_count = chunk.indirect_count + 1;

if (chunk.stayput_probability < random.float(0., 1.) &&
(chunk.indirect_count < max_indirect_hops) {section.please_indirect_this(chunk);}
else {section.please_try_to_put_as_usual_this(chunk);} }}


#59

All you do is start indirection early meaning more chunks will be redirected than needed. If you make it the same overall number then why add complexity? If you start early then you might cause indirection when none will ever occur since its just a temporary condition approaching that trigger point but never reaching it.

It really does not help to “smoothly” or “gradually” start indirection. And will cause some unnecessary indirection. Not efficient use of resources.

Its not a motor or machine that is benefited by gradual increase. This is a network protocol where every chunk can be dealt with on a chunk by chunk basis. Gradual “startup” does nothing to help and only means you end up in the long term causing many indirections when they were never needed.

Ummm what does a limit mean? Yes I included that.

If the network is under resourced in many sections (needing more farmers) then it more than likely that multiple sections will be in this situation. We are here talking not of an attack but just transient situations where more farmers are needed and some sections are under resourced.

Thus a single try to indirect could have a reasonable probability of hitting another node also under resourced.

So allow the situation where the request to store a chunk could be sent to a few (random) sections one after the other and in effect passing the chunk along.

Obvious in this process the actual chunk is not passed along but the request to store the chunk is and when the section accepts it the chunk is stored in that section. Remember a chunk reaches the section it is stored at via many hop nodes so one more hop is not of major delays. The indirection request is more significant.

Anyhow in my mind this limit would have been 2 to 4. Anymore would likely be a hindrance because of the delays. The APP code can do a retry and the indirection would be trying other “random” sections the 2nd time. Allowing upto say 4 (or 3 or 2) attempts by the network itself to find a section to store the indirection chunk will make the network seem more stable to the users in adverse conditions on particular sections.

No, it won’t. Because you are redirecting early (gradually) you are more likely to fill up other critical sections that are also close. At least if you wait to the trigger point then you are not unnecessarily filling up other sections.

Remember that this would also be operating when no attack is occurring. You idea seems to be focused on the benefits to one section under attack that is GOING to need indirection or fail. But many times in the life of a section it may approach the trigger point and never get there because the incentive to add vaults was enough to relieve the problem. But your gradual approach potentially shifts a non-existent problem on to other sections and one of those maybe a lot closer to the trigger point.

tl;dr

Remember this is basic protocols and each chunk can be dealt with independently and you’ve not demonstrated any benefit to a gradual approaching the trigger point. Each chunk’s final location can be decided on an independent basis and all gradually getting to the trigger point does is increase the likelihood that redirection will occur, even if it never is eventually needed at that time.


#60

+1

We could define the network side max indirection attempts is to tryout all its neighbouring sections. As they will have direct connections, so the store request latency could be reduced to minimum.