Data Density Attack


Data Density Attack

There’s an attack that is expensive but may cause serious and permanent problems to the network so it’s worth discussing and seeing what it’s about.


The goal of the attacker is to cause significant disruption to the network. There is no direct gain to the attacker. This makes the threat hopefully not too high.


The crux of the problem is data on the network is expected to be evenly balanced across vault names and data names. Network operations like joining and relocating have this balanced-vaults and balanced-data assumption built in (well, kind of!).

If an attacker is uploading single chunks at a time (ie many 1MB files) they can choose the content of those files / chunks to result in names targeted to a specific part of the network.

This means that one part of the network will have much more data than the rest of the network and the vaults in that part of the network will have much more work to do than other vaults.

This data imbalance is probably ok (not ideal but still tolerable). But if there is a merge between the data-heavy section and a ‘normal’ section, the ‘normal’ vaults need a lot of spare space to be able to facilitate the merge.

Any vaults without enough resources to merge will be booted off the network, meaning a high chance of cascading merges.

Furthermore, this presents a problem to the current understanding of ‘balance’ on the network. Difference in section prefix length is assumed to be unhealthy, but I think difference in data density is even less healthy. It’s probably normal for prefixes to become uneven because there will be times where data is unevenly distributed, so the structure of the network needs to compensate for that.

If one part of the network is very heavy on data because it was subject to a data density attack, then it makes sense for there to be more vaults in that part of the network to manage the load. This naturally leads to a difference in section prefix length.

My proposal is to remove ‘similarity of prefix length’ as a metric for network health and replace it with ‘similarity of data density’ per vault. The reason for desiring similarity in data density is to avoid merges of uneven data loads that may cause vaults to be kicked off the network and possibly cascade the merge.

This may have an impact on the current design work for relocating / joining / disallowing / killing vaults.


How expensive is this attack? It’s pretty expensive since the imbalance of data needs to be significant compared to all other data entering the network. The more ‘normal’ data there is stored on the network the more data the attacker requires to build up a significant imbalance.

It may take a lot of time and computational energy to generate chunks with a specific prefix. This effort becomes exponentially more difficult as the network grows (or the attack made more targeted), so smaller networks are more at risk than larger networks.

It costs safecoin to store those chunks on the network, but this should ideally be a decreasing cost as the network grows.

Ideally the attacker would also have vaults in the right part of the network to trigger a merge, which is very difficult to achieve (akin to the google attack, but easier since sibling sections are still meaningful to be part of).

The effect of the attack lasts a long time after the attack stops since chunks remain permanently. Eventually the effect is diluted by the additional upload of broadly named data, but this takes time.

The effect, if well enough targeted, would not be diluted by splits, and may actually be made worse in some cases.


The concept of Balanced Network should aim for Resource Balance rather than the more abstract Naming Balance (I say this in the context of maidsafe/datachain_sim which measures MaxPrefixLenDiff). In this attack it specifically considers Storage Balance, but I think it probably also applies to Bandwidth Balance and Latency Balance etc.

The assumption that data will be evenly distributed and thus the assumption that vault names should be evenly distributed could be quite damaging.

This attack has impacts on the network understanding of spare space and requirements for how much spare space is needed before enacting incentives to bring more capacity online. Ideally the spare capacity is even across the network, but there may be times where capacity shouldn’t be evenly distributed.

This attack would indicate that random relocation is not good, and targeted / assisted relocation is necessary. I still think random relocation is overall preferable, but this attack is another point to add to the argument against it.

Concluding Thoughts

I’m not especially concerned about this attack in the real world, but cascading merges seem to be a big risk to overall network health, and it does seem to have implications for some of the current design philosophy being developed.

MaidSafe Dev Update - February 8, 2018

Would it be easier to do with MD data. How is the location for the MD data object determined?

Immutable data is based on its hash, but is MD data based on its address?

I ask because I don’t know and if it based on address then this attack just became easier in that one only has to store MDs with specific addresses.


Yes, It is.
However, such address can easily be adjusted to be not defined by the user, but be given by the network.
more like: if a user want to upload a MD, it request its account manager and be given a name range, then generate a MD name(address) within that range. Similar to the current join/relocation procedure - you need to generate a name withn the range defined by the network.
And such range can be used to avoid this data density attack from MD.

I remember we had explored an idea that the multiple copies of an ImmutableData are stored as multiple levels of hash, i.e. hash(data), hash(hash(data)) …
In that way, instead of having multiple copies of an ImD in the same section, the copies are scattered across the network. Which means the chance of the data density attack and this cascading merge concern is being further reduced. (I hope to the level that can be ignored :slight_smile: )
This approach still need to be further investigated, especially on how to retain copies when storage nodes leaving network.


Currently client programs generate a hash from a user defined name and this hash is the address of the MD. This insures that MDs are randomly dispersed in the XOR space. But nothing prevents someone to create a program that directly defines the address of a MD, so he could kill a specific node by overloading it with targeted addresses. So, yes, a solution must be implemented for MDs.

But not this solution, because this one doesn’t keep the possibility to derive the address of a MD from a user defined name. This is a distinctive feature of MDs compared to Immutable Data that shouldn’t be lost.

One possibility would be that the client passes the source name to the client manager, so that the latter can check that the address is really the hash of a name. The check would be done only at creation time and the source name wouldn’t be stored in the network. The update and get operations don’t need the source name.


How will the attacker generate imutable data for this attack? If attacker wants to target specific addres space, he has to be able to generate imutable blocks with hashes at that specific range. But how will he do it? Brute force does not work here. He would have to create huge amounts of random data and very small part of that data would fit into that range. In fact he would have to generate data of exactly same size as total SafeNet size just to have data for attack that will double used space in that section. I do not know any algorithm that can be used for such data generation. If you have hash address space of size N as input, and you want to generate X open texts with HASH in that address space, in my view complexity of that problem is approximatelly same as guesing X original texts for X hashes. The smaller is N, the more harder the problem is. I do not know how it is with mutable data, but I guess, it can be solved by generating address as a HASH of first content of MD +user public key.


Sorry, I shall write that sentence in more detail.
The name is not be given by the network. Network only define a name range.
The user still have the possibility to define the name(address) of a MD, just such name needs to be within a range.

I don’t think the network needs to care about how the address got derived?


In theory, a hash of small sized data could be the same as a hash of a large sized text.
So basically, you don’t need to guessing X original texts for X hashes, you only need to carry out recursively hashing, once the result hits an address range wanted, put the previous hash as payload of ImmutableData and the current result as its name.
It’s still expensive, but will be much cheaper (and payload size smaller) than guessing the original.


But how does it decrease number of guesings? Ok, your input will not be 1 MB block but few bytes, but still if there is 100000 groups in Safenetwork, only 1 out of 100000 computations will give some data which can be used for attack. And since you did not use 1 MB block, but just few bytes, those immutable data will not fill disk space in that group much. Yes that group will have many new mutable data but those will not fill disk space fast. The only result is, that this attack will be more expensive, since every put costs money and paying fixed price for put for few bytes is much more expensive than same price for 1 MB block.


Not measured in detail, just as an example that with this approach, now you may be able to generate one thousand 64bytes chunks in the same time of get a 1MB chunk. Which may cause more trouble to the section as there is additional number of messages and overheads involved to replicate additional entries.

However, I agree it is still expensive.

Regarding the coin cost involved, yes, that’s true. And I think that’s why mav mentioned the motivation of such attach is not high.


Ok, si if that attack is expensive, and attacker want to do trouble at any price, the worst case scenario is, he makes trouble to 1 section of the whole network. The rest of network will continue to work. In the worst case attack with lot of small puts can be solved by some limiter. But maybe I am wrong and there is some algorithm which can generate such data easier, than this attack can be dangerous.


Be aware though given that case no system will survive.


But that seems counter to what is desired by people. For instance I want tag type 1000001 (mytag) and address 1 then 2 then 3 then 4 because I want to be able to calculate the address to request in my application. I don’t want to have to update an index simply because the network no longer lets me specify the full address.

Example is say a series of tokens and they would be best if the address is the actual token number.

Why not simply take the hash of the tag + address to work out where to store the MD. At least it becomes more difficult even if simpler than doing it with MDs.

That sounds good.

Maybe something can be done to make it more difficult to control where the MD is stored AND allow APPs to be specific as to the tag+address for an MD


How would this work for domain names? Say a client wants to resolve a domain, he would pass the domain name into a function domain_name_to_net_addr(domain). That domain_name_to_net_addr fn has to be deterministic, otherwise the net addr would change each time a request is made, but the net addr needs to match up with the net addr generated at the PUT.


An address can then still be manipulated to make the resulted hash targeting a particular space.

Well, I think you can always have an address book MD chunk to mapping the address. So address 1,2,3 will be translated into XXX, YYY, ZZZ by looking up that.
Also, once you uploaded a MD successfully, the name is then taken and won’t be revoked. So you don’t need to update an index, only need append to that address book each time get a new MD to upload.
In a word, it will be upto the app to decide how to mapping address 1,2,3 (or a url link) into a real name(address) of a MD.

Also, even the network supports using hash of address+tag directly, a particular app will still facing the issue of the hash of a wanted address has already been occupied by other. So, my understanding is that an app-level of address mapping is inevitable.


It was clear for me. Constraining the range of the address implies that we cannot derive it from a user defined name anymore.

Not being able to freely generate names and using translation tables instead will break many apps.

In my solution, yes it does, to insure a uniform address distribution. This solution doesn’t break applications, the client layer has just to pass the source name in addition to its hash.


@tfa: your approach of using source name doesn’t eliminate the possibility that someone can manipulate the source name to make the resulted hash targeting particular address space?


Yes, but then this is true also for immutable data. Generating an address with a hash is one of the network funding principle. It was supposed to be secured enough with the random data distribution it was creating.

I don’t know what to say, except that a solution must be found that doesn’t break existing apps.


Just a wild idea: maybe we can discourage targeted puts by gradually increasing the cost of a put based on the amount of data already stored in the same section by the same user?


I tend to agree. If there’s to be indirection, I’d think it should be handled network-side rather than client-side. For example, if a given section is struggling to cope with the volume of stored data (or client accounts for example) these could be offloaded to a quieter section with pointers held by the original section. The pointers would map individual chunks/accounts, or address ranges.

If the original section is struggling for bandwidth, the client could be redirected to the new section and deal directly there. If the issue is purely disk space and not bandwidth, the original section could just act as a proxy and not bother the client.

There are a few other ways to address this too (e.g. archiving data, storing data with an expiry time and paying more for long-living data, detecting and punishing malicious behaviour like storing many ImmutableData chunks with very similar names), but as far as I know, none of these are agreed or fleshed out.


Good call, although it’s maybe not necessary to even specify the same user. An attacker would probably just try and use multiple clients. Any significant imbalance in data distribution across the address space probably indicates an attack, so charging everyone more seems fair, since honest users won’t be hitting the affected range very often.