How does SAFE protect whistleblowers?


#1

This is from a discussion on reddit: MaidSafe is safe for whistleblowers? I said I’d ask here and get back with an answer which I have done. Reddit and this post have been updated to reflect everying up to and including reply #18.

The question is, how could an attacker discover the identity of someone who stores something on the network, either in a public share (cf. WikiLeaks), or shared with select others (e.g. a handful of journalists, cf. Snowden)?

What Is The Aim Of The Attack?

The attacker seeks to know the identity of the uploader, and the content of the file uploaded.

Let’s equate the first with the IP of the node used to upload the file. A sophisticated whistleblower could ensure this isn’t enough, but lets assume someone on SAFE uploads from a machine that can be linked to them.

The second requires the attacker to know that a file originated from the machine identified with the whistleblower.

So an attacker needs to know: 1) The IP address of the machine running the uploading SAFE node, and 2) That a particular plaintext file was uploaded from that machine.

1) Discovery of IP Address.

The IP is I think only known to the four nodes directly connected to the uploading node, so the attacker must control one of these nodes.

For a large network this will be impossible to guarantee, but to succeed, the chances of discovery should be very very small, or whistleblowers will easily be discouraged, or have to take additional steps (such as shielding their real IP or ensure they are not associated with the machine used).

It seems to me that on its own, this is not enough. We also need to ensure that data passing through the four “gateway” nodes cannot a) be decrypted (I’ll take that as read), and b) cannot be linked to a known plaintext. This is part 2). Unless 2) is very hard, it seems very easy for an attacker to discourage whistleblowing by discovering even one significant target.

2) Matching Chunks To Plaintext

So I think it comes down to whether or not an attacker which has a given plaintext can use it to identify the chunks of data flowing through one of the gateway nodes, and so identify the IP of the uploading node.

UPDATE: If the attacker has an exact copy of an uploaded file, this attack is feasible. If the file is modified in the slightest, zipped into an archive for example, then the attack becomes infeasible unless the attacker can decrypt the files having somehow obtained you private key. That of course is a different attack - involving targetting of the client machines directly, since private keys are not shared beyond this.

SAFE therefore needs to consider making it very hard for a whistleblower to upload an unmodified file onto the network, without realising this might expose them under extreme circumstances, which may be avoided with a simple measure. See this post.


#2

Unless the gateway node holds the chunks adjecent to the target chunk and knows the datamap (i.e. knows which chunk is adjecent to the target in which order), otherwise it can’t decrypt(read) it.
As chunks are distributed based on it’s name, the chance one node holding the chunks adjecent to each other is very tiny. Not mention it also need to know the relations among them (which can’t be recovered from chunks themselves)

If the node has the plaintext it self, it may know whether the chunk belongs to that plaintext or not.
However, the chunk doesn’t expose the info of who uploaded it, so doesn’t see the point how the uploader’s ip will get exposed.
Unless the malicious node is acting as MaidManager of the uploader and also has a direct connection to it.
Given MaidManager is decided based on MAID ID, and direct connection is decided based on PMID ID,
and such ID pair generated separately (which will be quite different), the chance of acting both will be tiny in a loarge network.


#3

Thanks for responding so quickly @qi_ma. I accept from this that decrypting is infeasible.

This still sounds worryingly like a viable attack, no? Let’s say the attacker has a lot of nodes 10% or 50% or 75%, what are the chances of one of those nodes being able to achieve this?

For whistleblowers to feel safe the chances have to be very small indeed, or else we need to provide a whistleblower APP that ensures this.

For ordinary users the threshold will be much less, but still may be significant (I can see the headline - “SAFE is not Safe - user who uploaded xyz identified by hackers/police/private eyes/The Sun/…”).

This is important if we are making this claim, which I believe we are.


#4

This is great, thanks for setting this up happybeing.

I was wondering if we could simplify it a little to just the public case. I should not have mentioned Snowden, I really meant a wikileaks style whistleblower, who has powerful enemies (e.g. NSA)

i.e. How could an attacker discover the identity of someone who publishes a document on a public share?

The attacker, using various devious techniques, will try to deduce a set of IP addresses that are more likely to be owned by the whistleblower, than any other.

For example, we could say that …

Perfect anonymity - attacker gains no information - cannot deduce a set smaller than the set of all-possible-IPs

Strong anonymity - attacker cannot deduce a set smaller than the set of all-user-ips

No anonymity - attacker can deduce a set of size 1.


#5

Assuming a wikileaks style disclosure, everybody has access to the plaintext.

Does this not mean that every node can ascertain with certainty, whether a chunk is derived from the plaintext or not?

And so malicious nodes, can record the IPs of any one that STOREs and LOOKUPs blacklisted chunks?


#6

In fact it is trivial for a node to ascertain whether a chunk belongs to a particular plaintext.

We should note that the attack ‘node storing wikileaks data is evil’ is simplistic, there are far more subtle scenarios that can be envisaged.

We should also note that the more chunks in the Wikileaks document, the greater the probability that one chunk will hit an evil store node.


#7

I don’t know this to be the case, but am concerned it doesn’t appear to be hard enough.

This is true but I don’t think it is an issue. As I noted elsewhere, there is no way for a storing node to know where anything it stores comes from, or where it goes to on access. I don’t know in detail how that is achieved, but I believed the explanation when I heard it! :slight_smile:


#8

The chance will be square of the percentage of network you control.
If control 10%, chance will be 0.01, control 50% chance will be 25%

However, it must be :
1, the plaintxt must be known in advance
2, you are directly connected to the malicious node when upload

If you do concern about exposing your ID, one simple way to do : encrypt the plaintext with your private-key
Noboday can then link the generated chunk to the plaintext, unless they know your private-key.
Though, by doing so, the benefit of de-duplication is lost.

So, from network view, it is suported to get your ID totally hidden.
Although it is not encouraged, and not necessary for the majority of the users.


#9

Would the IP issue be resolved by using a VPN?

Instead of connecting to the SAFE Network directly from your ISP address, a whistle blower could use a VPN, connecting to their 4 close node neighbors with a VPN IP address. If the data were somehow tracked back to the uploading source, they would only discover the VPN address. The VPN IP address is shared by many other users.

This would be like using a “public” telephone to make a cryptic call. However, we cannot guarantee the VPN service is not recording traffic. Even so, the VPN would only see encrypted data being transmitted. Would it obfuscate the users IP address from the 4 close nodes?

Data flow would look like this.
PC ~> MaidSafe Encryption ~> VPN ~> The SAFE Network.


#10

Yes, using Tor, VPN, a cyber cafe etc to shield your IP helps a lot, but we’re just addressing how safe is SAFE for whistleblowers. As I said, I think a Snowden would take more precautions, but we should aim to protect even the least tech savvy whistleblowers. Looks like an APP would achieve this using the technique suggested by @qi_ma but I’m still not sure it is watertight. We have to consider someone like the NSA with the ability to store the hash of everything from all the nodes it is monitoring, and who could then trawl back through this after a leak to see which node uploaded the file. I think that what @qi_ma is suggesting would mean they would need to have access to the private key, which obviously would be harder for them to obtain, but as we know, they are capable of “collect it all” so it can’t be ruled out.

This highlights the difference between ordinary users and whistleblowers, but we should aim high, and certainly consider attackers who might seek to collect everyone’s private key. On a related note, it is now fairly public knowledge that there’s a spy plane hovering over London - probably hoovering up electronic communications data (WiFi, cellphone etc.).

We have to assume the NSA, GCHQ and others are doing this kind of stuff, and as widely as they can, especially when we’re considering the security of whistleblowers.


#11

Agreed,

I think a desperate entity would take the easiest route first then work their way up to more difficult methods. This would eventually lead to a takeover of all “access points.” In this case, the ISP would be targeted. I believe that would only hasten the evolution of mesh networks which is inevitable anyway. A future solution is going all the way to the hardware level.

Below is a comedy clip from the movie “The Big Hit” which expresses our situation. Warning, vulgar language in the video.


#12

Even with a vpn the receiving vpn node can tell your ip address, it just needs to. So how do you trust the vpn recipient / You just cannot. There is a mechanism in place where the MaidManagers may not even get to see what you store This is small change though and restricts ability to restrict users upload until they paid a safecoin (for instance). Don’t have time right now, but imagine the connection was made to the datamanager. The connect request comes back to you encrypted (as all info is). You then send the chunk with the name encrypted to those exact data managers.

Then the attack would be on 1 piece of data, where an attacker may see who uploaded it. But only one at a time across the address space, The issue then is can the attacker get a look at who Gets the data than?

Gets are plain encrypted data to be cached. So an attacker may see what you are reading if they have the original file and also get a close node to you. So there are a couple of areas to consider.

TL;DR If you are uploading personal data it cannot be tracked like this or attacked. If you are uploading a copy of something then it may be caught if the attacker also has a copy of that thing. If its known cp or similar then you are in trouble etc.

Now the actual attack itself. This depends on you storing with an account that is known, so how did your underlying ID get exposed (you do not get to see it, unless you look hard). So its unlikely you published it. So then the attacker monitored your IP address, well the bootstrap onwards is encrypted so good luck with that :slight_smile: This means you will not be connecting to this attacker, but to a node somewhere (well 4 of them). They can then see your IP, but who are you?. So then its IP collection time (we know this is pretty easy). If they are in your country etc. then it is likely to be simple, however in this attack the 4 nodes will be geographic and may not be so simple (just saying as its a difference).

So some differences, but its feasible an evesdropper with a large proportion of the network could spot the upload of a single known chunk (perhaps) and track back your IP to you. Of course they may be connected to nodes across the world (its random). So to target you will be very difficult (birthday paradox, reversed), If we swapped upload id’s regularly (we can do this) then it gets even harder. Perhaps this should get discussed further as this is possible to improve on.


#13

I get that the chances are small and we are talking about a very powerful attacker, but that is the exact scenario for whistleblowers. Is this feasible: attacker has say 10% of the network, storing hashes of chunks passing through those manager nodes from client nodes along with the client IPs. SAFE takes off and people start using it to blow the whistle… Not one person, but tens, hundreds, thousands, tens of thousands.

To be safe, the chances that this attacker can identify just one of these per year say, needs to be extremely small for each instance of a person uploading a file. If just one whistleblower ends up in prison in a year, everyone loses confidence.

So what are the chances in a scenario like this, with or without the more bullet-proof solution (in your first paragraph), and are they good enough for us to claim SAFE as safe for whistleblowers?

@dirvine I understand “I don’t have time now” so won’t be offended if you don’t answer this for the time being :slight_smile:


#14

Yes I would say so, unless a whistleblower sent out exact copies of what info he has. So if it was stolen info and unchanged in any way there is a chance of this attack working (I think its small and will get much smaller). So if a whistelblower had unique data, then no, if they zipped it up to send, then no etc. So they would need to copy a file and put that exact file up again untouched. So I would say for edge cases if you had a large bunch of files then its best to alter them even very slightly. Perhaps we should have a super private folder or similar that did do this exact thing ? So unique data is no problem ever, but copies that some agency has maybe if they are untouched. I do not like the alter them slightly approach as this as its too much for people to know. Perhaps a super secure drive could be used (perhaps a reason to use safecoin) where all data put on it is altered randomly. Could be done I suppose, worth consideration.

I do say though to attack like this would be a massive attack, not saying we know there are agencies who would perhaps do this, but the more they add the larger the network becomes, so it actually gets harder (many folk miss this part). To target somebody is going to be a full on attack, so back to greater than the network population again (probably by a large margin) and then it gets more interesting.

I would think an attack of this size would take out all these network tor/freenet/bittorrent/bitcoin/zerocash/darkphone/unseen etc. No network would survive it. All transport of all PKI keys from verisign etc. would have to be sent by post and typed in and so on. It would be pretty ferocious, but I am not saying its not something attempted, but it would mean true international collaboration with more than five eyes I think.

In this case then MaidSafe is the best bet as its encrypted from the start, but transmission attack like this may still happen. Discovery of who send to whom etc would be an even bigger attack (where the real SAFE network would hardly factor in terms of numbers). It would probably me much easier to do the record keystrokes (sounds) attacks or some of the recent read cpu noise etc. attacks. They are hard but can be targeted at a fraction of the cost.


#15

We should be aware that whenever the whistleblower performs a STORE, FIND_VALUE or FIND_NODE on the DHT he is leaking information about his IP address and the data he is interested in, reducing his anonymity.

We can see that he is effectively saying the following to those nodes he connects with

  • “Hi there! My IP is x.x.x.x and I am interested in chunk Y”

or even worse,

  • “Hello! My IP is x.x.x.x and I want to store chunk Z”

It is of course trivial to ascertain whether chunk Z derives from plaintext P, otherwise the system would not work!

So the best case we can hope for is that he only leaks identity information to just those nodes he connects with directly.

However, our adversary is able to monitor the RUDP traffic on every link in MaidSafe.

  • They can see which nodes are connected together at any moment in time
  • By performing timing analysis, they can track multi-hop FIND-NODE or FIND_VALUE requests back to the whistleblower who originated them
  • Perhaps the size of RUDP messages also leaks information useful to an attacker?

It does not stop there.

The adversary can also actively manipulate the handling of the DHT routing so as to increase the number of evil nodes that can monitor DHT activity. A small proportion of evil nodes can still be dangerous.

Guaranteeing anonymity is a hard problem in the face of adversaries such as this. Have a look at how I2P attempts it.

If you really want anonymity, perhaps you are creating a new Silk Road application, then the best advice might be to run MaidSafe over I2P …


#16

You caught my attention with this one. I would like to see it discussed further but not at the cost of delaying our current tasks.

No project is “born” perfect but it’s nice to know we have plans to keep polishing it. Thanks @dirvine for taking time out of your day to chat with us.


#17

unless a whistleblower sent out exact copies of what info he has

That is the exact scenario I was trying to talk about :smile:

The whistleblower publishes sensitive government documents to a public share.

  • The plaintext is known to everybody
  • Any node can ascertain whether chunk Z derives from the plaintext

#18

Need to be careful there though, http://ibnlive.in.com/news/tor-tails-privacyprotecting-flaws-show-that-no-anonymity-system-is-failsafe/487917-11.html

We need to understand what this flaw was and check we are not affected by it as well. This exposed IP address at the end point not only the close nodes, by the looks of it. I see people on twitter suggesting to switch of i2p in tails for the moment. The key is to work hard and fix what we know, the key to that is to get as large a community of eyes on the codebase and attacks. This is why the pods are so important to me, we need to have many eyes, there is no short circuit solution I am afraid.

I think just now we only need worry about close nodes (if we use them like this), but when we all get a chance to breath I bet we can spot other areas to beef up. The point is not to make it harder, but to make it impossible for this kind of thing. Whether we get there or not, this has to be the goal. I am pretty confident though as that seems like a smaller problem than we have had to contend with so far (watch these last words :slight_smile: ) yes I am the optimist and dreamer.


#19

I agree that MaidSafe will be awesome even without the anonymity properties.

I also think there could be a way to build an anonymity layer above what you have already. Think of it as an overlay over an overlay :smile:

For now though I would suggest that MaidSafe will not be enough to protect the anonymity of persons of interest to governments. The cost of performing the attacks ive been talking about is actually pretty tiny given the infrastructure that is in place (see snowden leaks)

Regarding Tor/tails etc we should note that the attack surface for someone using Tor browser is huuuge compared to I2P.

I2P does the minimum just packet routing whereas Tor inherits the classic never-ending-list of browser exploits and would be a much easier target.

It is a shame that I2P is written in Java though!


#20

I agree, but they are already published by then. I think this is OK.