That article focuses on mainly what they describe as “tagging attacks”, so I’ll focus on those for this post. Feel free to expand the surface area if you wish. Also, as an aside, the other three attacks that they link to; (some of) their pdf’s are unavailable. I guess someone didn’t want to renew their DNS lease. Another problem defeated by the SAFE Network…
Sorry for the long post. Don’t worry though, I repeat myself for clarity often enough. I swear it started out a lot smaller. Also, once again, I am not an expert. YMMV
The way we generally explain it is that Tor can try to protect against traffic analysis, where an attacker tries to learn whom to investigate.
For future comments, I want to re-emphasize this point. This is not a discussion about an attacker trying to learn whom to investigate. It is exactly about the “other things”.
However, Tor can’t protect against traffic confirmation (also known as end-to-end correlation), where an attacker tries to confirm a hypothesis by monitoring the right locations in the network and then doing the math. -Emphasis mine
Before we dive in, (breaking my own rule above) I did want to mention that setting up such a tagging attack would be so utterly massive in it’s scope that it would be relatively unfeasible. This mainly due to the amount of farmers required to serve up any meaningful bit of data. But feasibility never stopped the NSA before.
So chunks are served by farmers. Those chunks aren’t even necessarily a unique piece of data. For instance, I don’t know what kind of file to exemplify here, but imagine a type of file that contained a huge (1MB+) header - even when compressed. If that header uses default values and many people store a file of that nature with different content but with the same header, that header chunk will be brought down for many different reasons, without the corresponding exit of that chunk on the target client.
Also, if that chunk is popular, it will be cached on an intermediate node. So now the “server” of that chunk has gone from 4-6 vaults to a ${probability} number of vaults - the probability depending on just how popular that chunk is. That also varies with time. One day a chunk may not be there, another it is. And then the next it’s not again.
On top of all that, think about churn. As nodes enter and exit the network, the same chunk of data is copied to multiple machines. It may be stored offline while a farmer is offline, but there are always 4 live chunks available somewhere. So when a chunk is reduplicated and stored on another vault, the attacker would then have to figure out which new one it went to before it can continue the analysis. That is if they don’t have to start it all over again.
Another (less convincing) part to this is that vaults are not serving specific content. They could be content for any number of applications. For instance, the original Silk Road servers were hacked. If they did this type of attack before hacking it, they could attempt to correlate the content that was put out by these particular servers.
Now what if the servers served several services?[1] They wouldn’t be able to tell which services were requested when - thus instilling plausible denyability. Also, since with the SAFE Network port numbers are randomized uniformly(?), there’s no way to say: “That came out of port 80 for a http request,” or “That came out of port 21, that must be an FTP request.” Data is just data. Nothing more, nothing less. Every single thing would have to be taken into account. Even checks from all of the other managers that are in charge of that particular vault.
What if the known servers only hosted part of the service? Then there’s a good chance that the known nodes would not be contacted for all of the services, and they would miss many correlations of that same service, just because it wasn’t being served by those known machines, even though the request from the client would be indistinguishable from one that would be eventually served by those servers. Enough of a chance, I think, to establish plausible deniability.
The basic idea is that an adversary who controls both the first (entry)
and last (exit) relay that Alice picks can modify the data flow at one
end of the circuit (“tag” it), and detect that modification at the other
end — thus bridging the circuit and confirming that it really is Alice
talking to Bob. This attack has some limitations compared to the above
attacks. First, it involves modifying data, which in most cases will
break the connection; so there’s a lot more risk that he’ll be noticed.
Second, the attack relies on the adversary actually controlling both
relays. The passive variants can be performed by an observer like an ISP
or a telco.
In general I’d say that these limitations hold, and are amplified to some extent. Especially with the difficulty of controlling, hell, even knowing which “relays” to control being orders of magnitude greater.
An interesting question here is: “If a successful GET
triggers a reward to the farmer, how does the network know which farmer served up that data, and can that mechanism be exploited?” That is something that I have not researched yet. Anyone else care to expand upon that? If not I guess I’ll just keep digging (I got nothing but time anyways).
[1] How many services can a service server serve when a service server serves services? Several.