Stylometry & anonymity on the network

The topic about the Web of Trust and BrightID mentions a very interesting point:

BrightID … is based on a web of trust which respects anonymity outside your circles of friends

which leads me to a tangential question that always bothered me: is there a way to maintain real anonymity within public discussions outside of a circle of your friends?

There’s a quite specific attack that can circumvent one’s anonymity: stylometry, that is, statistical analysis of a person’s unique writing style. To my knowledge, it has not been proven on a large scale yet, but it seems quite plausible, especially considering the advancements in modern machine learning technology and huge amounts of data available on the clear net (not to mention the social media corporations). I suppose any actor with sufficient resources would be able to collect posts made in public and correlate them with anonymous posts, which can lead to de-anonymisation.

What’s more interesting, there’s more to that, because stylometry does not relate only to human languages: it can be used for code, as was demonstrated by Rachel Greenstadt and Aylin Caliskan at their DEFCON 26 presentation and in their scientific publications.

I think this is a fairly important topic – that will become even more important in the future – that relates to what we’re aiming to achieve with the SAFE Network. And there are many open questions which in my opinion deserve a further discussion. How do we prevent this and how do we make sure the freedom of speech is maintained at all circumstances?

18 Likes

Have our own stylometry machine learning program that mixes and mashes styles of users like a coin join type mechanism?

6 Likes

Good point @nbaksalyar, and this shows how difficult opsec and true anonymity are.

On the other hand, people have tried stylometry attacks to unmask Satoshi by analyzing the bitcoin whitepapers and source code, without any conclusive results. Either “Satoshi” is multiple people, or these kinds of attacks work really well for fiction (as seen with JK Rowling) and less well for code or technical documents.

There’s a program called Anonymouth that purports to be able to anonymize one’s written output to defend against these attacks. I tried it a couple years ago but couldn’t get it to run without erroring out.

6 Likes

Also be aware of other metadata: the time of day, the frequency, the length of the messages etc of your posts.
I doubt however that much useful information can be gained from things like this.

3 Likes

I believe this kind of information was part of what led to the arrest of Dread Pirate Roberts despite him using Tor, so I wouldn’t say it’s completely useless.

7 Likes

Actually it is a very important point because there are theoretical (and most likely even practical) attacks on the Tor network with the use of such metadata. I don’t have specific links about this research at hand, but to put it simply, just given the wealth of information that your ISP has about your connections, it’s not very hard to narrow down the connection times and correlate them with the specific time anonymous posts were made – if this metadata is not hidden.

6 Likes

The stylometry attack would work if your sample is big enough and you are absolutely sure that the target IS within your sample group to compare with.

Otherwise the only thing you will get is the closest match within a given sample, but never the absolute certainty that it is the same person… because the real culprit might not be in the sample.

(That’s how Satoshi was “found” several times, and every time there was a new suspect it was considered that “oh this time we found it, it is closer stylistically than the previous one”… The only one who would be able to have a more reliable stylometric attack would be the NSA with the global access to private data to compare with)

6 Likes

You and @nbaksalyar are correct of course, but it is not that easy, certainly on the SAFE Network I think. I was more thinking about the time of postings of Satoshi (not much results), but also the anime Death Note. The first 2 episodes should be enough to get an idea.

1 Like

I think can be both easier and harder.

  • Is it a new programmer who is writing new code by following examples of tutors/others
  • Is it a seasoned programmer set in their ways
  • Is it an engineer who like me adopts their programming style when they (regularly) find better ways to do things
  • etc

And depending on the person it could be easy since they keep to the style they learnt from the beginning of each aspect in their programming. Or it could be hard in the case on an engineering type of person who is always learning better ways to do things and engineers its adoption into their style and thus always changing.

It certainly can work especially if you can include some geographical information or the person uses antidotes. And also when comparing posts from similar subjects.

But like you suggest, on a grand scale with anonymity of location and any IDs it could prove much more difficult since many people are products of their education and upbringing and as the number of online users increase interacting then it could be more of a grouping rather than a individual identification.

I remember being in the USA in the 80’s and the boarder inspector (Canadian/USA) asked every driver&passenger to state the town they were from and used that to identify the actual county they were from. And then if they were suspicious then took the person in for further identification.

Yes good example of how these analysis systems can only give a confidence level of identification.

It is never going to be as good as face recognition is since facial recognition uses detailed analysis of the face to build a database. Just like the Chinese are testing now where they take so many images from all angles so their system can recognise a face from a camera image. Face recognition uses physical features that do not change. Whereas style changes over time, changes due to mood, emotional state, major changes in ones life, etc when taking of conversational writing.

Of course in this case they had access to what would be a very limited number of suspects to choose from.

So yes in some cases it will be easy, like finding out which student in a class wrote that email to whomever. In other words it will have application, but I doubt it will be useful in the general sense of unmaking joe or jane average’s identity. Especially if (s)he uses different IDs for each subject (ie forum etc) they converse on. And of course locality always removed by SAFE. Those things limits the ability to find the true style.

Actually I was reading about this analysis a couple of years ago and don’t have the links either. Our ISPs by law have to keep all the information about our internet usage, including email subjects, time spent on which site, and so on. One reason why VPNs are being heavily advertised in Australia.

TL;DR
At the end of the day, it is up to the person to be anonymous and this means being educated on what will unmask them and the risks of not doing certain anonymising actions.

SAFE helps in that each person is able to use different IDs for each aspect of their online experience which reduces any sort of stylistic identification. But of course if they upload selfies under each of their IDs then game over I guess.

4 Likes

Great post @neo. This is a highly nuanced attack and there are a lot of subtleties to be aware of when defending yourself against it.

One of the nice side effects of code auto-formatting tools like gofmt or rustfmt is that they remove at least one dimension of analysis for the attack, namely the formatting patterns used by the author. Much of the code then would depend on the needs of the program as opposed to the author’s personal style. This is especially true of a language like Rust which demands adherence to strict rules and restricts one’s personal expression, so to say. There might still be a few things that can give someone away though: average size of functions, size of modules, frequency of variable shadowing, etc, combined with the content of comments (a second dimension of analysis). I’m skeptical as to the reliability of this analysis.

Still, if you are paranoid, only upload obfuscated code to GitHub :wink:

2 Likes

Great summary @neo, thanks!

I agree re “it is up to the person to be anonymous and this means being educated”. Personally, I believe this can be a part of the network ethos: the whole ecosystem and apps can subtly (or not so) educate users about the importance of anonymity on the network and about best routes to anonymise oneself – similarly to e.g. how the Tor Browser disables cookies by default, which, of course, is not a silver bullet, but it’s still better than the current state of tracking on the Web.

I think this also relates to the discussion about the role of UI/UX that was brought up by @JimCollinson here: Web Apps and access control

Yes, exactly :slight_smile: There’s an interesting bit about that DEF CON presentation though: the authors explicitly state they don’t depend on the formatting at all, they analyse the parsed abstract syntax trees instead.

4 Likes

I hope this is done.

Maybe Maidsafe or someone else can create a “fundamentals” for Application programs. And include as one of them the desire for each program/application to educate or at least link to safesite that does the education and the link directly points to the section relating to the type of APP it is.

Others could be that “privacy of the user is paramount”, “no tracking of the user”, “record user data in the user’s files and not in a way that it can be externally read”, etc

1 Like