This webpage contains additional material of the paper:
Martino Trevisan, Francesca Soro, Marco Mellia, Idilio Drago, and Ricardo Morla. 2020. Does domain name encryption increase users’ privacy? SIGCOMM Comput. Commun. Rev. 50, 3 (July 2020). Available here.
Knowing domain names associated with traffic allows eavesdroppers to profile users without accessing packet payloads. Encrypting domain names transiting the network is, therefore, a key step to increase network confidentiality. The latest efforts include encrypting the TLS Server Name Indication (eSNI extension) and encrypting DNS traffic, with DNS over HTTPS (DoH) representing a prominent proposal. We show that an attacker able to observe users’ traffic relying on plain-text DNS can uncover the domain names of users relying on eSNI or DoH. By relying on large-scale network traces, we show that simplistic features and off-the-shelf machine learning models are sufficient to achieve surprisingly high precision and recall when recovering encrypted domain names. The triviality of the attack calls for further actions to protect privacy, in particular considering transient scenarios in which only a fraction of users will adopt these new privacy-enhancing technologies.
We evaluate whether an attacker observing the traffic of some users exposing domain names they contact can uncover the domain names contacted by the remaining users – i.e., breaking the protection intended by DoH and eSNI. We assume that the attacker can refer to different sources to build a dataset of flow-level statistics labeled with their domain names: (i) eavesdropping traffic of users relying on plain-text protocols; (ii) harvesting corporate/private traffic and DNS logs; or (iii) running active experiments.
An off-the-shelf machine learning approach is sufficient to execute the attack. We extract traffic features for the most popular domain names observed in the training set, such as flow duration, byte counters as well as packet sizes and packet inter-arrival times. The machine learning models yield surprisingly good results, delivering F1-scores over 0.8 for 80% of the evaluated domain names. The triviality of the attack calls for further actions to protect domain names.
Dataset and code
In this GitHub repository, you can find the code that we used for the paper.
This piece of code takes as input Tstat TCP log files and trains ML classifiers to guess the domains each flow refers to. Given a list of input Autonomous Systems (ASes), it trains a given ML classifier for each. A configurable subset of the clients in the logs is used for training. The remaining ones are used for testing.
The output is a classification report detailing the performance of the classifiers across the different ASes and domains. Optionally, you can store as output the trained models and the pre-processed CSV files for further analyses.