Clustering of URLs from HTTP and HTTPS traces – dataset

In this page we show the results of applying LENTA to a HTTPS trace we collected from volunteers. For the details on the methodology – please check the paper Morichetta, Andrea, and Marco Mellia. “LENTA: Longitudinal Exploration for Network Traffic Analysis.2018 30th International Teletraffic Congress (ITC 30). Vol. 1. IEEE, 2018

Here we report on our experiment with HTTPS traffic. Traffic was collected from a set of more than 100 volunteers during a one-month analysis on web privacy. Our data collection program has been explicitly approved by the volunteers, and the project was also subject to a privacy impact assessment that was done with the data protection officer of our institution.

Users are considered individually (albeit after anonymization). Starting from the traffic generated each day by each volunteer, we run IDBSCAN group similar URLs, and offer them to the network analyst who can then label each URL cluster to identify the accessed web-service, getting a better understanding of the traffic the network carries.

Figure above reports categories of most common accessed services. Categories are assigned after a manual inspection of clusters. We focus on the 171 clusters that resulted to be visited by at least two users, over the overall period. The figure reports the fraction of clusters that fall into the same category. Almost a quarter of the groups are related to third party services, which include advertisement, web tracking and analytics; their pervasiveness is such as to affect the results of the system. Not surprisingly, the second position is held by Social networks. Third are cloud services belonging to google and some CDN used for image storage. The rest of the categories involve each no more than 5% of the overall set. Overall we were able to easily map URLs to categories.

The manual labelling is greatly simplified by the rich information offered by URLs in each cluster. To show this, this table details some of the clusters of URLs that have been identified by LENTA, and the label we assigned to each of them. Little domain knowledge allows one to assign a label to cluster (and thus all URLs).