Domains per web users

This page collects the open datasets and code used in the papers:

  1. L. Vassio, D. Giordano, M. Trevisan, M. Mellia, A.P.C. da Silva. Users’ Fingerprinting Techniques from TCP Traffic. ACM SIGCOMM Workshop on Big Data Analytics and Machine Learning for Data Communication Networks, Los Angeles, USA, pp. 49-54, 2017. DOI: 10.1145/3098593.3098602
  2. L. Vassio, M. Mellia, F. Figueiredo, A.P.C. da Silva, J.M. Almeida. Mining and modeling web trajectories from passive traces. In: 2017 IEEE International Conference on Big Data (Big Data), 4016-4021, IEEE, 2017. DOI: 10.1109/BigData.2017.8258416
  3. A. Faroughi, A. Morichetta, L. Vassio, F. Figueiredo, M. Mellia, R. Javidan. Towards website domain name classification using graph based semi-supervised learning. Computer Networks, Elsevier, 188, 107865, 2021. DOI: 10.1016/j.comnet.2021.107865

L. Vassio, D. Giordano, M. Trevisan, M. Mellia, A.P.C. da Silva. Users’ Fingerprinting Techniques from TCP Traffic. ACM SIGCOMM Workshop on Big Data Analytics and Machine Learning for Data Communication Networks, Los Angeles, USA, pp. 49-54, 2017. DOI: 10.1145/3098593.3098602

The anonymized visited domains and the list of core domains used to perform the experiments are reported.

The dataset with the visited domain can be downloaded from here (link is external) (590MB) and is composed by 4 columns, separated by space character, with the following structure:

<clientIP> <timestamp> <domainID> <Core>

  1. <clientIP> The anonymized Client IP address
  2. <timestamp> The timestamp the flow was generated in seconds
  3. <domainID>  The anonymized Domain as an integer between 000001 and 499999
  4. <Core> A flag stating if the Domain is a Core Domain (True) or a Support Domain (False)

The list of 1000 Core Domains can be downloaded from here and is composed by 2 columns:

  1. The domain
  2. If the domain is a Core Domain (Core) or a Support Domain (Support)
Log visited domains: download 
Labeled Core/Support domains: download

L. Vassio, M. Mellia, F. Figueiredo, A.P.C. da Silva, J.M. Almeida. Mining and modeling web trajectories from passive traces. In: 2017 IEEE International Conference on Big Data (Big Data), 4016-4021, IEEE, 2017. DOI: 10.1109/BigData.2017.8258416

Anonymized trajectories of domains and their TribeFlow models are reported.

The dataset with the visited domain can be downloaded from here (link is external) and is composed by 4 columns, separated by space character, with the following structure:

<timestamp> <clientIP>  <originDomainID> <destinationDomainID>

  1. <timestamp> The timestamp the flow was generated in seconds
  2. <clientIP> The anonymized Client IP address
  3. <originDomainID>  The anonymized origin Domain as an integer
  4. <destinationDomainID>  The anonymized origin Domain as an integer

The Tribelow models:

  1. Campus model  (link is external)
Log trajectories visited domains: download 
Trained campus model: download

A. Faroughi, A. Morichetta, L. Vassio, F. Figueiredo, M. Mellia, R. Javidan. Towards website domain name classification using graph based semi-supervised learning. Computer Networks, Elsevier, 188, 107865, 2021. DOI: 10.1016/j.comnet.2021.107865

The partially labeled dataset with the visited domains and their category and can be downloaded from here (link is external) . The file dataset.txt (70MB) is composed by 4 columns,  separated by space character, with the following structure:

<userID> <hourID> <domainList> <labelList>

Where:

  1. <userID> is an unique anonymized identifier for the user in the whole dataset. UserID are in the range [0, 2637];
  2. <hourID> is the sequential identifier of the hour of reference. HourID are in the range [0 , 935];
  3. <domainList> is a list of domain names, separated by comma character, and enclosed in square brackets;
  4. <labelList> is a list of categories, separated by comma character, and enclosed in square brackets. Its indexes are related to <domainList>. In case a domain is not labeled, it is characterized by “[]”.

For example:

0 684 [www.deezer.com,store.steampowered.com,www.adobe.com] [Arts_and_Entertainment,[],Computer_and_Electronics]

means that user “0” in hour “684” visited 3 domains.  1) “www.deezer.com”, with corresponding label “Arts_and_Entertainment”, 2) “store.steampowered.com”, without any label, and 3) “www.adobe.com” with label “Computer_and_Electronics”.

The code in python to reproduce the experiments can be downloaded from here. The description of the different modules and how to replicate the results obtained in the paper can be found in the file “code_description.rtf” within the compressed folder.

Labeled visited domains: download
Python code for classification and parameters: download

For more details, please check the papers or contact us.