Domains per web users

This page collects the open datasets used in the papers:


Azadeh Faroughi, Andrea Morichetta, Luca Vassio, Flavio Figueiredo, Marco Mellia and Reza Javidan, Towards Website Domain Name Classification Using Graph Based Semi-supervised Learning, Currently Under Review, 2020

The partially labeled dataset with the visited domains and their category and can be donwloaded from here (link is external) . The file dataset.txt (70MB) is composed by 4 columns,  separated by space character, with the following structure:

<userID> <hourID> <domainList> <labelList>

Where:

  1. <userID> is an unique anonymized identifier for the user in the whole dataset. UserID are in the range [0, 2637];
  2. <hourID> is the sequential identifier of the hour of reference. HourID are in the range [0 , 935];
  3. <domainList> is a list of domain names, separated by comma character, and enclosed in square brackets;
  4. <labelList> is a list of categories, separated by comma character, and enclosed in square brackets. Its indexes are related to <domainList>. In case a domain is not labeled, it is characterized by “[]”.

For example:

0 684 [www.deezer.com,store.steampowered.com,www.adobe.com] [Arts_and_Entertainment,[],Computer_and_Electronics]

means that user “0” in hour “684” visited 3 domains.  1) “www.deezer.com”, with corresponding label “Arts_and_Entertainment”, 2) “store.steampowered.com”, without any label, and 3) “www.adobe.com” with label “Computer_and_Electronics”.


Luca Vassio, Danilo Giordano, Martino Trevisan, Marco Mellia, Ana Paula Couto da Silva, Users’ Fingerprinting Techniques from TCP Traffic, ACM SIGCOMM Workshop on Big Data Analytics and Machine Learning for Data Communication Networks, Los Angeles, USA, August 2017

The anonymized visited domains and the list of core domains used to perform the experiments are reported.

The dataset with the visited domain can be donwloaded from here (link is external) and is composed by 4 columns:

  1. The Client IP address anonymized
  2. The timestamp the flow was generated in seconds
  3. The Domain anonymized as a number between 000001 and 500k
  4. The a flag stating if the Domain is a Core Domain (True) or a Support Domain (False)

The list of 1000 Core Domains can be donwloaded from here (link is external) and is composed by 2 columns:

  1. The domain
  2. If the domain is a Core Domain (Core) or a Support Domain (Support)


Luca Vassio, Flavio Figuereido, Ana Paula Couto da Silva, Marco Mellia, Jussara Almeida, Mining and Modeling Web Trajectories from Passive TracesIEEE BigData 2017 DS4N, Boston, MA, December 2017

Anonymized trajectories of domains and their TribeFlow models are reported.

The dataset with the visited domain can be donwloaded from here (link is external) and is composed by 4 columns:

  1. Timestamp in seconds
  2. The Client IP address anonymized
  3. The original Domain anonymized as a integer number
  4. The landing Domain anonymized as a  integer number

The Tribelow models:

  1. Campus model (download here (link is external)

For more details, please check the papers or contact us.