Clickstreams towards adult website: anonymized dataset

The data are available to the community in anonymized form for further investigation at the following link.

The dataset has been used for the following paper:

Andrea Morichetta, Martino Trevisan, and Luca Vassio. Characterizing web pornography consumption from passive measurements. Passive and Active Measurement Conference (PAM) 2019. Puerto Varas, Chile, March 27-29, 2019.

From HTTP traffic monitored at a Point of Presence of an ISP in Italy, we extract user-actions towards adult websites. Users are characterized by anonymized IP and user agent.
In total, we have 58 millions user-actions towards 59,989 different adult domains. The data is taken from July 2013 to March 2018 (57 months), with an average of 13,261 different users per month.

The shared compressed archive (165 MB) contains archives for each month. The file domain.csv contains the dictionary of domain names, structured as follow:

<domainID>, <domainName>

Where there is a one-to-one correspondence between the string of the domain name <domainName> and its hash value <domainID> (an integer beetween 0 and 999999).

The files <yyyymm>\_users.csv, where <yyyymm> indicates the year (2014, 2015, 2016, or 2017) and the month (1 to 12), contain the anonymized list of single users for that month:

<userID>, <typeOS>, <typeBrowser>, <typeDevice>

where userID is the hash value of the user identifier (an integer beetween 0 and 999999), <typeOS> indicates the OS in usage (e.g., Windows10), <typeBrowser> indicates the browser where the page has been visited (e.g., Chrome) and <typeDevice> indicates if the device was a PC, a smartphone, or a tablet. <typeOS>, <typeBrowser> and <typeDevice> are extracted from the user-agent of the original HTTP request.

The logs are compressed and present a name <yyyymm>.csv.gz where <yyyymm> indicates the year (2014, 2015, 2016, or 2017) and the month (1 to 12). Inside each file, each line holds the following information:

<timestamp>,<domainID>,<userID>

where <timestamp> is Unix time in seconds, with granularity up to 1 second. <domainID> is the hash value of the domain, and <userID> is the hash value of the user identifier.