AWESoME Source Code and Sample Dataset

This repository contains the source code of the core part of AWESoME (Github). In particular, you can find the source code of the training and the classification modules.

Moreover, a sample dataset is provided. It contains the traffic generated by an automatic browser visiting a (quite large) set of popular Websites.

For information about this Readme file and this tool please write to Martino Trevisan.

The dataset

The file dataset_awesome.csv contains the TCP flows generated by a browser that was instrumented for visiting 2,500 webistes. Some of them have been visited multiple time. The file in csv format and contains one row for each TCP flow. It has 4 columns, that are:

The timestamp of the first packet of the flow (in epoch/seconds)
The Server IP address
The domain name of the server
The name of the website that the browser was visiting at that time (i.e. the originating web service)

The list of the visited websites is available in the file services_awesome.txt.

How to run the code

BoD creation

To begin, you must create the Bag of Domains (BoDs) using the provided trace. You can run:

./create_BoDs.py services_awesome.txt dataset_awesome.csv bags.json

It creates the BoD file bags.json for all the services contained in services_awesome.txt You can manually inspect the bag file to get confident with BoDs.

Flow classification

Then, you can run AWESoME classifier on the same trace providing the BoD file just created to the classifier. Just run:

./classify_flows.py dataset_awesome.csv bags.json classified_flows.csv

It classifies the input trace dataset_awesome.csv and provides the output in classified_flows.csv. The latter file is a copy of the input one, with an extra column (in last position) indicating the result of the classification. This extra column contains the name of the service that AWESoME accounted that flow to.

You can verify that more that 95% of flows have the correct label with:

$ echo $(awk -F, '$4==$5' classified_flows.csv  | wc -l)*100/$(cat classified_flows.csv | wc -l)  | bc 95

Enjoy!