Moreover, a sample dataset is provided. It contains the traffic generated by an automatic browser visiting a (quite large) set of popular Websites.
For information about this Readme file and this tool please write to Martino Trevisan.
The dataset
The file dataset_awesome.csv
contains the TCP flows generated by a browser that was instrumented for visiting 2,500 webistes. Some of them have been visited multiple time. The file in csv
format and contains one row for each TCP flow. It has 4 columns, that are:
- The timestamp of the first packet of the flow (in epoch/seconds)
- The Server IP address
- The domain name of the server
- The name of the website that the browser was visiting at that time (i.e. the originating web service)
The list of the visited websites is available in the file services_awesome.txt
.
How to run the code
BoD creation
To begin, you must create the Bag of Domains (BoDs) using the provided trace. You can run:
./create_BoDs.py services_awesome.txt dataset_awesome.csv bags.json
It creates the BoD file bags.json
for all the services contained in services_awesome.txt
You can manually inspect the bag file to get confident with BoDs.
Flow classification
Then, you can run AWESoME classifier on the same trace providing the BoD file just created to the classifier. Just run:
./classify_flows.py dataset_awesome.csv bags.json classified_flows.csv
It classifies the input trace dataset_awesome.csv
and provides the output in classified_flows.csv
. The latter file is a copy of the input one, with an extra column (in last position) indicating the result of the classification. This extra column contains the name of the service that AWESoME accounted that flow to.
You can verify that more that 95% of flows have the correct label with:
$ echo $(awk -F, '$4==$5' classified_flows.csv | wc -l)*100/$(cat classified_flows.csv | wc -l) | bc 95
Enjoy!