Moreover, a sample dataset is provided. It contains the traffic generated by an automatic browser visiting a (quite large) set of popular Websites.
For information about this Readme file and this tool please write to Martino Trevisan.
dataset_awesome.csv contains the TCP flows generated by a browser that was instrumented for visiting 2,500 webistes. Some of them have been visited multiple time. The file in
csv format and contains one row for each TCP flow. It has 4 columns, that are:
- The timestamp of the first packet of the flow (in epoch/seconds)
- The Server IP address
- The domain name of the server
- The name of the website that the browser was visiting at that time (i.e. the originating web service)
The list of the visited websites is available in the file
How to run the code
To begin, you must create the Bag of Domains (BoDs) using the provided trace. You can run:
./create_BoDs.py services_awesome.txt dataset_awesome.csv bags.json
It creates the BoD file
bags.json for all the services contained in
services_awesome.txt You can manually inspect the bag file to get confident with BoDs.
Then, you can run AWESoME classifier on the same trace providing the BoD file just created to the classifier. Just run:
./classify_flows.py dataset_awesome.csv bags.json classified_flows.csv
It classifies the input trace
dataset_awesome.csv and provides the output in
classified_flows.csv. The latter file is a copy of the input one, with an extra column (in last position) indicating the result of the classification. This extra column contains the name of the service that AWESoME accounted that flow to.
You can verify that more that 95% of flows have the correct label with:
$ echo $(awk -F, '$4==$5' classified_flows.csv | wc -l)*100/$(cat classified_flows.csv | wc -l) | bc 95