Published News URLs datasets and categorization scripts

We make available to the community our categorization scripts (e.g., how categories or sections assigned by editors to articles are grouped together to form broad categories) as well as a sample of 80K categorized articles using this method (dataset Published-articles in our paper).

This should give the reader (who masters Italian) the possibility to gauge the diversity of possible sub-categories in categories like the “Ed. Columns” (“Rubriche”) and “International news” (“Esteri”). One can think indeed that these two categories could be be possibly overlapping with the politics category.

The data is available to the community at the following link. It has been used in the following paper:

Z. B. Houidi, G. Scavo, S. Traverso, R. Teixeira, M. Mellia, S. Ganguly “The News We Like Are Not the News We Visit: News Categories Popularity in Usage Data” AAAI ICWSM 2019.

The compressed archive contains two folders. The first contains all news articles (around 80 thousand) published by major news outlets (more than 500) in Italy, during a period of 3 months. The second folder contains categorization scripts as well as a description on how to launch them on the first folder dataset.

Published News articles

The first folder (list_published_URLs_3months) contains a python pickle (Published_URLs.pkl) of a dictionary whose keys are URLs and values are lists containing:

[title, text, outelet name, publication date, crawling date]

Example on how to explore the data:

>>> import pickle
>>> with open('Published_URLs.pkl','rb') as input:
...     dict_urls=pickle.load(input)
...
>>> dict_urls.popitem()
(u'https://bari.ilquotidianoitaliano.com/cronaca/2018/01/news/bitonto-auto-si-ribalta-in-via-planelli-ferita-60enne-di-modugno-185281.html/', [u'Bitonto, auto si ribalta in via Planelli', u
'Verso le 9 di questa mattina si \xe8 verificato un incidete in via Planelli, alle porte di Bitonto. Forse a causa dell\u2019asfalto viscido, una piccola utilitaria Ford ha iniziato a sband
are, finendo la sua corsa ribaltata sul fianco destro.\n\nNell\u2019incidente, la donna 60enne originaria di Modugno ha riportato ferite giudicate lievi dai dai sanitari del 118, intervenut
i sul posto. Dopo le prime medicazioni, la guidatrice \xe8 stata trasportata al Policlinico. Per effettuare i rilievi e chiare cosa sia successo, sono intervenuti gli agenti della Polizia L
ocale.\n\n1 di 3\n\nprint', u'ilquotidianoitaliano', {u'$date': u'2018-01-13T12:12:21.000Z'}, u'2018-01-13T12:00:01.480Z'])

Categorization scripts and categorized URLs

The second folder (categorizer) contains categorization scripts as well as the set of Published_URLs described earlier tagged with their news category (Published_URLs_tagged_CLEAN.pkl).The pickle is a dictionary whose keys are URLs and values are the categories.

To run the scripts on another set of URLs, follow the instructions in Readme.txt and the comments in launch_categorizer.sh .  To run the categorizer on the Published_URLs.pkl dataset above, just launch the latter script as is.