Exploring the use of Deep Natural Language Processing models to analyze documents in cross-lingual and multi-domain scenarios

Supervisors

Luca Cagliero – luca.cagliero@polito.it

PhD Student: Lorenzo Vaiani

Summary of the proposal

Deep Natural Language Processing (Deep NLP) focuses on applying both representation learning and Deep Learning to tackle NLP tasks. The main goal is to analyze large document corpora to support knowledge discovery. Examples of Deep NLP applications encompass extractive text summarization (i.e., generate concise summaries of large corpora), machine translation (i.e., translate a text from one language to another), sentiment analysis (i.e., extract subjective information from a text such as opinions, moods, rates, and feelings), and question answering (e.g., chatbot applications).

Vector representations of text are commonly used to capture semantic text relationships at the word-level (e.g., Word2Vect, GloVe, FastText) or at the sentence-level (e.g., BERT, XLNET, ElMO). However, existing models show a limited portability towards different languages and domains.

The Ph.D. candidate will study, design, and develop new DNLP models tailored to cross-lingual and cross-domain scenarios. Furthermore, she/he will apply the aforesaid models to solve specific NLP tasks such as cross-lingual and multi- domain text summarization and sentiment analysis.

Research objectives and methods

The main goal of Deep Natural Language Processing (DNLP) techniques is to exploit Deep Learning models in order to extract significant information from large document corpora. The idea behind is to transform the original textual units into high-dimensional vector representations, which incorporate most semantic text relationships.

The current challenges are enumerated below.

  • The need for large training datasets. Deep Learning architectures usually require sufficiently large amounts of training data. When there is a lack of training data, it is necessary to opportunistically reuse pre-trained models and data from other related domains. The open question is to what extent these models and datasets are suitable and for which application contexts.
  • The need to cope with multilingual documents. Documents can be written in different languages. Models trained for  specific languages are not always portable to other languages. This calls for new cross-lingual NLP solutions.
  • The need to cope with documents relative to different domains. Documents often range over specific domains. The covered topic/domain influences text meaning, syntax, and scope. Effectively coping with multi-domain document
    corpora is another open research challenge. It is particularly relevant to extract meaningful document summaries and to infer the main author’s opinions/feelings).
  • The need to generate new text. In specific NLP applications extracting part of the existing textual content is not sufficient. For example, chatbot applications usually need to generate appropriate answers to end-user queries. The study of DNLP techniques to produce abstractive summaries of large documents is another challenging yet appealing research direction.
  • The main goal of the proposal is to study, develop, and test new DNLP solutions tailored to multilingual and multidomain contexts, paying a particular attention to contexts in which there is a lack of training data or the use of abstractive models is preferrable.

Outline of work plan

PHASE I (1st year):
– Step 1: study of the state-of-the-art Deep NLP embedding models (e.g., Word2Vec, GloVe, SIF, BERT, ElMo). Overview of state-of-the-art extractive summarization methods (e.g., SumPip, TextRankBM25, ELSA, Centroid) and sentiment analyzers (e.g., VADER).

– Step 2: retrieval of cross-lingual document corpora from various sources (e.g., Wikipedia). Analysis of the performance of state-of-the-art contextualized embedding models. Study, design, and developments of targeted model extensions to overcome specific issues (e.g., their limited applicability in cross- lingual and multi-domain scenarios).
Pursued objective: learn contextual embeddings on multilingual and multidomain data.

PHASE II (2nd year):
– Step 1: Study of new solutions for domain-adaptation and retrofitting of embedding models (i.e., tailor the models to specific domains). Empirical assessment on benchmark multilingual document corpora.

– Step 2: Application of the previously extended models to solve specific NLP tasks such as cross-lingual text  summarization (extract a summary in a target language that differs from those of the source documents), cross-lingual
sentiment analysis (exploit end-users opinions written in a given language to improve the quality of the sentiment analysis models specifically designed for other languages).
Pursued objective: develop new cross-lingual summarization and sentiment analysis methods.

PHASE III (3rd year):
– Step 1: Study and development of new abstractive summarization models (i.e., generate new text using Deep Learning architectures).

– Step 2: Study, design, and developments of state-of-the-art abstractive summarization methods. Evaluation on benchmark data.
Pursued objective: develop new abstractive summarization methods.

During all the three years the candidate will have the opportunity to attend top quality conferences, to collaborate with researchers from different countries, and to participate to NLP research challenges (e.g., FNS 2020 http://wp.lancs.ac.uk/cfie/fns2020/).

Expected target publications

Conferences: ACM SIGIR, EMNLP, COLING, IEEE ICDM, ACM KDD
Journals: IEEE TKDE, ACM TKDD, Elsevier ESWA, IEEE TETC, Springer DMKD

Current funded projects of the proposer related to the proposal

PRIN Italian Project Bando 2017 “Entrepreneurs As Scientists: When and How Start-ups Benefit from A Scientific Approach to Decision Making”. The activity will be focused on the analysis of questionnaires, reviews, and technical reports related to training activities.

Further information about the PhD program at Politecnico can be found here

Back to the list of PhD positions