Robustness in Language Models: From Misspelling Resistant Embedding to Fact Checkable Text Generation (public)

speaker DIAG:

Fabrizio Silvestri

Data dell'evento:

Venerdì, 22 May, 2020 - 14:00

Luogo:

Meet: https://meet.google.com/jvg-snvt-qme

Contatto:

Stefano Leonardi

Fabrizio Silvestri è risultato vincitore della procedura selettiva ad 1 posto di Professore di I fascia presso il Dipartimento di Ingegneria Informatica, Automatica e Gestionale Antonio Ruberti, Settore Concorsuale 09/H1, SSD ING-INF/05 - codice concorso 2019POE007, bandito con Decreto Rettorale 2757/2019 del 19/09/2019, i cui atti sono stati approvati con Decreto Rettorale n. 1272/2020 del 13/05/2020.

Nell'ambito della procedura ai fini della chiamata da parte del Consiglio di dipartimento, Fabrizio Silvestri terrà un seminario ed una lezione pubblica sulle attività di ricerca da lui svolte e in corso di svolgimento. Il seminario e la lezione saranno svolte in modalità telematica Venerdì 22 Maggio, alle ore 14:00.

Per partecipare al seminario e alla lezione, connettersi all’indirizzo https://meet.google.com/jvg-snvt-qme

Il titolo della lezione verrà estratto il giorno 21 Maggio 2020 alle ore 13:45 su meet.google.com/cxf-pxjk-tqd in presenza della Direttrice prof.ssa Tiziana Catarci.

Title: Robustness in Language Models: From Misspelling Resistant Embedding to Fact Checkable Text Generation.

Abstract

In this talk, We will review two recent results in the field of Neural Language Modelling. In the first part I will show a technique to improve n-gram embeddings, namely FastText, and making them more robust to misspellings. Traditional word embeddings are good at solving lots of natural language processing (NLP) downstream problems such as documentation classification and named-entity recognition (NER). However, one of the drawbacks is a lack of capability on handling out-of-vocabulary (OOV). Misspelling Oblivious (word) Embedding (MOE) overcomes this limitation by using a combination of self-supervised and supervised language modelling.

The second part will show a mechanism to generate and assess the fact-checkability of neural generated text. We argue that verifiability, i.e., the consistency of the generated text with factual knowledge, is a suitable metric for measuring this cost. We use an automatic fact-checking system to calculate new metrics as a function of the number of supported claims per sentence and find that sampling-based generation strategies, such as top-k, indeed lead to less verifiable text. Based on this finding, we introduce a simple and effective generation strategy for producing non-repetitive and more verifiable (in comparison to other methods) text.

Finally, we present some lines of research inspired by these recent findings.

gruppo di ricerca:

Algorithms and Data Science