Cross-functional Analysis of Generalisation in Behavioural Learning

In behavioural testing, system functionalities underrepresented in the standard evaluation setting (with a held-out test set) are validated through controlled input-output pairs. Optimising performance on the behavioural tests during training (behavioural learning) would improve coverage of phenomen...

Full description

Bibliographic Details
Published in:Transactions of the Association for Computational Linguistics
Main Authors: de Araujo, Pedro Henrique Luz, Roth, Benjamin
Format: Text
Language:unknown
Published: 2023
Subjects:
Online Access:http://arxiv.org/abs/2305.12951
https://doi.org/10.1162/tacl_a_00590
id ftarxivpreprints:oai:arXiv.org:2305.12951
record_format openpolar
spelling ftarxivpreprints:oai:arXiv.org:2305.12951 2023-10-01T03:55:03+02:00 Cross-functional Analysis of Generalisation in Behavioural Learning de Araujo, Pedro Henrique Luz Roth, Benjamin 2023-05-22 http://arxiv.org/abs/2305.12951 https://doi.org/10.1162/tacl_a_00590 unknown http://arxiv.org/abs/2305.12951 doi:10.1162/tacl_a_00590 Computer Science - Computation and Language Computer Science - Machine Learning text 2023 ftarxivpreprints https://doi.org/10.1162/tacl_a_00590 2023-09-03T01:06:06Z In behavioural testing, system functionalities underrepresented in the standard evaluation setting (with a held-out test set) are validated through controlled input-output pairs. Optimising performance on the behavioural tests during training (behavioural learning) would improve coverage of phenomena not sufficiently represented in the i.i.d. data and could lead to seemingly more robust models. However, there is the risk that the model narrowly captures spurious correlations from the behavioural test suite, leading to overestimation and misrepresentation of model performance -- one of the original pitfalls of traditional evaluation. In this work, we introduce BeLUGA, an analysis method for evaluating behavioural learning considering generalisation across dimensions of different granularity levels. We optimise behaviour-specific loss functions and evaluate models on several partitions of the behavioural test suite controlled to leave out specific phenomena. An aggregate score measures generalisation to unseen functionalities (or overfitting). We use BeLUGA to examine three representative NLP tasks (sentiment analysis, paraphrase identification and reading comprehension) and compare the impact of a diverse set of regularisation and domain generalisation methods on generalisation performance. Comment: 16 pages, 1 figure. To be published in the Transactions of the Association for Computational Linguistics (TACL). This preprint is a pre-MIT Press publication version Text Beluga Beluga* ArXiv.org (Cornell University Library) Transactions of the Association for Computational Linguistics 11 1066 1081
institution Open Polar
collection ArXiv.org (Cornell University Library)
op_collection_id ftarxivpreprints
language unknown
topic Computer Science - Computation and Language
Computer Science - Machine Learning
spellingShingle Computer Science - Computation and Language
Computer Science - Machine Learning
de Araujo, Pedro Henrique Luz
Roth, Benjamin
Cross-functional Analysis of Generalisation in Behavioural Learning
topic_facet Computer Science - Computation and Language
Computer Science - Machine Learning
description In behavioural testing, system functionalities underrepresented in the standard evaluation setting (with a held-out test set) are validated through controlled input-output pairs. Optimising performance on the behavioural tests during training (behavioural learning) would improve coverage of phenomena not sufficiently represented in the i.i.d. data and could lead to seemingly more robust models. However, there is the risk that the model narrowly captures spurious correlations from the behavioural test suite, leading to overestimation and misrepresentation of model performance -- one of the original pitfalls of traditional evaluation. In this work, we introduce BeLUGA, an analysis method for evaluating behavioural learning considering generalisation across dimensions of different granularity levels. We optimise behaviour-specific loss functions and evaluate models on several partitions of the behavioural test suite controlled to leave out specific phenomena. An aggregate score measures generalisation to unseen functionalities (or overfitting). We use BeLUGA to examine three representative NLP tasks (sentiment analysis, paraphrase identification and reading comprehension) and compare the impact of a diverse set of regularisation and domain generalisation methods on generalisation performance. Comment: 16 pages, 1 figure. To be published in the Transactions of the Association for Computational Linguistics (TACL). This preprint is a pre-MIT Press publication version
format Text
author de Araujo, Pedro Henrique Luz
Roth, Benjamin
author_facet de Araujo, Pedro Henrique Luz
Roth, Benjamin
author_sort de Araujo, Pedro Henrique Luz
title Cross-functional Analysis of Generalisation in Behavioural Learning
title_short Cross-functional Analysis of Generalisation in Behavioural Learning
title_full Cross-functional Analysis of Generalisation in Behavioural Learning
title_fullStr Cross-functional Analysis of Generalisation in Behavioural Learning
title_full_unstemmed Cross-functional Analysis of Generalisation in Behavioural Learning
title_sort cross-functional analysis of generalisation in behavioural learning
publishDate 2023
url http://arxiv.org/abs/2305.12951
https://doi.org/10.1162/tacl_a_00590
genre Beluga
Beluga*
genre_facet Beluga
Beluga*
op_relation http://arxiv.org/abs/2305.12951
doi:10.1162/tacl_a_00590
op_doi https://doi.org/10.1162/tacl_a_00590
container_title Transactions of the Association for Computational Linguistics
container_volume 11
container_start_page 1066
op_container_end_page 1081
_version_ 1778523170375991296