Cross-functional Analysis of Generalisation in Behavioural Learning ...

In behavioural testing, system functionalities underrepresented in the standard evaluation setting (with a held-out test set) are validated through controlled input-output pairs. Optimising performance on the behavioural tests during training (behavioural learning) would improve coverage of phenomen...

Full description

Bibliographic Details
Main Authors: de Araujo, Pedro Henrique Luz, Roth, Benjamin
Format: Text
Language:unknown
Published: arXiv 2023
Subjects:
Online Access:https://dx.doi.org/10.48550/arxiv.2305.12951
https://arxiv.org/abs/2305.12951
id ftdatacite:10.48550/arxiv.2305.12951
record_format openpolar
spelling ftdatacite:10.48550/arxiv.2305.12951 2023-10-01T03:55:03+02:00 Cross-functional Analysis of Generalisation in Behavioural Learning ... de Araujo, Pedro Henrique Luz Roth, Benjamin 2023 https://dx.doi.org/10.48550/arxiv.2305.12951 https://arxiv.org/abs/2305.12951 unknown arXiv https://dx.doi.org/10.1162/tacl_a_00590 Creative Commons Attribution 4.0 International https://creativecommons.org/licenses/by/4.0/legalcode cc-by-4.0 Computation and Language cs.CL Machine Learning cs.LG FOS Computer and information sciences ScholarlyArticle Article article-journal Text 2023 ftdatacite https://doi.org/10.48550/arxiv.2305.1295110.1162/tacl_a_00590 2023-09-04T15:13:49Z In behavioural testing, system functionalities underrepresented in the standard evaluation setting (with a held-out test set) are validated through controlled input-output pairs. Optimising performance on the behavioural tests during training (behavioural learning) would improve coverage of phenomena not sufficiently represented in the i.i.d. data and could lead to seemingly more robust models. However, there is the risk that the model narrowly captures spurious correlations from the behavioural test suite, leading to overestimation and misrepresentation of model performance -- one of the original pitfalls of traditional evaluation. In this work, we introduce BeLUGA, an analysis method for evaluating behavioural learning considering generalisation across dimensions of different granularity levels. We optimise behaviour-specific loss functions and evaluate models on several partitions of the behavioural test suite controlled to leave out specific phenomena. An aggregate score measures generalisation to unseen ... : 16 pages, 1 figure. To be published in the Transactions of the Association for Computational Linguistics (TACL). This preprint is a pre-MIT Press publication version ... Text Beluga Beluga* DataCite Metadata Store (German National Library of Science and Technology)
institution Open Polar
collection DataCite Metadata Store (German National Library of Science and Technology)
op_collection_id ftdatacite
language unknown
topic Computation and Language cs.CL
Machine Learning cs.LG
FOS Computer and information sciences
spellingShingle Computation and Language cs.CL
Machine Learning cs.LG
FOS Computer and information sciences
de Araujo, Pedro Henrique Luz
Roth, Benjamin
Cross-functional Analysis of Generalisation in Behavioural Learning ...
topic_facet Computation and Language cs.CL
Machine Learning cs.LG
FOS Computer and information sciences
description In behavioural testing, system functionalities underrepresented in the standard evaluation setting (with a held-out test set) are validated through controlled input-output pairs. Optimising performance on the behavioural tests during training (behavioural learning) would improve coverage of phenomena not sufficiently represented in the i.i.d. data and could lead to seemingly more robust models. However, there is the risk that the model narrowly captures spurious correlations from the behavioural test suite, leading to overestimation and misrepresentation of model performance -- one of the original pitfalls of traditional evaluation. In this work, we introduce BeLUGA, an analysis method for evaluating behavioural learning considering generalisation across dimensions of different granularity levels. We optimise behaviour-specific loss functions and evaluate models on several partitions of the behavioural test suite controlled to leave out specific phenomena. An aggregate score measures generalisation to unseen ... : 16 pages, 1 figure. To be published in the Transactions of the Association for Computational Linguistics (TACL). This preprint is a pre-MIT Press publication version ...
format Text
author de Araujo, Pedro Henrique Luz
Roth, Benjamin
author_facet de Araujo, Pedro Henrique Luz
Roth, Benjamin
author_sort de Araujo, Pedro Henrique Luz
title Cross-functional Analysis of Generalisation in Behavioural Learning ...
title_short Cross-functional Analysis of Generalisation in Behavioural Learning ...
title_full Cross-functional Analysis of Generalisation in Behavioural Learning ...
title_fullStr Cross-functional Analysis of Generalisation in Behavioural Learning ...
title_full_unstemmed Cross-functional Analysis of Generalisation in Behavioural Learning ...
title_sort cross-functional analysis of generalisation in behavioural learning ...
publisher arXiv
publishDate 2023
url https://dx.doi.org/10.48550/arxiv.2305.12951
https://arxiv.org/abs/2305.12951
genre Beluga
Beluga*
genre_facet Beluga
Beluga*
op_relation https://dx.doi.org/10.1162/tacl_a_00590
op_rights Creative Commons Attribution 4.0 International
https://creativecommons.org/licenses/by/4.0/legalcode
cc-by-4.0
op_doi https://doi.org/10.48550/arxiv.2305.1295110.1162/tacl_a_00590
_version_ 1778523175431176192