Description
Summary:The Conversational Machine Reading Comprehension (CMRC) task aims to answer questions in conversations, which has been a hot research topic because of its wide applications. However, existing CMRC benchmarks in which each conversation is coupled with a static passage are inconsistent with real scenarios. In this regard, it is hard to evaluate model's comprehension ability towards real scenarios. In this work, we propose the first Chinese CMRC benchmark Orca and further provide zero-shot/few-shot settings to evaluate model's generalization ability towards diverse domains. We collect 831 hot-topic driven conversations with 4,742 turns in total. Each turn of a conversation is assigned with a response-related passage, aiming to evaluate model's comprehension ability more reasonably. The topics of conversations are collected from social media platform and cover 33 domains, trying to be consistent with real scenarios. Importantly, answers in Orca are all well-annotated natural responses rather than specific spans or short phrases in previous datasets. We implement two strong frameworks to tackle the challenge in Orca. The results indicate there is substantial room for improvement for strong baselines such as ChatGPT on our CMRC benchmark. Our codes and datasets are available at: https://github.com/nuochenpku/Orca. © 2023 Association for Computational Linguistics.