References & Citations
Computer Science > Computation and Language
Title: CrossSum: Beyond English-Centric Cross-Lingual Abstractive Text Summarization for 1500+ Language Pairs
(Submitted on 16 Dec 2021 (v1), last revised 23 May 2022 (this version, v2))
Abstract: We present CrossSum, a large-scale cross-lingual abstractive summarization dataset comprising 1.7 million article-summary samples in 1500+ language pairs. We create CrossSum by aligning identical articles written in different languages via cross-lingual retrieval from a multilingual summarization dataset. We propose a multi-stage data sampling algorithm to effectively train a cross-lingual summarization model capable of summarizing an article in any target language. We also propose LaSE, a new metric for automatically evaluating model-generated summaries and showing a strong correlation with ROUGE. Performance on ROUGE and LaSE indicate that pretrained models fine-tuned on CrossSum consistently outperform baseline models, even when the source and target language pairs are linguistically distant. To the best of our knowledge, CrossSum is the largest cross-lingual summarization dataset and the first-ever that does not rely solely on English as the pivot language. We are releasing the dataset, alignment and training scripts, and the models to spur future research on cross-lingual abstractive summarization. The resources can be found at this https URL
Submission history
From: Rifat Shahriyar [view email][v1] Thu, 16 Dec 2021 11:40:36 GMT (9098kb,D)
[v2] Mon, 23 May 2022 18:44:10 GMT (13351kb,D)
Link back to: arXiv, form interface, contact.