We gratefully acknowledge support from
the Simons Foundation and member institutions.

Computation and Language

New submissions

[ total of 52 entries: 1-52 ]
[ showing up to 2000 entries per page: fewer | more ]

New submissions for Fri, 18 Jun 21

[1]  arXiv:2106.09024 [pdf, other]
Title: Disentangling Online Chats with DAG-Structured LSTMs
Comments: 8 pages, 1 figure. Accepted at *SEM 2021
Subjects: Computation and Language (cs.CL); Machine Learning (cs.LG)

Many modern messaging systems allow fast and synchronous textual communication among many users. The resulting sequence of messages hides a more complicated structure in which independent sub-conversations are interwoven with one another. This poses a challenge for any task aiming to understand the content of the chat logs or gather information from them. The ability to disentangle these conversations is then tantamount to the success of many downstream tasks such as summarization and question answering. Structured information accompanying the text such as user turn, user mentions, timestamps, is used as a cue by the participants themselves who need to follow the conversation and has been shown to be important for disentanglement. DAG-LSTMs, a generalization of Tree-LSTMs that can handle directed acyclic dependencies, are a natural way to incorporate such information and its non-sequential nature. In this paper, we apply DAG-LSTMs to the conversation disentanglement task. We perform our experiments on the Ubuntu IRC dataset. We show that the novel model we propose achieves state of the art status on the task of recovering reply-to relations and it is competitive on other disentanglement metrics.

[2]  arXiv:2106.09063 [pdf, other]
Title: Specializing Multilingual Language Models: An Empirical Study
Subjects: Computation and Language (cs.CL)

Contextualized word representations from pretrained multilingual language models have become the de facto standard for addressing natural language tasks in many different languages, but the success of this approach is far from universal. For languages rarely or never seen by these models, directly using such models often results in suboptimal representation or use of data, motivating additional model adaptations to achieve reasonably strong performance. In this work, we study the performance, extensibility, and interaction of two such adaptations for this low-resource setting: vocabulary augmentation and script transliteration. Our evaluations on a set of three tasks in nine diverse low-resource languages yield a mixed result, upholding the viability of these approaches while raising new questions around how to optimally adapt multilingual models to low-resource settings.

[3]  arXiv:2106.09069 [pdf, other]
Title: Automatic Construction of Evaluation Suites for Natural Language Generation Datasets
Subjects: Computation and Language (cs.CL); Machine Learning (cs.LG)

Machine learning approaches applied to NLP are often evaluated by summarizing their performance in a single number, for example accuracy. Since most test sets are constructed as an i.i.d. sample from the overall data, this approach overly simplifies the complexity of language and encourages overfitting to the head of the data distribution. As such, rare language phenomena or text about underrepresented groups are not equally included in the evaluation. To encourage more in-depth model analyses, researchers have proposed the use of multiple test sets, also called challenge sets, that assess specific capabilities of a model. In this paper, we develop a framework based on this idea which is able to generate controlled perturbations and identify subsets in text-to-scalar, text-to-text, or data-to-text settings. By applying this framework to the GEM generation benchmark, we propose an evaluation suite made of 80 challenge sets, demonstrate the kinds of analyses that it enables and shed light onto the limits of current generation models.

[4]  arXiv:2106.09141 [pdf, other]
Title: Probing Image-Language Transformers for Verb Understanding
Subjects: Computation and Language (cs.CL); Computer Vision and Pattern Recognition (cs.CV)

Multimodal image-language transformers have achieved impressive results on a variety of tasks that rely on fine-tuning (e.g., visual question answering and image retrieval). We are interested in shedding light on the quality of their pretrained representations -- in particular, if these models can distinguish different types of verbs or if they rely solely on nouns in a given sentence. To do so, we collect a dataset of image-sentence pairs (in English) consisting of 421 verbs that are either visual or commonly found in the pretraining data (i.e., the Conceptual Captions dataset). We use this dataset to evaluate pretrained image-language transformers and find that they fail more in situations that require verb understanding compared to other parts of speech. We also investigate what category of verbs are particularly challenging.

[5]  arXiv:2106.09174 [pdf, other]
Title: Can I Be of Further Assistance? Using Unstructured Knowledge Access to Improve Task-oriented Conversational Modeling
Comments: Presented as a DIALDOC workshop paper at ACL 2021
Subjects: Computation and Language (cs.CL); Artificial Intelligence (cs.AI); Machine Learning (cs.LG)

Most prior work on task-oriented dialogue systems are restricted to limited coverage of domain APIs. However, users oftentimes have requests that are out of the scope of these APIs. This work focuses on responding to these beyond-API-coverage user turns by incorporating external, unstructured knowledge sources. Our approach works in a pipelined manner with knowledge-seeking turn detection, knowledge selection, and response generation in sequence. We introduce novel data augmentation methods for the first two steps and demonstrate that the use of information extracted from dialogue context improves the knowledge selection and end-to-end performances. Through experiments, we achieve state-of-the-art performance for both automatic and human evaluation metrics on the DSTC9 Track 1 benchmark dataset, validating the effectiveness of our contributions.

[6]  arXiv:2106.09204 [pdf, other]
Title: An Empirical Study on Hyperparameter Optimization for Fine-Tuning Pre-trained Language Models
Authors: Xueqing Liu, Chi Wang
Comments: To appear in ACL-IJCNLP 2021
Subjects: Computation and Language (cs.CL)

The performance of fine-tuning pre-trained language models largely depends on the hyperparameter configuration. In this paper, we investigate the performance of modern hyperparameter optimization methods (HPO) on fine-tuning pre-trained language models. First, we study and report three HPO algorithms' performances on fine-tuning two state-of-the-art language models on the GLUE dataset. We find that using the same time budget, HPO often fails to outperform grid search due to two reasons: insufficient time budget and overfitting. We propose two general strategies and an experimental procedure to systematically troubleshoot HPO's failure cases. By applying the procedure, we observe that HPO can succeed with more appropriate settings in the search space and time budget; however, in certain cases overfitting remains. Finally, we make suggestions for future work. Our implementation can be found in https://github.com/microsoft/FLAML/tree/main/flaml/nlp/.

[7]  arXiv:2106.09231 [pdf, other]
Title: Knowledgeable or Educated Guess? Revisiting Language Models as Knowledge Bases
Comments: Accepted to ACL2021(main conference)
Subjects: Computation and Language (cs.CL); Artificial Intelligence (cs.AI)

Previous literatures show that pre-trained masked language models (MLMs) such as BERT can achieve competitive factual knowledge extraction performance on some datasets, indicating that MLMs can potentially be a reliable knowledge source. In this paper, we conduct a rigorous study to explore the underlying predicting mechanisms of MLMs over different extraction paradigms. By investigating the behaviors of MLMs, we find that previous decent performance mainly owes to the biased prompts which overfit dataset artifacts. Furthermore, incorporating illustrative cases and external contexts improve knowledge prediction mainly due to entity type guidance and golden answer leakage. Our findings shed light on the underlying predicting mechanisms of MLMs, and strongly question the previous conclusion that current MLMs can potentially serve as reliable factual knowledge bases.

[8]  arXiv:2106.09232 [pdf, other]
Title: Text2Event: Controllable Sequence-to-Structure Generation for End-to-end Event Extraction
Comments: Accepted to ACL2021 (main conference)
Subjects: Computation and Language (cs.CL)

Event extraction is challenging due to the complex structure of event records and the semantic gap between text and event. Traditional methods usually extract event records by decomposing the complex structure prediction task into multiple subtasks. In this paper, we propose Text2Event, a sequence-to-structure generation paradigm that can directly extract events from the text in an end-to-end manner. Specifically, we design a sequence-to-structure network for unified event extraction, a constrained decoding algorithm for event knowledge injection during inference, and a curriculum learning algorithm for efficient model learning. Experimental results show that, by uniformly modeling all tasks in a single model and universally predicting different labels, our method can achieve competitive performance using only record-level annotations in both supervised learning and transfer learning settings.

[9]  arXiv:2106.09233 [pdf, other]
Title: De-biasing Distantly Supervised Named Entity Recognition via Causal Intervention
Comments: Accepted to ACL2021(main conference)
Subjects: Computation and Language (cs.CL)

Distant supervision tackles the data bottleneck in NER by automatically generating training instances via dictionary matching. Unfortunately, the learning of DS-NER is severely dictionary-biased, which suffers from spurious correlations and therefore undermines the effectiveness and the robustness of the learned models. In this paper, we fundamentally explain the dictionary bias via a Structural Causal Model (SCM), categorize the bias into intra-dictionary and inter-dictionary biases, and identify their causes. Based on the SCM, we learn de-biased DS-NER via causal interventions. For intra-dictionary bias, we conduct backdoor adjustment to remove the spurious correlations introduced by the dictionary confounder. For inter-dictionary bias, we propose a causal invariance regularizer which will make DS-NER models more robust to the perturbation of dictionaries. Experiments on four datasets and three DS-NER models show that our method can significantly improve the performance of DS-NER.

[10]  arXiv:2106.09234 [pdf, other]
Title: Denoising Distantly Supervised Named Entity Recognition via a Hypergeometric Probabilistic Model
Comments: Accepted to AAAI2021
Subjects: Computation and Language (cs.CL)

Denoising is the essential step for distant supervision based named entity recognition. Previous denoising methods are mostly based on instance-level confidence statistics, which ignore the variety of the underlying noise distribution on different datasets and entity types. This makes them difficult to be adapted to high noise rate settings. In this paper, we propose Hypergeometric Learning (HGL), a denoising algorithm for distantly supervised NER that takes both noise distribution and instance-level confidence into consideration. Specifically, during neural network training, we naturally model the noise samples in each batch following a hypergeometric distribution parameterized by the noise-rate. Then each instance in the batch is regarded as either correct or noisy one according to its label confidence derived from previous training step, as well as the noise distribution in this sampled batch. Experiments show that HGL can effectively denoise the weakly-labeled data retrieved from distant supervision, and therefore results in significant improvements on the trained models.

[11]  arXiv:2106.09248 [pdf, other]
Title: X-FACT: A New Benchmark Dataset for Multilingual Fact Checking
Comments: ACL 2021; For data and code, see this https URL
Subjects: Computation and Language (cs.CL)

In this work, we introduce X-FACT: the largest publicly available multilingual dataset for factual verification of naturally existing real-world claims. The dataset contains short statements in 25 languages and is labeled for veracity by expert fact-checkers. The dataset includes a multilingual evaluation benchmark that measures both out-of-domain generalization, and zero-shot capabilities of the multilingual models. Using state-of-the-art multilingual transformer-based models, we develop several automated fact-checking models that, along with textual claims, make use of additional metadata and evidence from news stories retrieved using a search engine. Empirically, our best model attains an F-score of around 40%, suggesting that our dataset is a challenging benchmark for evaluation of multilingual fact-checking models.

[12]  arXiv:2106.09343 [pdf, ps, other]
Title: Lost in Interpreting: Speech Translation from Source or Interpreter?
Comments: to be published at INTERSPEECH 2021
Subjects: Computation and Language (cs.CL)

Interpreters facilitate multi-lingual meetings but the affordable set of languages is often smaller than what is needed. Automatic simultaneous speech translation can extend the set of provided languages. We investigate if such an automatic system should rather follow the original speaker, or an interpreter to achieve better translation quality at the cost of increased delay.
To answer the question, we release Europarl Simultaneous Interpreting Corpus (ESIC), 10 hours of recordings and transcripts of European Parliament speeches in English, with simultaneous interpreting into Czech and German. We evaluate quality and latency of speaker-based and interpreter-based spoken translation systems from English to Czech. We study the differences in implicit simplification and summarization of the human interpreter compared to a machine translation system trained to shorten the output to some extent. Finally, we perform human evaluation to measure information loss of each of these approaches.

[13]  arXiv:2106.09395 [pdf, other]
Title: A Self-supervised Method for Entity Alignment
Subjects: Computation and Language (cs.CL); Information Retrieval (cs.IR); Machine Learning (cs.LG)

Entity alignment, aiming to identify equivalent entities across different knowledge graphs (KGs), is a fundamental problem for constructing large-scale KGs. Over the course of its development, supervision has been considered necessary for accurate alignments. Inspired by the recent progress of self-supervised learning, we explore the extent to which we can get rid of supervision for entity alignment. Existing supervised methods for this task focus on pulling each pair of positive (labeled) entities close to each other. However, our analysis suggests that the learning of entity alignment can actually benefit more from pushing sampled (unlabeled) negatives far away than pulling positive aligned pairs close. We present SelfKG by leveraging this discovery to design a contrastive learning strategy across two KGs. Extensive experiments on benchmark datasets demonstrate that SelfKG without supervision can match or achieve comparable results with state-of-the-art supervised baselines. The performance of SelfKG demonstrates self-supervised learning offers great potential for entity alignment in KGs.

[14]  arXiv:2106.09449 [pdf, other]
Title: DocNLI: A Large-scale Dataset for Document-level Natural Language Inference
Comments: ACL'21 Findings Camera-ready
Subjects: Computation and Language (cs.CL)

Natural language inference (NLI) is formulated as a unified framework for solving various NLP problems such as relation extraction, question answering, summarization, etc. It has been studied intensively in the past few years thanks to the availability of large-scale labeled datasets. However, most existing studies focus on merely sentence-level inference, which limits the scope of NLI's application in downstream NLP problems. This work presents DocNLI -- a newly-constructed large-scale dataset for document-level NLI. DocNLI is transformed from a broad range of NLP problems and covers multiple genres of text. The premises always stay in the document granularity, whereas the hypotheses vary in length from single sentences to passages with hundreds of words. Additionally, DocNLI has pretty limited artifacts which unfortunately widely exist in some popular sentence-level NLI datasets. Our experiments demonstrate that, even without fine-tuning, a model pretrained on DocNLI shows promising performance on popular sentence-level benchmarks, and generalizes well to out-of-domain NLP tasks that rely on inference at document granularity. Task-specific fine-tuning can bring further improvements. Data, code, and pretrained models can be found at https://github.com/salesforce/DocNLI.

[15]  arXiv:2106.09460 [pdf, other]
Title: DravidianCodeMix: Sentiment Analysis and Offensive Language Identification Dataset for Dravidian Languages in Code-Mixed Text
Comments: 36 pages
Subjects: Computation and Language (cs.CL)

This paper describes the development of a multilingual, manually annotated dataset for three under-resourced Dravidian languages generated from social media comments. The dataset was annotated for sentiment analysis and offensive language identification for a total of more than 60,000 YouTube comments. The dataset consists of around 44,000 comments in Tamil-English, around 7,000 comments in Kannada-English, and around 20,000 comments in Malayalam-English. The data was manually annotated by volunteer annotators and has a high inter-annotator agreement in Krippendorff's alpha. The dataset contains all types of code-mixing phenomena since it comprises user-generated content from a multilingual country. We also present baseline experiments to establish benchmarks on the dataset using machine learning methods. The dataset is available on Github (https://github.com/bharathichezhiyan/DravidianCodeMix-Dataset) and Zenodo (https://zenodo.org/record/4750858\#.YJtw0SYo\_0M).

[16]  arXiv:2106.09462 [pdf, ps, other]
Title: pysentimiento: A Python Toolkit for Sentiment Analysis and SocialNLP tasks
Comments: 4 pages, 2 tables Source code at this https URL Submitted to ASAI/JAIIO
Subjects: Computation and Language (cs.CL)

Extracting opinions from texts has gathered a lot of interest in the last years, as we are experiencing an unprecedented volume of user-generated content in social networks and other places. A problem that social researchers find in using opinion mining tools is that they are usually behind commercial APIs and unavailable for other languages than English. To address these issues, we present pysentimiento, a multilingual Python toolkit for Sentiment Analysis and other Social NLP tasks. This open-source library brings state-of-the-art models for Spanish and English in a black-box fashion, allowing researchers to easily access these techniques.

[17]  arXiv:2106.09493 [pdf, other]
Title: Scalable Approach for Normalizing E-commerce Text Attributes (SANTA)
Comments: Accepted in ECNLP workshop of ACL-IJCNLP 2021 (this https URL)
Subjects: Computation and Language (cs.CL); Machine Learning (cs.LG)

In this paper, we present SANTA, a scalable framework to automatically normalize E-commerce attribute values (e.g. "Win 10 Pro") to a fixed set of pre-defined canonical values (e.g. "Windows 10"). Earlier works on attribute normalization focused on fuzzy string matching (also referred as syntactic matching in this paper). In this work, we first perform an extensive study of nine syntactic matching algorithms and establish that 'cosine' similarity leads to best results, showing 2.7% improvement over commonly used Jaccard index. Next, we argue that string similarity alone is not sufficient for attribute normalization as many surface forms require going beyond syntactic matching (e.g. "720p" and "HD" are synonyms). While semantic techniques like unsupervised embeddings (e.g. word2vec/fastText) have shown good results in word similarity tasks, we observed that they perform poorly to distinguish between close canonical forms, as these close forms often occur in similar contexts. We propose to learn token embeddings using a twin network with triplet loss. We propose an embedding learning task leveraging raw attribute values and product titles to learn these embeddings in a self-supervised fashion. We show that providing supervision using our proposed task improves over both syntactic and unsupervised embeddings based techniques for attribute normalization. Experiments on a real-world attribute normalization dataset of 50 attributes show that the embeddings trained using our proposed approach obtain 2.3% improvement over best string matching and 19.3% improvement over best unsupervised embeddings.

[18]  arXiv:2106.09502 [pdf, other]
Title: Biomedical Interpretable Entity Representations
Comments: Accepted into Findings of ACL-IJCNLP 2021
Subjects: Computation and Language (cs.CL); Machine Learning (cs.LG)

Pre-trained language models induce dense entity representations that offer strong performance on entity-centric NLP tasks, but such representations are not immediately interpretable. This can be a barrier to model uptake in important domains such as biomedicine. There has been recent work on general interpretable representation learning (Onoe and Durrett, 2020), but these domain-agnostic representations do not readily transfer to the important domain of biomedicine. In this paper, we create a new entity type system and training set from a large corpus of biomedical texts by mapping entities to concepts in a medical ontology, and from these to Wikipedia pages whose categories are our types. From this mapping we derive Biomedical Interpretable Entity Representations(BIERs), in which dimensions correspond to fine-grained entity types, and values are predicted probabilities that a given entity is of the corresponding type. We propose a novel method that exploits BIER's final sparse and intermediate dense representations to facilitate model and entity type debugging. We show that BIERs achieve strong performance in biomedical tasks including named entity disambiguation and entity label classification, and we provide error analysis to highlight the utility of their interpretability, particularly in low-supervision settings. Finally, we provide our induced 68K biomedical type system, the corresponding 37 million triples of derived data used to train BIER models and our best performing model.

[19]  arXiv:2106.09558 [pdf, other]
Title: Element Intervention for Open Relation Extraction
Comments: Accepted to ACL2021(main conference)
Subjects: Computation and Language (cs.CL)

Open relation extraction aims to cluster relation instances referring to the same underlying relation, which is a critical step for general relation extraction. Current OpenRE models are commonly trained on the datasets generated from distant supervision, which often results in instability and makes the model easily collapsed. In this paper, we revisit the procedure of OpenRE from a causal view. By formulating OpenRE using a structural causal model, we identify that the above-mentioned problems stem from the spurious correlations from entities and context to the relation type. To address this issue, we conduct \emph{Element Intervention}, which intervenes on the context and entities respectively to obtain the underlying causal effects of them. We also provide two specific implementations of the interventions based on entity ranking and context contrasting. Experimental results on unsupervised relation extraction datasets show that our methods outperform previous state-of-the-art methods and are robust across different datasets.

[20]  arXiv:2106.09572 [pdf]
Title: Topic Modeling and Progression of American Digital News Media During the Onset of the COVID-19 Pandemic
Subjects: Computation and Language (cs.CL)

Currently, the world is in the midst of a severe global pandemic, which has affected all aspects of people's lives. As a result, there is a deluge of COVID-related digital media articles published in the United States, due to the disparate effects of the pandemic. This large volume of information is difficult to consume by the audience in a reasonable amount of time. In this paper, we develop a Natural Language Processing (NLP) pipeline that is capable of automatically distilling various digital articles into manageable pieces of information, while also modelling the progression topics discussed over time in order to aid readers in rapidly gaining holistic perspectives on pressing issues (i.e., the COVID-19 pandemic) from a diverse array of sources. We achieve these goals by first collecting a large corpus of COVID-related articles during the onset of the pandemic. After, we apply unsupervised and semi-supervised learning procedures to summarize articles, then cluster them based on their similarities using the community detection methods. Next, we identify the topic of each cluster of articles using the BART algorithm. Finally, we provide a detailed digital media analysis based on the NLP-pipeline outputs and show how the conversation surrounding COVID-19 evolved over time.

[21]  arXiv:2106.09578 [pdf, other]
Title: Modeling Worlds in Text
Comments: Preprint. Under review. Benchmark can be found at this https URL
Subjects: Computation and Language (cs.CL); Artificial Intelligence (cs.AI)

We provide a dataset that enables the creation of learning agents that can build knowledge graph-based world models of interactive narratives. Interactive narratives -- or text-adventure games -- are partially observable environments structured as long puzzles or quests in which an agent perceives and interacts with the world purely through textual natural language. Each individual game typically contains hundreds of locations, characters, and objects -- each with their own unique descriptions -- providing an opportunity to study the problem of giving language-based agents the structured memory necessary to operate in such worlds. Our dataset provides 24198 mappings between rich natural language observations and: (1) knowledge graphs that reflect the world state in the form of a map; (2) natural language actions that are guaranteed to cause a change in that particular world state. The training data is collected across 27 games in multiple genres and contains a further 7836 heldout instances over 9 additional games in the test set. We further provide baseline models using rules-based, question-answering, and sequence learning approaches in addition to an analysis of the data and corresponding learning tasks.

[22]  arXiv:2106.09588 [pdf, other]
Title: End-to-End Cross-Domain Text-to-SQL Semantic Parsing with Auxiliary Task
Subjects: Computation and Language (cs.CL)

In this work, we focus on two crucial components in the cross-domain text-to-SQL semantic parsing task: schema linking and value filling. To encourage the model to learn better encoding ability, we propose a column selection auxiliary task to empower the encoder with the relevance matching capability by using explicit learning targets. Furthermore, we propose two value filling methods to build the bridge from the existing zero-shot semantic parsers to real-world applications, considering most of the existing parsers ignore the values filling in the synthesized SQL. With experiments on Spider, our proposed framework improves over the baselines on the execution accuracy and exact set match accuracy when database contents are unavailable, and detailed analysis sheds light on future work.

[23]  arXiv:2106.09589 [pdf, other]
Title: Classifying vaccine sentiment tweets by modelling domain-specific representation and commonsense knowledge into context-aware attentive GRU
Comments: Accepted in International Joint Conference on Neural Networks (IJCNN) 2021
Subjects: Computation and Language (cs.CL)

Vaccines are an important public health measure, but vaccine hesitancy and refusal can create clusters of low vaccine coverage and reduce the effectiveness of vaccination programs. Social media provides an opportunity to estimate emerging risks to vaccine acceptance by including geographical location and detailing vaccine-related concerns. Methods for classifying social media posts, such as vaccine-related tweets, use language models (LMs) trained on general domain text. However, challenges to measuring vaccine sentiment at scale arise from the absence of tonal stress and gestural cues and may not always have additional information about the user, e.g., past tweets or social connections. Another challenge in LMs is the lack of commonsense knowledge that are apparent in users metadata, i.e., emoticons, positive and negative words etc. In this study, to classify vaccine sentiment tweets with limited information, we present a novel end-to-end framework consisting of interconnected components that use domain-specific LM trained on vaccine-related tweets and models commonsense knowledge into a bidirectional gated recurrent network (CK-BiGRU) with context-aware attention. We further leverage syntactical, user metadata and sentiment information to capture the sentiment of a tweet. We experimented using two popular vaccine-related Twitter datasets and demonstrate that our proposed approach outperforms state-of-the-art models in identifying pro-vaccine, anti-vaccine and neutral tweets.

[24]  arXiv:2106.09650 [pdf, other]
Title: Multi-head or Single-head? An Empirical Comparison for Transformer Training
Comments: Work in progress
Subjects: Computation and Language (cs.CL); Machine Learning (cs.LG)

Multi-head attention plays a crucial role in the recent success of Transformer models, which leads to consistent performance improvements over conventional attention in various applications. The popular belief is that this effectiveness stems from the ability of jointly attending multiple positions. In this paper, we first demonstrate that jointly attending multiple positions is not a unique feature of multi-head attention, as multi-layer single-head attention also attends multiple positions and is more effective. Then, we suggest the main advantage of the multi-head attention is the training stability, since it has less number of layers than the single-head attention, when attending the same number of positions. For example, 24-layer 16-head Transformer (BERT-large) and 384-layer single-head Transformer has the same total attention head number and roughly the same model size, while the multi-head one is significantly shallower. Meanwhile, we show that, with recent advances in deep learning, we can successfully stabilize the training of the 384-layer Transformer. As the training difficulty is no longer a bottleneck, substantially deeper single-head Transformer achieves consistent performance improvements without tuning hyper-parameters.

[25]  arXiv:2106.09685 [pdf, other]
Title: LoRA: Low-Rank Adaptation of Large Language Models
Subjects: Computation and Language (cs.CL); Artificial Intelligence (cs.AI); Machine Learning (cs.LG)

The dominant paradigm of natural language processing consists of large-scale pre-training on general domain data and adaptation to particular tasks or domains. As we pre-train larger models, conventional fine-tuning, which retrains all model parameters, becomes less feasible. Using GPT-3 175B as an example, deploying many independent instances of fine-tuned models, each with 175B parameters, is extremely expensive. We propose Low-Rank Adaptation, or LoRA, which freezes the pre-trained model weights and injects trainable rank decomposition matrices into each layer of the Transformer architecture, greatly reducing the number of trainable parameters for downstream tasks. For GPT-3, LoRA can reduce the number of trainable parameters by 10,000 times and the computation hardware requirement by 3 times compared to full fine-tuning. LoRA performs on-par or better than fine-tuning in model quality on both GPT-3 and GPT-2, despite having fewer trainable parameters, a higher training throughput, and no additional inference latency. We also provide an empirical investigation into rank-deficiency in language model adaptations, which sheds light on the efficacy of LoRA. We release our implementation in GPT-2 at https://github.com/microsoft/LoRA .

[26]  arXiv:2106.09700 [pdf, other]
Title: Scientific Language Models for Biomedical Knowledge Base Completion: An Empirical Study
Subjects: Computation and Language (cs.CL); Machine Learning (cs.LG)

Biomedical knowledge graphs (KGs) hold rich information on entities such as diseases, drugs, and genes. Predicting missing links in these graphs can boost many important applications, such as drug design and repurposing. Recent work has shown that general-domain language models (LMs) can serve as "soft" KGs, and that they can be fine-tuned for the task of KG completion. In this work, we study scientific LMs for KG completion, exploring whether we can tap into their latent knowledge to enhance biomedical link prediction. We evaluate several domain-specific LMs, fine-tuning them on datasets centered on drugs and diseases that we represent as KGs and enrich with textual entity descriptions. We integrate the LM-based models with KG embedding models, using a router method that learns to assign each input example to either type of model and provides a substantial boost in performance. Finally, we demonstrate the advantage of LM models in the inductive setting with novel scientific entities. Our datasets and code are made publicly available.

Cross-lists for Fri, 18 Jun 21

[27]  arXiv:2106.09216 (cross-list from eess.AS) [pdf, other]
Title: Layer Pruning on Demand with Intermediate CTC
Comments: Interspeech 2021
Subjects: Audio and Speech Processing (eess.AS); Computation and Language (cs.CL); Sound (cs.SD)

Deploying an end-to-end automatic speech recognition (ASR) model on mobile/embedded devices is a challenging task, since the device computational power and energy consumption requirements are dynamically changed in practice. To overcome the issue, we present a training and pruning method for ASR based on the connectionist temporal classification (CTC) which allows reduction of model depth at run-time without any extra fine-tuning. To achieve the goal, we adopt two regularization methods, intermediate CTC and stochastic depth, to train a model whose performance does not degrade much after pruning. We present an in-depth analysis of layer behaviors using singular vector canonical correlation analysis (SVCCA), and efficient strategies for finding layers which are safe to prune. Using the proposed method, we show that a Transformer-CTC model can be pruned in various depth on demand, improving real-time factor from 0.005 to 0.002 on GPU, while each pruned sub-model maintains the accuracy of individually trained model of the same depth.

[28]  arXiv:2106.09488 (cross-list from eess.AS) [pdf, ps, other]
Title: Scaling Laws for Acoustic Models
Comments: Submitted to Interspeech 2021
Subjects: Audio and Speech Processing (eess.AS); Computation and Language (cs.CL); Machine Learning (cs.LG); Sound (cs.SD)

There is a recent trend in machine learning to increase model quality by growing models to sizes previously thought to be unreasonable. Recent work has shown that autoregressive generative models with cross-entropy objective functions exhibit smooth power-law relationships, or scaling laws, that predict model quality from model size, training set size, and the available compute budget. These scaling laws allow one to choose nearly optimal hyper-parameters given constraints on available training data, model parameter count, or training computation budget. In this paper, we demonstrate that acoustic models trained with an auto-predictive coding loss behave as if they are subject to similar scaling laws. We extend previous work to jointly predict loss due to model size, to training set size, and to the inherent "irreducible loss" of the task. We find that the scaling laws accurately match model performance over two orders of magnitude in both model size and training set size, and make predictions about the limits of model performance.

[29]  arXiv:2106.09532 (cross-list from eess.AS) [pdf, other]
Title: ASR Adaptation for E-commerce Chatbots using Cross-Utterance Context and Multi-Task Language Modeling
Comments: Accepted at ACL-IJCNLP 2021 Workshop on e-Commerce and NLP (ECNLP)
Subjects: Audio and Speech Processing (eess.AS); Computation and Language (cs.CL); Machine Learning (cs.LG); Sound (cs.SD)

Automatic Speech Recognition (ASR) robustness toward slot entities are critical in e-commerce voice assistants that involve monetary transactions and purchases. Along with effective domain adaptation, it is intuitive that cross utterance contextual cues play an important role in disambiguating domain specific content words from speech. In this paper, we investigate various techniques to improve contextualization, content word robustness and domain adaptation of a Transformer-XL neural language model (NLM) to rescore ASR N-best hypotheses. To improve contextualization, we utilize turn level dialogue acts along with cross utterance context carry over. Additionally, to adapt our domain-general NLM towards e-commerce on-the-fly, we use embeddings derived from a finetuned masked LM on in-domain data. Finally, to improve robustness towards in-domain content words, we propose a multi-task model that can jointly perform content word detection and language modeling tasks. Compared to a non-contextual LSTM LM baseline, our best performing NLM rescorer results in a content WER reduction of 19.2% on e-commerce audio test set and a slot labeling F1 improvement of 6.4%.

[30]  arXiv:2106.09545 (cross-list from eess.AS) [pdf, other]
Title: STAN: A stuttering therapy analysis helper
Journal-ref: Demo presented at 2021 IEEE Spoken Language Technology Workshop (SLT)
Subjects: Audio and Speech Processing (eess.AS); Computation and Language (cs.CL); Sound (cs.SD)

Stuttering is a complex speech disorder identified by repeti-tions, prolongations of sounds, syllables or words and blockswhile speaking. Specific stuttering behaviour differs strongly,thus needing personalized therapy. Therapy sessions requirea high level of concentration by the therapist. We introduceSTAN, a system to aid speech therapists in stuttering therapysessions. Such an automated feedback system can lower thecognitive load on the therapist and thereby enable a more con-sistent therapy as well as allowing analysis of stuttering overthe span of multiple therapy sessions.

[31]  arXiv:2106.09553 (cross-list from cs.LG) [pdf, other]
Title: Do Large Scale Molecular Language Representations Capture Important Structural Information?
Comments: 17 pages, 3 figures
Subjects: Machine Learning (cs.LG); Computation and Language (cs.CL); Biomolecules (q-bio.BM)

Predicting chemical properties from the structure of a molecule is of great importance in many applications including drug discovery and material design. Machine learning based molecular property prediction holds the promise of enabling accurate predictions at much less complexity, when compared to, for example Density Functional Theory (DFT) calculations. Features extracted from molecular graphs, using graph neural nets in a supervised manner, have emerged as strong baselines for such tasks. However, the vast chemical space together with the limited availability of labels makes supervised learning challenging, calling for learning a general-purpose molecular representation. Recently, pre-trained transformer-based language models (PTLMs) on large unlabeled corpus have produced state-of-the-art results in many downstream natural language processing tasks. Inspired by this development, here we present molecular embeddings obtained by training an efficient transformer encoder model, referred to as MoLFormer. This model was employed with a linear attention mechanism and highly paralleized training on 1D SMILES sequences of 1.1 billion unlabeled molecules from the PubChem and ZINC datasets. Experiments show that the learned molecular representation performs competitively, when compared to existing graph-based and fingerprint-based supervised learning baselines, on the challenging tasks of predicting properties of QM8 and QM9 molecules. Further task-specific fine-tuning of the MoLFormerr representation improves performance on several of those property prediction benchmarks. These results provide encouraging evidence that large-scale molecular language models can capture sufficient structural information to be able to accurately predict quantum chemical properties and beyond.

[32]  arXiv:2106.09608 (cross-list from cs.LG) [pdf, other]
Title: Learning Knowledge Graph-based World Models of Textual Environments
Comments: Preprint. Under review
Subjects: Machine Learning (cs.LG); Artificial Intelligence (cs.AI); Computation and Language (cs.CL)

World models improve a learning agent's ability to efficiently operate in interactive and situated environments. This work focuses on the task of building world models of text-based game environments. Text-based games, or interactive narratives, are reinforcement learning environments in which agents perceive and interact with the world using textual natural language. These environments contain long, multi-step puzzles or quests woven through a world that is filled with hundreds of characters, locations, and objects. Our world model learns to simultaneously: (1) predict changes in the world caused by an agent's actions when representing the world as a knowledge graph; and (2) generate the set of contextually relevant natural language actions required to operate in the world. We frame this task as a Set of Sequences generation problem by exploiting the inherent structure of knowledge graphs and actions and introduce both a transformer-based multi-task architecture and a loss function to train it. A zero-shot ablation study on never-before-seen textual worlds shows that our methodology significantly outperforms existing textual world modeling techniques as well as the importance of each of our contributions.

Replacements for Fri, 18 Jun 21

[33]  arXiv:1909.03405 (replaced) [pdf, other]
Title: Symmetric Regularization based BERT for Pair-wise Semantic Reasoning
Comments: 8 pages, 3 figures, 6 tables
Subjects: Computation and Language (cs.CL)
[34]  arXiv:1911.01212 (replaced) [pdf, other]
Title: Scrambled Translation Problem: A Problem of Denoising UNMT
Comments: Accepted by MT Summit 2021
Subjects: Computation and Language (cs.CL); Machine Learning (cs.LG); Machine Learning (stat.ML)
[35]  arXiv:1911.08782 (replaced) [pdf, other]
Title: Joint Emotion Label Space Modelling for Affect Lexica
Comments: Computer Speech and Language journal, to appear
Subjects: Computation and Language (cs.CL)
[36]  arXiv:2004.03974 (replaced) [pdf, other]
Title: Pre-training is a Hot Topic: Contextualized Document Embeddings Improve Topic Coherence
Comments: Updated version. Published as a conference paper at ACL-IJCNLP 2021
Subjects: Computation and Language (cs.CL)
[37]  arXiv:2010.10811 (replaced) [pdf, other]
Title: STN4DST: A Scalable Dialogue State Tracking based on Slot Tagging Navigation
Subjects: Computation and Language (cs.CL)
[38]  arXiv:2104.01989 (replaced) [pdf, ps, other]
Title: Dr-Vectors: Decision Residual Networks and an Improved Loss for Speaker Recognition
Comments: To appear in Interspeech 2021
Subjects: Computation and Language (cs.CL); Machine Learning (cs.LG)
[39]  arXiv:2104.05544 (replaced) [pdf, ps, other]
Title: Investigating Methods to Improve Language Model Integration for Attention-based Encoder-Decoder ASR Models
Comments: accepted to Interspeech 2021
Subjects: Computation and Language (cs.CL); Sound (cs.SD); Audio and Speech Processing (eess.AS); Machine Learning (stat.ML)
[40]  arXiv:2104.10507 (replaced) [pdf, ps, other]
Title: On Sampling-Based Training Criteria for Neural Language Modeling
Comments: Accepted at INTERSPEECH 2021
Subjects: Computation and Language (cs.CL); Sound (cs.SD); Audio and Speech Processing (eess.AS); Machine Learning (stat.ML)
[41]  arXiv:2106.00248 (replaced) [pdf, other]
Title: Volta at SemEval-2021 Task 9: Statement Verification and Evidence Finding with Tables using TAPAS and Transfer Learning
Comments: 9 pages, accepted at SemEval-2021 co-located with ACL-IJCNLP 2021
Subjects: Computation and Language (cs.CL)
[42]  arXiv:2106.01072 (replaced) [src]
Title: Evidence-based Factual Error Correction
Comments: Uploaded as a new paper in error. Please see the replacement of arxiv paper 2012.15788v2 for this version: arXiv:2012.15788
Subjects: Computation and Language (cs.CL); Artificial Intelligence (cs.AI); Machine Learning (cs.LG)
[43]  arXiv:2106.02569 (replaced) [pdf, other]
Title: Neural semi-Markov CRF for Monolingual Word Alignment
Comments: Accepted to ACL 2021
Subjects: Computation and Language (cs.CL)
[44]  arXiv:2106.05365 (replaced) [pdf, other]
Title: DESCGEN: A Distantly Supervised Dataset for Generating Abstractive Entity Descriptions
Journal-ref: ACL-IJCNLP 2021
Subjects: Computation and Language (cs.CL); Artificial Intelligence (cs.AI)
[45]  arXiv:2106.05580 (replaced) [pdf, other]
Title: AGGGEN: Ordering and Aggregating while Generating
Comments: Correct the first citation in the Zero-shot Few-shot scenarios paragraph in Section 7
Journal-ref: Proceedings of the 59th Annual Meeting of the Association for Computational Linguistics (ACL2021)
Subjects: Computation and Language (cs.CL)
[46]  arXiv:2106.08616 (replaced) [pdf, other]
Title: Out-of-Scope Intent Detection with Self-Supervision and Discriminative Training
Comments: Published as long oral paper in ACL 2021
Subjects: Computation and Language (cs.CL); Machine Learning (cs.LG)
[47]  arXiv:1911.10088 (replaced) [pdf, other]
Title: Optimizing Data Usage via Differentiable Rewards
Comments: Accepted at ICML 2020
Subjects: Machine Learning (cs.LG); Computation and Language (cs.CL); Machine Learning (stat.ML)
[48]  arXiv:2012.00096 (replaced) [pdf, other]
Title: Multi-Modal Detection of Alzheimer's Disease from Speech and Text
Comments: 9 pages, 3 figures, Accepted in BIOKDD 2021
Subjects: Machine Learning (cs.LG); Computation and Language (cs.CL)
[49]  arXiv:2012.12352 (replaced) [pdf, other]
Title: Seeing past words: Testing the cross-modal capabilities of pretrained V&L models on counting tasks
Comments: Paper accepted for publication at MMSR 2021; 13 pages, 3 figures, 7 Tables
Journal-ref: Proceedings of the 1st Workshop on Multimodal Semantic Representations (MMSR), 2021, Groningen, Netherlands (Online), Association for Computational Linguistics, p. 32--44
Subjects: Computer Vision and Pattern Recognition (cs.CV); Computation and Language (cs.CL)
[50]  arXiv:2104.03416 (replaced) [pdf, ps, other]
Title: Pushing the Limits of Non-Autoregressive Speech Recognition
Comments: Accepted to Interspeech 2021
Subjects: Audio and Speech Processing (eess.AS); Computation and Language (cs.CL); Machine Learning (cs.LG); Sound (cs.SD)
[51]  arXiv:2105.07071 (replaced) [pdf, other]
Title: Listen with Intent: Improving Speech Recognition with Audio-to-Intent Front-End
Comments: To appear in Interspeech 2021
Subjects: Audio and Speech Processing (eess.AS); Computation and Language (cs.CL); Sound (cs.SD)
[52]  arXiv:2106.08846 (replaced) [pdf, other]
Title: Algorithm to Compilation Co-design: An Integrated View of Neural Network Sparsity
Subjects: Machine Learning (cs.LG); Artificial Intelligence (cs.AI); Computation and Language (cs.CL); Systems and Control (eess.SY)
[ total of 52 entries: 1-52 ]
[ showing up to 2000 entries per page: fewer | more ]

Disable MathJax (What is MathJax?)

Links to: arXiv, form interface, find, cs, recent, 2106, contact, help  (Access key information)