We gratefully acknowledge support from
the Simons Foundation and member institutions.

Information Retrieval

New submissions

[ total of 6 entries: 1-6 ]
[ showing up to 2000 entries per page: fewer | more ]

New submissions for Mon, 13 Jul 20

[1]  arXiv:2007.05186 [pdf, other]
Title: BISON:BM25-weighted Self-Attention Framework for Multi-Fields Document Search
Comments: 8pages,2figures
Subjects: Information Retrieval (cs.IR)

Recent breakthrough in natural language processing has advanced the information retrieval from keyword match to semantic vector search. To map query and documents into semantic vectors, self-attention models are being widely used. However, typical self-attention models, like Transformer, lack prior knowledge to distinguish the importance of different tokens, which has been proved to play a critical role in information retrieval tasks. In addition to this, when applying WordPiece tokenization, a rare word may be split into several different tokens. How to translate word-level prior knowledge into WordPiece tokens becomes a new challenge for the semantic representation generation. Moreover, web documents usually have multiple fields. Due to the heterogeneity of different fields, simple combination is not a good choice. In this paper, We propose a novel BM25-weighted Self-Attention framework (BISON) for web document search. By leveraging BM25 as prior weights, BISON learns weighted attention scores jointly with query matrix Q and key matrix K. We also present an efficient whole word weight sharing solution to mitigate prior knowledge discrepancy between words and WordPiece tokens. Furthermore, BISON effectively combines multiple fields by placing different fields into different segments. We demonstrate BISON is more efficient to capture the topical and semantic representation both in query and document. Intrinsic evaluation and experiments conducted on public data sets reveal BISON to be a general framework for document ranking task. It outperforms BERT and other modern models while retaining the same model complexity with BERT.

Cross-lists for Mon, 13 Jul 20

[2]  arXiv:2007.05039 (cross-list from cs.CY) [pdf, other]
Title: On the Social and Technical Challenges of Web Search Autosuggestion Moderation
Comments: 17 Pages, 4 images displayed within 3 latex figures
Subjects: Computers and Society (cs.CY); Information Retrieval (cs.IR)

Past research shows that users benefit from systems that support them in their writing and exploration tasks. The autosuggestion feature of Web search engines is an example of such a system: It helps users in formulating their queries by offering a list of suggestions as they type. Autosuggestions are typically generated by machine learning (ML) systems trained on a corpus of search logs and document representations. Such automated methods can become prone to issues that result in problematic suggestions that are biased, racist, sexist or in other ways inappropriate. While current search engines have become increasingly proficient at suppressing such problematic suggestions, there are still persistent issues that remain. In this paper, we reflect on past efforts and on why certain issues still linger by covering explored solutions along a prototypical pipeline for identifying, detecting, and addressing problematic autosuggestions. To showcase their complexity, we discuss several dimensions of problematic suggestions, difficult issues along the pipeline, and why our discussion applies to the increasing number of applications beyond web search that implement similar textual suggestion features. By outlining persistent social and technical challenges in moderating web search suggestions, we provide a renewed call for action.

[3]  arXiv:2007.05163 (cross-list from cs.CL) [pdf, other]
Title: Handling Collocations in Hierarchical Latent Tree Analysis for Topic Modeling
Subjects: Computation and Language (cs.CL); Information Retrieval (cs.IR); Machine Learning (cs.LG)

Topic modeling has been one of the most active research areas in machine learning in recent years. Hierarchical latent tree analysis (HLTA) has been recently proposed for hierarchical topic modeling and has shown superior performance over state-of-the-art methods. However, the models used in HLTA have a tree structure and cannot represent the different meanings of multiword expressions sharing the same word appropriately. Therefore, we propose a method for extracting and selecting collocations as a preprocessing step for HLTA. The selected collocations are replaced with single tokens in the bag-of-words model before running HLTA. Our empirical evaluation shows that the proposed method led to better performance of HLTA on three of the four data sets tested.

[4]  arXiv:2007.05302 (cross-list from cs.CL) [pdf, other]
Title: Topic Modeling on User Stories using Word Mover's Distance
Subjects: Computation and Language (cs.CL); Information Retrieval (cs.IR)

Requirements elicitation has recently been complemented with crowd-based techniques, which continuously involve large, heterogeneous groups of users who express their feedback through a variety of media. Crowd-based elicitation has great potential for engaging with (potential) users early on but also results in large sets of raw and unstructured feedback. Consolidating and analyzing this feedback is a key challenge for turning it into sensible user requirements. In this paper, we focus on topic modeling as a means to identify topics within a large set of crowd-generated user stories and compare three approaches: (1) a traditional approach based on Latent Dirichlet Allocation, (2) a combination of word embeddings and principal component analysis, and (3) a combination of word embeddings and Word Mover's Distance. We evaluate the approaches on a publicly available set of 2,966 user stories written and categorized by crowd workers. We found that a combination of word embeddings and Word Mover's Distance is most promising. Depending on the word embeddings we use in our approaches, we manage to cluster the user stories in two ways: one that is closer to the original categorization and another that allows new insights into the dataset, e.g. to find potentially new categories. Unfortunately, no measure exists to rate the quality of our results objectively. Still, our findings provide a basis for future work towards analyzing crowd-sourced user stories.

Replacements for Mon, 13 Jul 20

[5]  arXiv:2005.11490 (replaced) [pdf, other]
Title: Summarizing and Exploring Tabular Data in Conversational Search
Comments: Proceedings of the 43rd International ACM SIGIR Conference on Research and Development in Information Retrieval (SIGIR 2020), 2020
Subjects: Information Retrieval (cs.IR); Computation and Language (cs.CL)
[6]  arXiv:2006.06251 (replaced) [pdf, other]
Title: Performance in the Courtroom: Automated Processing and Visualization of Appeal Court Decisions in France
Subjects: Computation and Language (cs.CL); Artificial Intelligence (cs.AI); Information Retrieval (cs.IR); Machine Learning (cs.LG)
[ total of 6 entries: 1-6 ]
[ showing up to 2000 entries per page: fewer | more ]

Disable MathJax (What is MathJax?)

Links to: arXiv, form interface, find, cs, recent, 2007, contact, help  (Access key information)