We gratefully acknowledge support from
the Simons Foundation and member institutions.

Information Retrieval

New submissions

[ total of 11 entries: 1-11 ]
[ showing up to 2000 entries per page: fewer | more ]

New submissions for Thu, 8 Dec 22

[1]  arXiv:2212.03464 [pdf]
Title: An automated approach to extracting positive and negative clinical research results
Subjects: Information Retrieval (cs.IR)

Failure is common in clinical trials since the successful failures presented in negative results always indicate the ways that should not be taken. In this paper, we proposed an automated approach to extracting positive and negative clinical research results by introducing a PICOE (Population, Intervention, Comparation, Outcome, and Effect) framework to represent randomized controlled trials (RCT) reports, where E indicates the effect between a specific I and O. We developed a pipeline to extract and assign the corresponding statistical effect to a specific I-O pair from natural language RCT reports. The extraction models achieved a high degree of accuracy for ICO and E descriptive words extraction through two rounds of training. By defining a threshold of p-value, we find in all Covid-19 related intervention-outcomes pairs with statistical tests, negative results account for nearly 40%. We believe that this observation is noteworthy since they are extracted from the published literature, in which there is an inherent risk of reporting bias, preferring to report positive results rather than negative results. We provided a tool to systematically understand the current level of clinical evidence by distinguishing negative results from the positive results.

[2]  arXiv:2212.03641 [pdf, other]
Title: THREAT/crawl: a Trainable, Highly-Reusable, and Extensible Automated Method and Tool to Crawl Criminal Underground Forums
Authors: Michele Campobasso, Luca Allodi (Eindhoven University of Technology)
Comments: To be published in the Proceedings of the 17th Symposium on Electronic Crime Research (APWG eCrime 2022). Source code of the implemented solution available at this https URL
Subjects: Information Retrieval (cs.IR); Cryptography and Security (cs.CR)

Collecting data on underground criminal communities is highly valuable both for security research and security operations. Unfortunately these communities live within a constellation of diverse online forums that are difficult to infiltrate, may adopt crawling monitoring countermeasures, and require the development of ad-hoc scrapers for each different community, making the endeavour increasingly technically challenging, and potentially expensive. To address this problem we propose THREAT/crawl, a method and prototype tool for a highly reusable crawler that can learn a wide range of (arbitrary) forum structures, can remain under-the-radar during the crawling activity and can be extended and configured at the user will. We showcase THREAT/crawl capabilities and provide prime evaluation of our prototype against a range of active, live, underground communities.

[3]  arXiv:2212.03725 [pdf, other]
Title: Learning-To-Embed: Adopting Transformer based models for E-commerce Products Representation Learning
Comments: 8 pages, 2 figures
Subjects: Information Retrieval (cs.IR)

Learning low-dimensional representation for large number of products present in an e-commerce catalogue plays a vital role as they are helpful in tasks like product ranking, product recommendation, finding similar products, modelling user-behaviour etc. Recently, a lot of tasks in the NLP field are getting tackled using the Transformer based models and these deep models are widely applicable in the industries setting to solve various problems. With this motivation, we apply transformer based model for learning contextual representation of products in an e-commerce setting. In this work, we propose a novel approach of pre-training transformer based model on a users generated sessions dataset obtained from a large fashion e-commerce platform to obtain latent product representation. Once pre-trained, we show that the low-dimension representation of the products can be obtained given the product attributes information as a textual sentence. We mainly pre-train BERT, RoBERTa, ALBERT and XLNET variants of transformer model and show a quantitative analysis of the products representation obtained from these models with respect to Next Product Recommendation(NPR) and Content Ranking(CR) tasks. For both the tasks, we collect an evaluation data from the fashion e-commerce platform and observe that XLNET model outperform other variants with a MRR of 0.5 for NPR and NDCG of 0.634 for CR. XLNET model also outperforms the Word2Vec based non-transformer baseline on both the downstream tasks. To the best of our knowledge, this is the first and novel work for pre-training transformer based models using users generated sessions data containing products that are represented with rich attributes information for adoption in e-commerce setting. These models can be further fine-tuned in order to solve various downstream tasks in e-commerce, thereby eliminating the need to train a model from scratch.

[4]  arXiv:2212.03760 [pdf, other]
Title: Pivotal Role of Language Modeling in Recommender Systems: Enriching Task-specific and Task-agnostic Representation Learning
Comments: 14 pages, 5 figures, 9 tables
Subjects: Information Retrieval (cs.IR); Computation and Language (cs.CL)

Recent studies have proposed a unified user modeling framework that leverages user behavior data from various applications. Most benefit from utilizing users' behavior sequences as plain texts, representing rich information in any domain or system without losing generality. Hence, a question arises: Can language modeling for user history corpus help improve recommender systems? While its versatile usability has been widely investigated in many domains, its applications to recommender systems still remain underexplored. We show that language modeling applied directly to task-specific user histories achieves excellent results on diverse recommendation tasks. Also, leveraging additional task-agnostic user histories delivers significant performance benefits. We further demonstrate that our approach can provide promising transfer learning capabilities for a broad spectrum of real-world recommender systems, even on unseen domains and services.

Cross-lists for Thu, 8 Dec 22

[5]  arXiv:2212.03459 (cross-list from cs.SE) [pdf, other]
Title: You Don't Know Search: Helping Users Find Code by Automatically Evaluating Alternative Queries
Comments: 11 pages, 5 figures
Subjects: Software Engineering (cs.SE); Information Retrieval (cs.IR)

Tens of thousands of engineers use Sourcegraph day-to-day to search for code and rely on it to make progress on software development tasks. We face a key challenge in designing a query language that accommodates the needs of a broad spectrum of users. Our experience shows that users express different and often contradictory preferences for how queries should be interpreted. These preferences stem from users with differing usage contexts, technical experience, and implicit expectations from using prior tools. At the same time, designing a code search query language poses unique challenges because it intersects traditional search engines and full-fledged programming languages. For example, code search queries adopt certain syntactic conventions in the interest of simplicity and terseness but invariably risk encoding implicit semantics that are ambiguous at face-value (a single space in a query could mean three or more semantically different things depending on surrounding terms). Users often need to disambiguate intent with additional syntax so that a query expresses what they actually want to search. This need to disambiguate is one of the primary frustrations we've seen users experience with writing search queries in the last three years. We share our observations that lead us to a fresh perspective where code search behavior can straddle seemingly ambiguous queries. We develop Automated Query Evaluation (AQE), a new technique that automatically generates and adaptively runs alternative query interpretations in frustration-prone conditions. We evaluate AQE with an A/B test across more than 10,000 unique users on our publicly-available code search instance. Our main result shows that relative to the control group, users are on average 22% more likely to click on a search result at all on any given day when AQE is active.

[6]  arXiv:2212.03533 (cross-list from cs.CL) [pdf, other]
Title: Text Embeddings by Weakly-Supervised Contrastive Pre-training
Comments: 17 pages
Subjects: Computation and Language (cs.CL); Information Retrieval (cs.IR)

This paper presents E5, a family of state-of-the-art text embeddings that transfer well to a wide range of tasks. The model is trained in a contrastive manner with weak supervision signals from our curated large-scale text pair dataset (called CCPairs). E5 can be readily used as a general-purpose embedding model for any tasks requiring a single-vector representation of texts such as retrieval, clustering, and classification, achieving strong performance in both zero-shot and fine-tuned settings. We conduct extensive evaluations on 56 datasets from the BEIR and MTEB benchmarks. For zero-shot settings, E5 is the first model that outperforms the strong BM25 baseline on the BEIR retrieval benchmark without using any labeled data. When fine-tuned, E5 obtains the best results on the MTEB benchmark, beating existing embedding models with 40x more parameters.

[7]  arXiv:2212.03721 (cross-list from cs.CL) [pdf, other]
Title: Intent Recognition in Conversational Recommender Systems
Subjects: Computation and Language (cs.CL); Information Retrieval (cs.IR); Machine Learning (cs.LG)

Any organization needs to improve their products, services, and processes. In this context, engaging with customers and understanding their journey is essential. Organizations have leveraged various techniques and technologies to support customer engagement, from call centres to chatbots and virtual agents. Recently, these systems have used Machine Learning (ML) and Natural Language Processing (NLP) to analyze large volumes of customer feedback and engagement data. The goal is to understand customers in context and provide meaningful answers across various channels. Despite multiple advances in Conversational Artificial Intelligence (AI) and Recommender Systems (RS), it is still challenging to understand the intent behind customer questions during the customer journey. To address this challenge, in this paper, we study and analyze the recent work in Conversational Recommender Systems (CRS) in general and, more specifically, in chatbot-based CRS. We introduce a pipeline to contextualize the input utterances in conversations. We then take the next step towards leveraging reverse feature engineering to link the contextualized input and learning model to support intent recognition. Since performance evaluation is achieved based on different ML models, we use transformer base models to evaluate the proposed approach using a labelled dialogue dataset (MSDialogue) of question-answering interactions between information seekers and answer providers.

Replacements for Thu, 8 Dec 22

[8]  arXiv:2208.05327 (replaced) [pdf, other]
Title: Fast Offline Policy Optimization for Large Scale Recommendation
Comments: Thirty-Seventh AAAI Conference on Artificial Intelligence (AAAI-23)
Subjects: Information Retrieval (cs.IR); Artificial Intelligence (cs.AI); Machine Learning (cs.LG); Machine Learning (stat.ML)
[9]  arXiv:2211.17171 (replaced) [pdf, other]
Title: CDSM: Cascaded Deep Semantic Matching on Textual Graphs Leveraging Ad-hoc Neighbor Selection
Subjects: Information Retrieval (cs.IR)
[10]  arXiv:2205.13255 (replaced) [pdf, other]
Title: Active Labeling: Streaming Stochastic Gradients
Comments: 38 pages (9 main pages), 9 figures
Subjects: Machine Learning (cs.LG); Artificial Intelligence (cs.AI); Information Retrieval (cs.IR); Machine Learning (stat.ML)
[11]  arXiv:2206.00085 (replaced) [pdf, other]
Title: Semantically-enhanced Topic Recommendation System for Software Projects
Comments: To appear in the Empirical Software Engineering Journal
Subjects: Software Engineering (cs.SE); Information Retrieval (cs.IR); Machine Learning (cs.LG)
[ total of 11 entries: 1-11 ]
[ showing up to 2000 entries per page: fewer | more ]

Disable MathJax (What is MathJax?)

Links to: arXiv, form interface, find, cs, recent, 2212, contact, help  (Access key information)