We gratefully acknowledge support from
the Simons Foundation and member institutions.
Full-text links:

Download:

Current browse context:

cs.CL

Change to browse by:

References & Citations

DBLP - CS Bibliography

Bookmark

(what is this?)
CiteULike logo BibSonomy logo Mendeley logo del.icio.us logo Digg logo Reddit logo

Computer Science > Computation and Language

Title: Description-Based Text Similarity

Abstract: Identifying texts with a given semantics is central for many information seeking scenarios. Similarity search over vector embeddings appear to be central to this ability, yet the similarity reflected in current text embeddings is corpus-driven, and is inconsistent and sub-optimal for many use cases. What, then, is a good notion of similarity for effective retrieval of text?
We identify the need to search for texts based on abstract descriptions of their content, and the corresponding notion of \emph{description based similarity}. We demonstrate the inadequacy of current text embeddings and propose an alternative model that significantly improves when used in standard nearest neighbor search. The model is trained using positive and negative pairs sourced through prompting a LLM, demonstrating how data from LLMs can be used for creating new capabilities not immediately possible using the original model.
Comments: A preprint
Subjects: Computation and Language (cs.CL); Information Retrieval (cs.IR); Machine Learning (cs.LG)
Cite as: arXiv:2305.12517 [cs.CL]
  (or arXiv:2305.12517v4 [cs.CL] for this version)

Submission history

From: Shauli Ravfogel [view email]
[v1] Sun, 21 May 2023 17:14:31 GMT (7351kb,D)
[v2] Sun, 22 Oct 2023 17:38:42 GMT (2396kb,D)
[v3] Thu, 25 Apr 2024 08:30:17 GMT (1603kb,D)
[v4] Fri, 26 Apr 2024 08:04:59 GMT (1603kb,D)

Link back to: arXiv, form interface, contact.