We gratefully acknowledge support from
the Simons Foundation and member institutions.
Full-text links:

Download:

Current browse context:

cs.IR

Change to browse by:

cs

References & Citations

DBLP - CS Bibliography

Bookmark

(what is this?)
CiteULike logo BibSonomy logo Mendeley logo del.icio.us logo Digg logo Reddit logo

Computer Science > Information Retrieval

Title: CODEC: Complex Document and Entity Collection

Abstract: CODEC is a document and entity ranking benchmark that focuses on complex research topics. We target essay-style information needs of social science researchers, i.e. "How has the UK's Open Banking Regulation benefited Challenger Banks?". CODEC includes 42 topics developed by researchers and a new focused web corpus with semantic annotations including entity links. This resource includes expert judgments on 17,509 documents and entities (416.9 per topic) from diverse automatic and interactive manual runs. The manual runs include 387 query reformulations, providing data for query performance prediction and automatic rewriting evaluation.
CODEC includes analysis of state-of-the-art systems, including dense retrieval and neural re-ranking. The results show the topics are challenging with headroom for document and entity ranking improvement. Query expansion with entity information shows significant gains in document ranking, demonstrating the resource's value for evaluating and improving entity-oriented search. We also show that the manual query reformulations significantly improve document ranking and entity ranking performance. Overall, CODEC provides challenging research topics to support the development and evaluation of entity-centric search methods.
Comments: 10 pages, SIGIR 2022 Preprint
Subjects: Information Retrieval (cs.IR)
ACM classes: H.3.3
Cite as: arXiv:2205.04546 [cs.IR]
  (or arXiv:2205.04546v2 [cs.IR] for this version)

Submission history

From: Iain Mackie [view email]
[v1] Mon, 9 May 2022 20:40:53 GMT (5281kb,D)
[v2] Tue, 17 May 2022 11:09:14 GMT (5281kb,D)

Link back to: arXiv, form interface, contact.