We gratefully acknowledge support from
the Simons Foundation and member institutions.
Full-text links:

Download:

Current browse context:

cs.CL

Change to browse by:

References & Citations

DBLP - CS Bibliography

Bookmark

(what is this?)
CiteULike logo BibSonomy logo Mendeley logo del.icio.us logo Digg logo Reddit logo

Computer Science > Information Retrieval

Title: Out-of-Category Document Identification Using Target-Category Names as Weak Supervision

Abstract: Identifying outlier documents, whose content is different from the majority of the documents in a corpus, has played an important role to manage a large text collection. However, due to the absence of explicit information about the inlier (or target) distribution, existing unsupervised outlier detectors are likely to make unreliable results depending on the density or diversity of the outliers in the corpus. To address this challenge, we introduce a new task referred to as out-of-category detection, which aims to distinguish the documents according to their semantic relevance to the inlier (or target) categories by using the category names as weak supervision. In practice, this task can be widely applicable in that it can flexibly designate the scope of target categories according to users' interests while requiring only the target-category names as minimum guidance. In this paper, we present an out-of-category detection framework, which effectively measures how confidently each document belongs to one of the target categories based on its category-specific relevance score. Our framework adopts a two-step approach; (i) it first generates the pseudo-category label of all unlabeled documents by exploiting the word-document similarity encoded in a text embedding space, then (ii) it trains a neural classifier by using the pseudo-labels in order to compute the confidence from its target-category prediction. The experiments on real-world datasets demonstrate that our framework achieves the best detection performance among all baseline methods in various scenarios specifying different target categories.
Comments: ICDM 2021. 10 pages, 4 figures
Subjects: Information Retrieval (cs.IR); Computation and Language (cs.CL); Machine Learning (cs.LG)
Cite as: arXiv:2111.12796 [cs.IR]
  (or arXiv:2111.12796v1 [cs.IR] for this version)

Submission history

From: Dongha Lee [view email]
[v1] Wed, 24 Nov 2021 21:01:25 GMT (3304kb,D)

Link back to: arXiv, form interface, contact.