We gratefully acknowledge support from
the Simons Foundation and member institutions.
Full-text links:


Current browse context:


Change to browse by:

References & Citations

DBLP - CS Bibliography


(what is this?)
CiteULike logo BibSonomy logo Mendeley logo del.icio.us logo Digg logo Reddit logo ScienceWISE logo

Computer Science > Computation and Language

Title: Corpus-Based Paraphrase Detection Experiments and Review

Abstract: Paraphrase detection is important for a number of applications, including plagiarism detection, authorship attribution, question answering, text summarization, text mining in general, etc. In this paper, we give a performance overview of various types of corpus-based models, especially deep learning (DL) models, with the task of paraphrase detection. We report the results of eight models (LSI, TF-IDF, Word2Vec, Doc2Vec, GloVe, FastText, ELMO, and USE) evaluated on three different public available corpora: Microsoft Research Paraphrase Corpus, Clough and Stevenson and Webis Crowd Paraphrase Corpus 2011. Through a great number of experiments, we decided on the most appropriate approaches for text pre-processing: hyper-parameters, sub-model selection-where they exist (e.g., Skipgram vs. CBOW), distance measures, and semantic similarity/paraphrase detection threshold. Our findings and those of other researchers who have used deep learning models show that DL models are very competitive with traditional state-of-the-art approaches and have potential that should be further developed.
Comments: 25 pages, 7 figures, 4 tables
Subjects: Computation and Language (cs.CL); Machine Learning (cs.LG)
Journal reference: In Information (Switzerland) (Vol. 11, Issue 5, p. 241). 2020, MDPI AG
DOI: 10.3390/INFO11050241
Cite as: arXiv:2106.00145 [cs.CL]
  (or arXiv:2106.00145v1 [cs.CL] for this version)

Submission history

From: Tedo Vrbanec [view email]
[v1] Mon, 31 May 2021 23:29:24 GMT (1450kb)

Link back to: arXiv, form interface, contact.