References & Citations
Computer Science > Information Retrieval
Title: Character 3-gram Mover's Distance: An Effective Method for Detecting Near-duplicate Japanese-language Recipes
(Submitted on 11 Dec 2019 (v1), last revised 21 Dec 2019 (this version, v2))
Abstract: In user-generated recipe websites, users post their-original recipes. Some recipes, however, are very similar in major components such as the cooking instructions to other recipes. We refer to such recipes as "near-duplicate recipes". In this study, we propose a method that extends the "Word Mover's Distance", which calculates distances between texts based on word embedding, to character 3-gram embedding. Using a corpus of over 1.21 million recipes, we learned the word embedding and the character 3-gram embedding by using a Skip-Gram model with negative sampling and fastText to extract candidate pairs of near-duplicate recipes. We then annotated these candidates and evaluated the proposed method against a comparison method. Our results demonstrated that near-duplicate recipes that were not detected by the comparison method were successfully detected by the proposed method.
Submission history
From: Masaki Oguni [view email][v1] Wed, 11 Dec 2019 08:27:57 GMT (280kb)
[v2] Sat, 21 Dec 2019 11:44:40 GMT (280kb)
Link back to: arXiv, form interface, contact.