We gratefully acknowledge support from
the Simons Foundation and member institutions.
Full-text links:

Download:

Current browse context:

cs.LG

Change to browse by:

References & Citations

DBLP - CS Bibliography

Bookmark

(what is this?)
CiteULike logo BibSonomy logo Mendeley logo del.icio.us logo Digg logo Reddit logo ScienceWISE logo

Computer Science > Machine Learning

Title: The Klarna Product Page Dataset: A Realistic Benchmark for Web Representation Learning

Abstract: This paper tackles the under-explored problem of DOM tree element representation learning. We advance the field of machine learning-based web automation and hope to spur further research regarding this crucial area with two contributions. First, we adapt several popular Graph-based Neural Network models and apply them to embed elements in website DOM trees. Second, we present a large-scale and realistic dataset of webpages. By providing this open-access resource, we lower the entry barrier to this area of research. The dataset contains $51,701$ manually labeled product pages from $8,175$ real e-commerce websites. The pages can be rendered entirely in a web browser and are suitable for computer vision applications. This makes it substantially richer and more diverse than other datasets proposed for element representation learning, classification and prediction on the web. Finally, using our proposed dataset, we show that the embeddings produced by a Graph Convolutional Neural Network outperform representations produced by other state-of-the-art methods in a web element prediction task.
Comments: 17 pages, 5 figures, 5 tables, under review
Subjects: Machine Learning (cs.LG); Computation and Language (cs.CL); Computer Vision and Pattern Recognition (cs.CV); Human-Computer Interaction (cs.HC); Information Retrieval (cs.IR)
MSC classes: 68T07
Cite as: arXiv:2111.02168 [cs.LG]
  (or arXiv:2111.02168v2 [cs.LG] for this version)

Submission history

From: Stefan Magureanu [view email]
[v1] Wed, 3 Nov 2021 12:13:52 GMT (222kb,D)
[v2] Tue, 9 Nov 2021 15:17:14 GMT (223kb,D)

Link back to: arXiv, form interface, contact.