We gratefully acknowledge support from
the Simons Foundation and member institutions.
Full-text links:

Download:

Current browse context:

cs

Change to browse by:

References & Citations

DBLP - CS Bibliography

Bookmark

(what is this?)
CiteULike logo BibSonomy logo Mendeley logo del.icio.us logo Digg logo Reddit logo

Computer Science > Social and Information Networks

Title: Analysis of the Web Graph Aggregated by Host and Pay-Level Domain

Abstract: In this paper the web is analyzed as a graph aggregated by host and pay-level domain (PLD). The web graph datasets, publicly available, have been released by the Common Crawl Foundation and are based on a web crawl performed during the period May-June-July 2017. The host graph has $\sim$1.3 billion nodes and $\sim$5.3 billion arcs. The PLD graph has $\sim$91 million nodes and $\sim$1.1 billion arcs. We study the distributions of degree and sizes of strongly/weakly connected components (SCC/WCC) focusing on power laws detection using statistical methods. The statistical plausibility of the power law model is compared with that of several alternative distributions. While there is no evidence of power law tails on host level, they emerge on PLD aggregation for indegree, SCC and WCC size distributions. Finally, we analyze distance-related features by studying the cumulative distributions of the shortest path lengths, and give an estimation of the diameters of the graphs.
Subjects: Social and Information Networks (cs.SI)
Journal reference: Complex Networks and Their Applications VII. COMPLEX NETWORKS 2018. Studies in Computational Intelligence, vol 813. Springer, Cham
DOI: 10.1007/978-3-030-05414-4_2
Cite as: arXiv:1802.05435 [cs.SI]
  (or arXiv:1802.05435v3 [cs.SI] for this version)

Submission history

From: Agostino Funel [view email]
[v1] Thu, 15 Feb 2018 08:52:02 GMT (1398kb,D)
[v2] Sun, 18 Feb 2018 11:35:47 GMT (1398kb,D)
[v3] Tue, 6 Mar 2018 10:25:23 GMT (1398kb,D)

Link back to: arXiv, form interface, contact.