BUbiNG

Paolo Boldi, Andrea Marino, Massimo Santini, Sebastiano Vigna

doi:10.1145/3160017

BUbiNG

DOI Web Site 1 Citations

Paolo Boldi

Dipartimento di Informatica, Università degli Studi di Milano, Milano, MI, Italy
Andrea Marino

Dipartimento di Informatica, Università degli Studi di Milano, Milano, MI, Italy
Massimo Santini

Dipartimento di Informatica, Università degli Studi di Milano, Milano, MI, Italy
Sebastiano Vigna

Dipartimento di Informatica, Università degli Studi di Milano, Milano, MI, Italy

Bibliographic Information

Other Title

Massive Crawling for the Masses

Description

<jats:p>Although web crawlers have been around for twenty years by now, there is virtually no freely available, open-source crawling software that guarantees high throughput, overcomes the limits of single-machine systems, and, at the same time, scales linearly with the amount of resources available. This article aims at filling this gap, through the description of BUbiNG, our next-generation web crawler built upon the authors’ experience with UbiCrawler [9] and on the last ten years of research on the topic. BUbiNG is an open-source Java fully distributed crawler; a single BUbiNG agent, using sizeable hardware, can crawl several thousand pages per second respecting strict politeness constraints, both host- and IP-based. Unlike existing open-source distributed crawlers that rely on batch techniques (like MapReduce), BUbiNG job distribution is based on modern high-speed protocols to achieve very high throughput.</jats:p>

Journal

ACM Transactions on the Web

ACM Transactions on the Web 12 (2), 1-26, 2018-05-31

Association for Computing Machinery (ACM)

Citations (1)*help

Details 詳細情報について

CRID

1360016866569255552
DOI

10.1145/3160017
ISSN

1559114X

15591131
Web Site

https://dl.acm.org/doi/pdf/10.1145/3160017
Data Source
- Crossref

BUbiNG

Bibliographic Information

Description

Journal

Citations (1)*help

Details 詳細情報について

Export

Report a problem