-
- Paolo Boldi
- Dipartimento di Informatica, Università degli Studi di Milano, Milano, MI, Italy
-
- Andrea Marino
- Dipartimento di Informatica, Università degli Studi di Milano, Milano, MI, Italy
-
- Massimo Santini
- Dipartimento di Informatica, Università degli Studi di Milano, Milano, MI, Italy
-
- Sebastiano Vigna
- Dipartimento di Informatica, Università degli Studi di Milano, Milano, MI, Italy
書誌事項
- タイトル別名
-
- Massive Crawling for the Masses
説明
<jats:p>Although web crawlers have been around for twenty years by now, there is virtually no freely available, open-source crawling software that guarantees high throughput, overcomes the limits of single-machine systems, and, at the same time, scales linearly with the amount of resources available. This article aims at filling this gap, through the description of BUbiNG, our next-generation web crawler built upon the authors’ experience with UbiCrawler [9] and on the last ten years of research on the topic. BUbiNG is an open-source Java fully distributed crawler; a single BUbiNG agent, using sizeable hardware, can crawl several thousand pages per second respecting strict politeness constraints, both host- and IP-based. Unlike existing open-source distributed crawlers that rely on batch techniques (like MapReduce), BUbiNG job distribution is based on modern high-speed protocols to achieve very high throughput.</jats:p>
収録刊行物
-
- ACM Transactions on the Web
-
ACM Transactions on the Web 12 (2), 1-26, 2018-05-31
Association for Computing Machinery (ACM)
- Tweet
詳細情報 詳細情報について
-
- CRID
- 1360016866569255552
-
- DOI
- 10.1145/3160017
-
- ISSN
- 1559114X
- 15591131
-
- データソース種別
-
- Crossref