The scheduler whose requests can be counted for monitor.
Scheduler is the part of url management.
You can implement interface Scheduler to do: manage urls to fetch remove duplicate urls
BloomFilterDuplicateRemover for huge number of urls.
Remove duplicate urls and only push urls which are not duplicate.
Store urls and cursor in files so that a Spider can resume the status when shutdown.
Basic Scheduler implementation.
Store urls to fetch in LinkedBlockingQueue and remove duplicate urls by HashMap.
the redis scheduler with priority
Use Redis as url scheduler for distributed crawlers.
Copyright © 2017. All rights reserved.