6.2 Customized Scheduler
The scheduler is a components of the WebMagic which manage the url. In general, there are two effect of the scheduler:
- manage the url which is wait to be crawl
- filter out the repeat url
WebMagic have some common schedular. If you just want to run some simple sprider in your local, then you needn't to customize scheduler. But it is meaningful for you to know some of them.
Class | Description | Remark |
---|---|---|
DuplicateRemovedScheduler | a abstract class,it provide some template method | extends it can achieve your own function |
QueueScheduler | use the memory queue to save the url | |
PriorityScheduler | use the priority mamory queue to save the url | the use of memory is bigger than the QueueScheduler, but whne you set the request.priority. it is necessary to use the PriorityScheduler to take the priority effect |
FileCacheQueueScheduler | use the file to save the url, when the program exit and start next time, it can crawl the url which have been saved in the file | it need to set the path of the file. It will create two files .urls.txt and .cursor.txt |
RedisScheduler | use the redis to save the queue, it can crawl the internet in a distributed system | need to install redis and start it |
In the Version 0.5.1, i have rebuild the scheduler. The duplicated remover have been extract to a independent interface: DuplicateRemover
. Then you can set a different DuplicateRemover
for one scheduler. There are two ways of remove the Duplicate.
Class | Description |
---|---|
HashSetDuplicateRemover | use the HashSet to remove, but it needs a lots of memory |
BloomFilterDuplicateRemover | use the BloomFilter to remove, use a few of memory. But it may leave out a few url |
All the default scheduler use the HashSetDuplicateRemover
to remove (except the RedisScheduler). If you have a mount of url to do this, we recommend you to use the BloomFilterDuplicateRemover
. For example:
spider.setScheduler(new QueueScheduler()
.setDuplicateRemover(new BloomFilterDuplicateRemover(10000000)) //10000000 is the estimate value of urls
)