Package | Description |
---|---|
us.codecraft.webmagic |
Main class "Spider" and models.
|
us.codecraft.webmagic.downloader |
Downloader is the part that downloads web pages and store in Page object.
|
us.codecraft.webmagic.downloader.selenium | |
us.codecraft.webmagic.handler | |
us.codecraft.webmagic.model |
Page model and annotations used to customize a crawler.
|
us.codecraft.webmagic.pipeline |
Pipeline is the persistent and offline process part of crawler.
|
us.codecraft.webmagic.proxy | |
us.codecraft.webmagic.samples.pipeline | |
us.codecraft.webmagic.samples.scheduler | |
us.codecraft.webmagic.scheduler |
Scheduler is the part of url management.
|
us.codecraft.webmagic.scheduler.component |
Component of scheduler.
|
Modifier and Type | Class and Description |
---|---|
class |
Spider
Entrance of a crawler.
A spider contains four modules: Downloader, Scheduler, PageProcessor and Pipeline. Every module is a field of Spider. |
Modifier and Type | Method and Description |
---|---|
Task |
Site.toTask() |
Modifier and Type | Method and Description |
---|---|
Page |
PhantomJSDownloader.download(Request request,
Task task) |
Page |
HttpClientDownloader.download(Request request,
Task task) |
Page |
Downloader.download(Request request,
Task task)
Downloads web pages and store in Page object.
|
protected Page |
HttpClientDownloader.handleResponse(Request request,
String charset,
org.apache.http.HttpResponse httpResponse,
Task task) |
Modifier and Type | Method and Description |
---|---|
Page |
SeleniumDownloader.download(Request request,
Task task) |
Modifier and Type | Method and Description |
---|---|
void |
CompositePipeline.process(ResultItems resultItems,
Task task) |
RequestMatcher.MatchOther |
SubPipeline.processResult(ResultItems resultItems,
Task task)
process the page, extract urls to fetch, extract the data and store
|
Modifier and Type | Class and Description |
---|---|
class |
OOSpider<T>
The spider for page model extractor.
In webmagic, we call a POJO containing extract result as "page model". |
Modifier and Type | Method and Description |
---|---|
void |
ConsolePageModelPipeline.process(Object o,
Task task) |
Modifier and Type | Method and Description |
---|---|
void |
JsonFilePageModelPipeline.process(Object o,
Task task) |
void |
FilePageModelPipeline.process(Object o,
Task task) |
void |
MultiPagePipeline.process(ResultItems resultItems,
Task task) |
void |
JsonFilePipeline.process(ResultItems resultItems,
Task task) |
void |
ResultItemsCollectorPipeline.process(ResultItems resultItems,
Task task) |
void |
Pipeline.process(ResultItems resultItems,
Task task)
Process extracted results.
|
void |
FilePipeline.process(ResultItems resultItems,
Task task) |
void |
ConsolePipeline.process(ResultItems resultItems,
Task task) |
void |
PageModelPipeline.process(T t,
Task task) |
void |
CollectorPageModelPipeline.process(T t,
Task task) |
Modifier and Type | Method and Description |
---|---|
Proxy |
SimpleProxyProvider.getProxy(Task task) |
Proxy |
ProxyProvider.getProxy(Task task)
Get a proxy for task by some strategy.
|
void |
SimpleProxyProvider.returnProxy(Proxy proxy,
Page page,
Task task) |
void |
ProxyProvider.returnProxy(Proxy proxy,
Page page,
Task task)
Return proxy to Provider when complete a download.
|
Modifier and Type | Method and Description |
---|---|
void |
OneFilePipeline.process(ResultItems resultItems,
Task task) |
Modifier and Type | Method and Description |
---|---|
Request |
DelayQueueScheduler.poll(Task task) |
void |
LevelLimitScheduler.push(Request request,
Task task) |
void |
DelayQueueScheduler.push(Request request,
Task task) |
Modifier and Type | Method and Description |
---|---|
protected String |
RedisScheduler.getItemKey(Task task) |
int |
RedisScheduler.getLeftRequestsCount(Task task) |
int |
FileCacheQueueScheduler.getLeftRequestsCount(Task task) |
int |
QueueScheduler.getLeftRequestsCount(Task task) |
int |
PriorityScheduler.getLeftRequestsCount(Task task) |
int |
MonitorableScheduler.getLeftRequestsCount(Task task) |
protected String |
RedisScheduler.getQueueKey(Task task) |
protected String |
RedisScheduler.getSetKey(Task task) |
int |
RedisScheduler.getTotalRequestsCount(Task task) |
int |
FileCacheQueueScheduler.getTotalRequestsCount(Task task) |
int |
BloomFilterDuplicateRemover.getTotalRequestsCount(Task task) |
int |
QueueScheduler.getTotalRequestsCount(Task task) |
int |
PriorityScheduler.getTotalRequestsCount(Task task) |
int |
MonitorableScheduler.getTotalRequestsCount(Task task) |
boolean |
RedisScheduler.isDuplicate(Request request,
Task task) |
boolean |
BloomFilterDuplicateRemover.isDuplicate(Request request,
Task task) |
Request |
RedisScheduler.poll(Task task) |
Request |
RedisPriorityScheduler.poll(Task task) |
Request |
FileCacheQueueScheduler.poll(Task task) |
Request |
Scheduler.poll(Task task)
get an url to crawl
|
Request |
QueueScheduler.poll(Task task) |
Request |
PriorityScheduler.poll(Task task) |
void |
Scheduler.push(Request request,
Task task)
add a url to fetch
|
void |
DuplicateRemovedScheduler.push(Request request,
Task task) |
protected void |
RedisScheduler.pushWhenNoDuplicate(Request request,
Task task) |
protected void |
RedisPriorityScheduler.pushWhenNoDuplicate(Request request,
Task task) |
protected void |
FileCacheQueueScheduler.pushWhenNoDuplicate(Request request,
Task task) |
void |
QueueScheduler.pushWhenNoDuplicate(Request request,
Task task) |
void |
PriorityScheduler.pushWhenNoDuplicate(Request request,
Task task) |
protected void |
DuplicateRemovedScheduler.pushWhenNoDuplicate(Request request,
Task task) |
void |
RedisScheduler.resetDuplicateCheck(Task task) |
void |
RedisPriorityScheduler.resetDuplicateCheck(Task task) |
void |
BloomFilterDuplicateRemover.resetDuplicateCheck(Task task) |
Modifier and Type | Method and Description |
---|---|
int |
HashSetDuplicateRemover.getTotalRequestsCount(Task task) |
int |
DuplicateRemover.getTotalRequestsCount(Task task)
Get TotalRequestsCount for monitor.
|
boolean |
HashSetDuplicateRemover.isDuplicate(Request request,
Task task) |
boolean |
DuplicateRemover.isDuplicate(Request request,
Task task)
Check whether the request is duplicate.
|
void |
HashSetDuplicateRemover.resetDuplicateCheck(Task task) |
void |
DuplicateRemover.resetDuplicateCheck(Task task)
Reset duplicate check.
|
Copyright © 2017. All rights reserved.