Package | Description |
---|---|
us.codecraft.webmagic |
Main class "Spider" and models.
|
us.codecraft.webmagic.downloader |
Downloader is the part that downloads web pages and store in Page object.
|
us.codecraft.webmagic.downloader.selenium | |
us.codecraft.webmagic.handler | |
us.codecraft.webmagic.monitor | |
us.codecraft.webmagic.samples.scheduler | |
us.codecraft.webmagic.scheduler |
Scheduler is the part of url management.
|
us.codecraft.webmagic.scheduler.component |
Component of scheduler.
|
us.codecraft.webmagic.utils |
Static utils of webmagic.
|
Modifier and Type | Field and Description |
---|---|
protected List<Request> |
Spider.startRequests |
Modifier and Type | Method and Description |
---|---|
Request |
Request.addCookie(String name,
String value) |
Request |
Request.addHeader(String name,
String value) |
Request |
ResultItems.getRequest() |
Request |
Page.getRequest()
get request of current page
|
Request |
Request.putExtra(String key,
Object value) |
Request |
Request.setBinaryContent(boolean binaryContent) |
Request |
Request.setCharset(String charset) |
Request |
Request.setExtras(Map<String,Object> extras) |
Request |
Request.setMethod(String method) |
Request |
Request.setPriority(long priority)
Set the priority of request for sorting.
Need a scheduler supporting priority. |
Request |
Request.setUrl(String url) |
Modifier and Type | Method and Description |
---|---|
List<Request> |
Page.getTargetRequests() |
Modifier and Type | Method and Description |
---|---|
Spider |
Spider.addRequest(Request... requests)
Add urls with information to crawl.
|
void |
Page.addTargetRequest(Request request)
add requests to fetch
|
Page |
SimpleHttpClient.get(Request request) |
<T> T |
SimpleHttpClient.get(Request request,
Class<T> clazz) |
void |
SpiderListener.onError(Request request) |
protected void |
Spider.onError(Request request) |
void |
SpiderListener.onSuccess(Request request) |
protected void |
Spider.onSuccess(Request request) |
ResultItems |
ResultItems.setRequest(Request request) |
void |
Page.setRequest(Request request) |
Modifier and Type | Method and Description |
---|---|
Spider |
Spider.startRequest(List<Request> startRequests)
Set startUrls of Spider.
Prior to startUrls of Site. |
Modifier and Type | Method and Description |
---|---|
HttpClientRequestContext |
HttpUriRequestConverter.convert(Request request,
Site site,
Proxy proxy) |
Page |
PhantomJSDownloader.download(Request request,
Task task) |
Page |
HttpClientDownloader.download(Request request,
Task task) |
Page |
Downloader.download(Request request,
Task task)
Downloads web pages and store in Page object.
|
protected String |
PhantomJSDownloader.getPage(Request request) |
protected Page |
HttpClientDownloader.handleResponse(Request request,
String charset,
org.apache.http.HttpResponse httpResponse,
Task task) |
protected void |
AbstractDownloader.onError(Request request) |
protected void |
AbstractDownloader.onSuccess(Request request) |
Modifier and Type | Method and Description |
---|---|
Page |
SeleniumDownloader.download(Request request,
Task task) |
Modifier and Type | Method and Description |
---|---|
boolean |
RequestMatcher.match(Request page)
Check whether to process the page.
Please DO NOT change page status in this method. |
boolean |
PatternRequestMatcher.match(Request request) |
Modifier and Type | Method and Description |
---|---|
void |
SpiderMonitor.MonitorSpiderListener.onError(Request request) |
void |
SpiderMonitor.MonitorSpiderListener.onSuccess(Request request) |
Modifier and Type | Method and Description |
---|---|
Request |
DelayQueueScheduler.poll(Task task) |
Modifier and Type | Method and Description |
---|---|
void |
LevelLimitScheduler.push(Request request,
Task task) |
void |
DelayQueueScheduler.push(Request request,
Task task) |
Modifier and Type | Method and Description |
---|---|
Request |
RedisScheduler.poll(Task task) |
Request |
RedisPriorityScheduler.poll(Task task) |
Request |
FileCacheQueueScheduler.poll(Task task) |
Request |
Scheduler.poll(Task task)
get an url to crawl
|
Request |
QueueScheduler.poll(Task task) |
Request |
PriorityScheduler.poll(Task task) |
Modifier and Type | Method and Description |
---|---|
protected String |
BloomFilterDuplicateRemover.getUrl(Request request) |
boolean |
RedisScheduler.isDuplicate(Request request,
Task task) |
boolean |
BloomFilterDuplicateRemover.isDuplicate(Request request,
Task task) |
protected boolean |
DuplicateRemovedScheduler.noNeedToRemoveDuplicate(Request request) |
void |
Scheduler.push(Request request,
Task task)
add a url to fetch
|
void |
DuplicateRemovedScheduler.push(Request request,
Task task) |
protected void |
RedisScheduler.pushWhenNoDuplicate(Request request,
Task task) |
protected void |
RedisPriorityScheduler.pushWhenNoDuplicate(Request request,
Task task) |
protected void |
FileCacheQueueScheduler.pushWhenNoDuplicate(Request request,
Task task) |
void |
QueueScheduler.pushWhenNoDuplicate(Request request,
Task task) |
void |
PriorityScheduler.pushWhenNoDuplicate(Request request,
Task task) |
protected void |
DuplicateRemovedScheduler.pushWhenNoDuplicate(Request request,
Task task) |
protected boolean |
DuplicateRemovedScheduler.shouldReserved(Request request) |
Modifier and Type | Method and Description |
---|---|
protected String |
HashSetDuplicateRemover.getUrl(Request request) |
boolean |
HashSetDuplicateRemover.isDuplicate(Request request,
Task task) |
boolean |
DuplicateRemover.isDuplicate(Request request,
Task task)
Check whether the request is duplicate.
|
Modifier and Type | Method and Description |
---|---|
static List<Request> |
UrlUtils.convertToRequests(Collection<String> urls) |
static List<Request> |
RequestUtils.from(String exp) |
Modifier and Type | Method and Description |
---|---|
static List<String> |
UrlUtils.convertToUrls(Collection<Request> requests) |
Copyright © 2017. All rights reserved.