4.4 Crawler configuration, start and stop
4.4.1 Spider
Spider
is crawler start entrance. Before starting the crawlers, we need to use a PageProcessor
create a Spider object, and then use the run()
to start. While other components of the Spider (Downloader, Scheduler, Pipeline) can be set by a set method.
Method | Description | Examples |
---|---|---|
create(PageProcessor) | Create Spider | spider.create(new GithubRepoProcessor()) |
addUrl(String…) | Add initial URL | spider.addUrl("http://webmagic.io/docs/") |
addRequest(Request...) | Add initial Request | spider.addRequest("http://webmagic.io/docs/") |
thread(n) | n threads open | spider.thread(5) |
run() | starts, blocking the current thread of execution | spider.run() |
start()/runAsync() | asynchronous start, continue with the current thread | spider.start() |
stop() | stop crawler | spider.stop() |
test(String) | crawl a page to test | spider.test("http://webmagic.io/docs/") |
addPipeline(Pipeline) | add a Pipeline, a Spider can have multiple Pipeline | spider.addPipeline(new ConsolePipeline()) |
setScheduler(Scheduler) | Settings Scheduler, a Spider must have at a Scheduler | spider.setScheduler(new RedisScheduler()) |
setDownloader(Downloader) | Settings Downloader, a Spider must have at a Downloader | spider.setDownloader(new SeleniumDownloader()) |
get(String) | synchronous calls, and direct access to the results | ResultItems result = spider.get("http://webmagic.io/docs/") |
getAll(String…) | synchronous calls, and direct access to a bunch of results | List<ResultItems> results = spider .getAll("http://webmagic.io/docs/", "http://webmagic.io/xxx") |
4.4.2 Site
The site itself, some configuration information, such as encoding, HTTP headers, timeout, retry strategies, agents, etc., can be configured by setting Site
object.
Method | Description | Examples |
---|---|---|
setCharset(String) | set the encoding | site.setCharset("utf-8") |
setUserAgent(String) | Settings UserAgent | site.setUserAgent("Spider") |
setTimeOut(int) | set the timeout in milliseconds | site.setTimeOut(3000) |
setRetryTimes(int) | Settings retries | site.setRetryTimes(3) |
setCycleRetryTimes(int) | Setting cycle retries | site.setCycleRetryTimes(3) |
addCookie(String,String) | add a cookie | site.addCookie("dotcomt_user","code4craft") |
setDomain(String) | set up the domain name, the domain name to be set later, addCookie only take effect | site.setDomain("github.com") |
addHeader(String,String) | add a addHeader | site.addHeader("Referer","https://github.com") |
setHttpProxy(HttpHost) | Http proxy settings | site.setHttpProxy(new HttpHost("127.0.0.1",8080)) |
Wherein the loop retry cycleRetry version 0.3.0 is added mechanism.
This mechanism will fail to download url back into the tail of the queue retry until the number of retries to ensure that no leakage grasping for some reason the network page.