6.1 Customize Pipeline
When the extract finished, we use Pipeline
to persist the result of extract.We can also customize the pipeline to do some common function. In this chapter we will introduce the Pipeline
, and use two examples to explane how to customize the pipeline.
6.1.1 Introduction of Pipeline
The interface ofPipeline
define is here:
public interface Pipeline {
// ResultItems persist the result of extract,it is a structure of map
// The data in the page.putField(key,value) can use the ResultItems.get(key) to get
public void process(ResultItems resultItems, Task task);
}
We can see, Pipeline
persist the data which was extracted by thePageProcessor
. This work we can also do in the PageProcessor
. But why we use the Pipeline
? There is some reason for this:
- To separate the modules. The extract of page and persist the data are the to stages of a spider. On one hand, separate the modules can make the structure of the code more clear. On the other hand, we can separate the process, process in another thread or even in another server.
- The function of
Pipeline
is more stable, it is very easy to make it as a common component. There is a big difference between process of different pages. But the persist of data is almost the same,such as save in a file or persist in the database. It is very commons for almost of the pages. There is lots of commonPipeline
in the WebMagic, such as write to the console, save in a file, save in a file as a JSON format.
In the WebMagic, a Spider
can have a lot of Pipeline
, to use the Spider.addPipeline()
can add a Pipeline
. These Pipeline
can all be process. For example, you can use:
spider.addPipeline(new ConsolePipeline()).addPipeline(new FilePipeline())
You can write the data on the console and save in the file.
6.1.2 Put the result on the console
When we introduce the PageProcessor
, we use the GithubRepoPageProcessoras a example. There is a chip of the code:
public void process(Page page) {
page.addTargetRequests(page.getHtml().links().regex("(https://github\\.com/\\w+/\\w+)").all());
page.addTargetRequests(page.getHtml().links().regex("(https://github\\.com/\\w+)").all());
//save the author, the data will be save in ResultItems finally
page.putField("author", page.getUrl().regex("https://github\\.com/(\\w+)/.*").toString());
page.putField("name", page.getHtml().xpath("//h1[@class='entry-title public']/strong/a/text()").toString());
if (page.getResultItems().get("name")==null){
//when we set the skip,this page will not be processed by the`Pipeline`
page.setSkip(true);
}
page.putField("readme", page.getHtml().xpath("//div[@id='readme']/tidyText()"));
}
Now we want to write the result in the console. ConsolePipeline can do this.
public class ConsolePipeline implements Pipeline {
@Override
public void process(ResultItems resultItems, Task task) {
System.out.println("get page: " + resultItems.getRequest().getUrl());
//Iterator all the result,and put it on the console,the "author","name","readme"are all the key,the result is value
for (Map.Entry<String, Object> entry : resultItems.getAll().entrySet()) {
System.out.println(entry.getKey() + ":\t" + entry.getValue());
}
}
}
To Reference this example, you can customize your own Pipeline
. Get the data from the ResultItems
and process as your own method.
6.1.3 persist the result in the MySQL
First, we introduce a examplejobhunter. It's a WebMagic which integrate a spring framework to crawl the job information. This example also show how to use Mybatis to persist the data in the MySQL database.
In Java, we have many methods to save the data in database, such as jdbc、dbutils、spring-jdbc、MyBatis. These tools can do the same things, but their complexity is not the same. If we use JBDC, we should get the data in the ResulrItem and save it.
If we use the ORM framework to persist the data, we will face a big problem. That is the framework all need a well defined model, but not a Key-Value format ResultItem. We use the Mybatis as a example to define a DAO MyBatis-Spring.
public interface JobInfoDAO {
@Insert("insert into JobInfo (`title`,`salary`,`company`,`description`,`requirement`,`source`,`url`,`urlMd5`) values (#{title},#{salary},#{company},#{description},#{requirement},#{source},#{url},#{urlMd5})")
public int add(LieTouJobInfo jobInfo);
}
All we need to do is to implements a Pipeline,to combine the ResultItem
andLieTouJobInfo
.
Annotation mode
Under the annotation mode, there is a PageModelPipeline in the WebMagic:
public interface PageModelPipeline<T> {
//give the well processed object
public void process(T t, Task task);
}
At this time,we can define a JobInfoDaoPipeline to achieve the function:
@Component("JobInfoDaoPipeline")
public class JobInfoDaoPipeline implements PageModelPipeline<LieTouJobInfo> {
@Resource
private JobInfoDAO jobInfoDAO;
@Override
public void process(LieTouJobInfo lieTouJobInfo, Task task) {
//call the MyBatis DAO to save the result
jobInfoDAO.add(lieTouJobInfo);
}
}
Basic Pipeline mode
We have finished the work of save the data! But how to use a original Pipeline
interface? It's very easy! If you want to save a object, then you should save the data as a object when you extract it from a page.
public void process(Page page) {
page.addTargetRequests(page.getHtml().links().regex("(https://github\\.com/\\w+/\\w+)").all());
page.addTargetRequests(page.getHtml().links().regex("(https://github\\.com/\\w+)").all());
GithubRepo githubRepo = new GithubRepo();
githubRepo.setAuthor(page.getUrl().regex("https://github\\.com/(\\w+)/.*").toString());
githubRepo.setName(page.getHtml().xpath("//h1[@class='entry-title public']/strong/a/text()").toString());
githubRepo.setReadme(page.getHtml().xpath("//div[@id='readme']/tidyText()").toString());
if (githubRepo.getName() == null) {
//skip this page
page.setSkip(true);
} else {
page.putField("repo", githubRepo);
}
}
In the Pipeline
, you should use
GithubRepo githubRepo = (GithubRepo)resultItems.get("repo");
then you can get this object
PageModelPipeline is also implements from the original
Pipeline
interface. It combine thePageProcessor
. it use the class name as the key and the value is the object.In detail: ModelPipeline.
6.1.4 The WebMagic has already define some Pipeline
WebMagic can write the result to the comsole,save the data in a file or save as a JSON format.
Class | Description | Remark |
---|---|---|
ConsolePipeline | write to the console | the result must implements the toString() method |
FilePipeline | save the result in the file | the result must implements the toString() method |
JsonFilePipeline | save the data in the file as JSON format | |
ConsolePageModelPipeline | (Annotation mode)write to the console | |
FilePageModelPipeline | (Annotation mode)save the result in the file | |
JsonFilePageModelPipeline | (Annotation mode)save the data in the file as JSON format | the field which want to be saved must have the getter method |