5.7 AfterExtractor

Sometimes, annotation mode can not meet all the needs, we may need to write the code to do some things, this time we should use the AfterExtractor interfaces.

public interface AfterExtractor {

    public void afterProcess(Page page);

afterProcess extraction method in the end, after the fields are initialized to be called, you can deal with some special logic. Like in the example using Jfinal ActiveRecord persistence webmagic crawled blog:

// TargetUrl mean only the following URL format will be extracted to generate the object model
// Here is to do a little positive change, '' The default is no need to escape, and the '*' will be automatically replaced with '*', as described URL looked a little uncomfortable ...
// Inherited jfinal the Model
// Implement AfterExtractor interfaces can perform other operations after filling properties
public class OschinaBlog extends Model<OschinaBlog> implements AfterExtractor {

    // Will be automatically extracted with ExtractBy annotation fields and filling
    // Default xpath grammar
    private String title;

    //Extract can be defined syntax Css, Regex, etc.
    @ExtractBy(value = "div.BlogContent", type = ExtractBy.Type.Css)
    private String content;

    //Multi labeling drawing result can be a List
    @ExtractBy(value = "//div[@class='BlogTags']/a/text()", multi = true)
    private List<String> tags;

    public void afterProcess(Page page) {
        //Jfinal property is actually a Map instead of the field, it does not matter, I want to go in filling
        this.set("title", title);
        this.set("content", content);
        this.set("tags", StringUtils.join(tags, ","));

    public static void main(String[] args) {
        C3p0Plugin c3p0Plugin = new C3p0Plugin("jdbc:mysql://", "blog", "password");
        ActiveRecordPlugin activeRecordPlugin = new ActiveRecordPlugin(c3p0Plugin);
        activeRecordPlugin.addMapping("blog", OschinaBlog.class);
        //Start webmagic
        OOSpider.create(Site.me().addStartUrl("http://my.oschina.net/flashsword/blog/145796"), OschinaBlog.class).run();


Annotation mode is now regarded as the end of the presentation, in WebMagic, the annotation model is in fact based entirely on webmagic-core the PageProcessor and Pipeline extension implementation, interested friends can go to look at the code.

This is partly achieved but it is still more complex, there is a problem if you find some of the details of the code, welcomed feedback to me.

results matching ""

    No results matching ""