List details the basic combination of page

We start with a simple example to start. For this example, we have a list of pages, this list tab page to show the form that we can traverse these tabs to find all the target page.

1 Introduction example

Here we are with the author's Sina blog http://blog.sina.com.cn/flashsword20 as an example. In this example, we want the final blog page article, crawling blog title, content, date and other information, but also to crawl blog links and other information from the list of pages, in order to gain all articles of this blog.

Page

Page format is "http://blog.sina.com.cn/s/articlelist_1487828712_0_1.html", where "0_1" in the "1" is a variable number of pages.
Article Page

The article on page format is "http://blog.sina.com.cn/s/blog_58ae76e80100g8au.html", where "58ae76e80100g8au" is variable character.

2 article URL found

In this crawler demand, the article URL is our ultimate concern, so how to find all the articles in this blog address is the first step crawlers.

We can use the regular expression http://blog\\.sina\\.com\\.cn/s/blog_\\w+\\.html a coarse filter for URL. More complicated here is that this URL is too broad and could crawl to the other blog information, so we must specify the area from the list page Get URL.

Here, we use xpath //div[@class=\\"articleList\\"] select all regions, then use links () or xpath //a/@href get all the links, and finally the use of regular expression http://blog\\.sina\\.com\\.cn/s/blog_\\w+\\.html, a URL filtering to remove some of the" edit "or" more "category links. Thus, we can write:

page.addTargetRequests(page.getHtml().xpath("//div[@class=\"articleList\"]").links().regex("http://blog\\.sina\\.com\\.cn/s/blog_\\w+\\.html").all());

At the same time, we need to find a list of all the pages are added to the URL to be downloaded to go:

page.addTargetRequests(page.getHtml().links().regex("http://blog\\.sina\\.com\\.cn/s/articlelist_1487828712_0_\\d+\\.html").all());

3 Content Extraction

Extracting the article page of information is relatively simple, written corresponding xpath expression to extract it.

page.putField("title", page.getHtml().xpath("//div[@class='articalTitle']/h2"));
page.putField("content", page.getHtml().xpath("//div[@id='articlebody']//div[@class='articalContent']"));
page.putField("date",
        page.getHtml().xpath("//div[@id='articlebody']//span[@class='time SG_txtc']").regex("\\((.*)\\)"));

4 distinguish lists and landing pages

Now, we have defined the target page list and processing the way, now we need to deal with when they make that distinction. In this case, the distinction is very simple, because the list on the page and the destination page URL format is different, so the URL directly distinguish it!

// List
if (page.getUrl().regex(URL_LIST).match()) {
    page.addTargetRequests(page.getHtml().xpath("//div[@class=\"articleList\"]").links().regex(URL_POST).all());
    page.addTargetRequests(page.getHtml().links().regex(URL_LIST).all());
    // Article Page
} else {
    page.putField("title", page.getHtml().xpath("//div[@class='articalTitle']/h2"));
    page.putField("content", page.getHtml().xpath("//div[@id='articlebody']//div[@class='articalContent']"));
    page.putField("date",
            page.getHtml().xpath("//div[@id='articlebody']//span[@class='time SG_txtc']").regex("\\((.*)\\)"));
}

Consider this example the complete code SinaBlogProcessor.java.

5 Summary

In this example, we use several main methods:

Found a link from a page using regular expressions to specify the location of the filter link.
PageProcessor deal with two pages, depending on page URL to distinguish between what is required.

Some friends of the reaction, if-else deal with some inconvenience to differentiate #issue83. WebMagic planned future version 0.5.0 added SubPageProcessor to solve this problem.

The combination of basic page and list links