4.5 Introduction extraction tools
WebMagic extraction main use of the Jsoup and my own development tools Xsoup.
4.5.1 Jsoup
Jsoup is a simple HTML parser, and it supports the use of CSS selectors way to find elements. In order to develop WebMagic, I Jsoup source conducted a detailed analysis of specific articles see Jsoup study notes.
4.5.2 Xsoup
Xsoup is based Jsoup I developed an XPath parser.
Before using the parser WebMagic HtmlCleaner, there are some problems during use. The main problem is XPath error position is not accurate, and it is not reasonable code structure, it is difficult to customize. I finally realized Xsoup, making it necessary to develop more in line with crawlers. It is gratifying, tested, Xsoup performance than HtmlCleaner faster than doubled.
Xsoup development up to now, has been supported crawler common syntax, the following are some of them have supported syntax table:
Name | Expression | Support |
nodename | nodename | yes |
immediate parent | / | yes |
parent | // | yes |
attribute | [@key=value] | yes |
nth child | tag[n] | yes |
attribute | /@key | yes |
wildcard in tagname | /* | yes |
wildcard in attribute | /[@*] | yes |
function | function() | part |
or | a | b | yes since 0.2.0 |
parent in path | . or .. | no |
predicates | price>35 | no |
predicates logic | @class=a or @class=b | yes since 0.2.0 |
Also my own definition for several crawlers, it is very convenient XPath functions. Note, however, these functions are not XPath standards.
Expression | Description | XPath1.0 |
---|---|---|
text(n) | n-th child node text directly and 0 for all | text() only |
allText() | all direct and indirect text child | not support |
tidyText() | all direct and indirect child nodes text, and replace some of the labels wrap, so plain text display cleaner | not support |
html() | internal html, html tag does not include itself | not support |
outerHtml() | internal html, including tags html itself | not support |
regex(@attr,expr,group) | @attr here and can be selected from the group, the default is group0 | not support |
4.5.3 Saxon
Saxon is a powerful parser XPath support XPath 2.0 syntax. Webmagic-saxon
integration of Saxon is a tentative, but now it seems, XPath 2.0's advanced grammar, it seems that users are not many crawlers development.