4.5 Introduction extraction tools

WebMagic extraction main use of the Jsoup and my own development tools Xsoup.

4.5.1 Jsoup

Jsoup is a simple HTML parser, and it supports the use of CSS selectors way to find elements. In order to develop WebMagic, I Jsoup source conducted a detailed analysis of specific articles see Jsoup study notes.

4.5.2 Xsoup

Xsoup is based Jsoup I developed an XPath parser.

Before using the parser WebMagic HtmlCleaner, there are some problems during use. The main problem is XPath error position is not accurate, and it is not reasonable code structure, it is difficult to customize. I finally realized Xsoup, making it necessary to develop more in line with crawlers. It is gratifying, tested, Xsoup performance than HtmlCleaner faster than doubled.

Xsoup development up to now, has been supported crawler common syntax, the following are some of them have supported syntax table:

Name Expression Support
nodename nodename yes
immediate parent / yes
parent // yes
attribute [@key=value] yes
nth child tag[n] yes
attribute /@key yes
wildcard in tagname /* yes
wildcard in attribute /[@*] yes
function function() part
or a | b yes since 0.2.0
parent in path . or .. no
predicates price>35 no
predicates logic @class=a or @class=b yes since 0.2.0

Also my own definition for several crawlers, it is very convenient XPath functions. Note, however, these functions are not XPath standards.

Expression Description XPath1.0
text(n) n-th child node text directly and 0 for all text() only
allText() all direct and indirect text child not support
tidyText() all direct and indirect child nodes text, and replace some of the labels wrap, so plain text display cleaner not support
html() internal html, html tag does not include itself not support
outerHtml() internal html, including tags html itself not support
regex(@attr,expr,group) @attr here and can be selected from the group, the default is group0 not support

4.5.3 Saxon

Saxon is a powerful parser XPath support XPath 2.0 syntax. Webmagic-saxon integration of Saxon is a tentative, but now it seems, XPath 2.0's advanced grammar, it seems that users are not many crawlers development.

results matching ""

    No results matching ""