4.5 Introduction extraction tools

WebMagic extraction main use of the Jsoup and my own development tools Xsoup.

4.5.1 Jsoup

Jsoup is a simple HTML parser, and it supports the use of CSS selectors way to find elements. In order to develop WebMagic, I Jsoup source conducted a detailed analysis of specific articles see Jsoup study notes.

4.5.2 Xsoup

Xsoup is based Jsoup I developed an XPath parser.

Before using the parser WebMagic HtmlCleaner, there are some problems during use. The main problem is XPath error position is not accurate, and it is not reasonable code structure, it is difficult to customize. I finally realized Xsoup, making it necessary to develop more in line with crawlers. It is gratifying, tested, Xsoup performance than HtmlCleaner faster than doubled.

Xsoup development up to now, has been supported crawler common syntax, the following are some of them have supported syntax table:

Name	Expression	Support
nodename	nodename	yes
immediate parent	/	yes
parent	//	yes
attribute	[@key=value]	yes
nth child	tag[n]	yes
attribute	/@key	yes
wildcard in tagname	/*	yes
wildcard in attribute	/[@*]	yes
function	function()	part
or	a \| b	yes since 0.2.0
parent in path	. or ..	no
predicates	price>35	no
predicates logic	@class=a or @class=b	yes since 0.2.0

Also my own definition for several crawlers, it is very convenient XPath functions. Note, however, these functions are not XPath standards.

Expression	Description	XPath1.0
text(n)	n-th child node text directly and 0 for all	text() only
allText()	all direct and indirect text child	not support
tidyText()	all direct and indirect child nodes text, and replace some of the labels wrap, so plain text display cleaner	not support
html()	internal html, html tag does not include itself	not support
outerHtml()	internal html, including tags html itself	not support
regex(@attr,expr,group)	@attr here and can be selected from the group, the default is group0	not support

4.5.3 Saxon

Saxon is a powerful parser XPath support XPath 2.0 syntax. Webmagic-saxon integration of Saxon is a tentative, but now it seems, XPath 2.0's advanced grammar, it seems that users are not many crawlers development.

Jsoup and Xsoup

4.5 Introduction extraction tools

4.5.1 Jsoup

4.5.2 Xsoup

4.5.3 Saxon

results matching ""

No results matching ""