Flexible Java Web Crawler Framework

WebMagic is a scalable crawler framework that covers the whole lifecycle of crawler: downloading, url management, content extraction and persistent

Spider.create(new GithubRepoPageProcessor())
    .addUrl("https://github.com/code4craft")
    .addPipeline(new ConsolePipeline())
    .run();

Core Features

πŸš€

Simple & Easy

Simple core with high flexibility. Build powerful crawlers with just a few lines of code

πŸ”§

Highly Flexible

Modular design for easy extension. Support custom downloaders, processors and pipelines

⚑

Multi-thread Support

Built-in multi-threading support for better performance and distributed deployment

🎯

Annotation Driven

Annotation-based POJO crawler configuration for cleaner and more elegant code

πŸ”

Powerful Extraction

Support XPath, CSS selectors, regex and more for content extraction

πŸ”Œ

Easy Integration

multiple data persistence solutions

Quick Start

1. Add Maven Dependency

<dependency>
    <groupId>us.codecraft</groupId>
    <artifactId>webmagic-core</artifactId>
    <version>1.0.3</version>
</dependency>
<dependency>
    <groupId>us.codecraft</groupId>
    <artifactId>webmagic-extension</artifactId>
    <version>1.0.3</version>
</dependency>

2. Write PageProcessor

public class GithubRepoPageProcessor implements PageProcessor {
    private Site site = Site.me().setRetryTimes(3).setSleepTime(100);

    @Override
    public void process(Page page) {
        page.addTargetRequests(page.getHtml().links()
            .regex("(https://github\\.com/\\w+/\\w+)").all());
        page.putField("author", page.getUrl()
            .regex("https://github\\.com/(\\w+)/.*").toString());
        page.putField("name", page.getHtml()
            .xpath("//h1[@class='public']/strong/a/text()").toString());
    }

    @Override
    public Site getSite() {
        return site;
    }
}

3. Start Spider

Spider.create(new GithubRepoPageProcessor())
    .addUrl("https://github.com/code4craft")
    .addPipeline(new ConsolePipeline())
    .thread(5)
    .run();

Documentation & Resources

πŸ“š δΈ­ζ–‡ζ–‡ζ‘£

Complete Chinese documentation and API reference

View Docs

πŸ“– English Docs

Complete English documentation and guides

View Docs

πŸ” JavaDoc

Detailed API reference documentation

API Docs

πŸ’¬ Community Support

GitHub Issues and community discussions

Submit Issues