Tool
ACHE
activeExternal site
ACHE is a focused web crawler that collects web pages satisfying specific criteria using page classifiers and intelligent link prioritization.
ACHE is a focused web crawler. It collects web pages that satisfy some specific criteria, e.g., pages that belong to a given domain or that contain a user-specified pattern.
## How It Differs
ACHE differs from generic crawlers in that it uses page classifiers to distinguish between relevant and irrelevant pages in a given domain. A page classifier can be:
- A simple regular expression
- A machine-learning based classification model
## Features
- **Focused Crawling**: Only collect relevant pages
- **Intelligent Prioritization**: Automatically learn how to prioritize links
- **Flexible Classifiers**: From regex to ML models
- **Efficient**: Avoid retrieval of irrelevant content
## Links
- GitHub: [github.com/VIDA-NYU/ache](https://github.com/VIDA-NYU/ache)