Tool

ACHE

ACHE

ACHE is a focused web crawler that collects web pages satisfying specific criteria using page classifiers and intelligent link prioritization.

ACHE is a focused web crawler. It collects web pages that satisfy some specific criteria, e.g., pages that belong to a given domain or that contain a user-specified pattern. ## How It Differs ACHE differs from generic crawlers in that it uses page classifiers to distinguish between relevant and irrelevant pages in a given domain. A page classifier can be: - A simple regular expression - A machine-learning based classification model ## Features - **Focused Crawling**: Only collect relevant pages - **Intelligent Prioritization**: Automatically learn how to prioritize links - **Flexible Classifiers**: From regex to ML models - **Efficient**: Avoid retrieval of irrelevant content ## Links - GitHub: [github.com/VIDA-NYU/ache](https://github.com/VIDA-NYU/ache)