Web crawler (beta) FAQ
editWeb crawler (beta) FAQ
editThe Elastic Enterprise Search web crawler is a beta feature. Beta features are subject to change and are not covered by the support SLA of general release (GA) features. Elastic plans to promote this feature to GA in a future release.
View frequently asked questions about the Enterprise Search web crawler:
See Web crawler (beta) reference for detailed technical information about the web crawler.
We also welcome your feedback.
What functionality is supported?
edit- Crawling publicly-accessible HTTP/HTTPS websites
- Support for crawling multiple domains per-Engine
- Robots meta tag support
-
Robots "nofollow" support
Includes robots meta tags set to "nofollow" and links with rel="nofollow" attributes
-
Basic content extraction
The web crawler will extract content for a predefined, unconfigurable set of fields from each page it visits.
-
"Entry points"
Entry points allow customers to specify where the web crawler begins crawling each domain.
-
"Crawl rules"
Crawl rules allow customers to control whether each URL the web crawler encounters will be visited and indexed.
-
Logging of each crawl
Logs are representative of an entire crawl, which encompasses all domains in an engine.
- User interfaces for managing domains, entry points, and crawl rules
What functionality is not supported?
edit-
Automatic or scheduled crawling
Start crawls manually from the UI or use the crawler API to schedule a crawl on-demand.
-
Single-page app (SPA) support
The crawler cannot currently crawl pages that are pure JavaScript single-page apps.
-
Configurable content extraction
Content extraction is currently limited to an unconfigurable, predefined set of fields.
- Crawling private websites or websites behind authentication
-
Sitemap support
The web crawler currently has no knowledge of sitemaps and cannot utilize them to identify pages to visit.
-
robots.txt support
The web crawler does not currently adhere to
robots.txt
rules. The crawler only honors robots meta tags set tonofollow
and links withrel="nofollow"
attributes. -
Crawl persistence
If a crawl is unexpectedly stopped before it finishes, it will not be able to restart where it left off. You can restart a crawl again from the beginning. The crawler will not duplicate documents that it already indexed.
-
Extracting content from files
Currently, the web crawler will only extract content from HTML content.