- App Search Guide: other versions:
- Installation
- Getting started
- Authentication
- Limits
- Security & Users
- Guides
- Analytics Tags
- Crawl web content
- Curations
- Facets
- Hierarchical Facets
- Indexing Documents
- Language Optimization
- Log settings
- Meta Engines
- Query Suggestions
- Reference UI
- Relevance Tuning
- Result Settings
- Result Suggestions
- Role Based Access Control
- Sanitization, Raw or Snippet
- Search
- Synonyms
- View web crawler events logs
- Web crawler (beta)
- Web crawler (beta) FAQ
- Web crawler (beta) reference
- API Reference
- Analytics APIs
- Analytics clicks API
- Analytics counts API
- Analytics queries API
- API logs API
- Click API
- Credentials API
- Curations API
- Documents API
- Engines API
- Log settings API
- Multi search API
- Query suggestion API
- Schema API
- Search API
- Search API boosts
- Search API facets
- Search API filters
- Search API group
- Search API result fields
- Search API search fields
- Search API sort
- Search API analytics tags
- Search settings API
- Source engines API
- Synonyms API
- Web crawler (beta) API reference
- API Clients
- Configuration
- Troubleshooting
Web crawler (beta) FAQ
editWeb crawler (beta) FAQ
editThe Elastic Enterprise Search web crawler is a beta feature. Beta features are subject to change and are not covered by the support SLA of general release (GA) features. Elastic plans to promote this feature to GA in a future release.
View frequently asked questions about the Enterprise Search web crawler:
See Web crawler (beta) reference for detailed technical information about the web crawler.
We also welcome your feedback.
What functionality is supported?
edit- Crawling publicly-accessible HTTP/HTTPS websites
- Support for crawling multiple domains per-Engine
- Robots meta tag support
-
Robots "nofollow" support
Includes robots meta tags set to "nofollow" and links with rel="nofollow" attributes
-
Basic content extraction
The web crawler will extract content for a predefined, unconfigurable set of fields from each page it visits.
-
"Entry points"
Entry points allow customers to specify where the web crawler begins crawling each domain.
-
"Crawl rules"
Crawl rules allow customers to control whether each URL the web crawler encounters will be visited and indexed.
-
Logging of each crawl
Logs are representative of an entire crawl, which encompasses all domains in an engine.
- User interfaces for managing domains, entry points, and crawl rules
What functionality is not supported?
edit-
Automatic or scheduled crawling
Start crawls manually from the UI or use the crawler API to schedule a crawl on-demand.
-
Single-page app (SPA) support
The crawler cannot currently crawl pages that are pure JavaScript single-page apps.
-
Configurable content extraction
Content extraction is currently limited to an unconfigurable, predefined set of fields.
- Crawling private websites or websites behind authentication
-
Sitemap support
The web crawler currently has no knowledge of sitemaps and cannot utilize them to identify pages to visit.
-
robots.txt support
The web crawler does not currently adhere to
robots.txt
rules. The crawler only honors robots meta tags set tonofollow
and links withrel="nofollow"
attributes. -
Crawl persistence
If a crawl is unexpectedly stopped before it finishes, it will not be able to restart where it left off. You can restart a crawl again from the beginning. The crawler will not duplicate documents that it already indexed.
-
Extracting content from files
Currently, the web crawler will only extract content from HTML content.