- Enterprise Search Guide: other versions:
- Overview
- Getting started
- Web crawler
- Connectors
- Engines and content sources
- Programming language clients
- Search UI
- App Search and Workplace Search
- Enterprise Search server
- Run using Docker images
- Run using downloads (packages)
- Troubleshooting
- Monitoring
- Read-only mode
- Management APIs
- Monitoring APIs
- Read-only mode API
- Configuration
- Configuring encryption keys
- Configuring a mail service
- Configuring SSL/TLS
- Upgrading and migrating
- Upgrading self-managed deployments
- Upgrading from Enterprise Search 7.x
- Upgrading from Enterprise Search 7.11 and earlier
- Migrating from App Search on Elastic Cloud
- Migrating from App Search on Swiftype.com
- Migrating from self-managed App Search
- Logs and logging
- Release notes
- 8.4.3 release notes
- 8.4.2 release notes
- 8.4.1 release notes
- 8.4.0 release notes
- 8.3.3 release notes
- 8.3.2 release notes
- 8.3.1 release notes
- 8.3.0 release notes
- 8.2.3 release notes
- 8.2.2 release notes
- 8.2.1 release notes
- 8.2.0 release notes
- 8.1.3 release notes
- 8.1.2 release notes
- 8.1.1 release notes
- 8.1.0 release notes
- 8.0.1 release notes
- 8.0.0 release notes
- 8.0.0-rc2 release notes
- 8.0.0-rc1 release notes
- 8.0.0-beta1 release notes
- 8.0.0-alpha2 release notes
- 8.0.0-alpha1 release notes
IMPORTANT: No additional bug fixes or documentation updates
will be released for this version. For the latest information, see the
current release documentation.
Web crawler schema
editWeb crawler schema
editThe web crawler indexes search documents using the following schema. All fields are strings or arrays of strings.
-
additional_urls
- The URLs of additional pages with the same content.
-
body_content
-
The content of the page’s
<body>
tag with all HTML tags removed. Truncated tocrawler.extraction.body_size.limit
. -
domains
- The domains in which this content appears.
-
headings
-
The text of the page’s HTML headings (
h1
-h6
elements). Limited bycrawler.extraction.headings_count.limit
. -
id
- The unique identifier for the page.
-
last_crawled_at
- The date and time when the page was last crawled.
-
links
-
Links found on the page.
Limited by
crawler.extraction.indexed_links_count.limit
. -
meta_description
-
The page’s description, taken from the
<meta name="description">
tag. Truncated tocrawler.extraction.description_size.limit
. -
meta_keywords
-
The page’s keywords, taken from the
<meta name="keywords">
tag. Truncated tocrawler.extraction.keywords_size.limit
. -
title
-
The title of the page, taken from the
<title>
tag. Truncated tocrawler.extraction.title_size.limit
. -
url
- The URL of the page.
-
url_host
- The hostname or IP from the page’s URL.
-
url_path
- The full pathname from the page’s URL.
-
url_path_dir1
- The first segment of the pathname from the page’s URL.
-
url_path_dir2
- The second segment of the pathname from the page’s URL.
-
url_path_dir3
- The third segment of the pathname from the page’s URL.
-
url_port
- The port number from the page’s URL (as a string).
-
url_scheme
- The scheme of the page’s URL.
In addition to these predefined fields, you can also extract custom fields via meta tags and attributes.
Was this helpful?
Thank you for your feedback.