- Enterprise Search Guide: other versions:
- Getting started
- Prerequisites
- Ingestion
- Web crawler
- Connectors
- Native connectors
- Connector clients
- Elastic connector framework
- Workplace Search connectors
- Using connectors
- Known issues
- Troubleshooting
- Document level security
- Logs
- Security
- Content syncs
- Sync rules
- Content extraction
- Reference: Azure Blob Storage
- Reference: Confluence
- Reference: Dropbox
- Reference: GitHub
- Reference: Gmail
- Reference: Google Cloud Storage
- Reference: Google Drive
- Reference: Jira
- Reference: Microsoft SQL
- Reference: MongoDB
- Reference: MySQL
- Reference: Network drive
- Reference: OneDrive
- Reference: Oracle
- Reference: PostgreSQL
- Reference: S3
- Reference: Salesforce
- Reference: ServiceNow
- Reference: SharePoint Online
- Reference: SharePoint Server
- Reference: Slack
- Ingestion APIs
- Ingest pipelines
- Document enrichment with ML
- ELSER text expansion
- Indices, engines, content sources
- Programming language clients
- Behavioral analytics
- Search UI
- App Search and Workplace Search
- Search Applications
- Enterprise Search server
- Run using Docker images
- Run using downloads (packages)
- Enterprise Search server known issues
- Troubleshooting
- Troubleshooting setup
- Monitoring
- Read-only mode
- Management APIs
- Monitoring APIs
- Read-only mode API
- Storage API
- Configuration
- Configuring encryption keys
- Configuring a mail service
- Configuring SSL/TLS
- Upgrading and migrating
- Upgrading self-managed deployments
- Upgrading from Enterprise Search 7.x
- Upgrading from Enterprise Search 7.11 and earlier
- Migrating from App Search on Elastic Cloud
- Migrating from App Search on Swiftype.com
- Migrating from self-managed App Search
- Logs and logging
- Known issues
- Troubleshooting
- Help, support, and feedback
- Release notes
- 8.10.4 release notes
- 8.10.3 release notes
- 8.10.2 release notes
- 8.10.1 release notes
- 8.10.0 release notes
- 8.9.2 release notes
- 8.9.1 release notes
- 8.9.0 release notes
- 8.8.2 release notes
- 8.8.1 release notes
- 8.8.0 release notes
- 8.7.1 release notes
- 8.7.0 release notes
- 8.6.2 release notes
- 8.6.1 release notes
- 8.6.0 release notes
- 8.5.3 release notes
- 8.5.2 release notes
- 8.5.1 release notes
- 8.5.0 release notes
- 8.4.3 release notes
- 8.4.2 release notes
- 8.4.1 release notes
- 8.4.0 release notes
- 8.3.3 release notes
- 8.3.2 release notes
- 8.3.1 release notes
- 8.3.0 release notes
- 8.2.3 release notes
- 8.2.2 release notes
- 8.2.1 release notes
- 8.2.0 release notes
- 8.1.3 release notes
- 8.1.2 release notes
- 8.1.1 release notes
- 8.1.0 release notes
- 8.0.1 release notes
- 8.0.0 release notes
- 8.0.0-rc2 release notes
- 8.0.0-rc1 release notes
- 8.0.0-beta1 release notes
- 8.0.0-alpha2 release notes
- 8.0.0-alpha1 release notes
IMPORTANT: No additional bug fixes or documentation updates
will be released for this version. For the latest information, see the
current release documentation.
Web crawler schema
editWeb crawler schema
editThe web crawler indexes search documents using the following schema. All fields are strings or arrays of strings.
-
additional_urls
- The URLs of additional pages with the same content.
-
body_content
-
The content of the page’s
<body>
tag with all HTML tags removed. Truncated tocrawler.extraction.body_size.limit
. -
domains
- The domains in which this content appears.
-
full_html
-
The full HTML of the page in string form.
This is disabled by default.
If the setting is disabled, the document will not have a
full_html
field at all. -
headings
-
The text of the page’s HTML headings (
h1
-h6
elements). Limited bycrawler.extraction.headings_count.limit
. -
id
- The unique identifier for the page.
-
last_crawled_at
- The date and time when the page was last crawled.
-
links
-
Links found on the page.
Limited by
crawler.extraction.indexed_links_count.limit
. -
meta_description
-
The page’s description, taken from the
<meta name="description">
tag. Truncated tocrawler.extraction.description_size.limit
. -
meta_keywords
-
The page’s keywords, taken from the
<meta name="keywords">
tag. Truncated tocrawler.extraction.keywords_size.limit
. -
title
-
The title of the page, taken from the
<title>
tag. Truncated tocrawler.extraction.title_size.limit
. -
url
- The URL of the page.
-
url_host
- The hostname or IP from the page’s URL.
-
url_path
- The full pathname from the page’s URL.
-
url_path_dir1
- The first segment of the pathname from the page’s URL.
-
url_path_dir2
- The second segment of the pathname from the page’s URL.
-
url_path_dir3
- The third segment of the pathname from the page’s URL.
-
url_port
- The port number from the page’s URL (as a string).
-
url_scheme
- The scheme of the page’s URL.
In addition to these predefined fields, you can also extract custom fields via meta tags and attributes.
Was this helpful?
Thank you for your feedback.