IMPORTANT: No additional bug fixes or documentation updates will be released for this version. For the latest information, see the current release documentation.

« Web crawler events logs reference Elastic connectors »

› ›

Web crawler schema

edit

IMPORTANT: This documentation is no longer updated. Refer to Elastic's version policy and the latest documentation.

Web crawler schema

edit

The web crawler indexes search documents using the following schema. All fields are strings or arrays of strings.

additional_urls: The URLs of additional pages with the same content.
body_content: The content of the page’s <body> tag with all HTML tags removed. Truncated to crawler.extraction.body_size.limit.
domains: The domains in which this content appears.
full_html: The full HTML of the page in string form. This is disabled by default. If the setting is disabled, the document will not have a full_html field at all.
headings: The text of the page’s HTML headings (h1 - h6 elements). Limited by crawler.extraction.headings_count.limit.
id: The unique identifier for the page.
last_crawled_at: The date and time when the page was last crawled.
links: Links found on the page. Limited by crawler.extraction.indexed_links_count.limit.
meta_description: The page’s description, taken from the <meta name="description"> tag. Truncated to crawler.extraction.description_size.limit.
meta_keywords: The page’s keywords, taken from the <meta name="keywords"> tag. Truncated to crawler.extraction.keywords_size.limit.
title: The title of the page, taken from the <title> tag. Truncated to crawler.extraction.title_size.limit.
url: The URL of the page.
url_host: The hostname or IP from the page’s URL.
url_path: The full pathname from the page’s URL.
url_path_dir1: The first segment of the pathname from the page’s URL.
url_path_dir2: The second segment of the pathname from the page’s URL.
url_path_dir3: The third segment of the pathname from the page’s URL.
url_port: The port number from the page’s URL (as a string).
url_scheme: The scheme of the page’s URL.

In addition to these predefined fields, you can also extract custom fields via meta tags and attributes.

« Web crawler events logs reference Elastic connectors »