IMPORTANT: No additional bug fixes or documentation updates will be released for this version. For the latest information, see the current release documentation.

« Web crawler events logs reference Elastic connectors »

› ›

Web crawler schema

edit

Web crawler schema

edit

The web crawler indexes search documents using the following schema. All fields are strings or arrays of strings.

additional_urls: The URLs of additional pages with the same content.
body_content: The content of the page’s <body> tag with all HTML tags removed. Truncated to crawler.extraction.body_size.limit.
domains: The domains in which this content appears.
headings: The text of the page’s HTML headings (h1 - h6 elements). Limited by crawler.extraction.headings_count.limit.
id: The unique identifier for the page.
last_crawled_at: The date and time when the page was last crawled.
links: Links found on the page. Limited by crawler.extraction.indexed_links_count.limit.
meta_description: The page’s description, taken from the <meta name="description"> tag. Truncated to crawler.extraction.description_size.limit.
meta_keywords: The page’s keywords, taken from the <meta name="keywords"> tag. Truncated to crawler.extraction.keywords_size.limit.
title: The title of the page, taken from the <title> tag. Truncated to crawler.extraction.title_size.limit.
url: The URL of the page.
url_host: The hostname or IP from the page’s URL.
url_path: The full pathname from the page’s URL.
url_path_dir1: The first segment of the pathname from the page’s URL.
url_path_dir2: The second segment of the pathname from the page’s URL.
url_path_dir3: The third segment of the pathname from the page’s URL.
url_port: The port number from the page’s URL (as a string).
url_scheme: The scheme of the page’s URL.

In addition to these predefined fields, you can also extract custom fields via meta tags and attributes.

« Web crawler events logs reference Elastic connectors »

Was this helpful?

Feedback

The Search AI Company

ELK Stack

Elastic Cloud

Generative AI

Search

Security

Observability

By solution

Industries

Customer spotlight

Research

Build

Learn

Connect

Web crawler schema

Web crawler schema

Follow us

About us

Join us

Partners

Trust & Security

Investor relations

Excellence Awards

About us

Join us

Partners

Trust & Security

Investor relations

Excellence Awards