New

The executive guide to generative AI

Read more

Web crawler schema

edit

Web crawler schema

edit

The web crawler indexes search documents using the following schema. All fields are strings or arrays of strings.

additional_urls
The URLs of additional pages with the same content.
body_content
The content of the page’s <body> tag with all HTML tags removed. Truncated to crawler.extraction.body_size.limit.
domains
The domains in which this content appears.
full_html
The full HTML of the page in string form. This is disabled by default. If the setting is disabled, the document will not have a full_html field at all.
headings
The text of the page’s HTML headings (h1 - h6 elements). Limited by crawler.extraction.headings_count.limit.
id
The unique identifier for the page.
last_crawled_at
The date and time when the page was last crawled.
links
Links found on the page. Limited by crawler.extraction.indexed_links_count.limit.
meta_description
The page’s description, taken from the <meta name="description"> tag. Truncated to crawler.extraction.description_size.limit.
meta_keywords
The page’s keywords, taken from the <meta name="keywords"> tag. Truncated to crawler.extraction.keywords_size.limit.
title
The title of the page, taken from the <title> tag. Truncated to crawler.extraction.title_size.limit.
url
The URL of the page.
url_host
The hostname or IP from the page’s URL.
url_path
The full pathname from the page’s URL.
url_path_dir1
The first segment of the pathname from the page’s URL.
url_path_dir2
The second segment of the pathname from the page’s URL.
url_path_dir3
The third segment of the pathname from the page’s URL.
url_port
The port number from the page’s URL (as a string).
url_scheme
The scheme of the page’s URL.

In addition to these predefined fields, you can also extract custom fields via meta tags and attributes.

Was this helpful?
Feedback