Optimizing web content for the web crawler

edit

Optimizing web content for the web crawler

edit

This documentation explains how to optimize web content for the crawler. To do this, you must be able to access and modify HTML, robots.txt files, or sitemap source files.

If you can’t access these files, manage crawls in Kibana.

You can optimize your web content source files for the web crawler.

These techniques are similar to search engine optimization (SEO) techniques used for other web crawlers and robots. For example, you can embed instructions for the web crawler within your HTML content. You can also prevent the crawler from following links or indexing any content for certain webpages. Use these tools to manage webpage discovery and content extraction.

Discovery concerns which web pages and files from crawled domains get indexed:

Extraction concerns how content is indexed and mapped to fields in Elasticsearch documents:

HTML elements and attributes

edit

The following sections describe crawler instructions you can embed within HTML elements and attributes.

Canonical URL link tags
edit

A canonical URL link tag is an HTML element you can embed within pages that duplicate the content of other pages. See Duplicate document handling for detailed information about managing duplicate content using the Kibana UI. The canonical URL link tag specifies the canonical URL for that content.

The canonical URL is stored on the document in the url field, while the additional_urls field contains all other URLs where the crawler discovered the same content. If your site contains pages that duplicate the content of other pages, use canonical URL link tags to explicitly manage which URL is stored in the url field of the indexed document.

Template:

<link rel="canonical" href="{CANONICAL_URL}">

Example:

<link rel="canonical" href="https://example.com/categories/dresses/starlet-red-medium">
Robots meta tags
edit

Robots meta tags are HTML elements you can embed within pages to prevent the crawler from following links or indexing content. These tags are related to crawl rules. See Crawl rules for detailed information about crawl rules.

Template:

<meta name="robots" content="{DIRECTIVES}">

Supported directives:

noindex
The web crawler will not index the page. If you want to index some, but not all, content on a page, see Data attributes for inclusion and exclusion.
nofollow

The web crawler will not follow links from the page. The web crawler logs a url_discover_denied event for each link.

The directive does not prevent the web crawler from indexing the page.

Currently, content deletion (purge or process crawl) does not honor the noindex and nofollow directives. Crawls will not remove previously indexed pages that now have noindex and nofollow directives from the engine at the end of each crawl. To manually remove obsolete content, create the appropriate crawl rules to exclude the pages and run a process crawl.

Examples:

<meta name="robots" content="noindex">
<meta name="robots" content="nofollow">
<meta name="robots" content="noindex, nofollow">
Data attributes for inclusion and exclusion
edit

Inject HTML data attributes into your web pages to instruct the web crawler to include or exclude particular sections from extracted content. For example, use this feature to exclude navigation and footer content when crawling, or to exclude sections of content only intended for screen readers.

These attributes work as follows:

  • For all pages that contain HTML tags with a data-elastic-include attribute, the crawler will only index content within those tags.
  • For all pages that contain HTML tags with a data-elastic-exclude attribute, the crawler will skip those tags from content extraction. You can nest data-elastic-include and data-elastic-exclude tags, too.
  • The web crawler will still crawl any links that appear inside excluded sections as long as the configured crawl rules allow them.
Examples
edit

A simple content exclusion rule example:

<body>
  <p>This is your page content, which will be indexed by the web crawler.
  <div data-elastic-exclude>Content in this div will be excluded from the search index</div>
</body>

In this more complex example with nested exclusion and inclusion rules, the web crawler will only extract "test1 test3 test5 test7" from the page.

<body>
  test1
  <div data-elastic-exclude>
    test2
    <p data-elastic-include>
      test3
      <span data-elastic-exclude>
        test4
        <span data-elastic-include>test5</span>
      </span>
    </p>
    test6
  </div>
  test7
</body>
Meta tags and data attributes to extract custom fields
edit

The web crawler extracts a predefined, set of fields (url, body content, etc) from each page it visits. View your documents to see the full schema. With meta tags and data attributes you can extract custom fields from your HTML pages.

Template:

<head>
  <meta class="elastic" name="{FIELD_NAME}" content="{FIELD_VALUE}">
</head>
<body>
  <div data-elastic-name="{FIELD_NAME}">{FIELD_VALUE}</div>
</body>

The crawled document for this example

<head>
  <meta class="elastic" name="product_price" content="99.99">
</head>
<body>
  <h1 data-elastic-name="product_name">Printer</h1>
</body>

will include 2 additional fields.

{
  "product_price": "99.99",
  "product_name": "Printer"
}

You can specify multiple class="elastic" and data-elastic-name tags.

Template:

<head>
  <meta class="elastic" name="{FIELD_NAME_1}" content="{FIELD_VALUE_1}">
  <meta class="elastic" name="{FIELD_NAME_2}" content="{FIELD_VALUE_2}">
</head>
<body>
  <div data-elastic-name="{FIELD_NAME_1}">{FIELD_VALUE_1}</div>
  <div data-elastic-name="{FIELD_NAME_2}">{FIELD_VALUE_2}</div>
</body>

{FIELD_NAME} must conform to field name rules:

  • Must contain a lowercase letter and may only contain lowercase letters, numbers, and underscores.
  • Must not contain whitespace or have a leading underscore.
  • Must not contain more than 64 characters.
  • Must not be any of the following reserved words:

    • id
    • engine_id
    • search_index_id
    • highlight
    • any
    • all
    • none
    • or
    • and
    • not
    • additional_urls
    • body_content
    • domains
    • headings
    • last_crawled_at
    • links
    • meta_description
    • meta_keywords
    • title
    • url
    • url_host
    • url_path
    • url_path_dir1
    • url_path_dir2
    • url_path_dir3
    • url_port
    • url_scheme

It might not be possible to customize the HTML source code for the webpages you want to crawl. Refer to Custom fields using proxy for information about how to extract custom fields using a proxy server.

Nofollow links
edit

Nofollow links are HTML links that instruct the crawler to not follow the URL.

The web crawler will not follow links that include rel="nofollow" (i.e. will not add links to the crawl queue). The web crawler logs a url_discover_denied event for each link.

The link does not prevent the web crawler from indexing the page in which it appears.

Template:

<a rel="nofollow" href="{LINK_URL}">{LINK_TEXT}</a>

Example:

<a rel="nofollow" href="/admin/categories">Edit this category</a>

robots.txt files

edit

It is impossible to configure the web crawler to ignore or work around a domain’s robots.txt file. Remember this if you’re crawling a domain you don’t control.

A domain may have a robots.txt file. This is a plain text file that provides instructions to web crawlers. The instructions within the file, also called directives, communicate which paths within that domain are disallowed (and allowed) for crawling.

You can also use a robots.txt file to specify sitemaps for a domain. See Sitemaps.

Most web crawlers automatically fetch and parse the robots.txt file for each domain they crawl. If you already publish a robots.txt file for other web crawlers, be aware the web crawler will fetch this file and honor the directives within it. You may want to add, remove, or update the robots.txt file for each of your domains.

Example: add a robots.txt file to a domain

To add a robots.txt file to the domain https://shop.example.com:

  1. Determine which paths within the domain you’d like to exclude.
  2. Create a robots.txt file with the appropriate directives from the Robots exclusion standard. For instance:

    User-agent: *
    Disallow: /cart
    Disallow: /login
    Disallow: /account
  3. Publish the file, with filename robots.txt, at the root of the domain: https://shop.example.com/robots.txt.

The next time the web crawler visits the domain, it will fetch and parse the robots.txt file.

The web crawler will crawl only those paths that are allowed by the crawl rules for the domain and the directives within the robots.txt file for the domain.

See crawl rules for detailed information about crawl rules.

Non-standard extensions

edit

The Elastic web crawler does not support all Nonstandard extensions to the robots exclusion standard.

Directive

Support

Crawl-delay directive

Not supported

Sitemap directive

Supported

Host directive

Not supported

Sitemaps

edit

A sitemap is an XML file, associated with a domain, that informs web crawlers about pages within that domain. XML elements within the sitemap identify specific URLs that are available for crawling. Each domain may have one or more sitemaps.

If you already publish sitemaps for other web crawlers, the web crawler can use the same sitemaps. To make your sitemaps discoverable, specify them within robots.txt files.

Sitemaps are related to entry points. See entry points. You can choose to submit URLs to the web crawler using sitemaps, entry points, or a combination of both.

You may prefer using sitemaps over entry points for any of the following reasons:

  • You have already been publishing sitemaps for other web crawlers.
  • You don’t have access to the web crawler UI in Kibana.
  • You prefer the sitemap file interface over the Kibana UI.

Use sitemaps to inform the web crawler of pages you think are important, or pages that are isolated and not linked from other pages. However, be aware the web crawler will visit only those pages from the sitemap that are allowed by the domain’s crawl rules and robots.txt file directives.

Sitemap discovery and management
edit

To add a sitemap to a domain, you can specify it within a robots.txt file. At the start of each crawl, the web crawler fetches and processes each domain’s robots.txt file and each sitemap specified within those robots.txt files.

Sitemap format and technical specification
edit

The sitemaps standard defines the format and technical specification for sitemaps. Refer to the standard for the required and optional elements, character escaping, and other technical considerations and examples.

The web crawler does not process optional meta data defined by the standard. The web crawler extracts a list of URLs from each sitemap and ignores all other information.

There is no guarantee that pages (and their respective linked pages) will be indexed in the order they appear in the sitemap, because crawls are run asynchronously.

Ensure each URL within your sitemap matches the exact domain — here defined as scheme + host + port— for your site. Different subdomains (like www.example.com and blog.example.com), and different schemes (like http://example.com and https://example.com), require separate sitemaps.

The web crawler also supports sitemap index files. Refer to Using sitemap index files within the sitemap standard for sitemap index file details and examples.

Manage sitemaps
edit

Example: Add a sitemap via robots.txt

To add a sitemap to the domain https://shop.example.com:

  1. Determine which pages within the domain you’d like to include. Ensure these paths are allowed by the domain’s crawl rules and the directives within the domain’s robots.txt file.
  2. Create a sitemap file with the appropriate elements from the sitemap standard. For instance:

    <?xml version="1.0" encoding="UTF-8"?>
    <urlset xmlns="http://www.sitemaps.org/schemas/sitemap/0.9">
      <url>
        <loc>https://shop.example.com/products/1/</loc>
      </url>
      <url>
        <loc>https://shop.example.com/products/2/</loc>
      </url>
      <url>
        <loc>https://shop.example.com/products/3/</loc>
      </url>
    </urlset>
  3. Publish the file on your site, for example, at the root of the domain: https://shop.example.com/sitemap.xml.
  4. Create or modify the robots.txt file for the domain, located at https://shop.example.com/robots.txt. Anywhere within the file, add a Sitemap directive that provides the location of the sitemap. For instance:

    Sitemap: https://shop.example.com/sitemap.xml
  5. Publish the new or updated robots.txt file.

The next time the web crawler visits the domain, it will fetch and parse the robots.txt file and the sitemap.

Alternatively, you can also manage the sitemaps for a domain through the Kibana UI. From here, you can view, add, edit, and delete sitemaps. Use the UI to add custom sitemap definitions that do not live on the domain, and are used only by your crawler. See Entry points and sitemaps.