Loading

Elastic Inference Service

Elastic Inference Service (EIS) enables you to leverage AI-powered search as a service without deploying a model in your environment. With EIS, you don't need to manage the infrastructure and resources required for machine learning inference by adding, configuring, and scaling machine learning nodes. Instead, you can use machine learning models for ingest, search, and chat independently of your Elasticsearch infrastructure.

You can use EIS with your self-managed cluster through Cloud Connect. For details, refer to EIS for self-managed clusters.

  • Your Elastic deployment or project comes with an Elastic Managed LLM connector with a default LLM. This connector is used in Agent Builder, the AI Assistant, Attack Discovery, Automatic Import and Search Playground. For the list of available models, refer to the documentation.

  • You can use ELSER to perform semantic search as a service (ELSER on EIS).

  • You can use the jina-embeddings-v3 multilingual dense vector embedding model to perform semantic search through the Elastic Inference Service.

This table lists the models supported by Elastic Inference Service.

Note

The Inference Regions column shows the regions where inference requests are processed and where data is sent.

Models supported by Elastic Inference Service
Author Name ID Release Status Input Modalities Output Modalities EOL Date Data Retention Period (Days) Data Used To Train Models? Model Card Author Terms Provider Terms Inference Regions
Anthropic Claude Sonnet 3.7 anthropic-claude-3.7-sonnet Generally Available Text Text 2026-04-28 0 No Claude 3.7 Sonnet System Card Anthropic terms AWS terms US
Anthropic Claude Sonnet 4.5 anthropic-claude-4.5-sonnet Generally Available Text Text 2026-09-29 0 No Claude Sonnet 4.5 System Card Anthropic terms AWS terms US
Elastic ELSER v2 elser_model_2 Generally Available Text Embedding 0 No ELSER docs Elastic Terms Elastic Terms US
Jina Embeddings v3 jina-embeddings-v3 Generally Available Text Embedding 0 No jina-embeddings-v3 Jina.ai terms Elastic Terms US
Jina Reranker v2 jina-reranker-v2-base-multilingual Generally Available Text Text 0 No jina-reranker-v2-base-multilingual Jina.ai terms Elastic Terms US
Jina Reranker v3 jina-reranker-v3 Generally Available Text Text 0 No jina-reranker-v3 Jina.ai terms Elastic Terms US
Important
  • Elastic does not guarantee the availability of supported models.
  • Use of the Elastic Inference Service requires that customers have read and agreed to the applicable terms of the model providers. Use of a model constitutes a contract between the customer and the model provider.
  • “AI Models” means the third-party generative artificial intelligence models accessed via API through the Service and listed on the OpenRouter website.
  • “AI Model Provider” means the provider of the applicable AI Model.
  • Availability. OpenRouter does not guarantee availability of the AI Models and provides Customer access to the AI Models only on an as-available basis. AI Model uptime and performance are described in the applicable AI Model Terms, and Customer is responsible for (a) reviewing the AI Model Terms to understand the availability and data practices of each AI Model Provider, and (b) agreeing to the AI Model Terms prior to using the Service.

Elastic Inference Service is currently available in a single region: AWS us-east-1. All inference requests sent through EIS are routed to this region, regardless of where your Elasticsearch deployment or Serverless project is hosted.

Depending on the model being used, request processing may involve Elastic inference infrastructure and, in some cases, trusted third-party model providers. For example, ELSER requests are processed entirely within Elastic inference infrastructure in AWS us-east-1. Other models, such as large language models or third-party embedding models, may involve additional processing by their respective model providers, which can operate in different cloud platforms or regions.

The service enforces rate limits on an ongoing basis. Exceeding a limit results in HTTP 429 responses from the server until the sliding window moves on further and parts of the limit resets.

Model Request/minute Tokens/minute (ingest) Tokens/minute (search) Notes
Claude Sonnet 3.7 400 - - No rate limit on tokens
Elastic Managed LLM 400 - - No rate limit on tokens. Renamed to Claude Sonnet 3.7 in later versions
Claude Sonnet 4.5 400 - - No rate limit on tokens
ELSER 6,000 6,000,000 600,000 Limits are applied to both requests per minute and tokens per minute, whichever limit is reached first.
jina-embeddings-v3 6,000 6,000,000 600,000 Limits are applied to both requests per minute and tokens per minute, whichever limit is reached first.

All models on EIS incur a charge per million tokens. The pricing details are available on our Pricing page.

This pricing model differs from the existing Machine Learning Nodes, which is billed through VCUs consumed.

EIS is billed per million tokens used:

  • For chat models, input and output tokens are billed. Longer conversations with extensive context or detailed responses will consume more tokens.
  • For embeddings models, only input tokens are billed.

Tokens are the fundamental units that language models process for both input and output. Tokenizers convert text into numerical data by segmenting it into subword units. A token can be a complete word, part of a word, or a punctuation mark, depending on the model's trained tokenizer and the frequency patterns in its training data.

For example, the sentence It was the best of times, it was the worst of times. contains 52 characters but would tokenize into approximately 14 tokens with a typical word-based approach, though the exact count varies by tokenizer.

To track your token consumption:

  1. Navigate to Billing and subscriptions > Usage in the Elastic Cloud Console.
  2. Look for line items where the Billing dimension is set to "Inference".

ELSER on EIS enables you to use the ELSER model on GPUs, without having to manage your own ML nodes. We expect better performance for ingest throughput than ML nodes and equivalent performance for search latency. We will continue to benchmark, remove limitations and address concerns.

You can now use semantic_text with the new ELSER endpoint on EIS. To learn how to use the .elser-2-elastic inference endpoint, refer to Using ELSER on EIS.

Semantic Search with semantic_text has a detailed tutorial on using the semantic_text field and using the ELSER endpoint on EIS instead of the default endpoint. This is a great way to get started and try the new endpoint.

You can use the jina-embeddings-v3 model through Elastic Inference Service. Running the model on EIS means that you use the model on GPUs, without the need of managing infrastructure and model resources.

Create an inference endpoint that references the jina-embeddings-v3 model in the model_id field.

				PUT _inference/text_embedding/eis-jina-embeddings-v3
					{
  "service": "elastic",
  "service_settings": {
    "model_id": "jina-embeddings-v3"
  }
}
		

The created inference endpoint uses the model for inference operations on the Elastic Inference Service. You can reference the inference_id of the endpoint in index mappings for the semantic_text field type, text_embedding inference tasks, or search queries.