With the Elasticsearch open inference API, you can perform inference outside Elasticsearch using Hugging Face's Inference Endpoints. This allows you to use Hugging Face's scalable infrastructure including the ability to perform inference on GPUs and AI accelerators. The capability to use generated embeddings from Hugging Face was introduced as the first open inference API integration in Elasticsearch 8.11, and since then we’ve been hard at work and have updated it with more powerful features that allow you to get better results with less effort.
With the integration of the semantic_text field, documents are natively chunked and stored with their embeddings. All stored embeddings are compressed with scalar quantization by default in the Elasticsearch vector database. Retrieval of these embeddings with retrievers
, enables composability of search when using multiple models hosted on Hugging Face (or any other service accessible through the open inference API) enabling multiple types of embeddings within a single document. All these features add up to save developers time by not having to write custom logic, and enable them to build fun gen AI apps more quickly!
Disambiguation
This blog post uses the words "inference endpoint" in two different ways:
- Hugging Face's
Inference Endpoints service
, and - Elasticsearch's Open Inference API
inference endpoint objects
.
Hugging Face's Inference Endpoints service provides compute instances which run Hugging Face Transformers models, while Elasticsearch inference endpoint objects store configurations for Elasticsearch to access and use Hugging Face inference endpoints service.
What is the Elasticsearch open inference API?
The open inference API is your gateway to performing inference in elasticsearch. It allows you to use machine learning models and services outside of Elasticsearch without having to write any messy glue code. All you need to do is supply an API key and create an Inference Endpoint object. With the elasticsearch open inference API, you can perform inference on LLMs using the completion
task, generate dense or sparse text embeddings using the text_embedding
or sparse_embedding
tasks, or rank documents using the rerank
task.
What is the Hugging Face Inference Endpoint service?
Hugging Face's Inference Endpoints service allows you to deploy and run Hugging Face Transformers models in the cloud. Checkout Hugging Face's guide to create your own endpoint https://huggingface.co/docs/inference-endpoints/guides/create_endpoint.
- Make sure to set the Task to match the model you are deploying and the field type you will be mapping in Elasticsearch.
- Make sure to copy/take note of the Endpoint URL.
- Create a User Access Token (also known as an API key) to authenticate your requests to the endpoint https://huggingface.co/settings/tokens. For better security, choose a Fine-grained Access Token to only give the required scope to the token.
- Make sure to securely copy/take note of the API key (access token).
How to use Hugging Face Inference Endpoints with the Elasticsearch open inference API
To use Hugging Face Inference Endpoint service with the open inference API, there are 3 steps you need to follow
- Create an Inference Endpoint service in Hugging Face with the model you want to use
- Create an Inference Endpoint object in Elasticsearch using the open inference API, and by supplying your hugging face API key
- Perform inference using the Inference Endpoint object, or configure an index to use semantic text to automatically embed your documents. Note: you can perform these same steps using cURL, any other HTTP client, or one of our other clients.
Step 1: Create an Inference Endpoint service in Hugging Face
See https://ui.endpoints.huggingface.co for more information on how to create an Inference Endpoint service in Hugging Face.
Step 2: Create an Inference Endpoint object in Elasticsearch
client.inference.put(
task_type="text_embedding",
inference_id="my_hf_endpoint",
body={
"service": "hugging_face",
"service_settings": {"api_key": <HF_API_KEY>,
"url": "<URL_TO_HUGGING_FACE_ENDPOINT>"
},
}
)
Note: the task_type is set to text_embdding (dense vector embedding) because the model we deployed to our Hugging Face inference endpoint service was a dense text embedding model (multilingual-e5-small). We also had to select the sentence-embeddings configuration when we created the endpoint in Hugging Face.
Step 3: Perform inference using the Inference Endpoint object to access your Hugging Face inference endpoint service
dense_embedding = client.inference.inference(
inference_id='my_hf_endpoint',
input="this is the raw text of my document!"
)
Step 4: Ingest your dataset into an index with semantic text while taking advantage of native chunking
By using semantic_text fields, we can improve the speed of ingestion, while taking advantage of native chunking. To do so, we need to create an index with a text field (into which we will insert our raw document text) alongside a semantic_text field which we will copy our text to. When we ingest data into this index by inserting it into the text_field
, the data will automatically be copied to the semantic text field, and the documents will be natively chunked, allowing us to easily perform semantic searches.
client.indices.create(
index="hf-semantic-text-index",
mappings={
"properties": {
"infer_field": {
"type": "semantic_text",
"inference_id": "my_hf_endpoint"
},
"text_field": {
"type": "text",
"copy_to": "infer_field"
}
}
}
)
documents = load_my_dataset()
docs = []
for doc in documents:
if len(docs) >= 100:
helpers.bulk(client, docs)
docs = []
else:
docs.append({
"_index": "hf-semantic-text-index",
"_source": {"text_field": doc['text']},
})
Step 5: Perform semantic search with semantic text
query = "Is it really this easy to perform semantic search?"
semantic_search_results = client.search(
index="hf-semantic-text-index",
query={"semantic": {"field": "infer_field", "query": query}},
)
Step 6: Rerank with Cohere to get even better results
An advantage of using Elasticsearch as your vector database is our continuously expanding support for innovative third party capabilities. For example, it’s possible to improve your top hits by combining semantic search over embeddings created using Hugging Face models with Cohere’s reranking capabilities. To use Cohere reranking, you’ll need a Cohere API key.
client.inference.put(
task_type="rerank",
inference_id="my_cohere_rerank_endpoint",
body={
"service": "cohere",
"service_settings": {
"api_key": <COHERE_API_KEY>,
"model_id": "rerank-english-v3.0"
},
"task_settings": {
"top_n": 100,
"return_documents": True
}
}
)
reranked_search_results = client.search(
index="hf-semantic-text-index",
retriever= {
"text_similarity_reranker": {
"retriever": {
"standard": {
"query": {
"semantic": {
"field": "infer_field",
"query": query
}
}
}
},
"field": "text_field",
"inference_id": "my_cohere_rerank_endpoint",
"inference_text": query,
"rank_window_size": 100,
}
}
)
Use Hugging Face Inference Endpoints service with Elastic today!
Try out this notebook to get started with our Hugging Face Inference Endpoints Integration: Index millions of documents with GPU-accelerated inference using Hugging Face and Elasticsearch
Elasticsearch has native integrations to industry leading Gen AI tools and providers. Check out our webinars on going Beyond RAG Basics, or building prod-ready apps Elastic Vector Database.
To build the best search solutions for your use case, start a free cloud trial or try Elastic on your local machine now.