Understanding BSI IT Grundschutz: A recipe for GenAI powered search on your (private) PDF treasure

Ever wondered how to make years of collected documents searchable by meaning, not just keywords? Dealing with PDFs full of valuable information can be challenging, especially when it comes to chunking and creating searchable data across multiple languages.

This blog post will guide you through transforming your document collection into an AI-powered semantic search system. We'll explore how Elastic offers an end-to-end solution for RAG: ingesting PDFs, processing them into chunks, vectorizing the content, and providing an interactive playground to query and interact with your data. Discover how to make your information not just more accessible, but truly conversational. You will be able to test how to easily create summaries of relevant information, understand the relevance score for each answer given and gain a maximum of transparency on why exactly an answer has been created.

semantic_text field type

Semantic search enhances data discovery by understanding word meanings and context. Elastic's semantic_text field type simplifies semantic search implementation by handling complexities behind the scenes, including inference, chunking, and continuous improvements. It's particularly effective with paragraph-based texts, automatically handling chunking strategies, with customizable options planned for the future.

For complex documents requiring different chunking strategies - such as those with intricate layouts, embedded images, or unique formatting - consider using tools like Apache Tika, Unstructured, or Tesseract OCR to pre-process the content.

Prerequisites

To implement the techniques described in this blog, you'll need:

  1. An Elastic cloud deployment optimized for Vector Search
    OR
    An on-premise Elasticsearch Cluster with:

    • Data nodes configured for search (1:15 ratio)
    • ML nodes with at least 8 GB RAM
    • Elastic Release 8.15
  2. Access to a Generative AI service:

    • OpenAI or AWS Bedrock (used in this example)
      OR
    • For on-premise setups: An OpenAI-compatible SDK (e.g., localai.io or LM Studio) to access a locally hosted LLM
  3. PDF files you would like to access with Generative AI powered search:

This setup will provide the necessary infrastructure to process and interact with your data effectively.

Configurations in Elastic

To begin, we'll load and start our multi-language Embedding Model. For this blog, we're using the compact E5 model, which is readily available for download and deployment in your Elastic Cloud environment.

Follow these steps:

  1. Open Kibana
  2. Navigate to Analytics > Machine Learning > Trained Models
  3. Select "Model Management" and click on the "Add Trained Models" button
Trained models

Select the E5-small model (ID: .multilingual-e5-small) from the list and click "Download". After a few minutes, when the model is loaded, deploy it by clicking the start/deploy symbol.

Next, navigate to the Dev Console to set up the inference service:

  1. Open the Dev Console
  2. Create an inference service pointing to our embedding model with the following command:
PUT _inference/text_embedding/my-e5-model
{
  "service": "elasticsearch",
  "service_settings": {
    "num_allocations": 1,
    "num_threads": 1,
    "model_id": ".multilingual-e5-small"
  }
}

With the inference service running, let's create a mapping for our destination index, incorporating the semantic_text field type:

PUT grundschutz-embeddings
{
  "mappings": {
    "properties": {
      "semantic_text": { 
        "type": "semantic_text", 
        "inference_id": "my-e5-model" 
      },
      "attachment.content": { 
        "type": "text",
        "copy_to": "semantic_text" 
      }
    }
  }
}

Finally we create an ingest pipeline to deal with the binary content of the PDF documents:

PUT _ingest/pipeline/attachment
{
  "description" : "Extract attachment information",
  "processors" : [
    {
      "attachment" : {
        "field" : "data",
        "remove_binary": true
      }
    }
  ]
}

The pipeline incorporates the attachment processor, which utilizes Apache Tika to extract textual information from binary content such as PDFs and other document formats. With the combination of the index mapping including the semantic_text field and the pipeline we will be able to index PDFs, create chunks and embeddings in one step (see next chapter).

With this setup in place, we're now ready to begin indexing documents.

Pushing and Indexing “IT Grundschutz” to Elastic

For this example, we'll use PDFs from the German Federal Office for Information Security (BSI)'s IT Baseline Protection ("Grundschutz Kompendium").

German critical infrastructure (KRITIS) organizations and federal authorities are required to build their security environments based on these BSI standards. However, the structure of these PDFs often makes it challenging to align existing measures with the guidelines. Traditional keyword searches often fall short when trying to find specific explanations or topics within these documents.

These documents offer an ideal test case for semantic search capabilities, challenging the system with technical content in a non-English language.

The earlier mapping handles PDF content extraction, while Elastic manages inference and chunking automatically.

Data loading can be done in various ways. Elastic’s Python Client simplifies the process and integrates smoothly.

For classic Elastic deployments either on premise or managed by Elastic the package can be installed with:

python -m pip install elasticsearch

The Python code required to transform and load your PDFs can look like this:

import os
import base64
from elasticsearch import Elasticsearch, helpers
from getpass import getpass

es = Elasticsearch(hosts=getpass("Host Address: "), api_key=getpass("Elastic API Key: ")
pipeline_id = 'attachment'  # the pipeline we have created before
pdf_dir = '/yourfolder/grundschutz'  # the folder with the documents 

# Function to convert PDF file to base64-encoded binary

def convert_pdf_to_base64(file_path):
    with open(file_path, "rb") as file:
        return base64.b64encode(file.read()).decode('utf-8')

# Function to generate actions for the bulk API
def generate_actions(pdf_dir):
    for filename in os.listdir(pdf_dir):
        if filename.lower().endswith(".pdf"):
            file_path = os.path.join(pdf_dir, filename)
            base64_encoded = convert_pdf_to_base64(file_path)
            yield {
                "_index": "grundschutz-embeddings",
                "_source": {
                    "data": base64_encoded
                }
            }
       

# Use the helpers.bulk() function to index documents in bulk
helpers.bulk(es, generate_actions(pdf_dir), pipeline=pipeline_id)

print("Finished indexing PDF documents.")

The index is set up for semantic, full-text, or hybrid search. Chunks break down large PDFs into manageable pieces, which simplifies indexing and allows for more precise searches of specific sections or topics within the document. By storing these chunks as nested objects, all related information from a single PDF is kept together within one document, ensuring comprehensive and contextually relevant search results. Everything is now ready to search:

Discover image

Prototyping search with Playground

In this step, the aim is to see if what's been created meets practical standards. This involves using Elastic's new “Playground” functionality, described in this Blog in more detail.

To get started with Playground, first set up connectors to one or more LLMs. You can choose from public options like OpenAI, AWS Bedrock, or Google Gemini, or use an LLM hosted privately. Testing with multiple connectors can help compare performance and pick the best one for the job.

In our use case, the information is publicly available, so using public LLM APIs doesn't pose a risk. In this example, connectors have been set up for OpenAI and AWS Bedrock:

Connectors

Navigate to the playground and select the index created earlier by clicking on “Add data sources” and “Save and continue”:

Add data

Now we are ready to start asking questions (In this case - in German ) and select one of the LLMs available through the connectors:

Connector

Review the answers to your questions and the documents added to the context window.

Question German

Since we used a multilingual embedding model to create the content, it's a good idea to try interacting with the information in different languages.

Question English

The code example in the Playground can be leveraged further for building the Search Application for your users.

Summary and conclusions

So finally we have found a straightforward approach to test the quality of a possible use case and develop a prototype in a reasonable amount of time. You've uncovered how Elastic's capabilities work and how to apply them effectively.

  • Flexibility: Easily load and use embedding models without getting bogged down in chunking and structure.

  • Speed and performance:
    Elastic allows for rapid development and deployment, making it easy to iterate and refine solutions efficiently. The technology is designed for speed, enabling exceptionally fast processing and quick turnaround times.

  • Transparency: See exactly how answers are derived using Playground.

Documents retrieved image
  • Unified Document Storage: Store structured and unstructured information together in one document, allowing easy access to key details like the original document name or author for your search application.

Documents stored image

So, dive in, start building your own search experience and understand how RAG might help you to gain more relevance and transparency in your Chatbot. Stay updated on the latest from Elastic by following our Search Labs.

Ready to try this out on your own? Start a free trial.
Elasticsearch has integrations for tools from LangChain, Cohere and more. Join our advanced semantic search webinar to build your next GenAI app!
Recommended Articles
Elasticsearch open Inference API adds support for AlibabaCloud AI Search
IntegrationsVector DatabaseHow To

Elasticsearch open Inference API adds support for AlibabaCloud AI Search

Discover how to use Elasticsearch vector database with AlibabaCloud AI Search, which offers inference, reranking, and embedding capabilities.

Dave Kyle

Weizijun

Elasticsearch open inference API adds native chunking support for Hugging Face
IntegrationsHow ToVector Database

Elasticsearch open inference API adds native chunking support for Hugging Face

Elasticsearch open inference API extends support for models from Hugging Face, and brings native chunking to Hugging Face users

Max Hniebergall

Unlocking multilingual insights: translating datasets with Python, LangChain, and Vector Database
How ToGenerative AIVector Database

Unlocking multilingual insights: translating datasets with Python, LangChain, and Vector Database

Learn how to translate a dataset from one language to another and use Elastic's vector database capabilities to gain more insights.

Jessica Garson

A tutorial on building local agent using LangGraph, LLaMA3 and Elasticsearch vector store from scratch
How ToGenerative AIVector Database

A tutorial on building local agent using LangGraph, LLaMA3 and Elasticsearch vector store from scratch

This article will provide a detailed tutorial on implementing a local, reliable agent using LangGraph, combining concepts from Adaptive RAG, Corrective RAG, and Self-RAG papers, and integrating Langchain, Elasticsearch Vector Store, Tavily AI for web search, and LLaMA3 via Ollama.

Pratik Rana

Looking back: A timeline of vector search innovations
Vector DatabaseSearch Relevance

Looking back: A timeline of vector search innovations

Looking back at Elastic's vector search innovations in Elasticsearch and Lucene

Kathleen DeRusso

Benjamin Trent