Building a Multimodal RAG system with Elasticsearch: The story of Gotham City

In this blog, you'll learn how to build a Multimodal RAG (Retrieval-Augmented Generation) pipeline using Elasticsearch. We'll explore how to leverage ImageBind to generate embeddings for various data types, including text, images, audio, and depth maps. You'll also discover how to efficiently store and retrieve these embeddings in Elasticsearch using dense_vector and k-NN search. Finally, we'll integrate a large language model (LLM) to analyze retrieved evidence and generate a comprehensive final report.

How does the pipeline work?

Collecting clues → Images, audio, texts, and depth maps from the crime scene in Gotham.
Generating embeddings → Each file is converted into a vector using the ImageBind multimodal model.
Indexing in Elasticsearch → The vectors are stored for efficient retrieval.
Searching by similarity → Given a new clue, the most similar vectors are retrieved.
The LLM analyzes the evidence → A GPT-4 model synthesizes the response and identifies the suspect!

Technologies used

ImageBind → Generates unified embeddings for various modalities.
Elasticsearch → Enables fast and efficient vector search.
LLM (GPT-4, OpenAI) → Analyzes the evidence and generates a final report.

Who is this blog for?

Elastic users interested in multimodal vector search.
Developers looking to understand Multimodal RAG in practice.
Anyone searching for scalable solutions for analyzing data from multiple sources.

Prerequisites: Setting up the environment

To solve the crime in Gotham City, you need to set up your technology environment. Follow this step-by-step guide:

1. Technical requirements

Component	Specification
Sistem OS	Linux, macOS, or Windows
Python	3.10 or later
RAM	Minimum 8GB (16GB recommended)
GPU	Optional but recommended for ImageBind

2. Setting up the project

All investigation materials are available on GitHub, and we'll be using Jupyter Notebook (Google Colab) for this interactive crime-solving experience. Follow these steps to get started:

Setting up with Jupyter Notebook (Google Colab)

1. Access the notebook

Open our ready-to-use Google Colab notebook: Multimodal RAG with Elasticsearch.
This notebook contains all the code and explanations you need to follow along.

2. Clone the repository

# Clone the repository with the multimodal RAG code
!git clone -b https://github.com/elastic/elasticsearch-labs.git

# Navigate to the project directory
cd elasticsearch-labs/supporting-blog-content/building-multimodal-rag-with-elasticsearch-gotham

3. Install dependencies

 # Install PyTorch and related libraries
!pip install torch>=2.1.0 torchvision>=0.16.0 torchaudio>=2.1.0

# Install vision processing libraries
!pip install opencv-python-headless pillow numpy

# Install the specific ImageBind fork
!pip install git+https://github.com/hkchengrex/ImageBind.git

# Install Elasticsearch and environment management
!pip install elasticsearch python-dotenv

4. Configure credentials

# Input your credentials securely
import getpass

ELASTICSEARCH_URL = input("Enter the Elasticsearch endpoint url: ")
ELASTICSEARCH_API_KEY = getpass.getpass("Enter the Elasticsearch API key: ")
OPENAI_API_KEY = getpass.getpass("Enter the OpenAI API key: ")

# Configure environment variables
import os
os.environ["ELASTICSEARCH_API_KEY"] = ELASTICSEARCH_API_KEY
os.environ["OPENAI_API_KEY"] = OPENAI_API_KEY
os.environ["ELASTICSEARCH_URL"] = ELASTICSEARCH_URL

Note: The ImageBind model (~2GB) will be downloaded automatically on the first run.

Now that everything is set up, let's dive into the details and solve the crime!

Introduction: The crime in Gotham City

On a rainy night in Gotham City, a shocking crime shakes the city. Commissioner Gordon needs your help to unravel the mystery. Clues are scattered across different formats: blurred images, mysterious audio, encrypted texts, and even depth maps. Are you ready to use the most advanced AI technology to solve the case?

In this blog, you will be guided step by step through building a Multimodal RAG (Retrieval-Augmented Generation) system that unifies different types of data (images, audio, texts, and depth maps) into a single search space. We will use ImageBind to generate multimodal embeddings, Elasticsearch to store and retrieve these embeddings, and a Large Language Model (LLM) to analyze the evidence and generate a final report.

Fundamentals: Multimodal RAG architecture

What is a Multimodal RAG?

The rise of Retrieval-Augmented Generation (RAG) Multimodal is revolutionizing the way we interact with AI models. Traditionally, RAG systems work exclusively with text, retrieving relevant information from databases before generating responses. However, the world is not limited to text—images, videos, and audio also carry valuable knowledge. This is why multimodal architectures are gaining prominence, allowing AI systems to combine information from different formats for richer and more precise responses.

Three main approaches for Multimodal RAG

To implement a Multimodal RAG, three strategies are commonly used. Each approach has its own advantages and limitations, depending on the use case:

1. Shared vector space

Data from different modalities are mapped into a common vector space using multimodal models like ImageBind. This allows text queries to retrieve images, videos, and audio without explicit format conversion.

Advantages:

Enables cross-modal retrieval without requiring explicit format conversion.
Provides a fluid integration between different modalities, allowing direct retrieval across text, image, audio, and video.
Scalable for diverse data types, making it useful for large-scale retrieval applications.

Disadvantages:

Training requires large multimodal datasets, which may not always be available.
The shared embedding space may introduce semantic drift, where relationships between modalities are not perfectly preserved.
Bias in multimodal models can impact retrieval accuracy, depending on the dataset distribution.

2. Single grounded modality

All modalities are converted to a single format, usually text, before retrieval. For example, images are described through automatically generated captions, and audio is transcribed into text.

Advantages:

Simplifies retrieval, as everything is converted into a uniform text representation.
Works well with existing text-based search engines, eliminating the need for specialized multimodal infrastructure.
Can improve interpretability since retrieved results are in a human-readable format.

Disadvantages:

Loss of information: Certain details (e.g., spatial relationships in images, tone in audio) may not be fully captured in text descriptions.
Dependent on captioning/transcription quality: Errors in automatic annotations can reduce retrieval effectiveness.
Not optimal for purely visual or auditory queries since the conversion process might remove essential context.

3. Separate retrieval

Maintains distinct models for each modality. The system performs separate searches for each data type and later merges the results.

Advantages:

Allows custom optimization per modality, improving retrieval accuracy for each type of data.
Less reliance on complex multimodal models, making it easier to integrate existing retrieval systems.
Provides fine-grained control over ranking and re-ranking as results from different modalities can be combined dynamically.

Disadvantages:

Requires fusion of results, making the retrieval and ranking process more complex.
May generate inconsistent responses if different modalities return conflicting information.
Higher computational cost since independent searches are performed for each modality, increasing processing time.

Our choice: Shared vector space with ImageBind

Among these approaches, we chose shared vector space, a strategy that aligns perfectly with the need for efficient multimodal searches. Our implementation is based on ImageBind, a model capable of representing multiple modalities (text, image, audio, and video) in a common vector space. This allows us to:

Perform cross-modal searches between different media formats without needing to convert everything to text.
Use highly expressive embeddings to capture relationships between different modalities.
Ensure scalability and efficiency, storing optimized embeddings for fast retrieval in Elasticsearch.

By adopting this approach, we built a robust multimodal search pipeline, where a text query can directly retrieve images or audio without additional pre-processing. This method expands practical applications from intelligent search in large repositories to advanced multimodal recommendation systems.

The following figure illustrates the data flow within the Multimodal RAG pipeline, highlighting the indexing, retrieval, and response generation process based on multimodal data:

How does the embedding space work?

Traditionally, text embeddings come from language models (e.g., BERT, GPT). Now, with native multimodal models like Meta AI’s ImageBind, we have a backbone that generates vectors for multiple modalities:

Text: Sentences and paragraphs are transformed into vectors of the same dimension.
Images (vision): Pixels are mapped into the same dimensional space used for text.
Audio: Sound signals are converted into embeddings comparable to images and text.
Depth Maps: Depth data is processed and also results in vectors.

Thus, any clue (text, image, audio, depth) can be compared to any other using vector similarity metrics like cosine similarity. If a laughing audio sample and an image of a suspect's face are “close” in this space, we can infer some correlation (e.g., the same identity).

Stage 1 - Collecting crime scene clues

Before analyzing the evidence, we need to collect it. The crime in Gotham left traces that may be hidden in images, audio, texts, and even depth data. Let's organize these clues to feed into our system.

What do we have?

Commissioner Gordon sent us the following files containing evidence collected from the crime scene in four different modalities:

Track description and modality

a) Images (2 photos)

crime_scene1.jpg, crime_scene2.jpg → Photos taken from the crime scene. Shows suspicious traces on the ground.
suspect_spotted.jpg → Security camera image showing a silhouette running away from the scene.

b) Audio (1 recording)

joker_laugh.wav → A microphone near the crime scene captured a sinister laugh.

c) Text (1 message)

Riddle.txt, note2.txt → Some mysterious notes were found at the location, possibly left by the criminal.

d) Depth (1 depth map)

depth_suspect.png → A security camera with a depth sensor captured a suspect in a nearby alley.
jdancing-depth.png → A security camera with a depth sensor captured a suspect going down the subway station.

These pieces of evidence are in different formats and cannot be analyzed directly in the same way. We need to transform them into embeddings—numerical vectors that will allow cross-modal comparison.

File organization

Before starting processing, we need to ensure that all clues are properly organized in the data/ directory so the pipeline runs smoothly.

Expected directory structure:

data/
├── images/
│   ├── crime_scene1.jpg
│   ├── suspect_spotted.jpg
│   ...
├── audios/
│   ├── joker_laugh.wav
│   ...
├── texts/
│   ├── riddle.txt
│   ... 
├── depths/
│   ├── depth_suspect.png

Code to verify clue organization

Before proceeding, let's ensure that all required files are in the correct location.

import os

# Base directory for clues
data_dir = "data"

# List of expected files
evidences = {
    "images": ["crime_scene1.jpg","crime_scene1.jpg", "joker_alley.jpg"],
    "audios": ["joker_laugh.wav"],
    "texts": ["riddle.txt", "note2.txt”],
    "depths": ["depth_suspect.png", "jdancing-depth.png"]
}

# Create directories if they don't exist
for category, files in evidences.items():
    category_path = os.path.join(data_dir, category)
    os.makedirs(category_path, exist_ok=True)

    for file in files:
        file_path = os.path.join(category_path, file)
        if not os.path.exists(file_path):
            print(f"Warning: {file} not found in {category_path}.")

print("All files are correctly organized!")

Running the file

python  stages/01-stage/files_check.py

Expected output (if all files are correct):

All files are correctly organized!

Expected output (if any file is missing):

Warning: joker_laugh.wav not found in data/audios/
Warning: depth_suspect.png not found in data/depths/

This script helps prevent errors before we start generating embeddings and indexing them into Elasticsearch.

Stage 2 - Organizing the evidence

Generating embeddings with ImageBind

To unify the clues, we need to transform them into embeddings—vector representations that capture the meaning of each modality. We will use ImageBind, a model by Meta AI that generates embeddings for different data types (images, audio, text, and depth maps) within a shared vector space.

How does ImageBind work?

To compare different types of evidence (images, audio, text, and depth maps), we need to transform them into numerical vectors using ImageBind. This model allows any type of input to be converted into the same embedding format, enabling cross-modal searches between modalities.

Below is an optimized code (src/embedding_generator.py) to generate embeddings for any type of input using the appropriate processors for each modality:

class EmbeddingGenerator:
    """Class for generating multimodal embeddings using ImageBind."""
    
    def __init__(self):
        self.device = "cuda" if torch.cuda.is_available() else "cpu"
        self.model = self._load_model()

    def _load_model(self):
        """Loads the ImageBind model and sets it to inference mode."""
        model = imagebind_model.imagebind_huge(pretrained=True)
        model.eval()
        model.to(self.device)
        return model

    def generate_embedding(self, input_data, modality):
        """Generates embedding for different modalities"""
        processors = {
            "vision": lambda x: data.load_and_transform_vision_data(x, self.device),
            "audio": lambda x: data.load_and_transform_audio_data(x, self.device),
            "text": lambda x: data.load_and_transform_text(x, self.device),
            "depth": self.process_depth
        }
        
        try:
            # Input type verification
            if not isinstance(input_data, list):
                raise ValueError(f"Input data must be a list. Received: {type(input_data)}")
                
            # Convert input data to a tensor format that the model can process
            # For images: [batch_size, channels, height, width] 
            # For audio: [batch_size, channels, time] 
            # For text: [batch_size, sequence_length]
            inputs = {modality: processors[modality](input_data)}
            with torch.no_grad():
                embedding = self.model(inputs)[modality]
            return embedding.squeeze(0).cpu().numpy()
        except Exception as e:
            logger.error(f"Error generating {modality} embedding: {str(e)}", exc_info=True)
            raise

A tensor is a fundamental data structure in machine learning and deep learning, especially when working with models like ImageBind. In our context:

input_tensor = processors[modality]([input_data], self.device)

Here, the tensor represents the input data (image, audio, or text) converted into a mathematical format that the model can process. Specifically:

For images: The tensor represents the image as a multidimensional matrix of numerical values (pixels organized by height, width, and color channels).
For audio: The tensor represents sound waves as a sequence of amplitudes over time.
For text: The tensor represents words or tokens as numerical vectors.

Testing embedding generation:

Let's test our embedding generation with the following code. Save it in 02-stage/test_embedding_generation.py and execute it with this command:

python stages/02-stage/test_embedding_generation.py

generator = EmbeddingGenerator()
image_embedding = generator.generate_embedding("data/images/crime_scene1.jpg","vision")

print(image_embedding.shape)

Expected output:

(1024,)

Now, the image has been transformed into a 1024-dimensional vector.

Stage 3 - Storage and search in Elasticsearch

Now that we have generated the embeddings for the evidence, we need to store them in a vector database to enable efficient searches. For this, we will use Elasticsearch, which supports dense vectors (dense_vector) and allows similarity searches.

This step consists of two main processes:

Indexing the embeddings → Stores the generated vectors in Elasticsearch.
Similarity search → Retrieves the most similar records to a new piece of evidence.

Indexing the evidence in Elasticsearch

Each piece of evidence processed by ImageBind (image, audio, text, or depth) is converted into a 1024-dimensional vector. We need to store these vectors in Elasticsearch to enable future searches.

The following code (src/elastic_manager.py) creates an index in Elasticsearch and configures the mapping to store the embeddings.

from elasticsearch import Elasticsearch, helpers
...

class ElasticsearchManager:
    """Manages multimodal operations in Elasticsearch"""
    
    def __init__(self):
        load_dotenv()  # Load variables from .env
        self.es = self._connect_elastic()
        self.index_name = "multimodal_content"
        self._setup_index()
    
    def _connect_elastic(self):
        """Connects to Elasticsearch"""
        return Elasticsearch(
            os.getenv("ELASTICSEARCH_URL"),  # Elasticsearch endpoint
            api_key=os.getenv("ELASTICSEARCH_API_KEY")
        )
    
    def _setup_index(self):
        """Sets up the index if it doesn't exist"""
        if not self.es.indices.exists(index=self.index_name):
            mapping = {
                "mappings": {
                    "properties": {
                        "embedding": {
                            "type": "dense_vector",
                            "dims": 1024,
                            "index": True,
                            "similarity": "cosine"
                        },
                        "modality": {"type": "keyword"},
                        "content": {"type": "binary"},
                        "description": {"type": "text"},
                        "metadata": {"type": "object"},
                        "content_path": {"type": "text"}
                    }
                }
            }
            self.es.indices.create(index=self.index_name, body=mapping)
    
    def index_content(self, embedding, modality, content=None, description="", metadata=None, content_path=None):
        """Indexes multimodal content"""
        doc = {
            "embedding": embedding.tolist(),
            "modality": modality,
            "description": description,
            "metadata": metadata or {},
            "content_path": content_path
        }
        
        if content:
            doc["content"] = base64.b64encode(content).decode() if isinstance(content, bytes) else content
        
        return self.es.index(index=self.index_name, document=doc)
    
    def search_similar(self, query_embedding, modality=None, k=5):
        """Searches for similar contents"""
        query = {
            "knn": {
                "field": "embedding",
                "query_vector": query_embedding.tolist(),
                "k": k,
                "num_candidates": 100,
                "filter": [{"term": {"modality": modality}}] if modality else []
            }
        }
        
        try:
            response = self.es.search(
                index=self.index_name,
                query=query,
                size=k            
            )
            
            # Return both source data and score for each hit
            return [{
                **hit["_source"],
                "score": hit["_score"]
            } for hit in response["hits"]["hits"]]
        
        except Exception as e:
            print(f"Error: processing search_evidence: {str(e)}")
            return "Error generating search evidence"

Running the indexing

Now, let's index a piece of evidence to test the process.

# Example: Indexing an image from the crime scene
generator = EmbeddingGenerator()
es_manager = ElasticsearchManager(cloud_id="YOUR_CLOUD_ID", api_key="YOUR_API_KEY")

image_embedding = generator.generate_embedding("data/images/crime_scene1.jpg", "vision")

response = es_manager.index_content(
    embedding=image_embedding,
    modality="vision",
    description="Photo of the crime scene with suspicious traces",
    content_path="data/images/crime_scene1.jpg"
)
print(json.dumps(response, indent=2))

Expected output in Elasticsearch (summary of the indexed document):

{
    "embedding": [0.12, -0.53, 0.89, ...],  
    "modality": "vision",  
    "description": "Photo of the crime scene with suspicious traces",  
    "content_path": "data/images/crime_scene1.jpg"  
}

To index all multimodal evidence, please execute the following Python command:

python stages/03-stage/index_all_modalities.py

Now, the evidence is stored in Elasticsearch and is ready to be retrieved when needed.

Verifying the indexing process

After running the indexing script, let's verify if all our evidence was correctly stored in Elasticsearch. You can use Kibana's Dev Tools to run some verification queries:

1. First, check if the index was created:

GET _cat/indices/multimodal_content?v

2. Then, verify the document count per modality:

GET multimodal_content/_search
{
  "size": 0,
  "aggs": {
    "modalities": {
      "terms": {
        "field": "modality.keyword"
      }
    }
  }
}

3. Finally, examine the indexed document structure:

GET multimodal_content/_search
{
  "size": 1,
  "query": {
    "match_all": {}
  }
}

Expected results:

An index named `multimodal_content` should exist.
Around 7 documents distributed across different modalities (vision, audio, text, depth).
Each document should contain: embedding, modality, description, metadata, and content_path fields.

This verification step ensures that our evidence database is properly set up before we proceed with the similarity searches.

Searching for similar evidence in Elasticsearch

Now that the evidence has been indexed, we can perform searches to find the most similar records to a new clue. This search uses vector similarity to return the closest records in the embedding space.

The following code performs this search.

def search_similar_evidence(self, query_embedding, k=5, modality=None):
    """Performs a kNN search to find the most similar clues."""
    
    knn_query = {
        "field": "embedding",
        "query_vector": query_embedding.tolist(),
        "k": k,
        "num_candidates": 100
    }

    query_body = {"knn": knn_query}
    if modality:
        query_body = {
            "bool": {
                "must": [
                    query_body, 
                    {"term": {"modality": modality}}
                ]
            }
        }

    try:
      results = self.es.search(
        index=self.index_name,
        query=query_body,
        _source_includes=["description", "modality", "content_path"],
        size=k
      )
    except Exception as e:
            print(f"Error processing search_evidence: {str(e)}")
            return "Error generating search evidence”

    return results["hits"]["hits"]

Testing the search - Using audio as a query for multimodal results

Now, let's test the search for evidence using a suspicious audio file. We need to generate an embedding for the file in the same way and search for similar embeddings:

python stages/03-stage/search_by_audio.py

# Initialize classes
generator = EmbeddingGenerator()
es_manager = ElasticsearchManager(cloud_id="YOUR_CLOUD_ID", api_key="YOUR_API_KEY")

# Generate embedding for a suspicious audio
audio_embedding = generator.generate_embedding("data/audios/mysterious_laugh.wav", "audio")

# Search for similar evidence in Elasticsearch
similar_evidences = es_manager.search_similar_evidence(audio_embedding, k=3)

# Display the retrieved results
print("\n🔎 Similar evidence found:\n")
for i, evidence in enumerate(similar_evidences, start=1):
    description = evidence['_source']['description']
    modality = evidence['_source']['modality']
    score = evidence['_score']
    content_path = evidence['_source'].get('content_path', 'N/A')

    print(f"{i}. {description} ({modality})")
    print(f"   Similarity: {score:.4f}")
    print(f"   File path: {content_path}\n")

Expected output in the terminal:

🔎 Similar evidence found:

1. A sinister laugh captured near the crime scene (audio)
   Similarity: 0.9985
   File path: data/audios/joker_laugh.wav

2. The Joker with green hair, white face paint, and a sinister smile in an urban night setting. (vision)
   Similarity: 0.6068
   File path: data/images/joker_laughing.png

3. Suspect dancing (vision)
   Similarity: 0.5591
   File path: data/images/jdancing.png

Now, we can analyze the retrieved evidence and determine its relevance to the case.

Beyond audio - Exploring multimodal searches

Reversing the roles: Any modality can be a "question"

In our Multimodal RAG system, every modality is a potential search query. Let's go beyond the audio example and explore how other data types can initiate investigations.

1. Searching by text (deciphering the criminal’s note)

Scenario: You found an encrypted text message and want to find related evidence.

python stages/03-stage/search_by_text.py

# Generate embedding from text
text = "Why so serious?"
embedding_text = generator.generate_embedding([text], "text")

# Search for related evidence
similar_evidences = es_manager.search_similar(
    query_embedding=embedding_text,
    k=3
)

Expected results:

🔎 Similar evidence found:

1. Mysterious note found at the location (text)
   Similarity: 0.7639
   File path: data/texts/riddle.txt

2. The Joker with green hair, white face paint, and a sinister smile in an urban night setting. (vision)
   Similarity: 0.7161
   File path: data/images/joker_laughing.png

3. Why so serious (text)
   Similarity: 0.7132
   File path: data/texts/note2.txt

2. Image search (tracking the suspicious crime scene)

Scenario: A new crime scene (crime_scene2.jpg) needs to be compared with other evidence.

python stages/03-stage/search_by_image.py

# Generate embedding for a suspicious image
vision_embedding = generator.generate_embedding(["data/images/crime_scene2.jpg"], "vision")

# Search for similar evidence in Elasticsearch
similar_evidences = es_manager.search_similar(
    query_embedding=vision_embedding,
    k=3
)

Output:

🔎 Similar evidence found:

1. Photo of the crime scene: A dark, rain-soaked alley is filled with playing cards, while a sinister graffiti of the Joker laughing stands out on the brick wall. (vision)
   Similarity: 0.8258
   File path: data/images/crime_scene1.jpg

2. The Joker with green hair, white face paint, and a sinister smile in an urban night setting. (vision)
   Similarity: 0.6897
   File path: data/images/joker_laughing.png

3. Suspect dancing (vision)
   Similarity: 0.6588
   File path: data/images/jdancing.png

3. Depth map search (3D pursuit)

Scenario: A depth map (jdancing-depth.png) reveals image escape patterns.

python stages/03-stage/search_by_depth.py

# Generate embedding for a suspicious depth map
vision_embedding = generator.generate_embedding(["data/depths/jdancing-depth.png"], "depth")

# Search for similar evidence in Elasticsearch
similar_evidences = es_manager.search_similar(
    query_embedding=vision_embedding,
    modality="vision",
    k=3
)

Output

🔎 Similar evidence found:

1. The Joker with green hair, white face paint, and a sinister smile in an urban night setting. (vision)
   Similarity: 0.5329
   File path: data/images/joker_laughing.png

2. Photo of the crime scene: A dark, rain-soaked alley is filled with playing cards, while a sinister graffiti of the Joker laughing stands out on the brick wall. (vision)
   Similarity: 0.5053
   File path: data/images/crime_scene1.jpg

3. Suspect dancing (vision)
   Similarity: 0.4859
   File path: data/images/jdancing.png

Why does this matter?

Each modality reveals unique connections:

Text → Linguistic patterns of the suspect.
Images → Recognition of locations and objects.
Depth → 3D scene reconstruction.

Now, we have a structured evidence database in Elasticsearch, enabling us to store and retrieve multimodal evidence efficiently.

Summary of what we've done:

Stored multimodal embeddings in Elasticsearch.
Performed similarity searches, finding evidence related to new clues.
Tested the search using a suspicious audio file, ensuring the system works correctly.

Next step: We will use an LLM (Large Language Model) to analyze the retrieved evidence and generate a final report.

Stage 4 - Connecting the dots with the LLM

Now that the evidence has been indexed in Elasticsearch and can be retrieved by similarity, we need a LLM (Large Language Model) to analyze it and generate a final report to send to Commissioner Gordon. The LLM will be responsible for identifying patterns, connecting clues, and suggesting a possible suspect based on the retrieved evidence.

For this task, we will use GPT-4 Turbo, formulating a detailed prompt so that the model can interpret the results efficiently.

LLM integration

To integrate the LLM into our system, we created the LLMAnalyzer class (src/llm_analyzer.py), which receives the retrieved evidence from Elasticsearch and generates a forensic report using this evidence as the prompt context.

import os
from openai import OpenAI
import logging
from dotenv import load_dotenv

logging.basicConfig(level=logging.INFO)
logger = logging.getLogger(__name__)

class LLMAnalyzer:
    """Evidence analyzer using GPT-4"""
    
    def __init__(self):
        load_dotenv()
        self.client = OpenAI(api_key=os.getenv("OPENAI_API_KEY"))
    
    def analyze_evidence(self, evidence_results):
        """
        Analyzes multimodal search results and generates a report
        
        Args:
            evidence_results: Dict with results by modality
            {
                'vision': [...],
                'audio': [...],
                'text': [...],
                'depth': [...]
            }
        """
        # Format evidence for the prompt
        evidence_summary = self._format_evidence(evidence_results)

        # final prompt
        prompt = f"""
You are a highly experienced forensic detective specializing in multimodal evidence analysis. Your task is to analyze the collected evidence (audio, images, text, depth maps) and conclusively determine the **prime suspect** responsible for the Gotham Central Bank case.

---

### **Collected Evidence:**
{evidence_summary}

### **Task:**
1. **Analyze all the evidence** and identify cross-modal connections.
2. **Determine the exact identity of the criminal** based on behavioral patterns, visual/auditory/textual clues, and symbolic markers.
3. **Justify your conclusion** by explaining why this suspect is definitively responsible.
4. **Assign a confidence score (0-100%)** to your conclusion.

---

### **Final Output Format (Strictly Follow This Format):**
- **Prime Suspect:** [Full Name or Alias]
- **Evidence Supporting Conclusion:** [Detailed breakdown of visual, auditory, textual, and behavioral evidence]
- **Behavioral Patterns:** [Key actions, motives, and criminal signature]
- **Confidence Level:** [0-100%]
- **Next Steps (if any):** [What additional evidence would further confirm the identity? If none, state "No further evidence required."]

If there is **insufficient evidence**, specify exactly what is missing and suggest what additional data would be needed for a conclusive identification.

This report must be **direct and definitive**--avoid speculation and provide a final, actionable determination of the suspect's identity.
"""
        try:
            response = self.client.chat.completions.create(
                model="gpt-4-turbo-preview",
                messages=[
                    {
                        "role": "system",
                        "content": "You are a forensic detective specialized in multimodal evidence analysis."
                    },
                    {"role": "user", "content": prompt_01}
                ],
                temperature=0.5,
                max_tokens=1000
            )
            
            report = response.choices[0].message.content
            logger.info("\n📋 Forensic Report Generated:")
            logger.info("=" * 50)
            logger.info(report)
            logger.info("=" * 50)
            
            return report
            
        except Exception as e:
            logger.error(f"Error generating report: {str(e)}")
            return None

Temperature setting in LLM analysis:

For our forensic analysis system, we use a moderate temperature of 0.5. This balanced setting was chosen because:

It represents a middle ground between deterministic (too rigid) and highly random outputs;
At 0.5, the model maintains enough structure to provide logical and justifiable forensic conclusions;
This setting allows the model to identify patterns and make connections while staying within reasonable forensic analysis parameters;
It balances the need for consistent, reliable outputs with the ability to generate insightful analysis.

This moderate temperature setting helps ensure our forensic analysis is both reliable and insightful, avoiding both overly rigid and overly speculative conclusions.

Running the evidence analysis

Now that we have the LLM integration, we need a script that connects all system components. This script will:

Search for similar evidence in Elasticsearch.
Analyze the retrieved evidence using the LLM to generate a final report.

Code: Evidence analysis script

python stages/04-stage/rag_crime_analyze.py

import sys
import os
sys.path.append(os.path.join(os.path.dirname(os.path.dirname(__file__)), 'src'))

from embedding_generator import EmbeddingGenerator
from elastic_manager import ElasticsearchManager
from llm_analyzer import LLMAnalyzer

import json
import logging
from dotenv import load_dotenv

# Setup logging
logging.basicConfig(level=logging.INFO)
logger = logging.getLogger(__name__)

# Load environment variables
load_dotenv()

# Initialize classes
generator = EmbeddingGenerator()
es_manager = ElasticsearchManager()

llm = LLMAnalyzer()
logger.info("✅ All components initialized successfully")
    
try:
    evidence_data = {}
    
    # Get data for each modality
    test_files = {
        'vision': 'data/images/crime_scene2.jpg',
        'audio': 'data/audios/joker_laugh.wav',
        'text': 'Why so serious?',
        'depth': 'data/depths/jdancing-depth.png'
    }
    
    logger.info("🔍 Collecting evidence...")
    for modality, test_input in test_files.items():
        try:
            if modality == 'text':
                embedding = generator.generate_embedding([test_input], modality)
            else:
                embedding = generator.generate_embedding([str(test_input)], modality)
            
            results = es_manager.search_similar(embedding, k=2)
            if results:
                evidence_data[modality] = results
                logger.info(f"✅ Data retrieved for {modality}: {len(results)} results")
            else:
                logger.warning(f"⚠️ No results found for {modality}")
                
        except Exception as e:
            logger.error(f"❌ Error retrieving {modality} data: {str(e)}")
    
    if not evidence_data:
        raise ValueError("No evidence data found in Elasticsearch!")
    
    # Test forensic report generation
    logger.info("\n📝 Generating forensic report...")
    report = llm.analyze_evidence(evidence_data)
    
    if report:
        logger.info("✅ Forensic report generated successfully")
        logger.info("\n📊 Report Preview:")
        logger.info("+" * 50)
        logger.info(report)
        logger.info("+" * 50)
    else:
        raise ValueError("Failed to generate forensic report")
        
except Exception as e:
    logger.error(f"❌ Error in analysis : {str(e)}")

Expected LLM output

**Prime Suspect:** The Joker

**Evidence Supporting Conclusion:**

- **Visual Evidence:**
  - The photo of the crime scene with playing cards scattered around and the graffiti of the Joker laughing matches the Joker's known calling cards and thematic elements. The similarity score of 0.83 indicates a high likelihood that these elements are directly associated with the Joker.
  - The image of the Joker with green hair, white face paint, and a sinister smile in an urban night setting, although with a lower similarity score of 0.69, still supports the presence or recent activity of the Joker in areas consistent with the crime scene's characteristics.

- **Auditory Evidence:**
  - The captured sinister laugh with a similarity score of 1.00 perfectly matches known audio profiles of the Joker, making it a direct auditory signature of his presence at or near the crime scene.
  - Despite the lower similarity score of 0.61, the second audio piece further corroborates the Joker's involvement through thematic consistency.

- **Textual Evidence:**
  - The mysterious note found at the location, with a similarity score of 0.76, likely contains thematic or direct references to the Joker's modus operandi or signature phrases, further implicating him in the crime.
  - The similarity score of 0.72 for the Joker's description in textual evidence reinforces the thematic connection to the crime scene.

- **Depth Evidence:**
  - Depth sensor capture of the suspect with a similarity score of 0.77 suggests a physical presence matching the Joker's known dimensions or characteristic movements.
  - The lower similarity score of 0.53 in the second depth evidence still contributes to the overall pattern of evidence pointing towards the Joker, albeit with less certainty.

**Behavioral Patterns:**
- The Joker is known for his theatrical crimes, often leaving behind a signature trail of chaos, including playing cards, sinister laughter, and thematic graffiti. These elements are not only consistent with his known criminal signature but also directly observed at the crime scene.
- His motives often include creating chaos, drawing attention to his acts, and challenging his arch-nemesis, Batman, making a high-profile bank heist fitting within his behavioral patterns.

**Confidence Level:** 95%

**Next Steps:** No further evidence required.

The combination of visual, auditory, textual, and depth evidence strongly points to the Joker as the prime suspect. The thematic consistency across multiple modes of evidence, combined with known behavioral patterns and criminal signature, leaves little doubt regarding his involvement. While there is always a small margin of uncertainty in forensic analysis, the evidence at hand provides a compelling case against the Joker with a high degree of confidence.

Conclusion: Case solved

With all the clues gathered and analyzed, the Multimodal RAG system has identified a suspect: The Joker.

By combining images, audio, text, and depth maps into a shared vector space using ImageBind, the system was able to detect connections that would have been impossible to identify manually. Elasticsearch ensured fast and efficient searches, while the LLM synthesized the evidence into a clear and conclusive report.

However, the true power of this system goes beyond Gotham City. The Multimodal RAG architecture opens doors to numerous real-world applications:

Urban surveillance: Identifying suspects based on images, audio, and sensor data.
Forensic analysis: Correlating evidence from multiple sources to solve complex crimes.
Multimedia recommendation: Creating recommendation systems that understand multimodal contexts (e.g., suggesting music based on images or text).
Social media trends: Detecting trending topics across different data formats.

Now that you’ve learned how to build a Multimodal RAG system, why not test it with your own clues?

Share your discoveries with us and help the community advance in the field of multimodal AI!

Special thanks

I would like to thank Adrian Cole for his valuable contribution and review during the process of defining the deployment architecture of this code.

References

Elasticsearch has native integrations to industry leading Gen AI tools and providers. Check out our webinars on going Beyond RAG Basics, or building prod-ready apps Elastic Vector Database.

To build the best search solutions for your use case, start a free cloud trial or try Elastic on your local machine now.

Report an issue

Building a Multimodal RAG system with Elasticsearch: The story of Gotham City

How does the pipeline work?

Technologies used

Who is this blog for?

Prerequisites: Setting up the environment

1. Technical requirements

2. Setting up the project

Setting up with Jupyter Notebook (Google Colab)

Introduction: The crime in Gotham City

Fundamentals: Multimodal RAG architecture

What is a Multimodal RAG?

Three main approaches for Multimodal RAG

1. Shared vector space

2. Single grounded modality

3. Separate retrieval

Our choice: Shared vector space with ImageBind

How does the embedding space work?

Stage 1 - Collecting crime scene clues

What do we have?

File organization

Code to verify clue organization

Stage 2 - Organizing the evidence

Generating embeddings with ImageBind

How does ImageBind work?

Testing embedding generation:

Expected output:

Stage 3 - Storage and search in Elasticsearch

Indexing the evidence in Elasticsearch

Running the indexing

Verifying the indexing process

Expected results:

Searching for similar evidence in Elasticsearch

Testing the search - Using audio as a query for multimodal results

Beyond audio - Exploring multimodal searches

Reversing the roles: Any modality can be a "question"

1. Searching by text (deciphering the criminal’s note)

2. Image search (tracking the suspicious crime scene)

3. Depth map search (3D pursuit)

Why does this matter?

Summary of what we've done:

Stage 4 - Connecting the dots with the LLM

LLM integration

Temperature setting in LLM analysis:

Running the evidence analysis

Code: Evidence analysis script

Expected LLM output

Conclusion: Case solved

Special thanks

References

Related content

Getting Started with the Elastic Chatbot RAG app using Vertex AI running on Google Kubernetes Engine

Using Amazon Nova models in Elasticsearch

Detecting relationships in data: Spotify Wrapped, part 4

Connect Agents to Elasticsearch with Model Context Protocol

RAG vs. Fine Tuning, a practical approach

Ready to build state of the art search experiences?