Natural Language Media Search

We want to build an application which returns the related images wrt the text given

Approach 1: Direct ML Inference per Request

The simplest implementation:
- Load CLIP model (text + image encoders)
- For every user query, compute the text embedding
- Compute image embeddings on-the-fly by running the vision model for every candidate image
- Compute cosine similarity between text vector and each image vector
- Return sorted top-K images.

{:ok, %{model: text_model, params: text_params}} =
  Bumblebee.load_model({:hf, "openai/clip-vit-base-patch32"},
    module: Bumblebee.Text.ClipText,
    architecture: :base
  )
 
{:ok, tokenizer} = Bumblebee.load_tokenizer({:hf, "openai/clip-vit-base-patch32"})
 
tokens = Bumblebee.apply_tokenizer(tokenizer, ["yellow flowers on a table"])
 
embedding =
  text_model
  |> Axon.predict(text_params, tokens)
  |> Nx.l2_normalize()

Limitations:

Image inference per request is too slow
- CLIP vision model is heavy (ViT)
- Computing embeddings for 10k images per query → seconds to minutes
No caching / reuse of embeddings
- Pure waste of computation for static images
Similarity search via linear scan
- Brute-force cosine similarity: O(N) per query

Approach 2: Precomputing Image Embeddings

Compute and store image embeddings offline
For each request:
- Compute text embedding
- Compare against pre-computed embeddings
- No repeated image inference

{:ok, %{model: vision_model, params: vision_params}} =
  Bumblebee.load_model({:hf, "openai/clip-vit-base-patch32"},
    module: Bumblebee.Vision.ClipVision,
    architecture: :base
  )
 
featurizer = Bumblebee.load_featurizer({:hf, "openai/clip-vit-base-patch32"})
 
def embed_image(path, model, params, featurizer) do
  {:ok, img} = StbImage.read_file(path)
  features = Bumblebee.apply_featurizer(featurizer, img)
 
  model
  |> Axon.predict(params, features)
  |> Nx.l2_normalize()
  |> Nx.to_flat_list()
end

Limitations:

Linear scan still too slow
- For N images, each query is O(N)
- 25k images = noticeable lag
- 100k+ images = unusable.
Works only if dataset is small
- Real-world production needs instant search (< 100ms)

Approach 3: ANN Search (HNSW) for Vector Indexing

Use HNSWLib to index image embeddings
k-NN queries become logarithmic instead of linear
Works well for 100k–10M embeddings

{:ok, index} =
  HNSWLib.Index.new(:cosine, 512, max_elements: count, ef_construction: 200)
 
Enum.each(image_embeddings, fn {id, vector} ->
  HNSWLib.Index.add_items(index, Nx.tensor(vector), id)
end)
 
HNSWLib.Index.save_index(index, "priv/clip_index.ann")

Query Code:

{:ok, idx} = HNSWLib.Index.load_index(:cosine, 512, "priv/clip_index.ann")
{:ok, {ids, distances}} = HNSWLib.Index.knn_query(idx, text_vec, k: 10)

Limitations:

Model inference still slow per request
- Text embedding still requires CLIP model forward pass
- If load is high (thousands of users), inference becomes the bottleneck
No batching
- Each request executes a separate inference pass
- GPUs/CPUs under-utilized
No automatic reindex pipeline
- When new images come, embeddings must be recomputed

Approach 4: Use Nx.Serving for Batching + High-Speed Inference

Use Nx.Serving to batch inference across concurrent requests
Serves model via GenServer-like interface
Automatically batches input requests within a time window
Gives 5×–25× throughput improvements depending on backend

def serving_text_model() do
  {:ok, %{model: text_model, params: text_params}} =
    Bumblebee.load_model({:hf, "openai/clip-vit-base-patch32"},
      module: Bumblebee.Text.ClipText,
      architecture: :base
    )
 
  {:ok, tokenizer} = Bumblebee.load_tokenizer({:hf, "openai/clip-vit-base-patch32"})
 
  Bumblebee.Text.text_embedding(text_model, text_params, tokenizer,
    embed_fn: fn embedding -> Nx.l2_normalize(embedding) end
  )
end
 
children = [
  {Nx.Serving, name: MediaSearch.TextServing, serving: serving_text_model()}
]
Supervisor.start_link(children, strategy: :one_for_one)

Query Code:

embedding = Nx.Serving.run(MediaSearch.TextServing, "cat wearing sunglasses")

Advantages:

High throughput inference
Great for web APIs
Leveraged across CPU or GPU automatically Limitations

Still no ingestion pipeline
- Images still need manual embedding + indexing
Still single-serving per model
- If the system grows, you need supervision trees, distributed load balancing, etc.

Approach 5: Add Broadway for Streaming Image Ingestion

Use Broadway to build a scalable ingestion pipeline:

Step 1: Receive messages (S3 event / HTTP upload / Kafka)
Step 2: Download image
Step 3: Compute embedding via Nx.Serving (batched)
Step 4: Insert into HNSW index
Step 5: Persist new index segments

[S3/Kafka/HTTP uploads] → Broadway → 
  Download → 
  Nx.Serving Vision Model → 
  HNSW Index Add → 
  Save Incremental Index

Code:

defmodule MediaSearch.Ingestor do
  use Broadway
 
  def start_link(_opts) do
    Broadway.start_link(__MODULE__,
      name: __MODULE__,
      producer: [module: {BroadwaySQS.Producer, queue_url: "..."}],
      processors: [default: [max_demand: 10]],
      batchers: [embed: [batch_size: 32]]
    )
  end
 
  def handle_message(_, msg, _) do
    path = download(msg.data["image_url"])
    Map.put(msg, :local_path, path)
  end
 
  def handle_batch(:embed, msgs, _, _) do
    images = Enum.map(msgs, & &1.local_path)
    vectors = Nx.Serving.run(MediaSearch.VisionServing, images)
 
    Enum.each(vectors, &HNSWLib.Index.add_items(MediaSearch.Index, &1))
 
    msgs
  end
end

Advantages:

Handles continuous ingestion
Batches embedding → way faster
Can scale consumers to multiple nodes
Proper backpressure

Limitations:

Index updates may block queries
- HNSW is mutable; writes compete with reads
- High write volume → lock contention
Need sharding / partitioning for very large datasets
- For more than 10M embeddings likely need multi-index architecture

Approach 6: Sharded Vector Index + Metadata Store

For very large datasets:

Use multiple HNSW indices (shards)
Partition by:
- hash of image id
- content type
- date bucket
Run queries against multiple shards and merge results

Add a metadata DB (Postgres) to store:

image_id → path/url
tags, filters
versioned embedding info

Architecture:

                ┌───────────────┐
                │ Text Encoder  │ (Nx.Serving)
                └───────┬───────┘
                        │
                 text embedding
                        │
        ┌───────────────┴────────────────┐
        │             Router             │
        │ decides which shards to query  │
        └───────┬───────────┬───────────┘
                │           │
           HNSW Shard 1   HNSW Shard 2  ...  
                │           │
                └─────── merge top-K
                        │
                 Postgres metadata lookup

Advantages:

Massive scalability
Parallel search
Blue/green indexing possible

Limitations:

Cross-shard coordination increases latency
Requires index rebalancing strategies
Deployment complexity higher

Summary

Query-time path: user query → Nx.Serving text encoder → vector → shard router → multi-shard HNSW search → merge → metadata → result
Ingestion-time path: image upload → Broadway → download → Nx.Serving vision encoder → HNSW add → persist → notify

Sadiq's Knowledge Vaults

Explorer

Natural Language Media Search

Approach 1: Direct ML Inference per Request

Approach 2: Precomputing Image Embeddings

Approach 3: ANN Search (HNSW) for Vector Indexing

Approach 4: Use Nx.Serving for Batching + High-Speed Inference

Approach 5: Add Broadway for Streaming Image Ingestion

Approach 6: Sharded Vector Index + Metadata Store

Summary

Graph View