We want to build an application which returns the related images wrt the text given

Approach 1: Direct ML Inference per Request

  • The simplest implementation:
    • Load CLIP model (text + image encoders)
    • For every user query, compute the text embedding
    • Compute image embeddings on-the-fly by running the vision model for every candidate image
    • Compute cosine similarity between text vector and each image vector
    • Return sorted top-K images.
{:ok, %{model: text_model, params: text_params}} =
  Bumblebee.load_model({:hf, "openai/clip-vit-base-patch32"},
    module: Bumblebee.Text.ClipText,
    architecture: :base
  )
 
{:ok, tokenizer} = Bumblebee.load_tokenizer({:hf, "openai/clip-vit-base-patch32"})
 
tokens = Bumblebee.apply_tokenizer(tokenizer, ["yellow flowers on a table"])
 
embedding =
  text_model
  |> Axon.predict(text_params, tokens)
  |> Nx.l2_normalize()

Limitations:

  1. Image inference per request is too slow
    • CLIP vision model is heavy (ViT)
    • Computing embeddings for 10k images per query → seconds to minutes
  2. No caching / reuse of embeddings
    • Pure waste of computation for static images
  3. Similarity search via linear scan
    • Brute-force cosine similarity: O(N) per query

Approach 2: Precomputing Image Embeddings

  • Compute and store image embeddings offline
  • For each request:
    • Compute text embedding
    • Compare against pre-computed embeddings
    • No repeated image inference
{:ok, %{model: vision_model, params: vision_params}} =
  Bumblebee.load_model({:hf, "openai/clip-vit-base-patch32"},
    module: Bumblebee.Vision.ClipVision,
    architecture: :base
  )
 
featurizer = Bumblebee.load_featurizer({:hf, "openai/clip-vit-base-patch32"})
 
def embed_image(path, model, params, featurizer) do
  {:ok, img} = StbImage.read_file(path)
  features = Bumblebee.apply_featurizer(featurizer, img)
 
  model
  |> Axon.predict(params, features)
  |> Nx.l2_normalize()
  |> Nx.to_flat_list()
end

Limitations:

  1. Linear scan still too slow
    • For N images, each query is O(N)
    • 25k images = noticeable lag
    • 100k+ images = unusable.
  2. Works only if dataset is small
    • Real-world production needs instant search (< 100ms)

Approach 3: ANN Search (HNSW) for Vector Indexing

  • Use HNSWLib to index image embeddings
  • k-NN queries become logarithmic instead of linear
  • Works well for 100k–10M embeddings
{:ok, index} =
  HNSWLib.Index.new(:cosine, 512, max_elements: count, ef_construction: 200)
 
Enum.each(image_embeddings, fn {id, vector} ->
  HNSWLib.Index.add_items(index, Nx.tensor(vector), id)
end)
 
HNSWLib.Index.save_index(index, "priv/clip_index.ann")

Query Code:

{:ok, idx} = HNSWLib.Index.load_index(:cosine, 512, "priv/clip_index.ann")
{:ok, {ids, distances}} = HNSWLib.Index.knn_query(idx, text_vec, k: 10)

Limitations:

  1. Model inference still slow per request
    • Text embedding still requires CLIP model forward pass
    • If load is high (thousands of users), inference becomes the bottleneck
  2. No batching
    • Each request executes a separate inference pass
    • GPUs/CPUs under-utilized
  3. No automatic reindex pipeline
    • When new images come, embeddings must be recomputed

Approach 4: Use Nx.Serving for Batching + High-Speed Inference

  • Use Nx.Serving to batch inference across concurrent requests
  • Serves model via GenServer-like interface
  • Automatically batches input requests within a time window
  • Gives 5×–25× throughput improvements depending on backend
def serving_text_model() do
  {:ok, %{model: text_model, params: text_params}} =
    Bumblebee.load_model({:hf, "openai/clip-vit-base-patch32"},
      module: Bumblebee.Text.ClipText,
      architecture: :base
    )
 
  {:ok, tokenizer} = Bumblebee.load_tokenizer({:hf, "openai/clip-vit-base-patch32"})
 
  Bumblebee.Text.text_embedding(text_model, text_params, tokenizer,
    embed_fn: fn embedding -> Nx.l2_normalize(embedding) end
  )
end
 
children = [
  {Nx.Serving, name: MediaSearch.TextServing, serving: serving_text_model()}
]
Supervisor.start_link(children, strategy: :one_for_one)

Query Code:

embedding = Nx.Serving.run(MediaSearch.TextServing, "cat wearing sunglasses")

Advantages:

  • High throughput inference
  • Great for web APIs
  • Leveraged across CPU or GPU automatically Limitations
  1. Still no ingestion pipeline
    • Images still need manual embedding + indexing
  2. Still single-serving per model
    • If the system grows, you need supervision trees, distributed load balancing, etc.

Approach 5: Add Broadway for Streaming Image Ingestion

Use Broadway to build a scalable ingestion pipeline:

  • Step 1: Receive messages (S3 event / HTTP upload / Kafka)
  • Step 2: Download image
  • Step 3: Compute embedding via Nx.Serving (batched)
  • Step 4: Insert into HNSW index
  • Step 5: Persist new index segments
[S3/Kafka/HTTP uploads] → Broadway → 
  Download → 
  Nx.Serving Vision Model → 
  HNSW Index Add → 
  Save Incremental Index

Code:

defmodule MediaSearch.Ingestor do
  use Broadway
 
  def start_link(_opts) do
    Broadway.start_link(__MODULE__,
      name: __MODULE__,
      producer: [module: {BroadwaySQS.Producer, queue_url: "..."}],
      processors: [default: [max_demand: 10]],
      batchers: [embed: [batch_size: 32]]
    )
  end
 
  def handle_message(_, msg, _) do
    path = download(msg.data["image_url"])
    Map.put(msg, :local_path, path)
  end
 
  def handle_batch(:embed, msgs, _, _) do
    images = Enum.map(msgs, & &1.local_path)
    vectors = Nx.Serving.run(MediaSearch.VisionServing, images)
 
    Enum.each(vectors, &HNSWLib.Index.add_items(MediaSearch.Index, &1))
 
    msgs
  end
end

Advantages:

  • Handles continuous ingestion
  • Batches embedding → way faster
  • Can scale consumers to multiple nodes
  • Proper backpressure

Limitations:

  1. Index updates may block queries
    • HNSW is mutable; writes compete with reads
    • High write volume → lock contention
  2. Need sharding / partitioning for very large datasets
    • For more than 10M embeddings likely need multi-index architecture

Approach 6: Sharded Vector Index + Metadata Store

For very large datasets:

  • Use multiple HNSW indices (shards)
  • Partition by:
    • hash of image id
    • content type
    • date bucket
  • Run queries against multiple shards and merge results

Add a metadata DB (Postgres) to store:

  • image_id → path/url
  • tags, filters
  • versioned embedding info

Architecture:

                ┌───────────────┐
                │ Text Encoder  │ (Nx.Serving)
                └───────┬───────┘
                        │
                 text embedding
                        │
        ┌───────────────┴────────────────┐
        │             Router             │
        │ decides which shards to query  │
        └───────┬───────────┬───────────┘
                │           │
           HNSW Shard 1   HNSW Shard 2  ...  
                │           │
                └─────── merge top-K
                        │
                 Postgres metadata lookup

Advantages:

  • Massive scalability
  • Parallel search
  • Blue/green indexing possible

Limitations:

  1. Cross-shard coordination increases latency
  2. Requires index rebalancing strategies
  3. Deployment complexity higher

Summary

  • Query-time path: user query → Nx.Serving text encoder → vector → shard router → multi-shard HNSW search → merge → metadata → result

  • Ingestion-time path: image upload → Broadway → download → Nx.Serving vision encoder → HNSW add → persist → notify