We want to build an application which returns the related images wrt the text given
Approach 1: Direct ML Inference per Request
- The simplest implementation:
- Load CLIP model (text + image encoders)
- For every user query, compute the text embedding
- Compute image embeddings on-the-fly by running the vision model for every candidate image
- Compute cosine similarity between text vector and each image vector
- Return sorted top-K images.
{:ok, %{model: text_model, params: text_params}} =
Bumblebee.load_model({:hf, "openai/clip-vit-base-patch32"},
module: Bumblebee.Text.ClipText,
architecture: :base
)
{:ok, tokenizer} = Bumblebee.load_tokenizer({:hf, "openai/clip-vit-base-patch32"})
tokens = Bumblebee.apply_tokenizer(tokenizer, ["yellow flowers on a table"])
embedding =
text_model
|> Axon.predict(text_params, tokens)
|> Nx.l2_normalize()Limitations:
- Image inference per request is too slow
- CLIP vision model is heavy (ViT)
- Computing embeddings for 10k images per query → seconds to minutes
- No caching / reuse of embeddings
- Pure waste of computation for static images
- Similarity search via linear scan
- Brute-force cosine similarity: O(N) per query
Approach 2: Precomputing Image Embeddings
- Compute and store image embeddings offline
- For each request:
- Compute text embedding
- Compare against pre-computed embeddings
- No repeated image inference
{:ok, %{model: vision_model, params: vision_params}} =
Bumblebee.load_model({:hf, "openai/clip-vit-base-patch32"},
module: Bumblebee.Vision.ClipVision,
architecture: :base
)
featurizer = Bumblebee.load_featurizer({:hf, "openai/clip-vit-base-patch32"})
def embed_image(path, model, params, featurizer) do
{:ok, img} = StbImage.read_file(path)
features = Bumblebee.apply_featurizer(featurizer, img)
model
|> Axon.predict(params, features)
|> Nx.l2_normalize()
|> Nx.to_flat_list()
endLimitations:
- Linear scan still too slow
- For N images, each query is O(N)
- 25k images = noticeable lag
- 100k+ images = unusable.
- Works only if dataset is small
- Real-world production needs instant search (< 100ms)
Approach 3: ANN Search (HNSW) for Vector Indexing
- Use HNSWLib to index image embeddings
- k-NN queries become logarithmic instead of linear
- Works well for 100k–10M embeddings
{:ok, index} =
HNSWLib.Index.new(:cosine, 512, max_elements: count, ef_construction: 200)
Enum.each(image_embeddings, fn {id, vector} ->
HNSWLib.Index.add_items(index, Nx.tensor(vector), id)
end)
HNSWLib.Index.save_index(index, "priv/clip_index.ann")Query Code:
{:ok, idx} = HNSWLib.Index.load_index(:cosine, 512, "priv/clip_index.ann")
{:ok, {ids, distances}} = HNSWLib.Index.knn_query(idx, text_vec, k: 10)Limitations:
- Model inference still slow per request
- Text embedding still requires CLIP model forward pass
- If load is high (thousands of users), inference becomes the bottleneck
- No batching
- Each request executes a separate inference pass
- GPUs/CPUs under-utilized
- No automatic reindex pipeline
- When new images come, embeddings must be recomputed
Approach 4: Use Nx.Serving for Batching + High-Speed Inference
- Use Nx.Serving to batch inference across concurrent requests
- Serves model via GenServer-like interface
- Automatically batches input requests within a time window
- Gives 5×–25× throughput improvements depending on backend
def serving_text_model() do
{:ok, %{model: text_model, params: text_params}} =
Bumblebee.load_model({:hf, "openai/clip-vit-base-patch32"},
module: Bumblebee.Text.ClipText,
architecture: :base
)
{:ok, tokenizer} = Bumblebee.load_tokenizer({:hf, "openai/clip-vit-base-patch32"})
Bumblebee.Text.text_embedding(text_model, text_params, tokenizer,
embed_fn: fn embedding -> Nx.l2_normalize(embedding) end
)
end
children = [
{Nx.Serving, name: MediaSearch.TextServing, serving: serving_text_model()}
]
Supervisor.start_link(children, strategy: :one_for_one)Query Code:
embedding = Nx.Serving.run(MediaSearch.TextServing, "cat wearing sunglasses")Advantages:
- High throughput inference
- Great for web APIs
- Leveraged across CPU or GPU automatically Limitations
- Still no ingestion pipeline
- Images still need manual embedding + indexing
- Still single-serving per model
- If the system grows, you need supervision trees, distributed load balancing, etc.
Approach 5: Add Broadway for Streaming Image Ingestion
Use Broadway to build a scalable ingestion pipeline:
- Step 1: Receive messages (S3 event / HTTP upload / Kafka)
- Step 2: Download image
- Step 3: Compute embedding via Nx.Serving (batched)
- Step 4: Insert into HNSW index
- Step 5: Persist new index segments
[S3/Kafka/HTTP uploads] → Broadway →
Download →
Nx.Serving Vision Model →
HNSW Index Add →
Save Incremental Index
Code:
defmodule MediaSearch.Ingestor do
use Broadway
def start_link(_opts) do
Broadway.start_link(__MODULE__,
name: __MODULE__,
producer: [module: {BroadwaySQS.Producer, queue_url: "..."}],
processors: [default: [max_demand: 10]],
batchers: [embed: [batch_size: 32]]
)
end
def handle_message(_, msg, _) do
path = download(msg.data["image_url"])
Map.put(msg, :local_path, path)
end
def handle_batch(:embed, msgs, _, _) do
images = Enum.map(msgs, & &1.local_path)
vectors = Nx.Serving.run(MediaSearch.VisionServing, images)
Enum.each(vectors, &HNSWLib.Index.add_items(MediaSearch.Index, &1))
msgs
end
endAdvantages:
- Handles continuous ingestion
- Batches embedding → way faster
- Can scale consumers to multiple nodes
- Proper backpressure
Limitations:
- Index updates may block queries
- HNSW is mutable; writes compete with reads
- High write volume → lock contention
- Need sharding / partitioning for very large datasets
- For more than 10M embeddings likely need multi-index architecture
Approach 6: Sharded Vector Index + Metadata Store
For very large datasets:
- Use multiple HNSW indices (shards)
- Partition by:
- hash of image id
- content type
- date bucket
- Run queries against multiple shards and merge results
Add a metadata DB (Postgres) to store:
- image_id → path/url
- tags, filters
- versioned embedding info
Architecture:
┌───────────────┐
│ Text Encoder │ (Nx.Serving)
└───────┬───────┘
│
text embedding
│
┌───────────────┴────────────────┐
│ Router │
│ decides which shards to query │
└───────┬───────────┬───────────┘
│ │
HNSW Shard 1 HNSW Shard 2 ...
│ │
└─────── merge top-K
│
Postgres metadata lookup
Advantages:
- Massive scalability
- Parallel search
- Blue/green indexing possible
Limitations:
- Cross-shard coordination increases latency
- Requires index rebalancing strategies
- Deployment complexity higher
Summary
-
Query-time path: user query → Nx.Serving text encoder → vector → shard router → multi-shard HNSW search → merge → metadata → result
-
Ingestion-time path: image upload → Broadway → download → Nx.Serving vision encoder → HNSW add → persist → notify