Inverted Index

Used for Search, Built on top of text documents. It’s basically a map where keys → terms and Values → Arrays of document Ids. Simply in terms of Go map[string][]int (An Array of map of strings)

Problem Context & Requirements

Log File Search

which have unique characteristics

Append-only: Log files are immutable once written
Exact search only: Looking for complete keyword matches (e.g., “error1”)
No document removal: Documents are never deleted individually
Continuous stream: Log files are constantly being written to
Memory/disk efficiency: Must be small and lightweight for server deployment

Performance Trade-offs

Prioritizes: Low memory usage and disk space over search speed
Acceptable: Slower searches in exchange for efficiency
Critical: Predictable resource usage as index grows

Key Design Decisions

Memory Efficiency
- Disk-resident index: Never fully loaded into memory
- Streaming writes: Data split into pieces and flushed frequently to avoid memory buildup
- Predictable usage: Memory consumption doesn’t grow with index size
Storage Optimization
- Data compression: Both terms and timestamps are compressed
- Immutable writes: Append-only operations for efficient sequential I/O
- Chunked data: Small segments allow reading only necessary parts
- Per-file indexing: One index per log file (deleted when source file is removed)
Enhanced Features
- Timestamp-based search: Replaces document IDs with timestamps for date-range queries
- Two-column index: Supports queries like “find today’s docs containing ‘error‘“

Compression Technologies

Integer compression: Uses established algorithms (references Daniel Lemire’s work)[
String compression: Employs Finite State Transducers (FST) used by Lucene
- FST creates a tree where strings share prefixes and suffixes
- Significantly reduces storage space

API Design

Put(term, timestamps): Writes data (requires sorted input)
ReadTerms(): Fast term-only searches
ReadTimestamps(terms, min, max): Value retrieval with time boundaries
Close(): Finalizes and writes FST structure

File Format Architecture

The sophisticated file structure optimizes for different query types

[term1_index_len][term1_index][term1_compressed_segments]
[...more terms...]
[termN_index_len][termN_index][termN_compressed_segments]
[fst][fst_len]

Key benefits:

Single-pass writing: Can be built sequentially
Efficient term searches: Only requires reading FST from file tail
Selective value reading: Mini-indexes allow reading only relevant segments
Range query optimization: Binary search through segments reduces I/O

Segmentation Strategy

Problem: Reading entire value lists for range queries wastes I/O
Solution: Split timestamp lists into compressed segments like (1,2,3), (4,5,6), (98,99,100)
Optimization: Each term has a small index of segment minimums and offsets for binary search

Sadiq's Knowledge Vaults

Explorer