Used for Search, Built on top of text documents. It’s basically a map where keys → terms and Values → Arrays of document Ids. Simply in terms of Go map[string][]int (An Array of map of strings)
Problem Context & Requirements
Log File Search
which have unique characteristics
- Append-only: Log files are immutable once written
- Exact search only: Looking for complete keyword matches (e.g., “error1”)
- No document removal: Documents are never deleted individually
- Continuous stream: Log files are constantly being written to
- Memory/disk efficiency: Must be small and lightweight for server deployment
Performance Trade-offs
- Prioritizes: Low memory usage and disk space over search speed
- Acceptable: Slower searches in exchange for efficiency
- Critical: Predictable resource usage as index grows
Key Design Decisions
- Memory Efficiency
- Disk-resident index: Never fully loaded into memory
- Streaming writes: Data split into pieces and flushed frequently to avoid memory buildup
- Predictable usage: Memory consumption doesn’t grow with index size
- Storage Optimization
- Data compression: Both terms and timestamps are compressed
- Immutable writes: Append-only operations for efficient sequential I/O
- Chunked data: Small segments allow reading only necessary parts
- Per-file indexing: One index per log file (deleted when source file is removed)
- Enhanced Features
- Timestamp-based search: Replaces document IDs with timestamps for date-range queries
- Two-column index: Supports queries like “find today’s docs containing ‘error‘“
Compression Technologies
- Integer compression: Uses established algorithms (references Daniel Lemire’s work)[
- String compression: Employs Finite State Transducers (FST) used by Lucene
- FST creates a tree where strings share prefixes and suffixes
- Significantly reduces storage space
API Design
Put(term, timestamps): Writes data (requires sorted input)ReadTerms(): Fast term-only searchesReadTimestamps(terms, min, max): Value retrieval with time boundariesClose(): Finalizes and writes FST structure
File Format Architecture
The sophisticated file structure optimizes for different query types
[term1_index_len][term1_index][term1_compressed_segments]
[...more terms...]
[termN_index_len][termN_index][termN_compressed_segments]
[fst][fst_len]Key benefits:
- Single-pass writing: Can be built sequentially
- Efficient term searches: Only requires reading FST from file tail
- Selective value reading: Mini-indexes allow reading only relevant segments
- Range query optimization: Binary search through segments reduces I/O
Segmentation Strategy
- Problem: Reading entire value lists for range queries wastes I/O
- Solution: Split timestamp lists into compressed segments like
(1,2,3), (4,5,6), (98,99,100) - Optimization: Each term has a small index of segment minimums and offsets for binary search