Used for Search, Built on top of text documents. It’s basically a map where keys terms and Values Arrays of document Ids. Simply in terms of Go map[string][]int (An Array of map of strings)

Problem Context & Requirements

which have unique characteristics

  • Append-only: Log files are immutable once written
  • Exact search only: Looking for complete keyword matches (e.g., “error1”)
  • No document removal: Documents are never deleted individually
  • Continuous stream: Log files are constantly being written to
  • Memory/disk efficiency: Must be small and lightweight for server deployment
Performance Trade-offs
  • Prioritizes: Low memory usage and disk space over search speed
  • Acceptable: Slower searches in exchange for efficiency
  • Critical: Predictable resource usage as index grows

Key Design Decisions

  1. Memory Efficiency
    • Disk-resident index: Never fully loaded into memory
    • Streaming writes: Data split into pieces and flushed frequently to avoid memory buildup
    • Predictable usage: Memory consumption doesn’t grow with index size
  2. Storage Optimization
    • Data compression: Both terms and timestamps are compressed
    • Immutable writes: Append-only operations for efficient sequential I/O
    • Chunked data: Small segments allow reading only necessary parts
    • Per-file indexing: One index per log file (deleted when source file is removed)
  3. Enhanced Features
    • Timestamp-based search: Replaces document IDs with timestamps for date-range queries
    • Two-column index: Supports queries like “find today’s docs containing ‘error‘“

Compression Technologies

  • Integer compression: Uses established algorithms (references Daniel Lemire’s work)[
  • String compression: Employs Finite State Transducers (FST) used by Lucene
    • FST creates a tree where strings share prefixes and suffixes
    • Significantly reduces storage space

API Design

  • Put(term, timestamps): Writes data (requires sorted input)
  • ReadTerms(): Fast term-only searches
  • ReadTimestamps(terms, min, max): Value retrieval with time boundaries
  • Close(): Finalizes and writes FST structure

File Format Architecture

The sophisticated file structure optimizes for different query types

[term1_index_len][term1_index][term1_compressed_segments]
[...more terms...]
[termN_index_len][termN_index][termN_compressed_segments]
[fst][fst_len]

Key benefits:

  • Single-pass writing: Can be built sequentially
  • Efficient term searches: Only requires reading FST from file tail
  • Selective value reading: Mini-indexes allow reading only relevant segments
  • Range query optimization: Binary search through segments reduces I/O

Segmentation Strategy

  • Problem: Reading entire value lists for range queries wastes I/O
  • Solution: Split timestamp lists into compressed segments like (1,2,3), (4,5,6), (98,99,100)
  • Optimization: Each term has a small index of segment minimums and offsets for binary search