B trees are great, but when the data is coming is too large or when it’s a write heavy database, the O(logn) write speed might not cut it and the disks we use to store are not very good at Random access(finding and writing data on random places on the disk), and in case of B trees we need to swap and move around a lot of data and that’s why there will be a lot of Random I/O making this writes slow
To solve this problem we use a famous technique, and the flow will be like this: batch write the incoming values into a temp storage flush it into database after sometime
Now let’s optimize this flow

  • The temp storage - Mem-Table - It’s stores with the keys sorted
  • Parallel to this, write the data in the Mem-Table to WAL using direct-io(extremely fast write by bypassing the OS) to be secure from crashes
  • Now Mem-Table is nearly full we write it to SSTable(Sorted String Table), this is immutable and it consists of many segments which have keys in sorted order, and in the next insertion from mem-table we create another segment file and write it into it
  • Now when we need to search anything we follow the following process
    • First we search in mem-table(we’ve fresh data here), if yes return, if not go search in SSTable
    • Now in SSTable before we use to do binary search, but it is also a overhead when we’ve large data, we moved to file segregation parts search, it means we sort and segregate the segment files into parts and have a index file which has the starting key of the segment file and the offset, this tells us which part has what range of data, and from this we go to that part and do the Binary search and get the data
    • This is ok, but we need more speed, and here we use Bloom filter, here rather parsing the indexes even if it’s not present and wasting our IO threads, we keep the values in bloom filter and ask it, if the value i need is present in the segment file or not, if no then ignore and move on, if yes then go and search
  • The read and write optimizations are solved, but as the data is coming continuously the storage(segment files) piles up quickly, so we’ve a background process running which performs “compaction” process, it takes two segment files and merge them removing duplicate and redundant data