Adam Jones

Understanding database indices by (poorly) implementing one

2023-05-20T20:59:12+00:00

There’s a lot of misconceptions about database indices. These exist, in part, because people are missing the context needed to imagine how a database uses them. There’s a lot to learn to establish that context. Too much for one blog post. But, we can try to bootstrap off what’s already familiar to help develop a better understanding.

To do that, we’re going to implement a fake database index in Ruby. It will be woefully incomplete, but still should be enough to give an idea of what’s happening.

Warning

What you’ll see here is not, actually, how database indices work. It’s an extremely crude approximation. I try to call out the where and how that approximation isn’t valid. If you encounter anything in an actual database that doesn’t match up with what you see here, I encourage you to take that as an opportunity to dive in and learn deeper.

Making things very simple

We’re going to build our fake index out the humble Ruby Hash. Those are pretty familiar, right? Store data by key and value, then you can later retrieve the value by providing the key. If you don’t have a key, you’re basically just working with a more expensive variant of an Array. Ironically, under the hood a Ruby Hash uses a lot of the same concepts and data structures as database indices. Anyways, this will be our substitute for writing actual data structure code.

We’ll only support unique indices. It’s possible, but messy, for us to support non-unique ones. I just don’t think it’s going to teach much you won’t already learn here. We will support composite indices, and will get into covering queries that only use some of the index columns.

Probably the biggest query-time thing separating our index from a real one will be lack of support for range queries. So no WHERE X > 0 style queries for our index. We’re ignoring this because hashes don’t make it easy to do efficiently, and I don’t think implementing it will tell you much that direct value lookups don’t. Real database indices absolutely are able to handle these for many different data types.

The Index class

We’ll start with a class, that we name Index, which will be the core of our code here. We will “implement” different SQL queries as Ruby code written in terms of this Index class.

We use Index.declare to create an (empty) index on a list of columns. Then we can add data to it by looping through the data and calling Index#add.

# Allow us to efficiently answer questions about a large amount of data based on
# specific column(s) in it.
class Index
  # The column this index is handling.
  attr_reader :column

  # The columns that come *after* this one in the index.
  #
  # If this list is empty, we're at the "end" of the index column list and
  # should store row ids as our Hash values.
  #
  # If it is *not* empty, we make an `Index` class that deals with those
  # columns and use it as our Hash value.
  attr_reader :subsequent_columns

  # The Hash that represents actual index content. I'm avoiding calling this
  # `data` because it's *not* the actual data we're indexing. Confusing
  # terminology.
  attr_reader :content

  def initialize(column, subsequent_columns = [])
    @column = column
    @subsequent_columns = subsequent_columns
    @content = {}
  end

  # Are we the final column of the index? If so, our answers should be data id
  # values instead of another `Index`
  def leaf?
    @subsequent_columns.empty?
  end

  # "Index" a piece of data. It's assumed that this data is functionally a Hash
  # that contains at least `:id` and whatever value we hvae for `column`.
  def add(data)
    value = data[column]
    if leaf?
      @content[value] = data[:id]
    else
      # If we are *not* the final column, create a new Index to represent the
      # slice of data that all shares the same value for our `column`. This
      # index should use the *next* subsequent column, and needs to know about
      # the *rest* of the subsequent columns in case it too is not the final
      # one.
      @content[value] ||= Index.new(subsequent_columns[0], subsequent_columns.drop(1))
      @content[value].add(data)
    end
  end
end

What we can learn about database indices

Surprisingly, just here we can draw an important and useful inference about working with indices. The “natural flow” of accessing this data is going to be along the path dictated by the columns. Our index also can’t answer questions involving columns that weren’t indexed.

It’s easy to imagine navigating this in column order, but other orders seem like a bigger challenge. Databases are full of clever optimizations that can sometimes make out-of-order usage possible, but generally speaking you want things to happen in-order.

Sample data

To play with this, we’ll work on sample data taken from the US Census Bureau City and Town Population Totals This is a list of ~20k cities in the US with their estimated population.

For the purposes of this post, I have Cleaned it up in a CSV, with state names extracted.

We’re going to assume here that the combination of the city and state columns makes a record unique. That isn’t strictly true for this data, but again it makes it easier to work with.

Harnass code

The following code is enough to get us started in an IRB session. It assumes the above code snippet is available locally as ./index.rb, and the csv can be found at ./city_populations_2022.csv.

require "csv"

load "index.rb"

# Load the CSV, converting integer values as we go
csv = CSV.read("./city_populations_2022.csv", headers: true, converters: [:integer, :all, :all, :all, :integer])

# Store our CSV in an Array where the values are hashes of the row
# data. This will simulate the actual database table.
data = csv.map(&:to_h); nil

# Declare an index on state and city, in that order
index = Index.declare("state", "city")

data.each do |row|
  index.add(row)
end; nil

If we were to discard the index class and just look at things as nested hashes, our index would look like this:

{
  "California" => {
    "Los Angeles" => 1444 # 1444 is the row id for this city
  }
}

Finding a row by state and city

We’ll start out simple: given a city and state, look up the row. We’ll try it out with Los Angeles, California. In SQL, this would be: SELECT * FROM populations WHERE state = 'California' AND city = 'Los Angeles'

state = index.content["California"]
city = state.content["Los Angeles"]

# Our `id` values don't exactly correspond to Array offsets, so we have to do this.
data[city - 1]

What we can learn about database indices

Index ordering

Notice how we’re starting the lookup with the state? That’s because it’s the “beginning” of the index.

Imagine if we tried to start with the city first. What would that code look like? It would have to dig through every value in the state index to get at cities, then work its way backwards.

Often, your database effectively can’t do this. There’s too much data involved, and simply keeping track of everything you’ve looked at could cause problems. Plus “examine the entire index” isn’t going to be a fast operation. It might pursue this strategy if you give it no better option, but you really want to give it better options.

Row lookup

Notice how, to return the data, we had to go to our “table” that is stored in data? That’s called a “row lookup”. Real databases almost certainly store the index data and row data separately, so row lookups have additional overhead that we want to be careful with.

Often, optimizing SQL queries is a process of trying to avoid any more row lookups than strictly necessary.

Finding the total population of a state

Ok, now let’s try another likely task: finding the total population of a state. We’ll go with Idaho this time. In SQL this would look like SELECT sum(population) FROM populations WHERE state = 'Idaho'.

At first glance it might not look like our index is helpful here, but it still is. Here’s code to get this without the index:

sum = 0

# Let's keep track of how many times we had to go fetch a row. This is
# important, because row lookups are expensive.
rows_examined = 0

# Notice: we are visiting *every* row in the data. If we had millions or
# billions of rows, this would be really bad.
data.each do |row|
  rows_examined += 1
  next unless row["state"] == "Idaho"
  
  sum += row["population"]
end; nil

[sum, rows_examined] # => [1302154, 19692]

So we got our sum, it probably was fast on your computer (reminder: this is a tiny amount of data), but we had to look at every single row in the data. Usually, “we have to look at every row in the entire table” is one of the absolute worst things you can see your database doing.

So how can we use our index? We don’t have a ready list of “the names of every city in Idaho”, so we can’t just plug that in as keys once we get to the Idaho index. But, we do have the ability to traverse a Hash by values. So we can still use our index to help us get to the state of Idaho, then crawl through its contents to find the total population.

sum = 0

# Again, we're tracking rows
rows_examined = 0

state = index.content['Idaho']
state.content.values.each do |row_id|
  city = data[row_id - 1]
  rows_examined += 1
  sum += city["population"]
end; nil

[sum, rows_examined] # => [1302154, 199]

So now we have the same sum, but we looked at roughly 10% of the rows. That’s a huge win.

What we can learn about database indices

Databases don’t just use indices for cases where they have every single relevant key. It’s a data structure that they can dig through, and that can help significantly.

Sometimes they do this by “skipping over” intermediate keys to get to the final rows, like what we did here. It’s worth noticing that this was only possible because our index was defined as (state, city). If it had been (city, state), we would have had to examine every single city name to see if it was in the state. That’s usually still better than crawling every row of the data, but it’s nowhere near as good as what we just experienced.

When you’re defining a composite index, it’s really important to think about the cases where you might end up querying only some of those columns. Getting the column order right will maximize the value you get out of the database’s work in maintaining the index.

A new index for even faster population totals

Let’s say this kind of population query is extremely important, and we’ve found the above “only accessing 10% of the rows” to still be too slow for our needs. What can an index do for us?

We’ve done more or less everything we can with our existing index. If our system supported non-unique indices, we could make an index on just state that would allow us to directly jump into rows, but it wouldn’t change the amount of rows we’re looking at.

Let’s build another index, one that extends our previous index with population values. So it would look like (state, city, population). Here’s how:

population_index = Index.declare("state", "city", "population")

data.each do |row|
  population_index.add(row)
end; nil

Because state+city was already unique, state+city+population is also going to be.

Here’s a sketch of it as a Hash:

{
  "California" => {
    "Los Angeles" => {
      # NOTE: This "population" Hash will always be a single key (the
      # population) pointing to the row id.
      3898767 => 1444
    }
  }
}

This index can give us our population total without touching a single row!

sum = 0

state = population_index.content["Idaho"]

state.content.values.each do |city|
  sum += city.content.keys.sum
end; nil

sum # => 1302154

Notice how data is not even mentioned in this code. We’re answering queries just from the index content!

What we can learn about database indices

Since our index reflects the underlying data, we can use the index contents in place of the actual data. Databases use this trick a lot, and it’s an incredibly effective optimization.

It’s generally safe to assume that your data on disk isn’t organized in a way that makes any particular lookup effective. Before when we read 199 rows to get our data, it’s safe to assume that none of those rows lived next to each other in a way that allowed the operating system to avoid doing 199 disk reads.

By comparison, even when the index is serialized to disk, all of the relevant bits of information live closer together. It’s very likely that reading the disk block that gave us one relevant city also happened to load and cache other cities we needed. Plus our index data is a lot smaller/denser than the actual row data. So even digging everything up off the disk involved fewer disk reads.

When trying to look up actual city records, the same “skip over a column” trick that we did in the last section can work here. So it’s possible to go from (state, city, population) to the city record even with just state and city. This index could handily serve every query we’ve seen so far.

Finding the total population of EVERY state

Now we’re going to try to handle this query: SELECT state, sum(population) FROM populations GROUP BY state.

STOP! Before you read further, I want you to think about how you’d solve this. You have three options now:

Walk through all the data rows
Try to use the original index
Try to use the population index

That act of “deciding how to get at the data” is called Query Planning. It’s an important part of how databases work. Get deep enough into database performance and you’re going to have to become intimately familiar with your database’s query plan explanations. Examining that output is a key way to help debug slow queries and figure out what changes need to happen to make them not-slow queries.

In this case we have only three options, and it’s (probably) relatively easy to pick which one will be “the best”. But, let’s think them through in a rough approximation of how a query planner might look at this.

If we assign a “cost” to data and index reads, we can weigh our options by “total cost”:

Walk all the rows: 20,000 data reads + 0 index reads
Use the original index: 20,000 data reads + 20,000 index reads
Use the population index: 0 data reads + 20,000 index reads (reminder: all needed data is in the index)

It’s generally accurate to assume that index reads are cheaper than data reads. So we’d want to “weigh” data reads higher. The actual process inside a real database is much more complicated, but here we’ll just assume data reads are 5x as expensive.

That gives us total costs of: 100,000 120,000 and 20,000 respectively. Which means we should go with the last option.

Sidenote: to make it accessible, this cost calculation is wildly naive. Real databases track a lot more information than “how many rows are there”, and have more detailed insights about both the characteristics of the data and the specifics of how it is stored. Imagine you spent years refining this concept to fix every case where your cost predictions were wrong, and you’re closer to how databases actually work.

So, how do we find per-state populations? That one again comes out kind of straightforward:

result = {}

population_index.content.each do |state, cities|
  result[state] = 0

  cities.content.values.each do |population|
    result[state] += population.content.keys[0]
  end
end; nil

result

What we can learn about database indices

Again, we’re using the index as a source of information rather than just a way to get to information. It’s worth reiterating this because it comes up so often in real world scenarios.

We also, in our simulated query planning, saw a case where using an index was slower than reading the entire table. In practice, these scenarios are rare, but they can happen. Sometimes you’ll look at a query plan and wonder why the database is ignoring an index, only to find that you’re missing details where the index makes things slower. In this case that detail was “we’re asking to read the entire table”, but it’s definitely not the only one.

We’re also, in our result hash, just getting a glimpse into result buffering. Again it’s worth imagining what we would do in a scenario where there was so much data that just storing this result in memory wouldn’t work.

Wrapping up

Hopefully this helps a little to demistify database indices. As I mentioned at the start, the actual data structures vary significantly from what we’re using here, but I’ve tried to keep the reasoning and thought processes consistent.

Despite the inaccuracies, you can get pretty far using this “wandering through nested hashes” view of how database indices work. Perhaps the best extension to your mental model would be imagining a hash that also includes the ability to do inequality comparisons on keys. Like if a Hash#lookup method existed that took a Ruby Range as an argument and could efficiently give you the values where keys were inside of that range.

If all of this has you interested in what the internals of an index actually look like, you can start by studying B-trees. They’re probably the most commonly used data structure for this purpose. Many databases support alternative index types based around different data structures, which is where you really start getting deep into the benefits and drawbacks of each one.

If you’d like to know more about query plans and how databases handle the topic of picking an algorithm to look up the data, that unfortunately gets pretty specific to the database involved. If you’re using MySQL or PostgreSQL I’ve linked to the relevant sections of their documentation. Because databases are attempting to generate the best possible plan out of (potentially) a huge number of choices in a tiny fraction of time, query planning gets kind of hairy and detailed fast.

If this has peaked your interest in how to effectively use indices, Use the index, Luke is a fantastic resource. It even includes an introduction to B-trees and resources tailored to multiple database types.

Should we be caching

2022-02-24T22:41:31+00:00

Caching is one of those weird things in programming, like inheritance and concurrency, where everyone parrots the line about how tough it is then immediately turns to it when they have a problem.

Chances are good that somewhere in your app, you’ve got a cache. Probably several.

Chances are also pretty good that at least some of that caching is causing invalid results. There’s even an outside possibility it’s making things slower. In this post I’ll talk through how to think about caching to be sure it’s worth the pains it can cause.

First, applicable scope

I’m only talking about read-through caches. Ones where we update the cache synchronously when reads come up empty/expired. It’s one of the most common forms of caching.

The standard Rails low-level cache (i.e. Rails.cache.fetch) is an example.

Other caching techniques might change the thought process, so if you’re looking at one of those be careful using this logic.

Good reasons for caching

Before we get into “is a cache worthwhile” questions, let’s talk through the reasons you might be adding caching.

Efficiency

This is the big one. You’re doing work to obtain results that don’t change, so reusing the results avoids repeating that work.

The goal is to improve user experiences by making things faster.

Although making things faster is generally desirable, it’s important to qualify (not quantify) this improvement. For most use cases, optimizing a 10ms request into a 1ms request isn’t particularly useful. 10ms already feels fast, so users won’t notice that it’s 10x faster.

Thankfully, there’s been some study in this area.

First off, we’ll look to Jakob Nielsen for tiers of user perception on waiting for a task:

< 0.1s - feels instantaneous
< 1.0s - keep your flow of thought
< 10.0s - keep your focus

A cache that moves your user from one tier into another is helping immensely, even if the overall improvement doesn’t seem impressive.

Second, we’ll look at Neil Patel (as referred to us by Google Analytics help) to see a 7% bounce rate increase for every 1s of load time.

Combining these, we can regard 1s increments as valuable changes, >10s request times as extremely problematic, and getting under 1s and especially 0.1s as huge improvements. Moving about inside those 1s increments, and especially underneath 0.1s, is less valuable than crossing a boundary.

Insulating a data store

This is also a common reason for caching. Here your goal isn’t speed, but availability. You’re trying to protect your data store from workloads it can’t handle by avoiding some of that work.

This requires different thought and planning. Scenarios where your cache isn’t available are a system problem, not a user experience one. For example, if your frontend servers lose their in-memory caches during a deploy, the data store will be on its own until the caches refill. If it can’t handle that load spike, you go down.

It’s not uncommon to initially deploy a cache for efficiency reasons, only to have request growth turn it into an “insulate the data store” cache.

For this caching goal, hit rate is more important than raw performance. The overhead of reaching across a network to a dedicated caching server is usually acceptable compared to your app servers needing to refill in-memory caches.

Bad reasons for caching

There are a few common cases where people think they should be caching, when it’s somewhere between a Band-Aid and harmful.

Very slow queries

Read-through caching cannot “fix” queries that run longer than your app server’s timeout. If your query times are approaching your timeouts, at best a cache will make the error intermittent. That’s better than nothing, but not good.

In these cases, first focus on speeding up the queries. Often queries are either over-fetching data or filtering against columns that aren’t properly indexed.

Slow view fragments

Similar to the above about queries, sometimes people turn to caching to solve problems with request timeouts during view rendering. Again, this isn’t going to actually solve the problem, just (maybe) push it off for a bit.

Usually, those slow view renderings are database queries in disguise. Look for N+1 query behavior. Are you loading too much data in a page?

It “might” be slow

Often people slap caching on things they think will be slow. Equally often, they’re wrong. Unless you’ve done analysis to back up the idea, this is premature optimization. You’re ponying up for the costs of caching without being sure they’re worth paying.

The costs of caching

Although adding a cache could increase your server costs, that’s no the most important one. The most important cost of caching lies with developers and users, not servers. Caches are very hard to reason about, which makes them easy to get wrong, which can cause all kinds of havoc.

In a system without caching, every result simply is what the source of truth says. A user changes a record, that gets committed to the database, the next request shows the new data. Easy peasy.

When caching is involved, stale caches mean what your database tells you isn’t always what users see. This can undermine trust, and in worst cases cause incorrect behavior.

Beyond user-facing consequences, caching means more code. Every data change also needs to worry about caching. Invalidating just the relevant cache keys is a hard problem. Get it wrong and your hit rates plummet, or you serve stale data, or maybe both at once.

Ultimately, we have to weigh the benefits against these costs. Unfortunately, it’s a difficult comparison. The benefits are easily quantifiable and the costs are highly subjective.

The caching equation

When trying to answer “is a cache making things faster”, there’s a relatively simple formula to use. Keep in mind that our cache system is its own data store, with its own access time.

hit_rate # % of cache lookups with validly cached data
miss_rate # 100 - hit_rate
lookup_time # time (ms) to consult the cache (hit or miss)
query_time # time(ms) to run the query
fractional_hit_cost = hit_rate * lookup_time
fractional_miss_cost = miss_rate * (lookup_time + query_time)
cached_query_time = fractional_hit_cost + fractional_miss_cost

The difference between cached_query_time and query_time is how much caching is helping (or possibly hurting) you.

The beauty is that you can plug all this into a spreadsheet and play with the numbers. Fill in what you think the hit rate would be, along with lookup and query times, and you’re able to tweak numbers to explore the benefits before doing any coding. What happens if your hit rate is half what you expect? Or if reading from the cache is twice as expensive? Feeling out the edges of this new system before introducing it is very useful.

What is a “good” hit rate

This is a pretty natural question as part of caching decision making, but I think it’s the wrong question.

A better question would be “is caching worth it”, and the hit rate is only part of that.

If the benefit you’re looking for is in user experience, then will caching move users between those 0.1/1/10 second thresholds? How much developer complexity and user uncertainty is it adding?

If the benefit you want is insulating your data store, how effectively is it doing that? How much load are you eliminating? Are you protected in high usage scenarios? A high hit rate cache of an inexpensive query may not be as useful to you as a lower hit rate for an expensive one.

This isn’t to say that hit rates aren’t important, they definitely can be, just that hard and fast numbers aren’t the best way to evaluate this problem.

When should I cache?

That’s the real question, huh. Unfortunately, like most other complicated things in this discipline, the answer is “it depends”. Hopefully now you’re able to say that too.

You have to weigh the potential user and system benefits against the developer costs in getting caching right, and the business costs when you get it wrong. Those things aren’t reducible to simple math problems, but it definitely helps to go into the situation with them in mind.

Understanding MySQL Multiversion Concurrency Control

2018-09-16T18:02:55+00:00

MySQL, under the InnoDB storage engine, allows writes and reads of the same row to not interfere with each other. This is one of those features that we use so often it kind of gets taken for granted, but if you think about how you would build such a thing it’s a lot more detailed than it seems. Here, I am going to talk through how that is implemented, as well as some ramifications of the design.

Allowing Concurrent Change

Unsurprisingly, given the title of this post, MySQL’s mechanism for allowing you to simultaneously read and write from the same row is called “Multiversion Concurrency Control”. They (of course) have documentation on it, but that dives into internal technical details pretty fast.

Instead, let’s talk about it at a little higher level. This concept has been around for a long time (the best I can do hunting down an origin is a thesis from 1979). The overall answer for allowing concurrent reads and writes is pretty simple: writes create new versions of rows, reads see the version that was current when they started.

Version tracking

Obviously if we’re going to keep track of versions, we need something to differentiate them. This tool needs to distinguish one version from another, but ideally it would also make it easy to decide which version a read operation should see.

In MySQL, this “version enabling thing” is a transaction id. Every transaction gets one. Even your one-shot update queries in the console get one. These ids are incremented in a way that allows MySQL to determine that one transaction started before another. Every table under InnoDB essentially has a “hidden column” that stores the transaction id of the last write operation to change the row. So, in addition to the columns you may have updated, a write operation also marks the row with its transaction id. This allows read operations to know if they can use the row data, or if it has been changed and they need to consult an older version.

Reading older version

For the cases where your read operation hits on rows that have been changed, you’ll need an older version of the data. The transaction id comes into play here too, but there’s more info needed. Every time MySQL writes data into a row, it also writes an entry into the rollback segment. This is a data structure that stores “undo logs” used to restore the row to its previous state. It’s called the “rollback segment” because it is the tool used to handle rolling back transactions.

The rollback segment stores undo logs for each row in the database. Every row has another hidden column that stores the location of the latest undo log entry, which would restore the row to its state prior to the last write. When these entries are created, they are marked with the outgoing transaction id. By walking the undo log for a row and finding the latest transaction before a read transaction the database can identify the correct data to present to a transaction.

Handling deletes

Deletion is handled by a marker in the row to indicate a record was deleted. Delete operations also set the row’s transaction id to their transaction id, so the process above can present a pre-delete version of the row to read operations that started before the delete.

When are versions deleted

MySQL obviously cannot keep a record of every change that happens in the database for all time. It doesn’t need to, though. Undo logs can be removed as soon as the last transaction that could possibly want them completes.

Similarly, rows that have been marked as deleted can be outright abandoned once the oldest active transaction started after the deletion. These rows and undo logs are physically removed to reclaim their disk space by a “purge” operation that happens in its own thread in the background.

What about indexes

So, to recap, MySQL handles versions by keeping the row constantly up to date and storing diffs for as long as currently running queries need them. That’s only half the story though, indexes need to support consistent reads as well. Primary key indexes work much like the above description for actual database rows. Secondary indexes are a little different.

MySQL handles this in two ways: pages of index entries are marked by the last transaction id to write in them, and individual index entries have delete markers. When an update changes an indexed column, three things happen: the current index entry is delete marked, a new entry for the updated value is written, and that new entry’s index page is marked with the transaction id.

Read operations that find non-delete-marked entries in pages that predate their transaction id use that index data directly. If the operation finds either a delete marker or a newer transaction id, it looks up the row and traverses the undo log to find the appropriate value to use.

Similar to the purging of deleted rows from expired transactions, delete-marked index entries are also eventually reclaimed. Because there is always a fresh new entry to work with somewhere in the index, MySQL can be a little more aggressive at cleaning those up.

What do I do with this information?

So, given the above, what can we take home to make our lives better? A few things. Keep in mind with all of this that database performance can be very difficult to analyze. Each point below is just one potential piece of the story of what could be happening with your data.

Big transactions are painful: Long running transactions don’t just tie up a connection, they force the database to preserve history for longer. If that transaction is reading through a large swath of the database subsequent writes will force it to read the rollback log, which may be in a different page of memory or even on disk.
Multi-statement transactions need to commit quickly: This is another variety of “big transactions are painful”, but it’s worth calling out. MySQL does not “kill” active transactions. If you open a transaction, query out data, then spend two hours in application code before committing, MySQL will faithfully preserve undo history for two hours. Every moment of an open transaction forces more undo history. Commit as quickly as you can and do your processing outside of transactions whenever possible.
Writes make index scans less useful. The whole point of an index is to answer questions about your data without actually looking at your data. Delete markers on index entries, and transaction stamps on index pages, force the database to read your data. Think carefully about using composite indexes with columns you aren’t querying. Your queries will pay the price for updates to those columns anyways.
Rapid fire writes magnify the penalties for reads. If you have a lot of data to write, especially to the same row, write it in chunks instead of one-by-one. Each write generates a transaction id, relevant undo logs, and makes a mess of secondary indexes. Chunking writes together increases the chances that reads will find valid index data, and lowers the size of undo logs they have to wade through. There’s an opposite extreme where one big write might have too much data, so it’s important to look for the happy middle ground here.
“Hot” rows are hot for all columns, not just updated ones. A row that stores a frequently updated counter forces more row transaction id updates and undo log entries. Queries that start before the counter is incremented, even if they don’t use the counter, still have to traverse undo logs for the row state when they started. That same logic applies to extremely frequently updated timestamps that aggregate change times across the relations to a row. If possible, batch those updates beforehand or consider storing them in a separate table you can join to when needed.
Consider separating reporting from direct/application use reads. Reporting queries tend to scan large sections of the database. They take a long time and thus force the preservation and consumption of more undo history. Most application behavior is more direct: it knows specific records to retrieve and goes straight for them. If you’re already using read replicas, consider dedicating one to reporting so that your application queries don’t pay the undo storage penalties of reporting.

Final thoughts

It’s worth noting that this is not the only way to implement MVCC. PostgreSQL handles this task by storing the minimum and maximum transaction ids where a row should be visible. Under that scheme, updating a row sets the maximum transaction id on it and creates an entirely new row entry by copying the original and performing the update in that copy. This avoids the need for undo logs, but at the cost of copying all row data for each update.

The point here is that, for cases where you are really trying to push database performance, understanding a few of the bigger details of the internals can pay off. Many of the takeaways I listed are only going to be applicable in extreme use cases, but in those cases knowing how the database goes about versioning data can make understanding performance problems easier.

Understanding the Elasticsearch Percolator

2018-04-24T18:02:55+00:00

Elasticsearch is a powerful, feature-packed tool. Their documentation is great, but some pieces are a bit … out there. Beyond that, some of the functionality has changed significantly over the years, so third-party explanations might no longer be accurate.

One fantastic feature that is both unusual and has changed a lot is percolation. I’m going to try to explain that feature, in the context of its current implementation (version 6.2.4). You’ll need a basic understanding of Elasticsearch, specifically mappings and search.

The Concept

The normal workflow for Elasticsearch is to store documents (as JSON data) in an index, and execute searches (also JSON data) to ask the index about those documents.

Succinctly, percolation reverses that. You store searches and use documents to ask the index about those searches. That’s true, but it’s not particularly actionable information. How percolators are structured has evolved over the years, to the point where we can give a more useful explanation.

Now, percolation revolves around the percolator mapping field type. This is like any other field type, except that it expects you to assign a search document as the value. When you store data, the index processes this search document into an executable form and saves it for later.

The Percolate Query takes one or more documents and limits results to ones whose stored searches match at least one document. When searching, the percolate query works like any other query element.

In Depth

Under the hood, this is implemented in about the way you would expect: indexes with percolate fields keep a hidden (in memory) index. Documents listed in your percolate queries are first put in that index, then a normal query is executed against that index to see if the original percolate-field-bearing document matches.

An important point to remember is that this hidden index gets its mappings from the original percolator index. So indexes used for percolate queries need to have mappings appropriate for the original data and the query document data.

This introduces a bit of a management problem, in that your index data and the percolate query documents could use the same field in different ways. A simple answer to that is to use the object type to isolate the percolate-relevant mappings from normal document mappings.

Assuming the queries you are using were originally written for another index of actual documents, it makes the most sense to isolate the data going directly into the percolate index and give the root level over to mapping definitions for percolate query documents.

Also, because percolate fields are parsed into searches and saved at index time, you likely will need to reindex percolate documents after upgrading to take advantage of any optimizations to the system.

An Example

In my opinion, percolator examples are one of the prime contributors to making the tool hard to understand. They tend to be too simple, to the point where it’s hard to distinguish the parts.

In this example, we’re going to build out an index of saved term and price searches for toys. The idea behind it is that users should be able to put in a search term and a max price, then get notified as soon as something matching that term goes below this price. Users should also be able to turn these notifications on and off.

The mapping below implements a percolate index to support this feature. Fields related to the saved search itself are in the search object, while fields related to the original toys live at the root level of the mappings.

{
  "mappings": {
    "_doc": {
      "properties": {
        "search": {
          "properties": {
            "query":   { "type": "percolator" },
            "user_id": { "type": "integer" },
            "enabled": { "type": "boolean" }
          },
        },
        "price":       { "type": "float" },
        "description": { "type": "text" }
      }
    }
  }
}

Here is what a document that represents a stored search would look like:

{
  "_id": 1,
  "search": {
    "user_id": 5,
    "enabled": true,
    "query": {
      "bool": {
        "filter": [
          { 
            "match": { 
              "description": { "query": "nintendo switch" }
            }
          },
          { "range": { "price": { "lte": 300.00 } } }
        ]
      }
    }
  }
}

Note that we are only storing data inside the search object field. The mappings for price and description are just there to support percolate queries.

At query time, we want to use both the plain object fields and the “special” percolator field. This query would check, inside a user’s searches, to see which currently-enabled searches match the document.

{
  "query": {
    "bool": {
      "filter": [
        {
          "percolate" : {
            "field" : "search.query",
            "document" : {
              "description" : "Nintendo Switch",
              "price": 250.00
            }
          }
        },
        { "term": { "search.enabled": true } },
        { "term": { "search.user_id": 5 } }
      ]
    }
  }
}

Note that it combines percolate matching of a document against the queries stored in the field with regular term queries to limit which documents we test based on their enabled state and the user id.

Some Additional Thoughts

Because of the work involved in running queries as part of resolving a percolate filter, you might need to pay extra attention to shards/replicas for a percolate index. Each shard reduces the number of queries any one machine may have to run, by reducing the number of search-bearing documents to evaluate.

Percolate queries have an option to get documents from another index inside the cluster. This takes the form of a literal GET request, so there’s not much benefit in trying to keep shards from the two indices on the same nodes.