<?xml version="1.0" encoding="utf-8"?><feed xmlns="http://www.w3.org/2005/Atom" ><generator uri="https://jekyllrb.com/" version="3.10.0">Jekyll</generator><link href="https://awj.dev/feed.xml" rel="self" type="application/atom+xml" /><link href="https://awj.dev/" rel="alternate" type="text/html" /><updated>2025-10-03T21:37:20+00:00</updated><id>https://awj.dev/feed.xml</id><title type="html">Adam Jones</title><subtitle>Me writing about things that I hopefully know something useful about.</subtitle><entry><title type="html">Understanding database indices by (poorly) implementing one</title><link href="https://awj.dev/ruby/databases/performance/2023/05/20/understanding-database-indexes-by-implementing-one.html" rel="alternate" type="text/html" title="Understanding database indices by (poorly) implementing one" /><published>2023-05-20T20:59:12+00:00</published><updated>2023-05-20T20:59:12+00:00</updated><id>https://awj.dev/ruby/databases/performance/2023/05/20/understanding-database-indexes-by-implementing-one</id><content type="html" xml:base="https://awj.dev/ruby/databases/performance/2023/05/20/understanding-database-indexes-by-implementing-one.html"><![CDATA[<p>There’s a lot of misconceptions about database indices. These exist, in part, because people are missing the context needed to imagine how a database uses them. There’s <em>a lot</em> to learn to establish that context. Too much for one blog post. But, we can try to bootstrap off what’s already familiar to help develop a better understanding.</p>

<p>To do that, we’re going to implement a fake database index in Ruby. It will be <em>woefully</em> incomplete, but still should be enough to give an idea of what’s happening.</p>

<h2 id="warning">Warning</h2>

<p>What you’ll see here is not, <em>actually</em>, how database indices work. It’s an extremely crude approximation. I try to call out the where and how that approximation isn’t valid. If you encounter anything in an actual database that doesn’t match up with what you see here, I encourage you to take that as an opportunity to dive in and learn deeper.</p>

<h1 id="making-things-very-simple">Making things <em>very</em> simple</h1>

<p>We’re going to build our fake index out the humble Ruby <code class="language-plaintext highlighter-rouge">Hash</code>. Those are pretty familiar, right? Store data by key and value, then you can later retrieve the value by providing the key. If you don’t have a key, you’re basically just working with a more expensive variant of an <code class="language-plaintext highlighter-rouge">Array</code>. Ironically, under the hood a Ruby <code class="language-plaintext highlighter-rouge">Hash</code> uses a lot of the same concepts and data structures as database indices. Anyways, this will be our substitute for writing actual data structure code.</p>

<p>We’ll only support <em>unique</em> indices. It’s possible, but messy, for us to support non-unique ones. I just don’t think it’s going to teach much you won’t already learn here. We <em>will</em> support composite indices, and will get into covering queries that only use some of the index columns.</p>

<p>Probably the biggest query-time thing separating our index from a real one will be lack of support for range queries. So no <code class="language-plaintext highlighter-rouge">WHERE X &gt; 0</code> style queries for our index. We’re ignoring this because hashes don’t make it easy to do efficiently, and I don’t think implementing it will tell you much that direct value lookups don’t. Real database indices <em>absolutely</em> are able to handle these for many different data types.</p>

<h2 id="the-index-class">The Index class</h2>

<p>We’ll start with a class, that we name <code class="language-plaintext highlighter-rouge">Index</code>, which will be the core of our code here. We will “implement” different SQL queries as Ruby code written in terms of this <code class="language-plaintext highlighter-rouge">Index</code> class.</p>

<p>We use <code class="language-plaintext highlighter-rouge">Index.declare</code> to create an (empty) index on a list of columns. Then we can add data to it by looping through the data and calling <code class="language-plaintext highlighter-rouge">Index#add</code>.</p>

<div class="language-ruby highlighter-rouge"><div class="highlight"><pre class="highlight"><code><span class="c1"># Allow us to efficiently answer questions about a large amount of data based on</span>
<span class="c1"># specific column(s) in it.</span>
<span class="k">class</span> <span class="nc">Index</span>
  <span class="c1"># The column this index is handling.</span>
  <span class="nb">attr_reader</span> <span class="ss">:column</span>

  <span class="c1"># The columns that come *after* this one in the index.</span>
  <span class="c1">#</span>
  <span class="c1"># If this list is empty, we're at the "end" of the index column list and</span>
  <span class="c1"># should store row ids as our Hash values.</span>
  <span class="c1">#</span>
  <span class="c1"># If it is *not* empty, we make an `Index` class that deals with those</span>
  <span class="c1"># columns and use it as our Hash value.</span>
  <span class="nb">attr_reader</span> <span class="ss">:subsequent_columns</span>

  <span class="c1"># The Hash that represents actual index content. I'm avoiding calling this</span>
  <span class="c1"># `data` because it's *not* the actual data we're indexing. Confusing</span>
  <span class="c1"># terminology.</span>
  <span class="nb">attr_reader</span> <span class="ss">:content</span>

  <span class="k">def</span> <span class="nf">initialize</span><span class="p">(</span><span class="n">column</span><span class="p">,</span> <span class="n">subsequent_columns</span> <span class="o">=</span> <span class="p">[])</span>
    <span class="vi">@column</span> <span class="o">=</span> <span class="n">column</span>
    <span class="vi">@subsequent_columns</span> <span class="o">=</span> <span class="n">subsequent_columns</span>
    <span class="vi">@content</span> <span class="o">=</span> <span class="p">{}</span>
  <span class="k">end</span>

  <span class="c1"># Are we the final column of the index? If so, our answers should be data id</span>
  <span class="c1"># values instead of another `Index`</span>
  <span class="k">def</span> <span class="nf">leaf?</span>
    <span class="vi">@subsequent_columns</span><span class="p">.</span><span class="nf">empty?</span>
  <span class="k">end</span>

  <span class="c1"># "Index" a piece of data. It's assumed that this data is functionally a Hash</span>
  <span class="c1"># that contains at least `:id` and whatever value we hvae for `column`.</span>
  <span class="k">def</span> <span class="nf">add</span><span class="p">(</span><span class="n">data</span><span class="p">)</span>
    <span class="n">value</span> <span class="o">=</span> <span class="n">data</span><span class="p">[</span><span class="n">column</span><span class="p">]</span>
    <span class="k">if</span> <span class="n">leaf?</span>
      <span class="vi">@content</span><span class="p">[</span><span class="n">value</span><span class="p">]</span> <span class="o">=</span> <span class="n">data</span><span class="p">[</span><span class="ss">:id</span><span class="p">]</span>
    <span class="k">else</span>
      <span class="c1"># If we are *not* the final column, create a new Index to represent the</span>
      <span class="c1"># slice of data that all shares the same value for our `column`. This</span>
      <span class="c1"># index should use the *next* subsequent column, and needs to know about</span>
      <span class="c1"># the *rest* of the subsequent columns in case it too is not the final</span>
      <span class="c1"># one.</span>
      <span class="vi">@content</span><span class="p">[</span><span class="n">value</span><span class="p">]</span> <span class="o">||=</span> <span class="no">Index</span><span class="p">.</span><span class="nf">new</span><span class="p">(</span><span class="n">subsequent_columns</span><span class="p">[</span><span class="mi">0</span><span class="p">],</span> <span class="n">subsequent_columns</span><span class="p">.</span><span class="nf">drop</span><span class="p">(</span><span class="mi">1</span><span class="p">))</span>
      <span class="vi">@content</span><span class="p">[</span><span class="n">value</span><span class="p">].</span><span class="nf">add</span><span class="p">(</span><span class="n">data</span><span class="p">)</span>
    <span class="k">end</span>
  <span class="k">end</span>
<span class="k">end</span>
</code></pre></div></div>
<h1 id="what-we-can-learn-about-database-indices">What we can learn about database indices</h1>

<p>Surprisingly, just here we can draw an important and useful inference about working with indices. The “natural flow” of accessing this data is going to be along the path dictated by the columns. Our index also can’t answer questions involving columns that weren’t indexed.</p>

<p>It’s easy to imagine navigating this in column order, but <em>other</em> orders seem like a bigger challenge. Databases are full of clever optimizations that can <em>sometimes</em> make out-of-order usage possible, but generally speaking you want things to happen in-order.</p>

<h1 id="sample-data">Sample data</h1>

<p>To play with this, we’ll work on sample data taken from the <a href="https://www.census.gov/data/tables/time-series/demo/popest/2020s-total-cities-and-towns.html#v2022">US Census Bureau City and Town Population Totals</a> This is a list of ~20k cities in the US with their estimated population.</p>

<p>For the purposes of this post, I have <a href="https://awj.dev/static/city_populations_2022.csv">Cleaned it up</a> in a CSV, with state names extracted.</p>

<p>We’re going to assume here that the combination of the <code class="language-plaintext highlighter-rouge">city</code> and <code class="language-plaintext highlighter-rouge">state</code> columns makes a record unique. That isn’t <em>strictly</em> true for this data, but again it makes it easier to work with.</p>

<h1 id="harnass-code">Harnass code</h1>

<p>The following code is enough to get us started in an IRB session. It assumes the above code snippet is available locally as <code class="language-plaintext highlighter-rouge">./index.rb</code>, and the csv can be found at <code class="language-plaintext highlighter-rouge">./city_populations_2022.csv</code>.</p>

<div class="language-ruby highlighter-rouge"><div class="highlight"><pre class="highlight"><code><span class="nb">require</span> <span class="s2">"csv"</span>

<span class="nb">load</span> <span class="s2">"index.rb"</span>

<span class="c1"># Load the CSV, converting integer values as we go</span>
<span class="n">csv</span> <span class="o">=</span> <span class="no">CSV</span><span class="p">.</span><span class="nf">read</span><span class="p">(</span><span class="s2">"./city_populations_2022.csv"</span><span class="p">,</span> <span class="ss">headers: </span><span class="kp">true</span><span class="p">,</span> <span class="ss">converters: </span><span class="p">[</span><span class="ss">:integer</span><span class="p">,</span> <span class="ss">:all</span><span class="p">,</span> <span class="ss">:all</span><span class="p">,</span> <span class="ss">:all</span><span class="p">,</span> <span class="ss">:integer</span><span class="p">])</span>

<span class="c1"># Store our CSV in an Array where the values are hashes of the row</span>
<span class="c1"># data. This will simulate the actual database table.</span>
<span class="n">data</span> <span class="o">=</span> <span class="n">csv</span><span class="p">.</span><span class="nf">map</span><span class="p">(</span><span class="o">&amp;</span><span class="ss">:to_h</span><span class="p">);</span> <span class="kp">nil</span>

<span class="c1"># Declare an index on state and city, in that order</span>
<span class="n">index</span> <span class="o">=</span> <span class="no">Index</span><span class="p">.</span><span class="nf">declare</span><span class="p">(</span><span class="s2">"state"</span><span class="p">,</span> <span class="s2">"city"</span><span class="p">)</span>

<span class="n">data</span><span class="p">.</span><span class="nf">each</span> <span class="k">do</span> <span class="o">|</span><span class="n">row</span><span class="o">|</span>
  <span class="n">index</span><span class="p">.</span><span class="nf">add</span><span class="p">(</span><span class="n">row</span><span class="p">)</span>
<span class="k">end</span><span class="p">;</span> <span class="kp">nil</span>
</code></pre></div></div>

<p>If we were to discard the index class and <em>just</em> look at things as nested hashes, our index would look like this:</p>

<div class="language-ruby highlighter-rouge"><div class="highlight"><pre class="highlight"><code><span class="p">{</span>
  <span class="s2">"California"</span> <span class="o">=&gt;</span> <span class="p">{</span>
    <span class="s2">"Los Angeles"</span> <span class="o">=&gt;</span> <span class="mi">1444</span> <span class="c1"># 1444 is the row id for this city</span>
  <span class="p">}</span>
<span class="p">}</span>
</code></pre></div></div>

<h2 id="finding-a-row-by-state-and-city">Finding a row by state and city</h2>

<p>We’ll start out simple: given a city and state, look up the row. We’ll try it out with Los Angeles, California. In SQL, this would be: <code class="language-plaintext highlighter-rouge">SELECT * FROM populations WHERE state = 'California' AND city = 'Los Angeles'</code></p>

<div class="language-ruby highlighter-rouge"><div class="highlight"><pre class="highlight"><code><span class="n">state</span> <span class="o">=</span> <span class="n">index</span><span class="p">.</span><span class="nf">content</span><span class="p">[</span><span class="s2">"California"</span><span class="p">]</span>
<span class="n">city</span> <span class="o">=</span> <span class="n">state</span><span class="p">.</span><span class="nf">content</span><span class="p">[</span><span class="s2">"Los Angeles"</span><span class="p">]</span>

<span class="c1"># Our `id` values don't exactly correspond to Array offsets, so we have to do this.</span>
<span class="n">data</span><span class="p">[</span><span class="n">city</span> <span class="o">-</span> <span class="mi">1</span><span class="p">]</span>
</code></pre></div></div>

<h1 id="what-we-can-learn-about-database-indices-1">What we can learn about database indices</h1>

<h4 id="index-ordering">Index ordering</h4>

<p>Notice how we’re starting the lookup with the <code class="language-plaintext highlighter-rouge">state</code>? That’s because it’s the “beginning” of the index.</p>

<p>Imagine if we tried to start with the <code class="language-plaintext highlighter-rouge">city</code> first. What would that code look like? It would have to dig through <em>every value</em> in the <code class="language-plaintext highlighter-rouge">state</code> index to get at cities, then work its way backwards.</p>

<p>Often, your database effectively can’t do this. There’s too much data involved, and simply keeping track of everything you’ve looked at could cause problems. Plus “examine the entire index” isn’t going to be a fast operation. It might pursue this strategy if you give it no better option, but you <em>really</em> want to give it better options.</p>

<h4 id="row-lookup">Row lookup</h4>

<p>Notice how, to return the data, we had to go to our “table” that is stored in <code class="language-plaintext highlighter-rouge">data</code>? That’s called a “row lookup”. Real databases almost certainly store the index data and row data separately, so row lookups have additional overhead that we want to be careful with.</p>

<p>Often, optimizing SQL queries is a process of trying to avoid any more row lookups than strictly necessary.</p>

<h2 id="finding-the-total-population-of-a-state">Finding the total population of a state</h2>

<p>Ok, now let’s try another likely task: finding the total population of a state. We’ll go with Idaho this time. In SQL this would look like <code class="language-plaintext highlighter-rouge">SELECT sum(population) FROM populations WHERE state = 'Idaho'</code>.</p>

<p>At first glance it might not look like our index is helpful here, but it still is. Here’s code to get this <em>without</em> the index:</p>

<div class="language-ruby highlighter-rouge"><div class="highlight"><pre class="highlight"><code><span class="n">sum</span> <span class="o">=</span> <span class="mi">0</span>

<span class="c1"># Let's keep track of how many times we had to go fetch a row. This is</span>
<span class="c1"># important, because row lookups are expensive.</span>
<span class="n">rows_examined</span> <span class="o">=</span> <span class="mi">0</span>

<span class="c1"># Notice: we are visiting *every* row in the data. If we had millions or</span>
<span class="c1"># billions of rows, this would be really bad.</span>
<span class="n">data</span><span class="p">.</span><span class="nf">each</span> <span class="k">do</span> <span class="o">|</span><span class="n">row</span><span class="o">|</span>
  <span class="n">rows_examined</span> <span class="o">+=</span> <span class="mi">1</span>
  <span class="k">next</span> <span class="k">unless</span> <span class="n">row</span><span class="p">[</span><span class="s2">"state"</span><span class="p">]</span> <span class="o">==</span> <span class="s2">"Idaho"</span>
  
  <span class="n">sum</span> <span class="o">+=</span> <span class="n">row</span><span class="p">[</span><span class="s2">"population"</span><span class="p">]</span>
<span class="k">end</span><span class="p">;</span> <span class="kp">nil</span>

<span class="p">[</span><span class="n">sum</span><span class="p">,</span> <span class="n">rows_examined</span><span class="p">]</span> <span class="c1"># =&gt; [1302154, 19692]</span>
</code></pre></div></div>

<p>So we got our sum, it <em>probably</em> was fast on your computer (reminder: this is a tiny amount of data), but we had to look at every single row in the data. Usually, “we have to look at every row in the entire table” is one of the absolute <em>worst</em> things you can see your database doing.</p>

<p>So how can we use our index? We don’t have a ready list of “the names of every city in Idaho”, so we can’t just plug that in as keys once we get to the <code class="language-plaintext highlighter-rouge">Idaho</code> index. But, we <em>do</em> have the ability to traverse a <code class="language-plaintext highlighter-rouge">Hash</code> by <em>values</em>. So we can still use our index to help us get to the state of Idaho, then crawl through its contents to find the total population.</p>

<div class="language-ruby highlighter-rouge"><div class="highlight"><pre class="highlight"><code><span class="n">sum</span> <span class="o">=</span> <span class="mi">0</span>

<span class="c1"># Again, we're tracking rows</span>
<span class="n">rows_examined</span> <span class="o">=</span> <span class="mi">0</span>

<span class="n">state</span> <span class="o">=</span> <span class="n">index</span><span class="p">.</span><span class="nf">content</span><span class="p">[</span><span class="s1">'Idaho'</span><span class="p">]</span>
<span class="n">state</span><span class="p">.</span><span class="nf">content</span><span class="p">.</span><span class="nf">values</span><span class="p">.</span><span class="nf">each</span> <span class="k">do</span> <span class="o">|</span><span class="n">row_id</span><span class="o">|</span>
  <span class="n">city</span> <span class="o">=</span> <span class="n">data</span><span class="p">[</span><span class="n">row_id</span> <span class="o">-</span> <span class="mi">1</span><span class="p">]</span>
  <span class="n">rows_examined</span> <span class="o">+=</span> <span class="mi">1</span>
  <span class="n">sum</span> <span class="o">+=</span> <span class="n">city</span><span class="p">[</span><span class="s2">"population"</span><span class="p">]</span>
<span class="k">end</span><span class="p">;</span> <span class="kp">nil</span>

<span class="p">[</span><span class="n">sum</span><span class="p">,</span> <span class="n">rows_examined</span><span class="p">]</span> <span class="c1"># =&gt; [1302154, 199]</span>
</code></pre></div></div>
<p>So now we have the <em>same</em> sum, but we looked at roughly 10% of the rows. That’s a <em>huge</em> win.</p>

<h1 id="what-we-can-learn-about-database-indices-2">What we can learn about database indices</h1>

<p>Databases don’t just use indices for cases where they have every single relevant key. It’s a data structure that they can dig through, and that can help significantly.</p>

<p>Sometimes they do this by “skipping over” intermediate keys to get to the final rows, like what we did here. It’s worth noticing that this was only possible because our index was defined as <code class="language-plaintext highlighter-rouge">(state, city)</code>. If it had been <code class="language-plaintext highlighter-rouge">(city, state)</code>, we would have had to examine every single city name to see if it was in the state. That’s usually still <em>better</em> than crawling every row of the data, but it’s nowhere near as good as what we just experienced.</p>

<p>When you’re defining a composite index, it’s <em>really</em> important to think about the cases where you might end up querying only some of those columns. Getting the column order right will maximize the value you get out of the database’s work in maintaining the index.</p>

<h2 id="a-new-index-for-even-faster-population-totals">A new index for even faster population totals</h2>

<p>Let’s say this kind of population query is extremely important, and we’ve found the above “only accessing 10% of the rows” to <em>still</em> be too slow for our needs. What can an index do for us?</p>

<p>We’ve done more or less everything we can with our existing index. If our system supported non-unique indices, we could make an index on just <code class="language-plaintext highlighter-rouge">state</code> that would allow us to directly jump into rows, but it wouldn’t change the amount of rows we’re looking at.</p>

<p>Let’s build <em>another</em> index, one that extends our previous index with population values. So it would look like <code class="language-plaintext highlighter-rouge">(state, city, population)</code>. Here’s how:</p>

<div class="language-ruby highlighter-rouge"><div class="highlight"><pre class="highlight"><code><span class="n">population_index</span> <span class="o">=</span> <span class="no">Index</span><span class="p">.</span><span class="nf">declare</span><span class="p">(</span><span class="s2">"state"</span><span class="p">,</span> <span class="s2">"city"</span><span class="p">,</span> <span class="s2">"population"</span><span class="p">)</span>

<span class="n">data</span><span class="p">.</span><span class="nf">each</span> <span class="k">do</span> <span class="o">|</span><span class="n">row</span><span class="o">|</span>
  <span class="n">population_index</span><span class="p">.</span><span class="nf">add</span><span class="p">(</span><span class="n">row</span><span class="p">)</span>
<span class="k">end</span><span class="p">;</span> <span class="kp">nil</span>
</code></pre></div></div>

<p>Because <code class="language-plaintext highlighter-rouge">state+city</code> was already unique, <code class="language-plaintext highlighter-rouge">state+city+population</code> is also going to be.</p>

<p>Here’s a sketch of it as a Hash:</p>

<div class="language-ruby highlighter-rouge"><div class="highlight"><pre class="highlight"><code><span class="p">{</span>
  <span class="s2">"California"</span> <span class="o">=&gt;</span> <span class="p">{</span>
    <span class="s2">"Los Angeles"</span> <span class="o">=&gt;</span> <span class="p">{</span>
      <span class="c1"># NOTE: This "population" Hash will always be a single key (the</span>
      <span class="c1"># population) pointing to the row id.</span>
      <span class="mi">3898767</span> <span class="o">=&gt;</span> <span class="mi">1444</span>
    <span class="p">}</span>
  <span class="p">}</span>
<span class="p">}</span>
</code></pre></div></div>

<p>This index can give us our population total <em>without touching a single row</em>!</p>

<div class="language-ruby highlighter-rouge"><div class="highlight"><pre class="highlight"><code><span class="n">sum</span> <span class="o">=</span> <span class="mi">0</span>

<span class="n">state</span> <span class="o">=</span> <span class="n">population_index</span><span class="p">.</span><span class="nf">content</span><span class="p">[</span><span class="s2">"Idaho"</span><span class="p">]</span>

<span class="n">state</span><span class="p">.</span><span class="nf">content</span><span class="p">.</span><span class="nf">values</span><span class="p">.</span><span class="nf">each</span> <span class="k">do</span> <span class="o">|</span><span class="n">city</span><span class="o">|</span>
  <span class="n">sum</span> <span class="o">+=</span> <span class="n">city</span><span class="p">.</span><span class="nf">content</span><span class="p">.</span><span class="nf">keys</span><span class="p">.</span><span class="nf">sum</span>
<span class="k">end</span><span class="p">;</span> <span class="kp">nil</span>

<span class="n">sum</span> <span class="c1"># =&gt; 1302154</span>
</code></pre></div></div>

<p>Notice how <code class="language-plaintext highlighter-rouge">data</code> is not even mentioned in this code. We’re answering queries <em>just</em> from the index content!</p>

<h1 id="what-we-can-learn-about-database-indices-3">What we can learn about database indices</h1>

<p>Since our index reflects the underlying data, we can use the index contents <em>in place of</em> the actual data. Databases use this trick <em>a lot</em>, and it’s an incredibly effective optimization.</p>

<p>It’s generally safe to assume that your data on disk isn’t organized in a way that makes any particular lookup effective. Before when we read 199 rows to get our data, it’s safe to assume that none of those rows lived next to each other in a way that allowed the operating system to avoid doing 199 disk reads.</p>

<p>By comparison, even when the index is serialized to disk, all of the relevant bits of information live closer together. It’s very likely that reading the disk block that gave us one relevant <code class="language-plaintext highlighter-rouge">city</code> <em>also</em> happened to load and cache other cities we needed. Plus our index data is a lot smaller/denser than the actual row data. So even digging everything up off the disk involved fewer disk reads.</p>

<p>When trying to look up actual city records, the same “skip over a column” trick that we did in the last section can work here. So it’s possible to go from <code class="language-plaintext highlighter-rouge">(state, city, population)</code> to the city record even with just <code class="language-plaintext highlighter-rouge">state</code> and <code class="language-plaintext highlighter-rouge">city</code>. This index could handily serve every query we’ve seen so far.</p>

<h2 id="finding-the-total-population-of-every-state">Finding the total population of EVERY state</h2>

<p>Now we’re going to try to handle this query: <code class="language-plaintext highlighter-rouge">SELECT state, sum(population) FROM populations GROUP BY state</code>.</p>

<p>STOP! Before you read further, I want you to think about how you’d solve this. You have three options now:</p>

<ul>
  <li>Walk through all the data rows</li>
  <li>Try to use the original index</li>
  <li>Try to use the population index</li>
</ul>

<p>That act of “deciding how to get at the data” is called <em>Query Planning</em>. It’s an important part of how databases work. Get deep enough into database performance and you’re going to have to become intimately familiar with your database’s query plan explanations. Examining that output is a key way to help debug slow queries and figure out what changes need to happen to make them not-slow queries.</p>

<p>In this case we have only three options, and it’s (probably) relatively easy to pick which one will be “the best”. But, let’s think them through in a rough approximation of how a query planner might look at this.</p>

<p>If we assign a “cost” to data and index reads, we can weigh our options by “total cost”:</p>

<ul>
  <li>Walk all the rows: 20,000 data reads + 0 index reads</li>
  <li>Use the original index: 20,000 data reads + 20,000 index reads</li>
  <li>Use the population index: 0 data reads + 20,000 index reads (reminder: all needed data is in the index)</li>
</ul>

<p>It’s generally accurate to assume that index reads are cheaper than data reads. So we’d want to “weigh” data reads higher. The actual process inside a real database is much more complicated, but here we’ll just assume data reads are 5x as expensive.</p>

<p>That gives us total costs of: 100,000 120,000 and 20,000 respectively. Which means we should go with the last option.</p>

<p><em>Sidenote: to make it accessible, this cost calculation is wildly naive. Real databases track a lot more information than “how many rows are there”, and have more detailed insights about both the characteristics of the data and the specifics of how it is stored. Imagine you spent years refining this concept to fix every case where your cost predictions were wrong, and you’re closer to how databases actually work.</em></p>

<p>So, how do we find per-state populations? That one again comes out kind of straightforward:</p>

<div class="language-ruby highlighter-rouge"><div class="highlight"><pre class="highlight"><code><span class="n">result</span> <span class="o">=</span> <span class="p">{}</span>

<span class="n">population_index</span><span class="p">.</span><span class="nf">content</span><span class="p">.</span><span class="nf">each</span> <span class="k">do</span> <span class="o">|</span><span class="n">state</span><span class="p">,</span> <span class="n">cities</span><span class="o">|</span>
  <span class="n">result</span><span class="p">[</span><span class="n">state</span><span class="p">]</span> <span class="o">=</span> <span class="mi">0</span>

  <span class="n">cities</span><span class="p">.</span><span class="nf">content</span><span class="p">.</span><span class="nf">values</span><span class="p">.</span><span class="nf">each</span> <span class="k">do</span> <span class="o">|</span><span class="n">population</span><span class="o">|</span>
    <span class="n">result</span><span class="p">[</span><span class="n">state</span><span class="p">]</span> <span class="o">+=</span> <span class="n">population</span><span class="p">.</span><span class="nf">content</span><span class="p">.</span><span class="nf">keys</span><span class="p">[</span><span class="mi">0</span><span class="p">]</span>
  <span class="k">end</span>
<span class="k">end</span><span class="p">;</span> <span class="kp">nil</span>

<span class="n">result</span>
</code></pre></div></div>

<h1 id="what-we-can-learn-about-database-indices-4">What we can learn about database indices</h1>

<p>Again, we’re using the index as a <em>source</em> of information rather than just a way to <em>get to</em> information. It’s worth reiterating this because it comes up so often in real world scenarios.</p>

<p>We also, in our simulated query planning, saw a case where using an index was <em>slower</em> than reading the entire table. In practice, these scenarios are rare, but they can happen. Sometimes you’ll look at a query plan and wonder why the database is ignoring an index, only to find that you’re missing details where the index makes things <em>slower</em>. In this case that detail was “we’re asking to read the entire table”, but it’s definitely not the only one.</p>

<p>We’re also, in our <code class="language-plaintext highlighter-rouge">result</code> hash, just getting a glimpse into result buffering. Again it’s worth imagining what we would do in a scenario where there was so much data that just storing this <code class="language-plaintext highlighter-rouge">result</code> in memory wouldn’t work.</p>

<h2 id="wrapping-up">Wrapping up</h2>

<p>Hopefully this helps a little to demistify database indices. As I mentioned at the start, the <em>actual</em> data structures vary significantly from what we’re using here, but I’ve tried to keep the reasoning and thought processes consistent.</p>

<p>Despite the inaccuracies, you can get pretty far using this “wandering through nested hashes” view of how database indices work. Perhaps the best extension to your mental model would be imagining a hash that also includes the ability to do inequality comparisons on keys. Like if a <code class="language-plaintext highlighter-rouge">Hash#lookup</code> method existed that took a Ruby <code class="language-plaintext highlighter-rouge">Range</code> as an argument and could efficiently give you the values where keys were inside of that range.</p>

<p>If all of this has you interested in what the internals of an index <em>actually</em> look like, you can start by studying <a href="https://en.wikipedia.org/wiki/B-tree#:~:text=In%20computer%20science%2C%20a%20B,with%20more%20than%20two%20children.">B-trees</a>. They’re probably the most commonly used data structure for this purpose. Many databases support alternative index types based around different data structures, which is where you really start getting deep into the benefits and drawbacks of each one.</p>

<p>If you’d like to know more about query plans and how databases handle the topic of picking an algorithm to look up the data, that unfortunately gets pretty specific to the database involved. If you’re using <a href="https://dev.mysql.com/doc/refman/8.0/en/execution-plan-information.html">MySQL</a> or <a href="https://www.postgresql.org/docs/current/using-explain.html">PostgreSQL</a> I’ve linked to the relevant sections of their documentation. Because databases are attempting to generate the best possible plan out of (potentially) a huge number of choices in a tiny fraction of time, query planning gets kind of hairy and detailed fast.</p>

<p>If this has peaked your interest in how to effectively use indices, <a href="https://use-the-index-luke.com/">Use the index, Luke</a> is a fantastic resource. It even includes an introduction to B-trees and resources tailored to multiple database types.</p>]]></content><author><name></name></author><category term="ruby" /><category term="databases" /><category term="performance" /><summary type="html"><![CDATA[Or: building an intuition for how databases use indices]]></summary></entry><entry><title type="html">Should we be caching</title><link href="https://awj.dev/caching/rails/performance/2022/02/24/should-we-be-caching.html" rel="alternate" type="text/html" title="Should we be caching" /><published>2022-02-24T22:41:31+00:00</published><updated>2022-02-24T22:41:31+00:00</updated><id>https://awj.dev/caching/rails/performance/2022/02/24/should-we-be-caching</id><content type="html" xml:base="https://awj.dev/caching/rails/performance/2022/02/24/should-we-be-caching.html"><![CDATA[<p>Caching is one of those weird things in programming, like inheritance and concurrency, where everyone parrots the line about how tough it is then <em>immediately</em> turns to it when they have a problem.</p>

<p>Chances are good that <em>somewhere</em> in your app, you’ve got a cache. Probably several.</p>

<p>Chances are <em>also</em> pretty good that at least some of that caching is <em>causing invalid results</em>. There’s even an outside possibility it’s making things <em>slower</em>. In this post I’ll talk through how to think about caching to be sure it’s worth the pains it can cause.</p>

<h1 id="first-applicable-scope">First, applicable scope</h1>
<p>I’m <em>only</em> talking about read-through caches. Ones where we update the cache synchronously when reads come up empty/expired. It’s one of the most common forms of caching.</p>

<p>The standard <a href="https://guides.rubyonrails.org/caching_with_rails.html#low-level-caching">Rails low-level cache</a> (i.e. <code class="language-plaintext highlighter-rouge">Rails.cache.fetch</code>) is an example.</p>

<p>Other caching techniques might change the thought process, so if you’re looking at one of those be careful using this logic.</p>

<h1 id="good-reasons-for-caching">Good reasons for caching</h1>
<p>Before we get into “is a cache worthwhile” questions, let’s talk through the reasons you might be adding caching.</p>

<h2 id="efficiency">Efficiency</h2>
<p>This is <em>the big one</em>. You’re doing work to obtain results that don’t change, so reusing the results avoids repeating that work.</p>

<p>The goal is to improve user experiences by making things faster.</p>

<p>Although making things faster is generally desirable, it’s important to qualify (<em>not</em> quantify) this improvement. For most use cases, optimizing a 10ms request into a 1ms request isn’t particularly useful. 10ms already <em>feels</em> fast, so users won’t notice that it’s 10x faster.</p>

<p>Thankfully, there’s been some study in this area.</p>

<p>First off, we’ll look to <a href="https://www.nngroup.com/articles/response-times-3-important-limits/">Jakob Nielsen</a> for tiers of user perception on waiting for a task:</p>

<ul>
  <li>&lt; 0.1s - feels instantaneous</li>
  <li>&lt; 1.0s - keep your flow of thought</li>
  <li>&lt; 10.0s - keep your focus</li>
</ul>

<p>A cache that moves your user from one tier into another is helping immensely, even if the overall improvement doesn’t seem impressive.</p>

<p>Second, we’ll look at <a href="https://neilpatel.com/blog/loading-time/">Neil Patel</a> (as referred to us by <a href="https://support.google.com/analytics/answer/4589209?hl=en">Google Analytics help</a>) to see a 7% bounce rate increase for every 1s of load time.</p>

<p>Combining these, we can regard 1s increments as valuable changes, &gt;10s request times as extremely problematic, and getting under 1s and especially 0.1s as huge improvements. Moving about inside those 1s increments, and especially underneath 0.1s, is less valuable than crossing a boundary.</p>

<h2 id="insulating-a-data-store">Insulating a data store</h2>
<p>This is also a common reason for caching. Here your goal isn’t speed, but availability. You’re trying to protect your data store from workloads it can’t handle by avoiding some of that work.</p>

<p>This requires different thought and planning. Scenarios where your cache isn’t available are a <em>system</em> problem, not a user experience one. For example, if your frontend servers lose their in-memory caches during a deploy, the data store will be on its own until the caches refill. If it can’t handle that load spike, you go down.</p>

<p>It’s not uncommon to <em>initially</em> deploy a cache for efficiency reasons, only to have request growth turn it into an “insulate the data store” cache.</p>

<p>For this caching goal, hit rate is more important than raw performance. The overhead of reaching across a network to a dedicated caching server is usually acceptable compared to your app servers needing to refill in-memory caches.</p>

<h1 id="bad-reasons-for-caching">Bad reasons for caching</h1>
<p>There are a few common cases where people <em>think</em> they should be caching, when it’s somewhere between a Band-Aid and harmful.</p>

<h2 id="very-slow-queries">Very slow queries</h2>
<p>Read-through caching cannot “fix” queries that run longer than your app server’s timeout. If your query times are approaching your timeouts, at best a cache will make the error intermittent. That’s better than nothing, but not good.</p>

<p>In these cases, first focus on speeding up the queries. Often queries are either over-fetching data or filtering against columns that aren’t properly indexed.</p>

<h2 id="slow-view-fragments">Slow view fragments</h2>
<p>Similar to the above about queries, sometimes people turn to caching to solve problems with request timeouts during view rendering. Again, this isn’t going to actually solve the problem, just (maybe) push it off for a bit.</p>

<p>Usually, those slow view renderings are database queries in disguise. Look for N+1 query behavior. Are you loading too much data in a page?</p>

<h2 id="it-might-be-slow">It “might” be slow</h2>
<p>Often people slap caching on things they <em>think</em> will be slow. Equally often, they’re wrong. Unless you’ve done analysis to back up the idea, this is premature optimization. You’re ponying up for the costs of caching without being sure they’re worth paying.</p>

<h1 id="the-costs-of-caching">The costs of caching</h1>
<p>Although adding a cache could increase your server costs, that’s no the most important one. The most important cost of caching lies with developers and users, not servers. Caches are <em>very</em> hard to reason about, which makes them easy to get wrong, which can cause all kinds of havoc.</p>

<p>In a system without caching, every result simply is what the source of truth says. A user changes a record, that gets committed to the database, the next request shows the new data. Easy peasy.</p>

<p>When caching is involved, stale caches mean what your database tells you isn’t always what users see. This can undermine trust, and in worst cases cause incorrect behavior.</p>

<p>Beyond user-facing consequences, caching means <em>more</em> code. Every data change <em>also</em> needs to worry about caching. Invalidating <em>just the relevant</em> cache keys is a hard problem. Get it wrong and your hit rates plummet, or you serve stale data, or maybe both at once.</p>

<p>Ultimately, we have to weigh the benefits against these costs. Unfortunately, it’s a difficult comparison. The benefits are easily quantifiable and the costs are highly subjective.</p>

<h1 id="the-caching-equation">The caching equation</h1>
<p>When trying to answer “is a cache making things faster”, there’s a relatively simple formula to use. Keep in mind that our cache system is its own data store, with its own access time.</p>

<div class="language-ruby highlighter-rouge"><div class="highlight"><pre class="highlight"><code><span class="n">hit_rate</span> <span class="c1"># % of cache lookups with validly cached data</span>
<span class="n">miss_rate</span> <span class="c1"># 100 - hit_rate</span>
<span class="n">lookup_time</span> <span class="c1"># time (ms) to consult the cache (hit or miss)</span>
<span class="n">query_time</span> <span class="c1"># time(ms) to run the query</span>
<span class="n">fractional_hit_cost</span> <span class="o">=</span> <span class="n">hit_rate</span> <span class="o">*</span> <span class="n">lookup_time</span>
<span class="n">fractional_miss_cost</span> <span class="o">=</span> <span class="n">miss_rate</span> <span class="o">*</span> <span class="p">(</span><span class="n">lookup_time</span> <span class="o">+</span> <span class="n">query_time</span><span class="p">)</span>
<span class="n">cached_query_time</span> <span class="o">=</span> <span class="n">fractional_hit_cost</span> <span class="o">+</span> <span class="n">fractional_miss_cost</span>
</code></pre></div></div>

<p>The difference between <code class="language-plaintext highlighter-rouge">cached_query_time</code> and <code class="language-plaintext highlighter-rouge">query_time</code> is how much caching is helping (or possibly hurting) you.</p>

<p>The beauty is that you can plug all this into a spreadsheet and play with the numbers. Fill in what you think the hit rate would be, along with lookup and query times, and you’re able to tweak numbers to explore the benefits before doing any coding. What happens if your hit rate is half what you expect? Or if reading from the cache is twice as expensive? Feeling out the edges of this new system before introducing it is very useful.</p>

<h1 id="what-is-a-good-hit-rate">What is a “good” hit rate</h1>
<p>This is a pretty natural question as part of caching decision making, but I think it’s the wrong question.</p>

<p>A better question would be “is caching worth it”, and the hit rate is only part of that.</p>

<p>If the <em>benefit</em> you’re looking for is in user experience, then will caching move users between those 0.1/1/10 second thresholds? How much developer complexity and user uncertainty is it adding?</p>

<p>If the <em>benefit</em> you want is insulating your data store, how effectively is it doing that? How much load are you eliminating? Are you protected in high usage scenarios? A high hit rate cache of an inexpensive query may not be as useful to you as a lower hit rate for an expensive one.</p>

<p>This isn’t to say that hit rates aren’t important, they definitely can be, just that hard and fast numbers aren’t the best way to evaluate this problem.</p>

<h1 id="when-should-i-cache">When should I cache?</h1>
<p>That’s the real question, huh. Unfortunately, like most other complicated things in this discipline, the answer is “it depends”. Hopefully now you’re able to say that too.</p>

<p>You have to weigh the potential user and system benefits against the developer costs in getting caching right, and the business costs when you get it wrong. Those things aren’t reducible to simple math problems, but it definitely helps to go into the situation with them in mind.</p>]]></content><author><name></name></author><category term="caching" /><category term="rails" /><category term="performance" /><summary type="html"><![CDATA[Or: how to talk yourself out of one of the hardest problems in computers]]></summary></entry><entry><title type="html">Understanding MySQL Multiversion Concurrency Control</title><link href="https://awj.dev/database/mysql/2018/09/16/understanding-mysql-multiversion-concurrency-control.html" rel="alternate" type="text/html" title="Understanding MySQL Multiversion Concurrency Control" /><published>2018-09-16T18:02:55+00:00</published><updated>2018-09-16T18:02:55+00:00</updated><id>https://awj.dev/database/mysql/2018/09/16/understanding-mysql-multiversion-concurrency-control</id><content type="html" xml:base="https://awj.dev/database/mysql/2018/09/16/understanding-mysql-multiversion-concurrency-control.html"><![CDATA[<p>MySQL, under the InnoDB storage engine, allows writes and reads of the same row to not interfere with each other. This is one of those features that we use so often it kind of gets taken for granted, but if you think about how you would build such a thing it’s a lot more detailed than it seems. Here, I am going to talk through how that is implemented, as well as some ramifications of the design.</p>

<h1 id="allowing-concurrent-change">Allowing Concurrent Change</h1>
<p>Unsurprisingly, given the title of this post, MySQL’s mechanism for allowing you to simultaneously read and write from the same row is called “Multiversion Concurrency Control”. They (of course) <a href="https://dev.mysql.com/doc/refman/8.0/en/innodb-multi-versioning.html">have documentation</a> on it, but that dives into internal technical details pretty fast.</p>

<p>Instead, let’s talk about it at a little higher level. This concept has been around for a long time (the best I can do hunting down an origin is a <a href="https://dspace.mit.edu/handle/1721.1/16279">thesis</a> from 1979). The overall answer for allowing concurrent reads and writes is pretty simple: writes create new versions of rows, reads see the version that was current when they started.</p>

<h1 id="version-tracking">Version tracking</h1>
<p>Obviously if we’re going to keep track of versions, we need something to differentiate them. This tool needs to distinguish one version from another, but ideally it would also make it easy to decide which version a read operation should see.</p>

<p>In MySQL, this “version enabling thing” is a transaction id. Every transaction gets one. Even your one-shot update queries in the console get one. These ids are incremented in a way that allows MySQL to determine that one transaction started before another. Every table under InnoDB essentially has a “hidden column” that stores the transaction id of the last write operation to change the row. So, in addition to the columns you may have updated, a write operation <em>also</em> marks the row with its transaction id. This allows read operations to know if they can use the row data, or if it has been changed and they need to consult an older version.</p>

<h1 id="reading-older-version">Reading older version</h1>
<p>For the cases where your read operation hits on rows that have been changed, you’ll need an older version of the data. The transaction id comes into play here too, but there’s more info needed. Every time MySQL writes data into a row, it <em>also</em> writes an entry into the rollback segment. This is a data structure that stores “undo logs” used to restore the row to its previous state. It’s called the “rollback segment” because it is the tool used to handle rolling back transactions.</p>

<p>The rollback segment stores undo logs for each row in the database. Every row has <em>another</em> hidden column that stores the location of the latest undo log entry, which would restore the row to its state prior to the last write. When these entries are created, they are marked with the <em>outgoing</em> transaction id. By walking the undo log for a row and finding the latest transaction <em>before</em> a read transaction the database can identify the correct data to present to a transaction.</p>

<h1 id="handling-deletes">Handling deletes</h1>
<p>Deletion is handled by a marker in the row to indicate a record was deleted. Delete operations <em>also</em> set the row’s transaction id to their transaction id, so the process above can present a pre-delete version of the row to read operations that started before the delete.</p>

<h2 id="when-are-versions-deleted">When are versions deleted</h2>
<p>MySQL obviously cannot keep a record of every change that happens in the database for all time. It doesn’t need to, though. Undo logs can be removed as soon as the last transaction that could possibly want them completes.</p>

<p>Similarly, rows that have been marked as deleted can be outright abandoned once the oldest active transaction started after the deletion. These rows and undo logs are physically removed to reclaim their disk space by a “purge” operation that happens in its own thread in the background.</p>

<h1 id="what-about-indexes">What about indexes</h1>
<p>So, to recap, MySQL handles versions by keeping the row constantly up to date and storing diffs for as long as currently running queries need them. That’s only half the story though, indexes need to support consistent reads as well. Primary key indexes work much like the above description for actual database rows. Secondary indexes are a little different.</p>

<p>MySQL handles this in two ways: pages of index entries are marked by the last transaction id to write in them, and individual index entries have delete markers. When an update changes an indexed column, three things happen: the current index entry is delete marked, a new entry for the updated value is written, and that new entry’s index page is marked with the transaction id.</p>

<p>Read operations that find non-delete-marked entries in pages that predate their transaction id use that index data directly. If the operation finds either a delete marker or a newer transaction id, it looks up the row and traverses the undo log to find the appropriate value to use.</p>

<p>Similar to the purging of deleted rows from expired transactions, delete-marked index entries are also eventually reclaimed. Because there is always a fresh new entry to work with <em>somewhere</em> in the index, MySQL can be a little more aggressive at cleaning those up.</p>

<h1 id="what-do-i-do-with-this-information">What do I do with this information?</h1>
<p>So, given the above, what can we take home to make our lives better? A few things. Keep in mind with all of this that database performance can be very difficult to analyze. Each point below is just one potential piece of the story of what could be happening with your data.</p>

<ul>
  <li>Big transactions are painful: Long running transactions don’t just tie up a connection, they force the database to preserve history for longer. If that transaction is reading through a large swath of the database subsequent writes will force it to read the rollback log, which may be in a different page of memory or even on disk.</li>
  <li>Multi-statement transactions need to commit quickly: This is another variety of “big transactions are painful”, but it’s worth calling out. MySQL does not “kill” active transactions. If you open a transaction, query out data, then spend two hours in application code before committing, MySQL will faithfully preserve undo history for two hours. Every moment of an open transaction forces more undo history. Commit as quickly as you can and do your processing outside of transactions whenever possible.</li>
  <li>Writes make index scans less useful. The whole point of an index is to answer questions about your data without actually looking at your data. Delete markers on index entries, and transaction stamps on index pages, force the database to read your data. Think carefully about using composite indexes with columns you aren’t querying. Your queries will pay the price for updates to those columns anyways.</li>
  <li>Rapid fire writes magnify the penalties for reads. If you have a lot of data to write, especially to the <em>same row</em>, write it in chunks instead of one-by-one. Each write generates a transaction id, relevant undo logs, and makes a mess of secondary indexes. Chunking writes together increases the chances that reads will find valid index data, and lowers the size of undo logs they have to wade through. There’s an opposite extreme where one big write might have too much data, so it’s important to look for the happy middle ground here.</li>
  <li>“Hot” rows are hot for all columns, not just updated ones. A row that stores a frequently updated counter forces more row transaction id updates and undo log entries. Queries that start before the counter is incremented, even if they don’t use the counter, still have to traverse undo logs for the row state when they started. That same logic applies to extremely frequently updated timestamps that aggregate change times across the relations to a row. If possible, batch those updates beforehand or consider storing them in a separate table you can join to when needed.</li>
  <li>Consider separating reporting from direct/application use reads. Reporting queries tend to scan large sections of the database. They take a long time and thus force the preservation and consumption of more undo history. Most application behavior is more direct: it knows specific records to retrieve and goes straight for them. If you’re already using read replicas, consider dedicating one to reporting so that your application queries don’t pay the undo storage penalties of reporting.</li>
</ul>

<h1 id="final-thoughts">Final thoughts</h1>
<p>It’s worth noting that this is not the only way to implement MVCC. PostgreSQL handles this task by storing the minimum and maximum transaction ids where a row should be visible. Under that scheme, updating a row sets the maximum transaction id on it and creates an entirely new row entry by copying the original and performing the update in that copy. This avoids the need for undo logs, but at the cost of copying all row data for each update.</p>

<p>The point here is that, for cases where you are really trying to push database performance, understanding a few of the bigger details of the internals can pay off. Many of the takeaways I listed are only going to be applicable in extreme use cases, but in those cases knowing how the database goes about versioning data can make understanding performance problems easier.</p>]]></content><author><name></name></author><category term="database" /><category term="mysql" /><summary type="html"><![CDATA[In which we figure out how MySQL writes new data without showing it in old queries.]]></summary></entry><entry><title type="html">Understanding the Elasticsearch Percolator</title><link href="https://awj.dev/elasticsearch/2018/04/24/understanding-the-elasticsearch-percolator.html" rel="alternate" type="text/html" title="Understanding the Elasticsearch Percolator" /><published>2018-04-24T18:02:55+00:00</published><updated>2018-04-24T18:02:55+00:00</updated><id>https://awj.dev/elasticsearch/2018/04/24/understanding-the-elasticsearch-percolator</id><content type="html" xml:base="https://awj.dev/elasticsearch/2018/04/24/understanding-the-elasticsearch-percolator.html"><![CDATA[<p>Elasticsearch is a powerful, feature-packed tool. Their <a href="https://www.elastic.co/guide/en/elasticsearch/reference/current/index.html">documentation</a> is great, but some pieces are a bit … out there. Beyond that, some of the functionality has changed significantly over the years, so third-party explanations might no longer be accurate.</p>

<p>One fantastic feature that is both unusual and has changed a lot is percolation. I’m going to try to explain that feature, in the context of its current implementation (version 6.2.4). You’ll need a basic understanding of Elasticsearch, specifically <a href="https://www.elastic.co/guide/en/elasticsearch/reference/6.2/mapping.html">mappings</a> and <a href="https://www.elastic.co/guide/en/elasticsearch/reference/6.2/search-request-body.html">search</a>.</p>

<h1 id="the-concept">The Concept</h1>
<p>The normal workflow for Elasticsearch is to store documents (as JSON data) in an index, and execute searches (also JSON data) to ask the index about those documents.</p>

<p>Succinctly, percolation reverses that. You store searches and use documents to ask the index about those searches. That’s true, but it’s not particularly actionable information. How percolators are structured has evolved over the years, to the point where we can give a more useful explanation.</p>

<p>Now, percolation revolves around the <a href="https://www.elastic.co/guide/en/elasticsearch/reference/6.2/percolator.html">percolator</a> mapping field type. This is like any other field type, except that it expects you to assign a search document as the value. When you store data, the index processes this search document into an executable form and saves it for later.</p>

<p>The <a href="https://www.elastic.co/guide/en/elasticsearch/reference/6.2/query-dsl-percolate-query.html">Percolate Query</a> takes one or more documents and limits results to ones whose stored searches match at least one document. When searching, the percolate query works like any other query element.</p>

<h1 id="in-depth">In Depth</h1>
<p>Under the hood, this is implemented in about the way you would expect: indexes with percolate fields keep a hidden (in memory) index. Documents listed in your percolate queries are first put in that index, then a normal query is executed against that index to see if the original percolate-field-bearing document matches.</p>

<p>An important point to remember is that this hidden index gets its mappings from the original percolator index. So indexes used for percolate queries need to have mappings appropriate for the original data and the query document data.</p>

<p>This introduces a bit of a management problem, in that your index data and the percolate query documents could use the same field in different ways. A simple answer to that is to use the <a href="https://www.elastic.co/guide/en/elasticsearch/reference/6.2/object.html">object type</a> to isolate the percolate-relevant mappings from normal document mappings.</p>

<p>Assuming the queries you are using were originally written for another index of actual documents, it makes the most sense to isolate the data going directly into the percolate index and give the root level over to mapping definitions for percolate query documents.</p>

<p>Also, because percolate fields are parsed into searches and saved at index time, you likely will need to reindex percolate documents after upgrading to take advantage of any optimizations to the system.</p>

<h1 id="an-example">An Example</h1>
<p>In my opinion, percolator examples are one of the prime contributors to making the tool hard to understand. They tend to be too simple, to the point where it’s hard to distinguish the parts.</p>

<p>In this example, we’re going to build out an index of saved term and price searches for toys. The idea behind it is that users should be able to put in a search term and a max price, then get notified as soon as something matching that term goes below this price. Users should also be able to turn these notifications on and off.</p>

<p>The mapping below implements a percolate index to support this feature. Fields related to the saved search itself are in the <code class="language-plaintext highlighter-rouge">search</code> object, while fields related to the original toys live at the root level of the mappings.</p>

<div class="language-json highlighter-rouge"><div class="highlight"><pre class="highlight"><code><span class="p">{</span><span class="w">
  </span><span class="nl">"mappings"</span><span class="p">:</span><span class="w"> </span><span class="p">{</span><span class="w">
    </span><span class="nl">"_doc"</span><span class="p">:</span><span class="w"> </span><span class="p">{</span><span class="w">
      </span><span class="nl">"properties"</span><span class="p">:</span><span class="w"> </span><span class="p">{</span><span class="w">
        </span><span class="nl">"search"</span><span class="p">:</span><span class="w"> </span><span class="p">{</span><span class="w">
          </span><span class="nl">"properties"</span><span class="p">:</span><span class="w"> </span><span class="p">{</span><span class="w">
            </span><span class="nl">"query"</span><span class="p">:</span><span class="w">   </span><span class="p">{</span><span class="w"> </span><span class="nl">"type"</span><span class="p">:</span><span class="w"> </span><span class="s2">"percolator"</span><span class="w"> </span><span class="p">},</span><span class="w">
            </span><span class="nl">"user_id"</span><span class="p">:</span><span class="w"> </span><span class="p">{</span><span class="w"> </span><span class="nl">"type"</span><span class="p">:</span><span class="w"> </span><span class="s2">"integer"</span><span class="w"> </span><span class="p">},</span><span class="w">
            </span><span class="nl">"enabled"</span><span class="p">:</span><span class="w"> </span><span class="p">{</span><span class="w"> </span><span class="nl">"type"</span><span class="p">:</span><span class="w"> </span><span class="s2">"boolean"</span><span class="w"> </span><span class="p">}</span><span class="w">
          </span><span class="p">},</span><span class="w">
        </span><span class="p">},</span><span class="w">
        </span><span class="nl">"price"</span><span class="p">:</span><span class="w">       </span><span class="p">{</span><span class="w"> </span><span class="nl">"type"</span><span class="p">:</span><span class="w"> </span><span class="s2">"float"</span><span class="w"> </span><span class="p">},</span><span class="w">
        </span><span class="nl">"description"</span><span class="p">:</span><span class="w"> </span><span class="p">{</span><span class="w"> </span><span class="nl">"type"</span><span class="p">:</span><span class="w"> </span><span class="s2">"text"</span><span class="w"> </span><span class="p">}</span><span class="w">
      </span><span class="p">}</span><span class="w">
    </span><span class="p">}</span><span class="w">
  </span><span class="p">}</span><span class="w">
</span><span class="p">}</span><span class="w">
</span></code></pre></div></div>

<p>Here is what a document that represents a stored search would look like:</p>

<div class="language-json highlighter-rouge"><div class="highlight"><pre class="highlight"><code><span class="p">{</span><span class="w">
  </span><span class="nl">"_id"</span><span class="p">:</span><span class="w"> </span><span class="mi">1</span><span class="p">,</span><span class="w">
  </span><span class="nl">"search"</span><span class="p">:</span><span class="w"> </span><span class="p">{</span><span class="w">
    </span><span class="nl">"user_id"</span><span class="p">:</span><span class="w"> </span><span class="mi">5</span><span class="p">,</span><span class="w">
    </span><span class="nl">"enabled"</span><span class="p">:</span><span class="w"> </span><span class="kc">true</span><span class="p">,</span><span class="w">
    </span><span class="nl">"query"</span><span class="p">:</span><span class="w"> </span><span class="p">{</span><span class="w">
      </span><span class="nl">"bool"</span><span class="p">:</span><span class="w"> </span><span class="p">{</span><span class="w">
        </span><span class="nl">"filter"</span><span class="p">:</span><span class="w"> </span><span class="p">[</span><span class="w">
          </span><span class="p">{</span><span class="w"> 
            </span><span class="nl">"match"</span><span class="p">:</span><span class="w"> </span><span class="p">{</span><span class="w"> 
              </span><span class="nl">"description"</span><span class="p">:</span><span class="w"> </span><span class="p">{</span><span class="w"> </span><span class="nl">"query"</span><span class="p">:</span><span class="w"> </span><span class="s2">"nintendo switch"</span><span class="w"> </span><span class="p">}</span><span class="w">
            </span><span class="p">}</span><span class="w">
          </span><span class="p">},</span><span class="w">
          </span><span class="p">{</span><span class="w"> </span><span class="nl">"range"</span><span class="p">:</span><span class="w"> </span><span class="p">{</span><span class="w"> </span><span class="nl">"price"</span><span class="p">:</span><span class="w"> </span><span class="p">{</span><span class="w"> </span><span class="nl">"lte"</span><span class="p">:</span><span class="w"> </span><span class="mf">300.00</span><span class="w"> </span><span class="p">}</span><span class="w"> </span><span class="p">}</span><span class="w"> </span><span class="p">}</span><span class="w">
        </span><span class="p">]</span><span class="w">
      </span><span class="p">}</span><span class="w">
    </span><span class="p">}</span><span class="w">
  </span><span class="p">}</span><span class="w">
</span><span class="p">}</span><span class="w">
</span></code></pre></div></div>

<p>Note that we are only storing data inside the <code class="language-plaintext highlighter-rouge">search</code> object field. The mappings for <code class="language-plaintext highlighter-rouge">price</code> and <code class="language-plaintext highlighter-rouge">description</code> are just there to support percolate queries.</p>

<p>At query time, we want to use both the plain object fields and the “special” percolator field. This query would check, inside a user’s searches, to see which currently-enabled searches match the document.</p>

<div class="language-json highlighter-rouge"><div class="highlight"><pre class="highlight"><code><span class="p">{</span><span class="w">
  </span><span class="nl">"query"</span><span class="p">:</span><span class="w"> </span><span class="p">{</span><span class="w">
    </span><span class="nl">"bool"</span><span class="p">:</span><span class="w"> </span><span class="p">{</span><span class="w">
      </span><span class="nl">"filter"</span><span class="p">:</span><span class="w"> </span><span class="p">[</span><span class="w">
        </span><span class="p">{</span><span class="w">
          </span><span class="nl">"percolate"</span><span class="w"> </span><span class="p">:</span><span class="w"> </span><span class="p">{</span><span class="w">
            </span><span class="nl">"field"</span><span class="w"> </span><span class="p">:</span><span class="w"> </span><span class="s2">"search.query"</span><span class="p">,</span><span class="w">
            </span><span class="nl">"document"</span><span class="w"> </span><span class="p">:</span><span class="w"> </span><span class="p">{</span><span class="w">
              </span><span class="nl">"description"</span><span class="w"> </span><span class="p">:</span><span class="w"> </span><span class="s2">"Nintendo Switch"</span><span class="p">,</span><span class="w">
              </span><span class="nl">"price"</span><span class="p">:</span><span class="w"> </span><span class="mf">250.00</span><span class="w">
            </span><span class="p">}</span><span class="w">
          </span><span class="p">}</span><span class="w">
        </span><span class="p">},</span><span class="w">
        </span><span class="p">{</span><span class="w"> </span><span class="nl">"term"</span><span class="p">:</span><span class="w"> </span><span class="p">{</span><span class="w"> </span><span class="nl">"search.enabled"</span><span class="p">:</span><span class="w"> </span><span class="kc">true</span><span class="w"> </span><span class="p">}</span><span class="w"> </span><span class="p">},</span><span class="w">
        </span><span class="p">{</span><span class="w"> </span><span class="nl">"term"</span><span class="p">:</span><span class="w"> </span><span class="p">{</span><span class="w"> </span><span class="nl">"search.user_id"</span><span class="p">:</span><span class="w"> </span><span class="mi">5</span><span class="w"> </span><span class="p">}</span><span class="w"> </span><span class="p">}</span><span class="w">
      </span><span class="p">]</span><span class="w">
    </span><span class="p">}</span><span class="w">
  </span><span class="p">}</span><span class="w">
</span><span class="p">}</span><span class="w">
</span></code></pre></div></div>

<p>Note that it combines percolate matching of a document against the queries stored in the field with regular term queries to limit which documents we test based on their enabled state and the user id.</p>

<h1 id="some-additional-thoughts">Some Additional Thoughts</h1>
<p>Because of the work involved in running queries as part of resolving a percolate filter, you might need to pay extra attention to shards/replicas for a percolate index. Each shard reduces the number of queries any one machine may have to run, by reducing the number of search-bearing documents to evaluate.</p>

<p>Percolate queries have an option to get documents from another index inside the cluster. This takes the form of a literal GET request, so there’s not much benefit in trying to keep shards from the two indices on the same nodes.</p>]]></content><author><name></name></author><category term="elasticsearch" /><summary type="html"><![CDATA[In which we flip search on its head and save *searches* then search with *documents*.]]></summary></entry></feed>