JB's BLOG

Directed Acyclic Word Graphs

Part 6 - Performance enhancements

Goals

In this article we're going to explore how to improve the performance of the Lookup function. We've created a benchmark suite in the previous article and we're going to use that to measure the effects of our changes.

We're going to proceed with the first 4 basic steps in performance enhancement.

  1. Algorithm refinement
  2. Loop hoisting
  3. Memory access reduction
  4. Branch reduction

After that, it's time to dig into assembly code and make some weirder optimizations that might not fit on this list.

  1. Starting point

    Here is the very naive implementation of the Wagner-Fischer Algorithm as described in part 4 of this series.

    Dawg01.cs

    public IEnumerable<SuggestItem> Lookup(string word, uint maxEdits)
    {
        var builder = new StringBuilder(word.Length + (int)maxEdits);
        builder.Append(new string(' ', word.Length + (int)maxEdits));
        var results = new List<SuggestItem>();
    
        var matrix = new int[word.Length + maxEdits + 1][];
        for(var i = 0; i < matrix.Length; i++)
        {
            matrix[i] = new int[word.Length + 1];
        }
    
        for (var i = 0; i < matrix.Length; i++)
        {
            matrix[i][0] = i;
        }
    
        for(var i = 0; i < matrix[0].Length; i++)
        {
            matrix[0][i] = i;
        }
    
        Recurse(_graph.RootNodeIndex, 0);
        return results;
    
        void Recurse(int currentNode, int depth)
        {
            Debug.Assert(depth <= word.Length + maxEdits);
            if (depth == word.Length + maxEdits)
            {
                return;
            }
    
            var firstChild = _graph.FirstChildEdgeIndex[currentNode];
            var lastChild = _graph.FirstChildEdgeIndex[currentNode + 1];
    
            for (var childEdge = firstChild; childEdge < lastChild; childEdge++)
            {
                var currentCharacter = _graph.EdgeCharacters[childEdge];
                builder[depth] = currentCharacter;
                for (var i = 1; i < matrix[depth].Length; i++)
                {
                    var targetCharacter = word[i - 1];
                    var cost = 1;
                    if (currentCharacter == targetCharacter)
                    {
         cost = 0;
                    }
    
                    matrix[depth + 1][i] = Min(
         matrix[depth][i] + 1,
         matrix[depth + 1][i - 1] + 1,
         matrix[depth][i - 1] + cost);
    
                    if (depth > 0 && i > 1 
         && word[i-1] == builder[depth - 1]
         && word[i-2] == builder[depth])
                    {
         matrix[depth + 1][i] = Math.Min(
             matrix[depth + 1][i],
             matrix[depth - 1][i - 2] + 1);
                    }
                }
    
                var nextNode = _graph.EdgeToNodeIndex[childEdge];
                if (nextNode < 0 && matrix[depth + 1][word.Length] <= maxEdits)
                {
                    results.Add(new SuggestItem(builder.ToString(0, depth + 1), 0));
                }
    
                Recurse(Math.Abs(nextNode), depth + 1);
            }
        }
    
        static int Min(int a, int b, int c)
        {
            return Math.Min(a, Math.Min(b, c));
        }
    }
    

    Benchmark result: 29.75 s

  2. Algorithm refinement - part 1

    The first implementation is painfully slow. A very quick fix we can make is to stop processing when the current graph path cannot possibly generate any valid words. The entries in the algorithm's matrix are monotonically increasing, so once we've reached a row where all entries are beyond our maxEdits value, we can stop traversing down that path.

    A simple way to do so is checking each value as we write it to the matrix. By starting with a flag set to false, and setting it to true if we encounter a valid entry, we can check the result once we've done processing the loop and potentially exit early.

    diff Dawg01.cs Dawg02.cs

    @@
      for (var childEdge = firstChild; childEdge < lastChild; childEdge++)
      {
    +     var any = false;
          var currentCharacter = _graph.EdgeCharacters[childEdge];
          builder[depth] = currentCharacter;
          for (var i = 1; i < matrix[depth].Length; i++)
    @@
                      matrix[depth + 1][i],
                      matrix[depth - 1][i - 2] + 1);
              }
    +
    +         if (matrix[depth + 1][i] <= maxEdits)
    +         {
    +             any = true;
    +         }
    +     }
    +
    +     if (!any)
    +     {
    +         continue;
          }
    
          var nextNode = _graph.EdgeToNodeIndex[childEdge];
    

    1.661 s

  3. Algorithm refinement - part 2

    Again, for our purposes we don't need words that can't possibly fit within the maxEdit distance. The matrix contains many entries representing such alignments and word lengths. To compute only the relevant entries, we can calculate start and end points for valid entries per row. The result is a set of diagonal stripes in the matrix, with entries outside left uncomputed.

    For this to work, we also need to add values before and after the stripe. These values are retrieved during computation for valid entries, so we can simply populate them with maxEdits to represent an invalid transition.

    diff Dawg02.cs Dawg03.cs

    @@
      {
          matrix[i] = new int[word.Length + 1];
          matrix[i][0] = i;
    +     if (i > maxEdits)
    +     {
    +         matrix[i][i - maxEdits - 1] = (int)maxEdits;
    +     }
    +
    +     var stripeEnd = i + maxEdits + 1;
    +     if (stripeEnd <= word.Length)
    +     {
    +         matrix[i][stripeEnd] = (int)maxEdits;
    +     }
      }
    
      for(var i = 0; i < matrix[0].Length; i++)
    @@
              var any = false;
              var currentCharacter = _graph.EdgeCharacters[childEdge];
              builder[depth] = currentCharacter;
    -         for (var i = 1; i < matrix[depth].Length; i++)
    +         var from = Math.Max(depth - (int)maxEdits + 1, 1);
    +         var to = Math.Min(word.Length + 1, depth + maxEdits + 2);
    +         for (var i = from; i < to; i++)
              {
                  var targetCharacter = word[i - 1];
                  var cost = 1;
    @@
              }
    
              var nextNode = _graph.EdgeToNodeIndex[childEdge];
    -         if (nextNode < 0 && matrix[depth + 1][word.Length] <= maxEdits)
    +         if (nextNode < 0
    +             && depth >= word.Length - maxEdits - 1
    +             && matrix[depth + 1][word.Length] <= maxEdits)
              {
                  results.Add(new SuggestItem(builder.ToString(0, depth + 1), 0));
              }
    

    1.313 s

  4. Loop hoisting - part 1

    The values to and from can be calculated outside the loop. This will move some arithmetic and conditionals from inside to outside.

    diff Dawg03.cs Dawg04.cs

    @@
      var firstChild = _graph.FirstChildEdgeIndex[currentNode];
      var lastChild = _graph.FirstChildEdgeIndex[currentNode + 1];
    
    +  var from = Math.Max(depth - (int)maxEdits + 1, 1);
    +  var to = Math.Min(word.Length + 1, depth + maxEdits + 2);
    +
      for (var childEdge = firstChild; childEdge < lastChild; childEdge++)
      {
          var any = false;
          var currentCharacter = _graph.EdgeCharacters[childEdge];
          builder[depth] = currentCharacter;
    -     var from = Math.Max(depth - (int)maxEdits + 1, 1);
    -     var to = Math.Min(word.Length + 1, depth + maxEdits + 2);
          for (var i = from; i < to; i++)
          {
              var targetCharacter = word[i - 1];
    

    1.311 s

  5. Memory access reduction - part 1

    The current implementation can perform multiple reads and writes to the same entry while populating it. By storing the result in a temporary variable and doing our computations on it, we save some time. This temporary variable can also be carried into the next iteration of the loop, further reducing the number of reads.

    Carrying it through iterations means we need an initial value before the loop begins. If you recall the previous step, we stored a precalculated value before and after each stripe. Instead of reading the pre-stripe value, we can calculate it on the fly from existing local variables. Now we can also omit the population of these prefix entries.

    diff Dawg04.cs Dawg05.cs

    @@
                 {
      matrix[i] = new int[word.Length + 1];
      matrix[i][0] = i;
    - if (i > maxEdits)
    - {
    -     matrix[i][i - maxEdits - 1] = (int)maxEdits;
    - }
    
      var stripeEnd = i + maxEdits + 1;
      if (stripeEnd <= word.Length)
    @@
          var any = false;
          var currentCharacter = _graph.EdgeCharacters[childEdge];
          builder[depth] = currentCharacter;
    +     var calculatedCost = depth + 1;
          for (var i = from; i < to; i++)
          {
              var targetCharacter = word[i - 1];
    @@
                  cost = 0;
              }
    
    -         matrix[depth + 1][i] = Min(
    +         calculatedCost = Min(
                  matrix[depth][i] + 1,
    -             matrix[depth + 1][i - 1] + 1,
    +             calculatedCost + 1,
                  matrix[depth][i - 1] + cost);
    
              if (depth > 0 && i > 1 
                  && word[i-1] == builder[depth - 1]
                  && word[i-2] == builder[depth])
              {
    -             matrix[depth + 1][i] = Math.Min(
    -   matrix[depth + 1][i],
    +             calculatedCost = Math.Min(
    +   calculatedCost,
        matrix[depth - 1][i - 2] + 1);
              }
    
    -         if (matrix[depth + 1][i] <= maxEdits)
    +         if (calculatedCost <= maxEdits)
              {
                  any = true;
              }
    +
    +         matrix[depth + 1][i] = calculatedCost;
          }
    
          if (!any)
    @@
          var nextNode = _graph.EdgeToNodeIndex[childEdge];
          if (nextNode < 0
              && depth >= word.Length - maxEdits - 1
    -         && matrix[depth + 1][word.Length] <= maxEdits)
    +         && calculatedCost <= maxEdits)
          {
              results.Add(new SuggestItem(builder.ToString(0, depth + 1), 0));
          }
    

    1.265 s

  6. Memory access reduction - part 2

    We've just seen that the inner loop can re-use values in the current row from previous iterations. We can continue this application of rolling local variables by also storing values read from the previous row.

    This rolling over from one variable to another is faster than re-reading from memory and the operation can often be free due to register renaming. TODO: add a citation for register renaming.

    While we're at it. some small miscellaneous values can also be read outside of the loop and stored locally for re-use.

    diff Dawg05.cs Dawg06.cs

    @@
      var currentCharacter = _graph.EdgeCharacters[childEdge];
      builder[depth] = currentCharacter;
      var calculatedCost = depth + 1;
    + var previousRowEntry = matrix[depth][from - 1];
      for (var i = from; i < to; i++)
      {
          var targetCharacter = word[i - 1];
    @@
              cost = 0;
          }
    
    +     var previousRowPreviousEntry = previousRowEntry;
    +     previousRowEntry = matrix[depth][i];
    +
          calculatedCost = Min(
    -         matrix[depth][i] + 1,
    +         previousRowEntry + 1,
              calculatedCost + 1,
    -         matrix[depth][i - 1] + cost);
    +         previousRowPreviousEntry + cost);
    
          if (depth > 0 && i > 1 
              && word[i-1] == builder[depth - 1]
    

    1.190 s

  7. Loop hoisting - part 2

    Now that the code is a little bit easier to read we can notice some more memory reads that can be hoisted out of the loop. builder[depth] is simply currentCharacter. builder[depth-1] is another loop invariant that we can pull out of both loops.

    diff Dawg06.cs Dawg07.cs

    @@
    
      var from = Math.Max(depth - (int)maxEdits + 1, 1);
      var to = Math.Min(word.Length + 1, depth + maxEdits + 2);
    + var previousCharacter = depth > 0 ? builder[depth - 1] : ' ';
    
      for (var childEdge = firstChild; childEdge < lastChild; childEdge++)
      {
    @@
                  previousRowPreviousEntry + cost);
    
              if (depth > 0 && i > 1 
    -             && word[i-1] == builder[depth - 1]
    -             && word[i-2] == builder[depth])
    +             && word[i-1] == previousCharacter
    +             && word[i-2] == currentCharacter)
              {
                  calculatedCost = Math.Min(
       calculatedCost,
    

    1.030 s

  8. Memory access reduction - part 3

    The check for a transposition looks like another good spot to clean up. word[i-1] can be replaced with targetCharacter. word[i-2] can be replaced by introducing another rolling local variable We can now also remove some boundary checks from the if statement.

    diff Dawg07.cs Dawg08.cs

    @@
              builder[depth] = currentCharacter;
              var calculatedCost = depth + 1;
              var previousRowEntry = matrix[depth][from - 1];
    +         var targetCharacter = from > 1 ? word[from - 1] : ' ';
              for (var i = from; i < to; i++)
              {
    - var targetCharacter = word[i - 1];
    + var previousTargetCharacter = targetCharacter;
    + targetCharacter = word[i - 1];
      var cost = 1;
      if (currentCharacter == targetCharacter)
      {
    @@
          calculatedCost + 1,
          previousRowPreviousEntry + cost);
    
    - if (depth > 0 && i > 1 
    -     && word[i-1] == previousCharacter
    -     && word[i-2] == currentCharacter)
    + if (targetCharacter == previousCharacter
    +     && previousTargetCharacter == currentCharacter)
      {
          calculatedCost = Math.Min(
              calculatedCost,
    

    945.5 ms

  9. Algorithm refinement - part 3

    This code is looking better and better. With all those array accesses out of the way, the next most complicated looking step is the Min calculation. Let's take a step back and think about what it's trying to solve.

    The best possible case is the one corresponding to a character match. When this happens, we don't need to compare previousRowPreviousEntry against adjacent matrix entries to obtain the best result. By switching to an if-else conditional instead of a Math.Min, we can avoid some unnecessary work.

    Conversely in the case of a transposition, previousRowPreviousEntry cannot possibly be better than the diagonal from the row before it.

    And finally, we can pull the common +1 out of the Min call and only apply it once.

    diff Dawg08.cs Dawg09.cs

    @@
      {
          var previousTargetCharacter = targetCharacter;
          targetCharacter = word[i - 1];
    -     var cost = 1;
    -     if (currentCharacter == targetCharacter)
    -     {
    -         cost = 0;
    -     }
    
          var previousRowPreviousEntry = previousRowEntry;
          previousRowEntry = matrix[depth][i];
    
    -     calculatedCost = Min(
    -         previousRowEntry + 1,
    -         calculatedCost + 1,
    -         previousRowPreviousEntry + cost);
    -
    -     if (targetCharacter == previousCharacter
    -         && previousTargetCharacter == currentCharacter)
    +     if (currentCharacter == targetCharacter)
    +     {
    +         calculatedCost = previousRowPreviousEntry;
    +     }
    +     else
          {
    -         calculatedCost = Math.Min(
    +         if (targetCharacter == previousCharacter
    +             && previousTargetCharacter == currentCharacter)
    +         {
    +             previousRowPreviousEntry = matrix[depth - 1][i - 2];
    +         }
    +
    +         calculatedCost = Min(
    +             previousRowEntry,
                  calculatedCost,
    -             matrix[depth - 1][i - 2] + 1);
    +             previousRowPreviousEntry) + 1;
          }
    
          if (calculatedCost <= maxEdits)
    

    916.4 ms

  10. Memory access reduction - part 4

    Taking a look at the variables in the inner-loop, I see maxEdits is only used once. Any variable we can remove from the loop will reduce pressure on the registers and eliminate loads and stores to the stack.

    Currently, entries in the matrix represent the edit distance between strings, going from 0 to maxEdits. If we instead prepoulate our matrix with values from -maxEdits to 0, we can then compare against 0 to see if a word is valid. Simply subtracting maxEdits from each prepopulated entry will suffice.

    diff Dawg09.cs Dawg10.cs

    @@
      for (var i = 0; i < matrix.Length; i++)
      {
          matrix[i] = new int[word.Length + 1];
    -     matrix[i][0] = i;
    +     matrix[i][0] = i - (int)maxEdits;
    
          var stripeEnd = i + maxEdits + 1;
          if (stripeEnd <= word.Length)
          {
    -         matrix[i][stripeEnd] = (int)maxEdits;
    +         matrix[i][stripeEnd] = 0;
          }
      }
    
      for (var i = 0; i < matrix[0].Length; i++)
      {
    -     matrix[0][i] = i;
    +     matrix[0][i] = i - (int)maxEdits;
      }
    
      Recurse(_graph.RootNodeIndex, 0);
    @@
               previousRowPreviousEntry) + 1;
                  }
    
    -             if (calculatedCost <= maxEdits)
    +             if (calculatedCost <= 0)
                  {
           any = true;
                  }
    @@
              var nextNode = _graph.EdgeToNodeIndex[childEdge];
              if (nextNode < 0
                  && depth >= word.Length - maxEdits - 1
    -             && calculatedCost <= maxEdits)
    +             && calculatedCost <= 0)
              {
                  results.Add(new SuggestItem(builder.ToString(0, depth + 1), 0));
              }
    

    861.1 ms

  11. Time to break out the big guns - Sharplab

    At this point I tried a few things but no matter what I did the results were not encouraging. We're now going to look under the hood and see what the compiler sees. Using Sharplab, we can see our code at different levels of compilation.

    1. C# (with abstractions and language features expanded)
    2. IL (Intermediate Language)
    3. JIT Asm (Native Asm Code)

    Instead of a diff, this time I present a minimum viable example that can be used as input for sharplab.

    Sharplab Dawg.cs

    @@
    using System;
    using System.Collections.Generic;
    using System.Text;
    
    namespace PerformanceTesting
    {
        public sealed class CompressedSparseRowGraph
        {
            public readonly int RootNodeIndex;
            public readonly uint[] FirstChildEdgeIndex;
            public readonly int[] EdgeToNodeIndex;
            public readonly char[] EdgeCharacters;
            public readonly ushort[] ReachableTerminalNodes;
        }
    
        public sealed class SuggestItem
        {
            public readonly string Term;
            public readonly ulong Count;
    
            public SuggestItem(string term, ulong count)
            {
                Term = term;
                Count = count;
            }
        }
    
        public sealed class Dawg
        {
            private readonly CompressedSparseRowGraph _graph;
    
            public Dawg(CompressedSparseRowGraph graph)
            {
                _graph = graph;
            }
    
            public IEnumerable<SuggestItem> Lookup(string word, uint maxEdits)
            {
                var builder = new StringBuilder(word.Length + (int)maxEdits);
                builder.Append(new string(' ', word.Length + (int)maxEdits));
                var results = new List<SuggestItem>();
    
                var matrix = new int[word.Length + maxEdits + 1][];
                for (var i = 0; i < matrix.Length; i++)
                {
                    matrix[i] = new int[word.Length + 1];
                    matrix[i][0] = i - (int)maxEdits;
    
                    var stripeEnd = i + maxEdits + 1;
                    if (stripeEnd <= word.Length)
                    {
                        matrix[i][stripeEnd] = 0;
                    }
                }
    
                for (var i = 0; i < matrix[0].Length; i++)
                {
                    matrix[0][i] = i - (int)maxEdits;
                }
    
                Recurse(_graph.RootNodeIndex, 0);
                return results;
    
                void Recurse(int currentNode, int depth)
                {
                    if (depth == word.Length + maxEdits)
                    {
                        return;
                    }
    
                    var firstChild = _graph.FirstChildEdgeIndex[currentNode];
                    var lastChild = _graph.FirstChildEdgeIndex[currentNode + 1];
    
                    var from = Math.Max(depth - (int)maxEdits + 1, 1);
                    var to = Math.Min(word.Length + 1, depth + maxEdits + 2);
                    var previousCharacter = depth > 0 ? builder[depth - 1] : ' ';
    
                    for (var childEdge = firstChild; childEdge < lastChild; childEdge++)
                    {
                        var any = false;
                        var currentCharacter = _graph.EdgeCharacters[childEdge];
                        builder[depth] = currentCharacter;
                        var calculatedCost = depth + 1;
                        var previousRowEntry = matrix[depth][from - 1];
                        var targetCharacter = from > 1 ? word[from - 1] : ' ';
                        for (var i = from; i < to; i++)
                        {
                            var previousTargetCharacter = targetCharacter;
                            targetCharacter = word[i - 1];
    
                            var previousRowPreviousEntry = previousRowEntry;
                            previousRowEntry = matrix[depth][i];
    
                            if (currentCharacter == targetCharacter)
                            {
                                calculatedCost = previousRowPreviousEntry;
                            }
                            else
                            {
                                if (targetCharacter == previousCharacter
                                    && previousTargetCharacter == currentCharacter)
                                {
                                    previousRowPreviousEntry = matrix[depth - 1][i - 2];
                                }
    
                                calculatedCost = Min(
                                    previousRowEntry,
                                    calculatedCost,
                                    previousRowPreviousEntry) + 1;
                            }
    
                            if (calculatedCost <= 0)
                            {
                                any = true;
                            }
    
                            matrix[depth + 1][i] = calculatedCost;
                        }
    
                        if (!any)
                        {
                            continue;
                        }
    
                        var nextNode = _graph.EdgeToNodeIndex[childEdge];
                        if (nextNode < 0
                            && depth >= word.Length - maxEdits - 1
                            && calculatedCost <= 0)
                        {
                            results.Add(new SuggestItem(builder.ToString(0, depth + 1), 0));
                        }
    
                        Recurse(Math.Abs(nextNode), depth + 1);
                    }
                }
    
                static int Min(int a, int b, int c)
                {
                    return Math.Min(a, Math.Min(b, c));
                }
            }
        }
    }
    
  12. Decompiled C# code

    To make observing results easier, we'll first make our code match up with the decompiled C# version. Select C# on the Results drop-down in Sharplab and you'll notice quite a few differences.

    Most of the required changes involve the Recurse function.

    • Create a struct that will hold the closure variables when Recurse is called.
    • Make the Recurse function static local.
    • Add the closure struct as an argument to Recurse.

    diff Dawg10.cs Dawg11.cs

    @@
    private struct ClosureVariable
    {
        public ClosureVariable(string word, uint maxEdits, Dawg11 @this, int[][] matrix, StringBuilder builder, List<SuggestItem> results) : this()
        {
            this.word = word;
            this.maxEdits = maxEdits;
            _this = @this;
            this.matrix = matrix;
            this.builder = builder;
            this.results = results;
        }
    
        public string word;
        public uint maxEdits;
        public Dawg11 _this;
        public int[][] matrix;
        public StringBuilder builder;
        public List<SuggestItem> results;
    }
    
    public IEnumerable<SuggestItem> Lookup(string word, uint maxEdits)
    {
        var builder = new StringBuilder(word.Length + (int)maxEdits);
        builder.Append(new string(' ', word.Length + (int)maxEdits));
        var results = new List<SuggestItem>();
    
        var matrix = new int[word.Length + maxEdits + 1][];
        for (var i = 0; i < matrix.Length; i++)
        {
            matrix[i] = new int[word.Length + 1];
            matrix[i][0] = i - (int)maxEdits;
    
            var stripeEnd = i + maxEdits + 1;
            if (stripeEnd <= word.Length)
            {
                matrix[i][stripeEnd] = 0;
            }
        }
    
        for (var i = 0; i < matrix[0].Length; i++)
        {
            matrix[0][i] = i - (int)maxEdits;
        }
    
        var closure = new ClosureVariable(word, maxEdits, this, matrix, builder, results);
        Recurse(_graph.RootNodeIndex, 0, ref closure);
        return results;
    
        static void Recurse(int currentNode, int depth, ref ClosureVariable closure)
        {
            if (depth == closure.word.Length + closure.maxEdits)
            {
                return;
            }
    
            var firstChild = closure._this._graph.FirstChildEdgeIndex[currentNode];
            var lastChild = closure._this._graph.FirstChildEdgeIndex[currentNode + 1];
    
            var from = Math.Max(depth - (int)closure.maxEdits + 1, 1);
            var to = Math.Min(closure.word.Length + 1, depth + closure.maxEdits + 2);
            var previousCharacter = depth > 0 ? closure.builder[depth - 1] : ' ';
    
            for (var childEdge = firstChild; childEdge < lastChild; childEdge++)
            {
                var any = false;
                var currentCharacter = closure._this._graph.EdgeCharacters[childEdge];
                closure.builder[depth] = currentCharacter;
                var calculatedCost = depth + 1;
                var previousRowEntry = closure.matrix[depth][from - 1];
                var targetCharacter = from > 1 ? closure.word[from - 1] : ' ';
                for (var i = from; i < to; i++)
                {
                    var previousTargetCharacter = targetCharacter;
                    targetCharacter = closure.word[i - 1];
    
                    var previousRowPreviousEntry = previousRowEntry;
                    previousRowEntry = closure.matrix[depth][i];
    
                    if (currentCharacter == targetCharacter)
                    {
                        calculatedCost = previousRowPreviousEntry;
                    }
                    else
                    {
                        if (targetCharacter == previousCharacter
                            && previousTargetCharacter == currentCharacter)
                        {
                            previousRowPreviousEntry = closure.matrix[depth - 1][i - 2];
                        }
    
                        calculatedCost = Min(
                            previousRowEntry,
                            calculatedCost,
                            previousRowPreviousEntry) + 1;
                    }
    
                    if (calculatedCost <= 0)
                    {
                        any = true;
                    }
    
                    closure.matrix[depth + 1][i] = calculatedCost;
                }
    
                if (!any)
                {
                    continue;
                }
    
                var nextNode = closure._this._graph.EdgeToNodeIndex[childEdge];
                if (nextNode < 0
                    && depth >= closure.word.Length - closure.maxEdits - 1
                    && calculatedCost <= 0)
                {
                    closure.results.Add(new SuggestItem(closure.builder.ToString(0, depth + 1), 0));
                }
    
                Recurse(Math.Abs(nextNode), depth + 1, ref closure);
            }
        }
    
        static int Min(int a, int b, int c)
        {
            return Math.Min(a, Math.Min(b, c));
        }
    }
    

    When we put this new code into Sharplab, the results are pretty similar. We'll use this as our starting point for further refinements.

    It's a bit slower, but I promise you that's because our struct constructor is slower than just setting non-readonly values as they come. That's all in the Lookup function and the assembly code for Recurse has very minimal changes, all in favor of this new version. We will catch up soon.

    915.5 ms

  13. Using language features

    One thing I noticed immediately is that the ClosureVariable struct could be made readonly.

    diff Dawg11.cs Dawg12.cs

    @@
    - private struct ClosureVariable
    + private readonly struct ClosureVariable
      {
    -     public ClosureVariable(string word, uint maxEdits, Dawg11 @this, int[][] matrix, StringBuilder builder, List<SuggestItem> results) : this()
    +     public ClosureVariable(string word, uint maxEdits, Dawg12 @this, int[][] matrix, StringBuilder builder, List<SuggestItem> results) : this()
          {
            this.word = word;
            this.maxEdits = maxEdits;
    @@
            this.results = results;
          }
    
    -     public string word;
    -     public uint maxEdits;
    -     public Dawg11 _this;
    -     public int[][] matrix;
    -     public StringBuilder builder;
    -     public List<SuggestItem> results;
    +     public readonly string word;
    +     public readonly uint maxEdits;
    +     public readonly Dawg12 _this;
    +     public readonly int[][] matrix;
    +     public readonly StringBuilder builder;
    +     public readonly List<SuggestItem> results;
    

    Before running the benchmark I was curious to see how it would look in Sharplab. In IL there are some attributes marking the struct and fields as readonly, and no other changes. In JIT Asm there are no changes at all. Oh well, at least it's semantically correct so I'll leave the changes in.

    914.3 ms

  14. Memory access reduction - part 5

    Reducing indirection

    We have some places with very high levels of indirection, ex: closure._this._graph.FirstChildEdgeIndex. To simplify this, we'll instead add those variables to the closure struct directly. Once done we can also remove the _this member from ClosureVariable.

    diff Dawg12.cs Dawg13.cs

    @@
    - public ClosureVariable(string word, uint maxEdits, Dawg12 @this, int[][] matrix, StringBuilder builder, List<SuggestItem> results) : this()
    + public ClosureVariable(string word, uint maxEdits, CompressedSparseRowGraph graph, int[][] matrix, StringBuilder builder, List<SuggestItem> results) : this()
      {
          this.word = word;
          this.maxEdits = maxEdits;
    -     _this = @this;
    +     this.graph = graph;
          this.matrix = matrix;
          this.builder = builder;
          this.results = results;
    @@
    
      public readonly string word;
      public readonly uint maxEdits;
    - public readonly Dawg12 _this;
    + public readonly CompressedSparseRowGraph graph;
      public readonly int[][] matrix;
      public readonly StringBuilder builder;
      public readonly List<SuggestItem> results;
    @@
          matrix[0][i] = i - (int)maxEdits;
      }
    
    - var closure = new ClosureVariable(word, maxEdits, this, matrix, builder, results);
    + var closure = new ClosureVariable(word, maxEdits, _graph, matrix, builder, results);
      Recurse(_graph.RootNodeIndex, 0, ref closure);
      return results;
    
    @@
              return;
          }
    
    -     var firstChild = closure._this._graph.FirstChildEdgeIndex[currentNode];
    -     var lastChild = closure._this._graph.FirstChildEdgeIndex[currentNode + 1];
    +     var firstChild = closure.graph.FirstChildEdgeIndex[currentNode];
    +     var lastChild = closure.graph.FirstChildEdgeIndex[currentNode + 1];
    
          var from = Math.Max(depth - (int)closure.maxEdits + 1, 1);
          var to = Math.Min(closure.word.Length + 1, depth + closure.maxEdits + 2);
    @@
          for (var childEdge = firstChild; childEdge < lastChild; childEdge++)
          {
              var any = false;
    -         var currentCharacter = closure._this._graph.EdgeCharacters[childEdge];
    +         var currentCharacter = closure.graph.EdgeCharacters[childEdge];
              closure.builder[depth] = currentCharacter;
              var calculatedCost = depth + 1;
              var previousRowEntry = closure.matrix[depth][from - 1];
    @@
       continue;
              }
    
    -         var nextNode = closure._this._graph.EdgeToNodeIndex[childEdge];
    +         var nextNode = closure.graph.EdgeToNodeIndex[childEdge];
              if (nextNode < 0
       && depth >= closure.word.Length - closure.maxEdits - 1
       && calculatedCost <= 0)
    

    892.4 ms

  15. Memory access reduction - part 6

    Reducing indirection

    Our two dimensional array matrix is another place where we can try to remove some unnecessary lookups. If we store a reference to the arrays that we care about, we don't have to make a 2-d access every single time.

    diff Dawg13.cs Dawg14.cs

    @@
      var to = Math.Min(closure.word.Length + 1, depth + closure.maxEdits + 2);
      var previousCharacter = depth > 0 ? closure.builder[depth - 1] : ' ';
    
    + var previousRow = closure.matrix[depth];
    + var currentRow = closure.matrix[depth + 1];
    +
      for (var childEdge = firstChild; childEdge < lastChild; childEdge++)
      {
          var any = false;
          var currentCharacter = closure.graph.EdgeCharacters[childEdge];
          closure.builder[depth] = currentCharacter;
          var calculatedCost = depth + 1;
    -     var previousRowEntry = closure.matrix[depth][from - 1];
    +     var previousRowEntry = previousRow[from - 1];
          var targetCharacter = from > 1 ? closure.word[from - 1] : ' ';
          for (var i = from; i < to; i++)
          {
    @@
              targetCharacter = closure.word[i - 1];
    
              var previousRowPreviousEntry = previousRowEntry;
    -         previousRowEntry = closure.matrix[depth][i];
    +         previousRowEntry = previousRow[i];
    
              if (currentCharacter == targetCharacter)
              {
    @@
                  any = true;
              }
    
    -         closure.matrix[depth + 1][i] = calculatedCost;
    +         currentRow[i] = calculatedCost;
          }
    

    TODO to TODO. This is a step I'd tried before breaking out Sharplab but it was a regression at the time. Now that we've cleaned up some other parts of the code, I'm willing to bet reduction in register pressure is allowing this change to be positive.

    810.5 ms

  16. Algorithm refinement - part 4

    It was at this point that I realized I made a mistake. The targetCharacter initialization was off by 1. It should have been loading the previous char from the array.

    I have unit tests, they all passed. To try to make sense of this I changed the line to assign an invalid character instead. All the tests were still passing. So either my tests are bad or the first character being checked for a transposition doesn't matter. Hint: My tests are fine, see if you can figure it out before reading more.

    The culprit here is our matrix striping. For the early rows where the stripe is not full-width, there is no possible transposition in the first entry because it's before the start of the word. For the others rows, because the width of the stripe is maxEdits from the diagonal in both directions, the very first entry can only be valid if there is a character match. So there are potential transpositions that we are not acknowledging, but their entry value would have been too high anyways.

    diff Dawg14.cs Dawg15.cs

    @@
    - var targetCharacter = from > 1 ? closure.word[from - 1] : ' ';
    + var targetCharacter = (char)0;
    

    771.4 ms

  17. Branch reduction - part 1

    Examining the generated Asm, I notice that the Math.Min are implemented with two jumps and 2 or 3 registers. It is an easy change to have it use 2 registers every time and only 1 jump. I've applied this in a few places. While it seems like there are other instances that could be changed, I tried and found them to be regressions. Hopefully after a few more changes we can go back and try again.

    diff Dawg15.cs Dawg16.cs

    @@
          var closure = new ClosureVariable(word, maxEdits, _graph, matrix, builder, results);
          Recurse(_graph.RootNodeIndex, 0, ref closure);
          return results;
    + }
    
    -     static void Recurse(int currentNode, int depth, ref ClosureVariable closure)
    + private static void Recurse(int currentNode, int depth, ref ClosureVariable closure)
    
    @@
          var firstChild = closure.graph.FirstChildEdgeIndex[currentNode];
          var lastChild = closure.graph.FirstChildEdgeIndex[currentNode + 1];
    
    -         var from = Math.Max(depth - (int)closure.maxEdits + 1, 1);
    +     var from = depth - (int)closure.maxEdits;
    +     if (from < 0)
    +     {
    +         from = 0;
    +     }
    +
    +     from++;
    +
          var to = Math.Min(closure.word.Length + 1, depth + closure.maxEdits + 2);
    -         var previousCharacter = depth > 0 ? closure.builder[depth - 1] : ' ';
    +     var previousCharacter = depth > 0 ? closure.builder[depth - 1] : (char)0;
    
    @@
                  }
                  else
                  {
    +                 if (previousRowEntry < calculatedCost)
    +                 {
    +                     calculatedCost = previousRowEntry;
    +                 }
    +
                      if (targetCharacter == previousCharacter
                          && previousTargetCharacter == currentCharacter)
                      {
                          previousRowPreviousEntry = closure.matrix[depth - 1][i - 2];
                      }
    
    -                     calculatedCost = Min(
    -                         previousRowEntry,
    -                         calculatedCost,
    -                         previousRowPreviousEntry) + 1;
    +                 if (previousRowPreviousEntry < calculatedCost)
    +                 {
    +                     calculatedCost = previousRowPreviousEntry;
    +                 }
    +
    +                 calculatedCost++;
                  }
    
                  if (calculatedCost <= 0)
    @@
              var nextNode = closure.graph.EdgeToNodeIndex[childEdge];
    -             if (nextNode < 0
    -                 && depth >= closure.word.Length - closure.maxEdits - 1
    +         if (nextNode < 0)
    +         {
    +             nextNode = -nextNode;
    +             if (depth >= closure.word.Length - closure.maxEdits - 1
                      && calculatedCost <= 0)
                  {
                      closure.results.Add(new SuggestItem(closure.builder.ToString(0, depth + 1), 0));
                  }
    -             Recurse(Math.Abs(nextNode), depth + 1, ref closure);
    -         }
    -     }
    -
    -     static int Min(int a, int b, int c)
    -     {
    -         return Math.Min(a, Math.Min(b, c));
    -     }
    +         Recurse(nextNode, depth + 1, ref closure);
    +     }
      }
    

    719.5 ms

  18. Reading assembly code

    At this point, anything I try seems to make the benchmark take longer. I've been avoiding this, but it's time to parse the Asm output and make sense of it.

    To make this easier, I'm annotating the ClosureVariable with the StructLayoutAttribute so that I can more readily identify which field is being used.

    Sharplab Dawg.cs v2

    using System;
    using System.Collections.Generic;
    using System.Text;
    using System.Runtime.InteropServices;
    
    namespace PerformanceTesting
    {
        public readonly struct SuggestItem
        {
            public readonly string Term;
            public readonly ulong Count;
    
            public SuggestItem(string term, ulong count)
            {
                Term = term;
                Count = count;
            }
        }
    
        public sealed class CompressedSparseRowGraph
        {
            public readonly char[] EdgeCharacters;
            public readonly uint[] FirstChildEdgeIndex;
            public readonly int[] EdgeToNodeIndex;
            public readonly int RootNodeIndex;
            public readonly ushort[] ReachableTerminalNodes;
            public readonly ulong[] WordCounts;
        }
    
        public sealed class Dawg
        {
            private readonly CompressedSparseRowGraph _graph;
    
            public Dawg(CompressedSparseRowGraph graph)
            {
                _graph = graph;
            }
    
            [StructLayout(LayoutKind.Explicit)]
            private readonly struct ClosureVariable
            {
                public ClosureVariable(string word, uint maxEdits, CompressedSparseRowGraph graph, int[][] matrix, StringBuilder builder, List<SuggestItem> results) : this()
                {
                    this.word = word;
                    this.maxEdits = maxEdits;
                    this.graph = graph;
                    this.matrix = matrix;
                    this.builder = builder;
                    this.results = results;
                }
    
                [FieldOffset(0x00)]
                public readonly string word;
                [FieldOffset(0x08)]
                public readonly uint maxEdits;
                [FieldOffset(0x10)]
                public readonly CompressedSparseRowGraph graph;
                [FieldOffset(0x18)]
                public readonly int[][] matrix;
                [FieldOffset(0x20)]
                public readonly StringBuilder builder;
                [FieldOffset(0x28)]
                public readonly List<SuggestItem> results;
            }
    
            public IEnumerable<SuggestItem> Lookup(string word, uint maxEdits)
            {
                var builder = new StringBuilder(word.Length + (int)maxEdits);
                builder.Append(new string(' ', word.Length + (int)maxEdits));
                var results = new List<SuggestItem>();
    
                var matrix = new int[word.Length + maxEdits + 1][];
                for (var i = 0; i < matrix.Length; i++)
                {
                    matrix[i] = new int[word.Length + 1];
                    matrix[i][0] = i - (int)maxEdits;
    
                    var stripeEnd = i + maxEdits + 1;
                    if (stripeEnd <= word.Length)
                    {
                        matrix[i][stripeEnd] = 0;
                    }
                }
    
                for (var i = 0; i < matrix[0].Length; i++)
                {
                    matrix[0][i] = i - (int)maxEdits;
                }
    
                var closure = new ClosureVariable(word, maxEdits, _graph, matrix, builder, results);
                Recurse(_graph.RootNodeIndex, 0, ref closure);
                return results;
            }
    
            static void Recurse(int currentNode, int depth, ref ClosureVariable closure)
            {
                if (depth == closure.word.Length + closure.maxEdits)
                {
                    return;
                }
    
                var firstChild = closure.graph.FirstChildEdgeIndex[currentNode];
                var lastChild = closure.graph.FirstChildEdgeIndex[currentNode + 1];
    
                var from = depth - (int) closure.maxEdits;
                if (from < 0)
                {
                    from = 0;
                }
    
                from++;
    
                var to = Math.Min(closure.word.Length + 1, depth + closure.maxEdits + 2);
                var previousCharacter = depth > 0 ? closure.builder[depth - 1] : (char) 0;
    
                var previousRow = closure.matrix[depth];
                var currentRow = closure.matrix[depth + 1];
    
                for (var childEdge = firstChild; childEdge < lastChild; childEdge++)
                {
                    var any = false;
                    var currentCharacter = closure.graph.EdgeCharacters[childEdge];
                    closure.builder[depth] = currentCharacter;
                    var calculatedCost = depth + 1;
                    var previousRowEntry = previousRow[from - 1];
                    var targetCharacter = (char) 0;
                    for (var i = from; i < to; i++)
                    {
                        var previousTargetCharacter = targetCharacter;
                        targetCharacter = closure.word[i - 1];
    
                        var previousRowPreviousEntry = previousRowEntry;
                        previousRowEntry = previousRow[i];
    
                        if (currentCharacter == targetCharacter)
                        {
                            calculatedCost = previousRowPreviousEntry;
                        }
                        else
                        {
                            if (previousRowEntry < calculatedCost)
                            {
                                calculatedCost = previousRowEntry;
                            }
    
                            if (targetCharacter == previousCharacter
                                && previousTargetCharacter == currentCharacter)
                            {
                                previousRowPreviousEntry = closure.matrix[depth - 1][i - 2];
                            }
    
                            if (previousRowPreviousEntry < calculatedCost)
                            {
                                calculatedCost = previousRowPreviousEntry;
                            }
    
                            calculatedCost++;
                        }
    
                        if (calculatedCost <= 0)
                        {
                            any = true;
                        }
    
                        currentRow[i] = calculatedCost;
                    }
    
                    if (!any)
                    {
                        continue;
                    }
    
                    var nextNode = closure.graph.EdgeToNodeIndex[childEdge];
                    if (nextNode < 0)
                    {
                        nextNode = -nextNode;
                        if (depth >= closure.word.Length - closure.maxEdits - 1
                            && calculatedCost <= 0)
                        {
                            closure.results.Add(new SuggestItem(closure.builder.ToString(0, depth + 1), 0));
                        }
                    }
    
                    Recurse(nextNode, depth + 1, ref closure);
                }
            }
        }
    }
    

    Here is the Asm along with my translation back to the C# code.

    Sharplab Asm output

    // preamble
    L0000: push r15
    L0002: push r14
    L0004: push r13
    L0006: push r12
    L0008: push rdi
    L0009: push rsi
    L000a: push rbp
    L000b: push rbx
    L000c: sub rsp, 0x58
    
    // Store 0s on the stack, memory is used when adding a result.
    L0010: xor eax, eax
    L0012: mov [rsp+0x38], rax
    L0017: mov [rsp+0x40], rax
    
    // Move function parameters into different registers
    L001c: mov edi, edx
    L001e: mov rsi, r8
    
    // if (depth == closure.word.Length + closure.maxEdits)
    // {
    //     return;
    // }
        // Load address of word from closure
        L0021: mov rdx, [rsi]
        // Load length from word
        L0024: mov edx, [rdx+8]
        // Expand length into 64 bit
        L0027: movsxd rdx, edx
        // Load maxEdits
        L002a: mov eax, [rsi+8]
        // Copy maxEdits
        L002d: mov r8d, eax
        // closure.word.Length + closure.maxEdits
        L0030: add rdx, r8
        // Copy and expand depth into 64 bit
        L0033: movsxd r8, edi
        // if (depth == closure.word.Length + closure.maxEdits)
        L0036: cmp rdx, r8
        L0039: jne short L004c
        // return;
        L003b: add rsp, 0x58
        L003f: pop rbx
        L0040: pop rbp
        L0041: pop rsi
        L0042: pop rdi
        L0043: pop r12
        L0045: pop r13
        L0047: pop r14
        L0049: pop r15
        L004b: ret
    
    // var firstChild = closure.graph.FirstChildEdgeIndex[currentNode];
    // var lastChild = closure.graph.FirstChildEdgeIndex[currentNode + 1];
        // Load the address of graph from closure
        L004c: mov rdx, [rsi+0x10]
        // Load the address of FirstChildEdgeIndex from graph
        L0050: mov rdx, [rdx+0x10]
        // Copy FirstChildEdgeIndex
        L0054: mov r8, rdx
        L0057: cmp ecx, [r8+8] // Boundary check on FirstChildEdgeIndex vs currentNode
        L005b: jae L0334 // IndexOutOfRangeException 
    
        // Copy currentNode
        L0061: movsxd r9, ecx
        // var firstChild = closure.graph.FirstChildEdgeIndex[currentNode];
        L0064: mov ebx, [r8+r9*4+0x10]
        // currentNode + 1
        L0069: lea r8d, [rcx+1]
        L006d: cmp r8d, [rdx+8] // Boundary check on FirstChildEdgeIndex vs currentNode + 1
        L0071: jae L0334 // IndexOutOfRangeException
    
        // currentNode + 1 // Isn't this unnecessary?
        L0077: inc ecx
        // Expand currentNode + 1 into 64 bits
        L0079: movsxd rcx, ecx
        // var lastChild = closure.graph.FirstChildEdgeIndex[currentNode + 1];
        L007c: mov ebp, [rdx+rcx*4+0x10]
    
    // var from = depth - (int)closure.maxEdits;
    // if (from < 0)
    // {
    //     from = 0;
    // }
    // from++;
        // Copy depth
        L0080: mov r14d, edi
        // var from = depth - (int)closure.maxEdits;
        L0083: sub r14d, eax
        // if (from < 0)
        L0086: test r14d, r14d
        L0089: jge short L008e
        // from = 0;
        L008b: xor r14d, r14d
        // from++;
        L008e: inc r14d
    
    // var to = Math.Min(closure.word.Length + 1, depth + closure.maxEdits + 2);
        L0091: mov rcx, [rsi] // word
        // Load word.Length
        L0094: mov ecx, [rcx+8]
        // word.length + 1
        L0097: inc ecx
        // Expand word.length + 1 into 64 bit
        L0099: movsxd r15, ecx
        // Expand depth into 64 bit
        L009c: movsxd rcx, edi
        // Copy maxEdits
        L009f: mov edx, eax
        // to = depth + closure.maxEdits + 2
        L00a1: lea r12, [rcx+rdx+2]
        L00a6: cmp r15, r12
        // Math.Min
        L00a9: jle short L00ad
        L00ab: jmp short L00b0
        // to = word.length + 1
        L00ad: mov r12, r15
    
    // var previousCharacter = depth > 0 ? closure.builder[depth - 1] : (char)0;
        // depth > 0 ? 
        L00b0: test edi, edi
        L00b2: jg short L00b9
        // var previousCharacter = (char)0;
        L00b4: xor r15d, r15d
        L00b7: jmp short L00ca
        // Load address of builder from closure
        L00b9: mov rcx, [rsi+0x20] // builder
        // depth - 1
        L00bd: lea edx, [rdi-1]
        L00c0: cmp [rcx], ecx // I don't know what this does
        L00c2: call System.Text.StringBuilder.get_Chars(Int32)
        // var previousCharacter = closure.builder[depth - 1];
        L00c7: mov r15d, eax
        // Expand previousCharacter into 64 bit
        L00ca: movzx r15d, r15w
        // Store previousCharacter on the stack
        L00ce: mov [rsp+0x50], r15d
    
    // var previousRow = closure.matrix[depth];
    // var currentRow = closure.matrix[depth + 1];
        // Load address of matrix from closure
        L00d3: mov rcx, [rsi+0x18] // matrix
        // Copy of matrix
        L00d7: mov rdx, rcx
        L00da: cmp edi, [rdx+8] // Boundary check on matrix.Length vs depth
        L00dd: jae L0334 // IndexOutOfRangeException 
        // Expand depth into 64 bit
        L00e3: movsxd r8, edi
        // var previousRow = closure.matrix[depth];
        L00e6: mov r13, [rdx+r8*8+0x10]
        // Depth + 1
        L00eb: lea eax, [rdi+1]
        L00ee: cmp eax, [rcx+8] // Boundary check on matrix.Length vs depth + 1
        L00f1: jae L0334 // IndexOutOfRangeException 
        // Store depth + 1 on the stack
        L00f7: mov [rsp+0x34], eax // depth + 1
        // Expand depth + 1 into 64 bit
        L00fb: movsxd rdx, eax
        // var currentRow = closure.matrix[depth + 1];
        L00fe: mov r9, [rcx+rdx*8+0x10]
        // Store currentRow on the stack
        L0103: mov [rsp+0x28], r9 // currentRow
    
    // Store lastChild on the stack
    L0108: mov [rsp+0x54], ebp // lastChild
    
    // for (var childEdge = firstChild; childEdge < lastChild; childEdge++)
        L010c: cmp ebx, ebp
        L010e: jae L0323
    
    // *****  *****  *****  *****
    // Begin outer loop
    // *****  *****  *****  *****
    
    // var any = false;
        L0114: xor r10d, r10d
        // Store any on the stack
        L0117: mov [rsp+0x4c], r10d
    
    // var currentCharacter = closure.graph.EdgeCharacters[childEdge];
        // Load the address of graph from closure
        L011c: mov rcx, [rsi+0x10] // graph
        // Load the address of EdgeCharacters from graph
        L0120: mov rcx, [rcx+8]
        L0124: cmp ebx, [rcx+8] // Boundary check on EdgeCharacters.Length vs childEdge
        L0127: jae L0334 // IndexOutOfRangeException 
        // Expand childEdge into 64 bits
        L012d: movsxd rdx, ebx
        // var currentCharacter = closure.graph.EdgeCharacters[childEdge];
        L0130: movzx r11d, word ptr [rcx+rdx*2+0x10]
    
    // closure.builder[depth] = currentCharacter;
        // Load the address of builder from closure
        L0136: mov rcx, [rsi+0x20] // builder
        // Copy depth into function parameter
        L013a: mov edx, edi
        // Store currentCharacter on the stack
        L013c: mov [rsp+0x48], r11d // currentCharacter
        // Copy currentCharacter into function parameter
        L0141: mov r8d, r11d
        L0144: cmp [rcx], ecx // I don't know what this does
        L0146: call System.Text.StringBuilder.set_Chars(Int32, Char)
    
    //var calculatedCost = depth + 1;
        // Load depth + 1 from the stack
        L014b: mov eax, [rsp+0x34] // depth + 1
        // Copy calculatedCost
        L014f: mov ecx, eax
    
    // var previousRowEntry = previousRow[from - 1];
        // from -1
        L0151: lea r8d, [r14-1]
        L0155: mov edx, [r13+8] // previousRow.Length
        L0159: cmp r8d, edx // Boundary check on previousRow vs from - 1
        L015c: jae L0334 // IndexOutOfRangeException 
        // Expand into 64 bits
        L0162: movsxd r8, r8d
        // var previousRowEntry = previousRow[from - 1];
        // Is actually previousRowPreviousEntry
        L0165: mov r8d, [r13+r8*4+0x10] // previousRow
    
    // var targetCharacter = (char)0;
        // Is actually setting previousTargetCharacter
        L016a: xor r9d, r9d
    
    
    // for (var i = from; i < to; i++)
        // i = from
        L016d: mov r10d, r14d
        // i = from // Expanded to 64 bit
        L0170: movsxd r11, r14d
        // i < to
        L0173: cmp r11, r12
        L0176: jge L0248
    
    // *****  *****  *****  *****
    // Begin inner loop
    // *****  *****  *****  *****
    
    // targetCharacter = closure.word[i - 1];
        // Load the address of word from closure
        L017c: mov r11, [rsi] // word
        // i - 1
        L017f: lea ebp, [r10-1]
        L0183: cmp ebp, [r11+8] // Boundary check on word vs i - 1
        L0187: jae L0334 // IndexOutOfRangeException
        // Expand i - 1 into 64 bit
        L018d: movsxd rbp, ebp
        // targetCharacter = closure.word[i - 1];
        L0190: movzx r11d, word ptr [r11+rbp*2+0xc]
    
    // previousRowEntry = previousRow[i];
        L0196: cmp r10d, edx // Boundary check on previousRow vs i
        L0199: jae L0334 // IndexOutOfRangeException
        // Expand i into 64 bit
        L019f: movsxd rbp, r10d
        // previousRow[i]
        L01a2: mov eax, [r13+rbp*4+0x10] // previousRow
    
    // if (currentCharacter == targetCharacter)
        // Load currentCharacter from the stack
        L01a7: mov r15d, [rsp+0x48] // currentCharacter
        L01ac: cmp r15d, r11d
        L01af: jne short L01b6
        L01b1: mov ecx, r8d // previousRowPreviousEntry
        L01b4: jmp short L0206
    // else
        // if (previousRowEntry < calculatedCost)
        //     calculatedCost = previousRowEntry;
            L01b6: cmp eax, ecx
            L01b8: jge short L01bc
            L01ba: mov ecx, eax
    
        // if (targetCharacter == previousCharacter && previousTargetCharacter == currentCharacter)
        //     previousRowPreviousEntry = closure.matrix[depth - 1][i - 2];
            L01bc: cmp r11d, [rsp+0x50] // targetCharacter vs previousCharacter
            L01c1: jne short L01fc
            L01c3: cmp r9d, r15d // previousTargetCharacter vs currentCharacter
            L01c6: jne short L01fc
            // Load address of matrix from closure
            L01c8: mov r8, [rsi+0x18] // matrix
            // depth - 1
            L01cc: lea r9d, [rdi-1]
            L01d0: cmp r9d, [r8+8] // Boundary check matrix vs depth - 1
            L01d4: jae L0334 // IndexOutOfRangeException
            // depth - 1
            L01da: lea r9d, [rdi-1] // This seems useless
            // Expand into 64 bit
            L01de: movsxd r9, r9d
            // Load address of closure.matrix[depth - 1] array
            L01e1: mov r8, [r8+r9*8+0x10]
            // i - 2
            L01e6: lea r9d, [r10-2]
            L01ea: cmp r9d, [r8+8] // Boundary check closure.matrix[depth - 1] vs i - 2
            L01ee: jae L0334 // IndexOutOfRangeException
            // Expand into 64 bit
            L01f4: movsxd r9, r9d
            // previousRowPreviousEntry = closure.matrix[depth - 1][i - 2];
            L01f7: mov r8d, [r8+r9*4+0x10] // previousRowPreviousEntry
    
        // if (previousRowPreviousEntry < calculatedCost)
        //     calculatedCost = previousRowPreviousEntry;
            L01fc: cmp r8d, ecx
            L01ff: jge short L0204
            L0201: mov ecx, r8d
    
        // calculatedCost++;
        L0204: inc ecx
    
    
    // if (calculatedCost <= 0)
    //     any = true;
        L0206: test ecx, ecx
        L0208: jg short L0215
        // Write true into register
        L020a: mov r8d, 1
        // any = true;
        L0210: mov [rsp+0x4c], r8d // any
    
    // currentRow[i] = calculatedCost;
        // Load the address of currentRow
        L0215: mov r9, [rsp+0x28]
        L021a: cmp r10d, [r9+8] // Boundary check on currentRow
        L021e: jae L0334 // IndexOutOfRangeException
        L0224: mov [rsp+0x28], r9 // Is this useless?
        // currentRow[i] = calculatedCost;
        L0229: mov [r9+rbp*4+0x10], ecx
    
    // i++;
        L022e: inc r10d
    // i < to
        // Copy and expand to 64 bits
        L0231: movsxd rbp, r10d
        L0234: cmp rbp, r12 // used in L0242
    
    // Store currentCharacter to the stack
    // This is useless
    L0237: mov [rsp+0x48], r15d // currentCharacter
    
    // var previousRowPreviousEntry = previousRowEntry;
        L023c: mov r8d, eax
    // var previousTargetCharacter = targetCharacter;
        L023f: mov r9d, r11d
    // *****  *****  *****  *****
    L0242: jl L017c // Iterate inner loop
    // *****  *****  *****  *****
    
    // if (!any)
    L0248: cmp dword ptr [rsp+0x4c], 0
    // continue;
    L024d: je L0311
    
    // var nextNode = closure.graph.EdgeToNodeIndex[childEdge];
        // Load address of graph from closure
        L0253: mov r8, [rsi+0x10] // graph
        // Load address of EdgeToNodeIndex from graph
        L0257: mov r8, [r8+0x18]
        L025b: cmp ebx, [r8+8] // Boundary check on EdgeToNodeIndex vs childEdge
        L025f: jae L0334 // IndexOutOfRangeException
        // Expand childEdge to 64 bit
        L0265: movsxd rdx, ebx
        // closure.graph.EdgeToNodeIndex[childEdge]
        L0268: mov ebp, [r8+rdx*4+0x10]
    
    //if (nextNode < 0)
        L026d: test ebp, ebp
        L026f: jge L0303
    
    // nextNode = -nextNode;
        L0275: neg ebp
    
    // if (depth >= closure.word.Length - closure.maxEdits - 1 && calculatedCost <= 0)
        // Load address of word from closure
        L0277: mov r8, [rsi] // word
        // Load word.Length
        L027a: mov r8d, [r8+8]
        // Expand to 64 bit
        L027e: movsxd r8, r8d
        // Load maxEdits from graph
        L0281: mov edx, [rsi+8] // maxEdits
        // closure.word.Length - closure.maxEdits
        L0284: sub r8, rdx
        // closure.word.Length - closure.maxEdits - 1
        L0287: dec r8
        // expand to 64 bit
        L028a: movsxd rdx, edi
        // depth >= closure.word.Length - closure.maxEdits - 1
        L028d: cmp r8, rdx
        L0290: jg short L0303
        // calculatedCost <= 0
        L0292: test ecx, ecx
        L0294: jg short L0303
        // closure.results.Add(new SuggestItem(closure.builder.ToString(0, depth + 1), 0));
            // Load address of results from closure
            L0296: mov r15, [rsi+0x28] // results
            // Load address of builder from closure
            L029a: mov rcx, [rsi+0x20] // builder
            // Load depth + 1 from the stack
            L029e: mov r8d, [rsp+0x34] // depth + 1
            // Set function argument to 0
            L02a3: xor edx, edx
            L02a5: cmp [rcx], ecx // I don't know what this does
            // closure.builder.ToString(0, depth + 1)
            L02a7: call System.Text.StringBuilder.ToString(Int32, Int32)
    
            // I think this is moving the result of the ToString into the function argument slot
            L02ac: mov rdx, rax
            // I don't really care what's going on down here.
            // closure.results.Add(new SuggestItem(...)
            L02af: xor eax, eax
            L02b1: inc dword ptr [r15+0x14]
            L02b5: mov rcx, [r15+8]
            L02b9: mov r8d, [r15+0x10]
            L02bd: mov r9d, [rcx+8]
            L02c1: cmp r9d, r8d
            L02c4: jbe short L02ea
            L02c6: lea eax, [r8+1]
            L02ca: mov [r15+0x10], eax
            L02ce: movsxd rax, r8d
            L02d1: shl rax, 4
            L02d5: lea r15, [rcx+rax+0x10]
            L02da: mov rcx, r15
            L02dd: call 0x00007ff9730aa9c0
            L02e2: xor ecx, ecx
            L02e4: mov [r15+8], rcx
            L02e8: jmp short L0303
            L02ea: lea rcx, [rsp+0x38]
            L02ef: mov [rcx], rdx
            L02f2: mov [rcx+8], rax
            L02f6: mov rcx, r15
            L02f9: lea rdx, [rsp+0x38]
            L02fe: call System.Collections.Generic.List`1[[PerformanceTesting.SuggestItem, _]].AddWithResize(PerformanceTesting.SuggestItem)
    
    // Load depth + 1 from the stack
    L0303: mov edx, [rsp+0x34] // depth + 1
    // Prepare function parameter nextNode
    L0307: mov ecx, ebp
    // Prepare function parameter closure
    L0309: mov r8, rsi
    // Recurse
    L030c: call PerformanceTesting.Dawg11.<Lookup>g__Recurse|3_0(Int32, Int32, ClosureVariable ByRef)
    
    // childEdge++
    L0311: inc ebx
    L0313: mov ebp, [rsp+0x54] // lastChild
    L0317: cmp ebx, ebp
    L0319: mov [rsp+0x54], ebp
    // *****  *****  *****  *****
    L031d: jb L0114 // Iterate outer loop
    // *****  *****  *****  *****
    
    // return;
    L0323: add rsp, 0x58
    L0327: pop rbx
    L0328: pop rbp
    L0329: pop rsi
    L032a: pop rdi
    L032b: pop r12
    L032d: pop r13
    L032f: pop r14
    L0331: pop r15
    L0333: ret
    // IndexOutOfRangeException 
    L0334: call 0x00007ff9731fc590
    L0339: int3
    
  19. Memory access reduction - part 7

    Reducing indirection

    closure.graph.EdgeCharacters and closure.graph.EdgeToNodeIndex are retrieved inside the loops. It's easy enough to store them directly in closure instead.

    diff Dawg16.cs Dawg17.cs

    @@
      {
          this.word = word;
          this.maxEdits = maxEdits;
    -     this.graph = graph;
          this.matrix = matrix;
          this.builder = builder;
          this.results = results;
    +
    +     EdgeCharacters = graph.EdgeCharacters;
    +     FirstChildEdgeIndex = graph.FirstChildEdgeIndex;
    +     EdgeToNodeIndex = graph.EdgeToNodeIndex;
      }
    
      public readonly string word;
      public readonly uint maxEdits;
    - public readonly CompressedSparseRowGraph graph;
      public readonly int[][] matrix;
      public readonly StringBuilder builder;
      public readonly List<SuggestItem> results;
    +
    + public readonly char[] EdgeCharacters;
    + public readonly uint[] FirstChildEdgeIndex;
    + public readonly int[] EdgeToNodeIndex;
    }
    
    @@
    - var firstChild = closure.graph.FirstChildEdgeIndex[currentNode];
    - var lastChild = closure.graph.FirstChildEdgeIndex[currentNode + 1];
    + var firstChild = closure.FirstChildEdgeIndex[currentNode];
    + var lastChild = closure.FirstChildEdgeIndex[currentNode + 1];
    
    @@
    -     var currentCharacter = closure.graph.EdgeCharacters[childEdge];
    +     var currentCharacter = closure.EdgeCharacters[childEdge];
    
    @@
    -     var nextNode = closure.graph.EdgeToNodeIndex[childEdge];
    +     var nextNode = closure.EdgeToNodeIndex[childEdge];
    

    718.0 ms

  20. Removing redundant instructions - part 1

    Once we reach the loop, depth is only accessed with an offset. The most common of these is +1, so why not just do that work once?

    diff Dawg17.cs Dawg18.cs

    @@
      var previousRow = closure.matrix[depth];
    - var currentRow = closure.matrix[depth + 1];
    + ++depth;
    + var currentRow = closure.matrix[depth];
    
      for (var childEdge = firstChild; childEdge < lastChild; childEdge++)
      {
          var any = false;
          var currentCharacter = closure.EdgeCharacters[childEdge];
    -     closure.builder[depth] = currentCharacter;
    -     var calculatedCost = depth + 1;
    +     closure.builder[depth - 1] = currentCharacter;
    +     var calculatedCost = depth;
          var previousRowEntry = previousRow[from - 1];
          var targetCharacter = (char)0;
          for (var i = from; i < to; i++)
    @@
                  if (targetCharacter == previousCharacter
                      && previousTargetCharacter == currentCharacter)
                  {
    -                 previousRowPreviousEntry = closure.matrix[depth - 1][i - 2];
    +                 previousRowPreviousEntry = closure.matrix[depth - 2][i - 2];
                  }
    
                  if (previousRowPreviousEntry < calculatedCost)
    @@
          if (nextNode < 0)
          {
              nextNode = -nextNode;
    -         if (depth >= closure.word.Length - closure.maxEdits - 1
    +         if (depth >= closure.word.Length - closure.maxEdits
                  && calculatedCost <= 0)
              {
    -             closure.results.Add(new SuggestItem(closure.builder.ToString(0, depth + 1), 0));
    +             closure.results.Add(new SuggestItem(closure.builder.ToString(0, depth), 0));
              }
          }
    
    -     Recurse(nextNode, depth + 1, ref closure);
    +     Recurse(nextNode, depth, ref closure);
      }
    

    Ok, this one I just can't explain. The resulting assembly successfully re-uses the incremented depth while the old version cache it and has to store/load it when used. It might be either an alignment issue, a cpu pipeline mismatch, or something weirder. I would use Intel VTune, but I'm not aware of a C# capable equivalent for AMD.

    755.68 ms

  21. Memory access reduction - part 8

    Reducing boundary checks.

    Not far from the top I noticed two consecutive boundary checks on FirstChildEdgeIndex. I tried swapping them to do the currentNode + 1 check first and eliminate the second check - it didn't work. Still a positive change in the Asm so I kept it.

    I also noticed maxEdits was being cast to an int or even causing int + uint -> long conversions, so I changed it from uint to int in ClosureVariable.

    diff Dawg18.cs Dawg19.cs

    @@
      public ClosureVariable(string word, uint maxEdits, CompressedSparseRowGraph graph, int[][] matrix, StringBuilder builder, List<SuggestItem> results) : this()
      {
          this.word = word;
    -     this.maxEdits = maxEdits;
    +     this.maxEdits = (int)maxEdits;
          this.matrix = matrix;
    @@
      public readonly string word;
    - public readonly uint maxEdits;
    + public readonly int maxEdits;
      public readonly int[][] matrix;
    @@
          return;
      }
    
    - var firstChild = closure.FirstChildEdgeIndex[currentNode];
    - var lastChild = closure.FirstChildEdgeIndex[currentNode + 1];
    + var fIndex = closure.FirstChildEdgeIndex;
    + var firstChild = fIndex[currentNode];
    + var lastChild = fIndex[currentNode + 1];
    
    - var from = depth - (int)closure.maxEdits;
    + var from = depth - closure.maxEdits;
      if (from < 0)
      {
          from = 0;
    @@
    
    - var to = Math.Min(closure.word.Length + 1, depth + closure.maxEdits + 2);
    + var to = (long)Math.Min(closure.word.Length + 1, depth + closure.maxEdits + 2);
      var previousCharacter = depth > 0 ? closure.builder[depth - 1] : (char)0;
    

    Well now I'm really confused. Another regression when the assembly has only gotten better.

    779.3 ms

  22. Memory access reduction - part 9

    Reducing boundary checks, for real this time.

    Hint: pointers don't check for out of range boundary conditions.

    While I was at it, I converted matrix to a one-dimensional array. It makes much more sense when we're using pointers.

    diff Dawg19.cs Dawg20.cs

    @@
          public readonly string word;
          public readonly int maxEdits;
    -     public readonly int[][] matrix;
    -     public readonly StringBuilder builder;
    +     public readonly int* matrix;
    +     public readonly char* builder;
          public readonly List<SuggestItem> results;
    
    -     public readonly char[] EdgeCharacters;
    -     public readonly uint[] FirstChildEdgeIndex;
    -     public readonly int[] EdgeToNodeIndex;
    +     public readonly char* EdgeCharacters;
    +     public readonly uint* FirstChildEdgeIndex;
    +     public readonly int* EdgeToNodeIndex;
      }
    
      public IEnumerable<SuggestItem> Lookup(string word, uint maxEdits)
      {
    -     var builder = new StringBuilder(word.Length + (int)maxEdits);
    -     builder.Append(new string(' ', word.Length + (int)maxEdits));
    +     var builderLength = word.Length + (int)maxEdits;
    +     var builder = stackalloc char[builderLength];
    +     for (var i = 0; i < builderLength; i++)
    +     {
    +         builder[i] = ' ';
    +     }
    +
          var results = new List<SuggestItem>();
    
    -     var matrix = new int[word.Length + maxEdits + 1][];
    -     for (var i = 0; i < matrix.Length; i++)
    +     var rowLength = word.Length + 1;
    +     var rowCount = rowLength + (int)maxEdits;
    +     var matrix = stackalloc int[rowLength * rowCount];
    +     for (var i = 0; i < rowCount; i++)
          {
    -         matrix[i] = new int[word.Length + 1];
    -         matrix[i][0] = i - (int)maxEdits;
    +
    +         matrix[i * rowLength] = i - (int)maxEdits;
    
              var stripeEnd = i + maxEdits + 1;
              if (stripeEnd <= word.Length)
              {
    -             matrix[i][stripeEnd] = 0;
    +             matrix[i * rowLength + stripeEnd] = 0;
              }
          }
    
    -     for (var i = 0; i < matrix[0].Length; i++)
    +     for (var i = 0; i < rowLength; i++)
          {
    -         matrix[0][i] = i - (int)maxEdits;
    +         matrix[i] = i - (int)maxEdits;
          }
    
          var closure = new ClosureVariable(word, maxEdits, _graph, matrix, builder, results);
    @@
          var to = (long)Math.Min(closure.word.Length + 1, depth + closure.maxEdits + 2);
          var previousCharacter = depth > 0 ? closure.builder[depth - 1] : (char)0;
    
    -     var previousRow = closure.matrix[depth];
    +     var rowLength = closure.word.Length + 1;
    +     var previousRow = closure.matrix + depth * rowLength;
    +     var currentRow = previousRow + rowLength;
          ++depth;
    -     var currentRow = closure.matrix[depth];
    
          for (var childEdge = firstChild; childEdge < lastChild; childEdge++)
          {
    @@
                      if (targetCharacter == previousCharacter
                          && previousTargetCharacter == currentCharacter)
                      {
    -                     previousRowPreviousEntry = closure.matrix[depth - 2][i - 2];
    +                     previousRowPreviousEntry = previousRow[i - closure.word.Length - 3];
                      }
    
                      if (previousRowPreviousEntry < calculatedCost)
    @@
                  if (depth >= closure.word.Length - closure.maxEdits
                      && calculatedCost <= 0)
                  {
    -                 closure.results.Add(new SuggestItem(closure.builder.ToString(0, depth), 0));
    +                 closure.results.Add(new SuggestItem(new string(closure.builder, 0, depth), 0));
                  }
              }
    

    Finally, an improvement.

    673.7 ms

  23. Removing redundant instructions - part 2

    Because we're in 64-bit mode, memory indexing operations operate with 64 bit addresses (duh). Many of our indexing values are int. Converting them to long removes some useless movxsd instructions.

    For this to work, I've changed some variables types from int to long.

    Conversely, some other values which aren't used for indexing and remain as int, are used either in arithmetic or comparisons to our newly promoted longs. Casting the long to an int is free in terms of assembly, the 32 bit version of the same register will be used instead. This adds some casts to (int) in a bunch of places.

    diff Dawg20.cs Dawg21.cs

    @@
    - public ClosureVariable(string word, uint maxEdits, PointerGraph graph, int* matrix, char* builder, List<SuggestItem> results) : this()
    + public ClosureVariable(string word, int maxEdits, PointerGraph graph, int* matrix, char* builder, List<SuggestItem> results) : this()
      {
          this.word = word;
    -     this.maxEdits = (int)maxEdits;
    +     this.maxEdits = maxEdits;
    
    @@
    - public readonly int maxEdits;
    + public readonly long maxEdits;
    
    @@
    - var builderLength = word.Length + (int)maxEdits;
    + var builderLength = word.Length + (int)maxEdits + 1;
      var builder = stackalloc char[builderLength];
      for (var i = 0; i < builderLength; i++)
      {
          builder[i] = ' ';
      }
    
    + builder++;
    +
    
    @@
    - var closure = new ClosureVariable(word, maxEdits, _graph, matrix, builder, results);
    + var closure = new ClosureVariable(word, (int)maxEdits, _graph, matrix, builder, results);
      Recurse(_graph.RootNodeIndex, 0, ref closure);
      return results;
             }
    
    -        private static void Recurse(int currentNode, int depth, ref ClosureVariable closure)
    +        private static void Recurse(int currentNode, long depth, ref ClosureVariable closure)
             {
    - if (depth == closure.word.Length + closure.maxEdits)
    + if ((int)depth == closure.word.Length + (int)closure.maxEdits)
      {
    -     return;
    +     goto end;
      }
    
    - var fIndex = closure.FirstChildEdgeIndex;
    - var firstChild = fIndex[currentNode];
    - var lastChild = fIndex[currentNode + 1];
    + var fIndex = closure.FirstChildEdgeIndex + currentNode;
    + var firstChild = *fIndex;
    + var lastChild = *(fIndex + 1);
    
    @@
    - var to = (long)Math.Min(closure.word.Length + 1, depth + closure.maxEdits + 2);
      var previousCharacter = depth > 0 ? closure.builder[depth - 1] : (char)0;
    
    - var rowLength = closure.word.Length + 1;
    + var rowLength = (long)closure.word.Length + 1;
    + var to = Math.Min(rowLength, depth + closure.maxEdits + 2);
      var previousRow = closure.matrix + depth * rowLength;
      var currentRow = previousRow + rowLength;
      ++depth;
    
    @@
          var any = false;
          var currentCharacter = closure.EdgeCharacters[childEdge];
          closure.builder[depth - 1] = currentCharacter;
    -     var calculatedCost = depth;
    +     var calculatedCost = (int)depth;
          var previousRowEntry = previousRow[from - 1];
          var targetCharacter = (char)0;
          for (var i = from; i < to; i++)
          {
              var previousTargetCharacter = targetCharacter;
    -         targetCharacter = closure.word[i - 1];
    +         targetCharacter = closure.word[(int)i - 1];
    
    @@
          if (nextNode < 0)
          {
              nextNode = -nextNode;
    -         if (depth >= closure.word.Length - closure.maxEdits
    +         if ((int)depth >= closure.word.Length - (int)closure.maxEdits
                  && calculatedCost <= 0)
              {
    -             closure.results.Add(new SuggestItem(new string(closure.builder, 0, depth), 0));
    +             closure.results.Add(new SuggestItem(new string(closure.builder, 0, (int)depth), 0));
              }
          }
    
          Recurse(nextNode, depth, ref closure);
      }
    +
    + end:;
    }
    

    A big difference for such a small change.

    637.6 ms

  24. Memory access reduction - part 10

    Reducing indirection

    closure.word.Length requires the address of word to be retrieved from ClosureVariable, then the value of Length is retrieved from word. Storing that value directly in the struct removes a layer of indirection and gives us a small boost.

    diff Dawg21.cs Dawg22.cs

    @@
          public ClosureVariable(string word, int maxEdits, PointerGraph graph, int* matrix, char* builder, List<SuggestItem> results) : this()
          {
              this.word = word;
    +         wordLength = word.Length;
              this.maxEdits = maxEdits;
    
    @@
          public readonly string word;
    +     public readonly int wordLength;
          public readonly long maxEdits;
    
    @@
      private static void Recurse(int currentNode, long depth, ref ClosureVariable closure)
      {
    -     if ((int)depth == closure.word.Length + (int)closure.maxEdits)
    +     if ((int)depth == closure.wordLength + (int)closure.maxEdits)
          {
              goto end;
          }
    
    @@
          var previousCharacter = depth > 0 ? closure.builder[depth - 1] : (char)0;
    
    -     var rowLength = (long)closure.word.Length + 1;
    +     var rowLength = (long)closure.wordLength + 1;
          var to = Math.Min(rowLength, depth + closure.maxEdits + 2);
          var previousRow = closure.matrix + depth * rowLength;
          var currentRow = previousRow + rowLength;
    
    @@
               if (targetCharacter == previousCharacter
                   && previousTargetCharacter == currentCharacter)
               {
    -              previousRowPreviousEntry = previousRow[i - closure.word.Length - 3];
    +              previousRowPreviousEntry = previousRow[i - closure.wordLength - 3];
               }
    
    @@
              if (nextNode < 0)
              {
                  nextNode = -nextNode;
    -             if ((int)depth >= closure.word.Length - (int)closure.maxEdits
    +             if ((int)depth >= closure.wordLength - (int)closure.maxEdits
                      && calculatedCost <= 0)
                  {
    

    Again, I don't know why this time is worse.

    658.2 ms

  25. Memory access reduction - part 11

    Reducing indirection

    previousCharacter is loaded outside of both loops and stored for later. The problem is that we have too many variables flying around and not enough registers. This causes previousCharacter to be stored on the stack. We don't always have transpositions in the Recurse method, so writing and loading to memory for something you might not access is not the best idea.

    Notice that the very similar address for closure.builder[depth - 1] (same LoC but depth has changed) is being indirectly computed at the start of the outer loop. If we use a variable to store that address, we can write directly to it as well as read (with an offset of -1) to grab previousCharacter on the fly. Even if this gets stored to the stack, the indirection is no worse that closure.builder. The previously eager load of previousCharacter (including a range check) now occurs much less often.

    diff Dawg22.cs Dawg23.cs

    @@
    - var previousCharacter = depth > 0 ? closure.builder[depth - 1] : (char)0;
    -
      var rowLength = (long)closure.wordLength + 1;
      var to = Math.Min(rowLength, depth + closure.maxEdits + 2);
      var previousRow = closure.matrix + depth * rowLength;
      var currentRow = previousRow + rowLength;
    + var builderPosition = closure.builder + depth;
      ++depth;
    
      for (var childEdge = firstChild; childEdge < lastChild; childEdge++)
      {
          var any = false;
          var currentCharacter = closure.EdgeCharacters[childEdge];
    -     closure.builder[depth - 1] = currentCharacter;
    +     *builderPosition = currentCharacter;
          var calculatedCost = (int)depth;
          var previousRowEntry = previousRow[from - 1];
          var tar
          getCharacter = (char)0;
    
    @@
    -             if (targetCharacter == previousCharacter
    -                 && previousTargetCharacter == currentCharacter)
    +             if (previousTargetCharacter == currentCharacter
    +                 && targetCharacter == *(builderPosition - 1))
                  {
                      previousRowPreviousEntry = previousRow[i - closure.wordLength - 3];
                  }
    

    591.2 ms

  26. Branch reduction - part 2

    Using a for loop results in the compiler creating two bound checks: one before the first iteration and one at the end for continuing. A do-while loop will omit the first of these.

    In the case of our outer loop, we can't guarantee that the first iteration will fit the bounds, but we can move that check to an earlier part of the method. Doing so saves us from performing a lot of the setup that happens before the loops.

    diff Dawg23.cs Dawg24.cs

    @@
      var fIndex = closure.FirstChildEdgeIndex + currentNode;
      var firstChild = *fIndex;
      var lastChild = *(fIndex + 1);
    + if (firstChild >= lastChild)
    + {
    +     goto end;
    + }
    
      var from = depth - closure.maxEdits;
      if (from < 0)
    
    @@
      var builderPosition = closure.builder + depth;
      ++depth;
    
    - for (var childEdge = firstChild; childEdge < lastChild; childEdge++)
    + do
      {
          var any = false;
    -     var currentCharacter = closure.EdgeCharacters[childEdge];
    +     var currentCharacter = closure.EdgeCharacters[firstChild];
          *builderPosition = currentCharacter;
          var calculatedCost = (int) depth;
    
    @@
              continue;
          }
    
    -     var nextNode = closure.EdgeToNodeIndex[childEdge];
    +     var nextNode = closure.EdgeToNodeIndex[firstChild];
          if (nextNode < 0)
          {
              nextNode = -nextNode;
    
    @@
          }
    
          Recurse(nextNode, depth, ref closure);
    - }
    + } while (++firstChild < lastChild);
    
      end:;
    }
    

    592.6 ms

  27. Branch reduction - part 3

    Math.Min uses 2 jumps to accomplish what should only need 1. By replacing the call to it with a single check and assignment, we can save some jumps and mov instructions. In the case of our outer loop, we can't guarantee that the first iteration will fit the bounds, but we can move that check to an earlier part of the method. Doing so saves us from performing a lot of the setup that happens before the loops.

    We can also replace an existing branch with some arithmetic operators. Taking the Min of a number and 0 can be accomplished with a few steps. First we take the value and perform a shift-arithmetic-right. If the number is negative we obtain a series of 1s, otherwise a series of 0s. Then we flip that series using a complement operator. Finally, a logical AND will turn that series into a mask, 0-ing the value if it was negative or preserving it otherwise.

    any can be turned from a bool to an int and avoid the branch that sets it. Previously we couldn't do much in this area because the register pressure caused any to be saved on the stack. With the help of changes through our many iterations of Dawg, we now have a free register available. By shifting the matrix values down by 1 (below 0 is good, instead of 0 or lower before), we can use the calculatedCost as a mask. Any valid cost will have a 1 in the first bit, making it a negative value. By OR-ing calculatedCost every iteration, if any of them are valid our flag will be made negative as well.

    After shifting the matrix entries I noticed that the calculatedCost = (int) depth; assignment was unnecessary. This is a repeat of Dawg15's issue. Any entries before our stripe are invalid anyways so the value was never used. Initializing with 0 is a safe bet because it will always be invalid.

    Since we're on the topic of stripes... (int) depth >= closure.wordLength - (int) closure.maxEdits can be converted to (int) to > closure.wordLength

    Now that our register pressure is a little more under control, I noticed that a lot of the stack activity was around adding results. Just by moving some code around and un-nesting the method calls and constructors, I managed to improve performance a bit.

    diff Dawg24.cs Dawg25.cs

    @@
      for (var i = 0; i < rowCount; i++)
      {
    
    -     matrix[i * rowLength] = i - (int)maxEdits;
    +     matrix[i * rowLength] = i - (int)maxEdits - 1;
    
          var stripeEnd = i + maxEdits + 1;
          if (stripeEnd <= word.Length)
          {
    -         matrix[i * rowLength + stripeEnd] = 0;
    +         matrix[i * rowLength + stripeEnd] = -1;
          }
      }
    
      for (var i = 0; i < rowLength; i++)
      {
    -     matrix[i] = i - (int)maxEdits;
    +     matrix[i] = i - (int)maxEdits - 1;
      }
    
    @@
      var from = depth - closure.maxEdits;
    - if (from < 0)
    - {
    -     from = 0;
    - }
    + from &= ~from >> 31;
    
      from++;
    
      var rowLength = (long)closure.wordLength + 1;
    - var to = Math.Min(rowLength, depth + closure.maxEdits + 2);
    + var to = depth + closure.maxEdits + 2;
    + if (rowLength < to)
    + {
    +     to = rowLength;
    + }
      var previousRow = closure.matrix + depth * rowLength;
      var currentRow = previousRow + rowLength;
    
    @@
      do
      {
    -     var any = false;
    +     var any = 0;
          var currentCharacter = closure.EdgeCharacters[firstChild];
          *builderPosition = currentCharacter;
    -     var calculatedCost = (int) depth;
    +     var calculatedCost = 0;
          var previousRowEntry = previousRow[from - 1];
          var targetCharacter = (char) 0;
    
    @@
                  calculatedCost++;
              }
    
    -         if (calculatedCost <= 0)
    -         {
    -             any = true;
    -         }
    +         any |= calculatedCost;
    
              currentRow[i] = calculatedCost;
          }
    
    -     if (!any)
    +     if (any >= 0)
          {
              continue;
          }
    
    @@
          if (nextNode < 0)
          {
              nextNode = -nextNode;
    -         if ((int) depth >= closure.wordLength - (int) closure.maxEdits
    -             && calculatedCost <= 0)
    +         if ((int) to > closure.wordLength
    +             && calculatedCost < 0)
              {
    -             closure.results.Add(new SuggestItem(new string(closure.builder, 0, (int) depth), 0));
    +             var str = new string(closure.builder, 0, (int) depth);
    +             var si = new SuggestItem(str, 0);
    +             closure.results.Add(si);
              }
          }
    

    Quite a few changes in this one, but they were only effective together. The pay-off was great though.

    498.8 ms

  28. Memory access reduction - part 12

    It looks like we're still accessing the stack in our inner loop. To alleviate this we need to reduce the number of variables we use in said loop. previousRowPreviousEntry is a good candidate because one of the branches doesn't use it at all. The other branch can be re-ordered to use that value before over-writing it, requiring one less saved variable and thus register.

    The calculation for closure.word[i-1] was also taking up a temporary variable. By shifting closure.word by 1 during the preparation phase, we can remove the -1 and remove the need for that temporary register.

    diff Dawg25.cs Dawg26.cs

    @@
    - public ClosureVariable(string word, int maxEdits, PointerGraph graph, int* matrix, char* builder, List<SuggestItem> results) : this()
    + public ClosureVariable(char* word, int wordLength, int maxEdits, PointerGraph graph, int* matrix, char* builder, List<SuggestItem> results) : this()
      {
          this.word = word;
    -     wordLength = word.Length;
    +     this.wordLength = wordLength;
    
    @@
    - public readonly string word;
    + public readonly char* word;
    
    @@
    - var closure = new ClosureVariable(word, (int)maxEdits, _graph, matrix, builder, results);
    + var wordCopy = stackalloc char[word.Length];
    + for (var i = 0; i < word.Length; i++)
    + {
    +     wordCopy[i] = word[i];
    + }
    +
    + var closure = new ClosureVariable(wordCopy - 1, word.Length, (int)maxEdits, _graph, matrix, builder, results);
      Recurse(_graph.RootNodeIndex, 0, ref closure);
      return results;
    
    @@
          for (var i = from; i < to; i++)
          {
              var previousTargetCharacter = targetCharacter;
    -         targetCharacter = closure.word[(int) i - 1];
    -
    -         var previousRowPreviousEntry = previousRowEntry;
    -         previousRowEntry = previousRow[i];
    -
    +         targetCharacter = closure.word[i];
              if (currentCharacter == targetCharacter)
              {
    -             calculatedCost = previousRowPreviousEntry;
    +             calculatedCost = previousRowEntry;
    +             previousRowEntry = previousRow[i];
              }
              else
              {
    -             if (previousRowEntry < calculatedCost)
    -             {
    -                 calculatedCost = previousRowEntry;
    -             }
    
                  if (previousTargetCharacter == currentCharacter
                      && targetCharacter == *(builderPosition - 1))
                  {
    -                 previousRowPreviousEntry = previousRow[i - closure.wordLength - 3];
    +                 previousRowEntry = previousRow[i - closure.wordLength - 3];
    +             }
    +
    +             if (previousRowEntry < calculatedCost)
    +             {
    +                 calculatedCost = previousRowEntry;
                  }
    
    -             if (previousRowPreviousEntry < calculatedCost)
    +             previousRowEntry = previousRow[i];
    +             if (previousRowEntry < calculatedCost)
                  {
    -                 calculatedCost = previousRowPreviousEntry;
    +                 calculatedCost = previousRowEntry;
                  }
    

    Another regression that doesn't add up with the assembly changes. This step is necessary for the next ones to pay off though, so wait and see.

    510.6 ms

  29. Algorithm refinement - part 5

    The best value an entry can possibly contain is equal to the value from the entry located in the previous row and previous column. This would represent a character match. If this entry is not valid, the current entry cannot possibly be valid itself. From this we introduce the concept of skipping.

    As an additional parameter in our recursion method we add int skip. This represents the number of steps to skip from the start of the stripe. Omitting these entries is less work to do and gains us a bit more performance.

    This finally gives us the opportunity to check "if ((int) depth < (int)closure.wordLength + (int)closure.maxEdits)" before calling Recurse.

    The changes made here also allowed for a small gain by re-ordering operations at the start of the method.

    diff Dawg26.cs Dawg27.cs

    @@
          public readonly char* word;
    -     public readonly int wordLength;
    +     public readonly long wordLength;
          public readonly long maxEdits;
          public readonly int* matrix;
    
    @@
      {
          var builderLength = word.Length + (int)maxEdits + 1;
          var builder = stackalloc char[builderLength];
    +     // It makes no sense but leaving this here makes it faster.
          for (var i = 0; i < builderLength; i++)
          {
              builder[i] = ' ';
    
    @@
          var closure = new ClosureVariable(wordCopy - 1, word.Length, (int)maxEdits, _graph, matrix, builder, results);
    -     Recurse(_graph.RootNodeIndex, 0, ref closure);
    +     Recurse(_graph.RootNodeIndex, 0, 0, ref closure);
          return results;
      }
    
    - private static void Recurse(int currentNode, long depth, ref ClosureVariable closure)
    + private static void Recurse(int currentNode, long depth, long skip, ref ClosureVariable closure)
      {
    -     if ((int)depth == closure.wordLength + (int)closure.maxEdits)
    +     var rowLength = closure.wordLength + 1;
    +     var to = depth + closure.maxEdits + 2;
    +     if (rowLength < to)
    +     {
    +         to = rowLength;
    +     }
    + 
    +     var from = depth - closure.maxEdits;
    +     from &= ~from >> 63;
    +     from += skip + 1;
    + 
    +     if (from >= to)
          {
              goto end;
          }
    
    +     var builderPosition = closure.builder + depth;
    +     var previousRow = closure.matrix + depth * rowLength;
    +     var currentRow = previousRow + rowLength;
    +     depth += 1;
    + 
          var fIndex = closure.FirstChildEdgeIndex + currentNode;
          var firstChild = *fIndex;
    
    @@
              goto end;
          }
    
    -     var from = depth - closure.maxEdits;
    -     from &= ~from >> 31;
    -
    -     from++;
    -
    -     var rowLength = (long)closure.wordLength + 1;
    -     var to = depth + closure.maxEdits + 2;
    -     if (rowLength < to)
    -     {
    -         to = rowLength;
    -     }
    -     var previousRow = closure.matrix + depth * rowLength;
    -     var currentRow = previousRow + rowLength;
    -     var builderPosition = closure.builder + depth;
    -     ++depth;
    -
          do
          {
    -         var any = 0;
              var currentCharacter = closure.EdgeCharacters[firstChild];
              *builderPosition = currentCharacter;
    -         var calculatedCost = 0;
              var previousRowEntry = previousRow[from - 1];
    +         var any = 0;
    +         var calculatedCost = 0;
              var targetCharacter = (char) 0;
              for (var i = from; i < to; i++)
    
    @@
              if (nextNode < 0)
              {
                  nextNode = -nextNode;
    -             if ((int) to > closure.wordLength
    -                 && calculatedCost < 0)
    +             if (calculatedCost < 0
    +                 && (int) to > (int)closure.wordLength)
                  {
                      var str = new string(closure.builder, 0, (int) depth);
    
    @@
                  }
              }
    
    -         Recurse(nextNode, depth, ref closure);
    +         if ((int) depth < (int)closure.wordLength + (int)closure.maxEdits)
    +         {
    +             var newSkip = skip;
    +             for (var i = from;; i++)
    +             {
    +                 if (currentRow[i] >= 0)
    +                 {
    +                     newSkip++;
    +                 }
    +                 else
    +                 {
    +                     break;
    +                 }
    +             }
    +
    +             Recurse(nextNode, depth, newSkip, ref closure);
    +         }
          } while (++firstChild < lastChild);
    
          end:;
    

    452.6 ms

  30. Memory access reduction - part 13

    Similarly to how we used "(int)to > (int)closure.wordLength" in Dawg25 to avoid loading closure.maxEdits and subtracting it, we can use "from" in our recursion bound check.

    I also tried adding a bound check before performing the work of calculating newSkip, but found it didn't help.

    diff Dawg27.cs Dawg28.cs

    @@
      var builderLength = word.Length + (int)maxEdits + 1;
      var builder = stackalloc char[builderLength];
    - // It makes no sense but leaving this here makes it faster.
      for (var i = 0; i < builderLength; i++)
      {
    -     builder[i] = ' ';
    +     builder[i] = (char)0;
      }
    
      builder++;
    
    @@
       var rowLength = word.Length + 1;
       var rowCount = rowLength + (int)maxEdits;
       var matrix = stackalloc int[rowLength * rowCount];
    +  var mp1 = (int) maxEdits + 1;
       for (var i = 0; i < rowCount; i++)
       {
    
    -      matrix[i * rowLength] = i - (int)maxEdits - 1;
    +      matrix[i * rowLength] = i - mp1;
    
    -      var stripeEnd = i + maxEdits + 1;
    +      var stripeEnd = i + mp1;
           if (stripeEnd <= word.Length)
           {
               matrix[i * rowLength + stripeEnd] = -1;
    
    @@
           goto end;
       }
    
    -  var builderPosition = closure.builder + depth;
       var previousRow = closure.matrix + depth * rowLength;
       var currentRow = previousRow + rowLength;
    +  var builderPosition = closure.builder + depth;
       depth += 1;
    
       var fIndex = closure.FirstChildEdgeIndex + currentNode;
    
    @@
    -      if ((int) depth < (int)closure.wordLength + (int)closure.maxEdits)
    +      if ((int)from < (int)closure.wordLength)
           { 
               var newSkip = skip;
    +          //if (from > 1)
    +          {
                   for (var i = from;; i++)
                   {
    
    @@
                           break;
                       }
                   }
    +          }
    
               Recurse(nextNode, depth, newSkip, ref closure);
           }
    

    445.9 ms

  31. Memory access reduction - part 14

    The function argument skip is only used in calculating "from" and "newSkip". In both cases "from" is added before the resulting value is used. Instead of passing "skip" and calculating "from", we pass "from" directly.

    The Max(0, from) operation becomes from += depth > closure.maxEdits ? 1 : 0; which is replaced with some fun arithmetic operations.

    The compiler wasn't making great decisions when it came to which variable to store on the stack. "depth" isn't accessed in the inner loop, and is only rarely access in the outer loop. I couldn't exactly control storing it on the stack, so instead I moved it to ClosureVariable and made the struct editable.

    Now that depth is stored in closure, all the information necessary for adding a result are in one place. I extracted it to another method with ClosureVariable as a parameter. This removed the final stack storage and makes our function much cleaner. I also marked the new function as NoInlining, to prevent it from sneaking back in and causing trouble.

    One final thing that you might notice is I've moved "depth" to the start of the ClosureVariable. If it has an offset from "rsi" there's one additional instruction in assembly

    Storing depth on the stack if I could control it might actually be preferrable. The fact that it is used at all levels of recursion means we need to decrement it when leaving the method. That's an expensive load-modify-store operation and it could be entirely avoided if each method invocation used its own stack to store it. But then again, there's no guarantee that it would help, benchmarking is always needed.

    diff Dawg28.cs Dawg29.cs

    @@
    using System;
      using System.Collections.Generic;
      using System.Diagnostics;
      using System.IO;
    + using System.Runtime.CompilerServices;
    + using System.Runtime.InteropServices;
      using System.Text;
    
    @@
      public uint WordCount { get; }
    
    - private readonly struct ClosureVariable
    + [StructLayout(LayoutKind.Explicit)]
    + private struct ClosureVariable
      {
          public ClosureVariable(char* word, int wordLength, int maxEdits, PointerGraph graph, int* matrix, char* builder, List<SuggestItem> results) : this()
    
    @@
              EdgeCharacters = graph.EdgeCharacters;
              FirstChildEdgeIndex = graph.FirstChildEdgeIndex;
              EdgeToNodeIndex = graph.EdgeToNodeIndex;
    +
    +         depth = 0;
          }
    
    -     public readonly char* word;
    +     [FieldOffset(0x00)]
    +     public long depth;
    +     [FieldOffset(0x08)]
          public readonly long wordLength;
    +     [FieldOffset(0x10)]
          public readonly long maxEdits;
    +     [FieldOffset(0x18)]
          public readonly int* matrix;
    -     public readonly char* builder;
    -     public readonly List<SuggestItem> results;
    -
    +     [FieldOffset(0x20)]
    +     public readonly char* word;
    +     [FieldOffset(0x28)]
          public readonly char* EdgeCharacters;
    +     [FieldOffset(0x30)]
          public readonly uint* FirstChildEdgeIndex;
    +     [FieldOffset(0x38)]
          public readonly int* EdgeToNodeIndex;
    +
    +     [FieldOffset(0x40)]
    +     public readonly char* builder;
    +     [FieldOffset(0x48)]
    +     public readonly List<SuggestItem> results;
    
    @@
          var closure = new ClosureVariable(wordCopy - 1, word.Length, (int)maxEdits, _graph, matrix, builder, results);
    -     Recurse(_graph.RootNodeIndex, 0, 0, ref closure);
    +     Recurse(_graph.RootNodeIndex, 1, ref closure);
          return results;
      }
    
    - private static void Recurse(int currentNode, long depth, long skip, ref ClosureVariable closure)
    + private static void Recurse(int currentNode, long from, ref ClosureVariable closure)
      {
    +     var depth = closure.depth;
          var rowLength = closure.wordLength + 1;
    +     from -= (closure.maxEdits - depth) >> 63;
          var to = depth + closure.maxEdits + 2;
          if (rowLength < to)
          {
              to = rowLength;
          }
    
    -     var from = depth - closure.maxEdits;
    -     from &= ~from >> 63;
    -     from += skip + 1;
    -
          if (from >= to)
          {
    
    @@
          var currentRow = previousRow + rowLength;
          var builderPosition = closure.builder + depth;
    -     depth += 1;
    +     closure.depth = depth + 1;
    
          var fIndex = closure.FirstChildEdgeIndex + currentNode;
          var firstChild = *fIndex;
          var lastChild = *(fIndex + 1);
          if (firstChild >= lastChild)
          {
    -         goto end;
    +         goto end2;
          }
    
          do
    
    @@
              var nextNode = closure.EdgeToNodeIndex[firstChild];
              if (nextNode < 0)
              {
    -             nextNode = -nextNode;
                  if (calculatedCost < 0
                      && (int)to > (int)closure.wordLength)
                  {
    -                 var str = new string(closure.builder, 0, (int)depth);
    -                 var si = new SuggestItem(str, 0);
    -                 closure.results.Add(si);
    +                 Add(ref closure);
    +                 nextNode = closure.EdgeToNodeIndex[firstChild];
                  }
    +             nextNode = -nextNode;
              }
    
              if ((int)from < (int)closure.wordLength)
              { 
    -             var newSkip = skip;
    -             //if (from > 1)
    -             {
    -                 for (var i = from;; i++)
    +             var newFrom = from;
    +             while (currentRow[newFrom] >= 0)
                  {
    -                     if (currentRow[i] >= 0)
    -                     {
    -                         newSkip++;
    -                     }
    -                     else
    -                     {
    -                         break;
    -                     }
    -                 }
    +                 newFrom += 1;
                  }
    
    -             Recurse(nextNode, depth, newSkip, ref closure);
    +             Recurse(nextNode, newFrom, ref closure);
              }
          } while (++firstChild < lastChild);
    
    +     end2:
    +     closure.depth -= 1;
          end: ;
      }
    
    + [MethodImpl(MethodImplOptions.NoInlining)]
    + private static void Add(ref ClosureVariable closure)
    + {
    +     var str = new string(closure.builder, 0, (int)closure.depth);
    +     var si = new SuggestItem(str, 0);
    +     closure.results.Add(si);
    + }
    +
    

    435.1 ms

Review of final version

Dawg.cs v3

[StructLayout(LayoutKind.Explicit)]
private struct ClosureVariable
{
    public ClosureVariable(char* word, int wordLength, int maxEdits, CompressedSparseRowGraph graph, int* matrix, char* builder, List<SuggestItem> results) : this()
    {
        this.word = word;
        this.wordLength = wordLength;
        this.maxEdits = maxEdits;
        this.matrix = matrix;
        this.builder = builder;
        this.results = results;

        EdgeCharacters = graph.EdgeCharacters;
        FirstChildEdgeIndex = graph.FirstChildEdgeIndex;
        EdgeToNodeIndex = graph.EdgeToNodeIndex;

        depth = 0;
    }

    [FieldOffset(0x00)]
    public long depth;
    [FieldOffset(0x08)]
    public readonly long wordLength;
    [FieldOffset(0x10)]
    public readonly long maxEdits;
    [FieldOffset(0x18)]
    public readonly int* matrix;
    [FieldOffset(0x20)]
    public readonly char* word;
    [FieldOffset(0x28)]
    public readonly char* EdgeCharacters;
    [FieldOffset(0x30)]
    public readonly uint* FirstChildEdgeIndex;
    [FieldOffset(0x38)]
    public readonly int* EdgeToNodeIndex;

    [FieldOffset(0x40)]
    public readonly char* builder;
    [FieldOffset(0x48)]
    public readonly List<SuggestItem> results;
}

public IEnumerable<SuggestItem> Lookup(string word, uint maxEdits)
{
    var builderLength = word.Length + (int)maxEdits + 1;
    var builder = stackalloc char[builderLength];
    for (var i = 0; i < builderLength; i++)
    {
        builder[i] = (char)0;
    }

    builder++;

    var results = new List<SuggestItem>();

    var rowLength = word.Length + 1;
    var rowCount = rowLength + (int)maxEdits;
    var matrix = stackalloc int[rowLength * rowCount];
    var mp1 = (int) maxEdits + 1;
    for (var i = 0; i < rowCount; i++)
    {

        matrix[i * rowLength] = i - mp1;

        var stripeEnd = i + mp1;
        if (stripeEnd <= word.Length)
        {
            matrix[i * rowLength + stripeEnd] = -1;
        }
    }

    for (var i = 0; i < rowLength; i++)
    {
        matrix[i] = i - (int)maxEdits - 1;
    }

    var wordCopy = stackalloc char[word.Length];
    for (var i = 0; i < word.Length; i++)
    {
        wordCopy[i] = word[i];
    }

    var closure = new ClosureVariable(wordCopy - 1, word.Length, (int)maxEdits, _graph, matrix, builder, results);
    Recurse(_graph.RootNodeIndex, 1, ref closure);
    return results;
}

private static void Recurse(int currentNode, long from, ref ClosureVariable closure)
{
    var depth = closure.depth;
    var rowLength = closure.wordLength + 1;
    from -= (closure.maxEdits - depth) >> 63;
    var to = depth + closure.maxEdits + 2;
    if (rowLength < to)
    {
        to = rowLength;
    }

    if (from >= to)
    {
        goto end;
    }

    var previousRow = closure.matrix + depth * rowLength;
    var currentRow = previousRow + rowLength;
    var builderPosition = closure.builder + depth;
    closure.depth = depth + 1;

    var fIndex = closure.FirstChildEdgeIndex + currentNode;
    var firstChild = *fIndex;
    var lastChild = *(fIndex + 1);
    if (firstChild >= lastChild)
    {
        goto end2;
    }

    do
    {
        var currentCharacter = closure.EdgeCharacters[firstChild];
        *builderPosition = currentCharacter;
        var previousRowEntry = previousRow[from - 1];
        var any = 0;
        var calculatedCost = 0;
        var targetCharacter = (char)0;
        for (var i = from; i < to; i++)
        {
            var previousTargetCharacter = targetCharacter;
            targetCharacter = closure.word[i];
            if (currentCharacter == targetCharacter)
            {
                calculatedCost = previousRowEntry;
                previousRowEntry = previousRow[i];
            }
            else
            {
                if (previousTargetCharacter == currentCharacter
                    && targetCharacter == *(builderPosition - 1))
                {
                    previousRowEntry = previousRow[i - closure.wordLength - 3];
                }

                if (previousRowEntry < calculatedCost)
                {
                    calculatedCost = previousRowEntry;
                }

                previousRowEntry = previousRow[i];
                if (previousRowEntry < calculatedCost)
                {
                    calculatedCost = previousRowEntry;
                }

                calculatedCost++;
            }

            any |= calculatedCost;

            currentRow[i] = calculatedCost;
        }

        if (any >= 0)
        {
            continue;
        }

        var nextNode = closure.EdgeToNodeIndex[firstChild];
        if (nextNode < 0)
        {
            if (calculatedCost < 0
                && (int)to > (int)closure.wordLength)
            {
                Add(ref closure);
                nextNode = closure.EdgeToNodeIndex[firstChild];
            }
            nextNode = -nextNode;
        }

        if ((int)from < (int)closure.wordLength)
        { 
            var newFrom = from;
            while (currentRow[newFrom] >= 0)
            {
                newFrom += 1;
            }

            Recurse(nextNode, newFrom, ref closure);
        }
    } while (++firstChild < lastChild);

    end2:
    closure.depth -= 1;
    end: ;
}

[MethodImpl(MethodImplOptions.NoInlining)]
private static void Add(ref ClosureVariable closure)
{
    var str = new string(closure.builder, 0, (int)closure.depth);
    var si = new SuggestItem(str, 0);
    closure.results.Add(si);
}

Something I wish had worked

One really clean change that can be made to the final version is to pass "currentRow" as an argument. If we use it as the previousRow for the next level, we can just add rowLength to get the new currentRow. The first call simply uses the pointer to matrix, removing the need to store it in ClosureVariable.

So we've removed a memory read for closure.matrix and removed a multiplication for calculating previousRow. Unfortunately testing revealed it was not an improvement and regressed performance by quite a bit. Setting up the call stack must be more expensive than those operations.

Another cool use of arithmetic

We already iterate through currentRow in the inner loop, it would be nice if we could calculate newFrom there instead of iterating a second time. Thankfully, there is a way. Sadly, it is slower.

We start with our newSkip variable being declared before the innerLoop and initialized to 0. Then during every iteration of the inner loop, we check "any" and update newSkip.

If any is negative, we can add 1 to newSkip. We want 1 when the value is bad, 0 otherwise. Performing a shift-arithmetic-right to any, which will yield -1 when the value is good and 0 otherwise. Using the complement operator turns that 0 into -1 and vice versa. Finally, negating the result gets us our desired value which we can add to newSkip (or just subtract without negating).

In retrospect it's easy to see why this would be a step backways for performance. Most of the calls to Recurse have a low depth and therefore can't even provide a value for skip. The dependency chain for all the arithmetic is also quite long. Compare this to the second iteration quitting after the first load from currentRow and we quickly lose any potential gains.

Inner loop skip calculation

var anyGoodEntries = any >> 31; // 1s when true, 0s otherwise
var continuousBadEntries = ~anyGoodEntries; // 1s == -1 when true; 0s == 0 otherwise
skip2 -= continuousBadEntries; // Adds 1 if the entry can be skipped when calculating the next row.