JB's BLOG

Directed Acyclic Word Graphs

Part 3 - Reducing DAWG memory footprint

Preamble

Years ago I had found some code that already included a large part of what I'm going to introduce in this article. I can't find it anymore, but I do know the author intended for it to be public domain and didn't include a license (otherwise it would at least have been a comment in my code). I'm not even sure what language it was written in... For an overview of some compression techniques you can also look at Comparisons of Efficient Implementations for DAWG by Fuketa, Morita, and Aoe. It's not published in the greatest journal and it has a fair amount of grammatical errors, but they do a good job covering some different options (and English isn't their first language so be nice).

The examples on this page will expand on the DAWG graph built in the previous article.

012345678catsfactet

Flattening Nodes

To begin, we'll place our nodes in an array with no particular order and see where that leads.

. . .012345cfats

Flattening Edges

Edges are essentially a char to Node-reference pair. Since our nodes are now stored in an array, this becomes a char to index pair. Let's store all these node-references in an array.

. . .012345cfatsa152346. . .

Furthermore, instead of storing the edges, a node can store the start and end index of it's edges in this new array. I'll even skip a step and point out that since the mappings from node to edge are ordered, we only need to store the start index, using that of the next node to determine the end. This will require the node-reference array to also contain the edge char value. Instead of storing them together, I'll create a new array, parallel to the node-reference one.

. . .012345152346. . .cfatsa. . .

You may have noticed how Node 4 and node 5 have the same first-edge index. That's because Node 4 doesn't have any edges leading from it.

The first-edge pointers in this diagram can be further reduced to another array, this one parallel to the nodes.

012345. . .023455. . .152346. . .cfatsa. . .

In the implementation, I'll make this array larger than the node array by one element. This way, when we look to the next node to find our end-index, we don't have to worry about bound checking. Therefore, the value of this last element will be the length of the edge array, ensuring that we can browse up to and including the last edge.

Ordering Nodes

You may have noticed that the node array is still holding on to another piece of information, The terminal state. Instead of placing the nodes into the array with no regard for their order, let's place the terminal ones at the beginning.

340125. . .011345. . .415236. . .scfata. . .

Now, instead of needing to store if each node is terminal or not, we can just remember how many there are.

For our example, that would be 2. So any edge with index < 2 is terminal.

Ordering Edges

Because of our construction method we get this for free, but it's important to note that we should order each node's edges alphabetically. When traversing the graph, we can now use a binary search among a node's edges. Or, if iterrating through all the edges looking for one in particular, we can exit early once we arrive to a letter coming after our target.

Code for the array-DAWG

First off, I'll start by adding the following to the DawgNode class. The function travels down the graph in a depth-first way, assigning an index position for our Node ordering, as well as counting the reachable terminal nodes for our perfect hash method. It also creates a flattened list of the nodes.

public int IndexOffsetByTerminalCount { get; private set; } = int.MinValue;

public int ReachableTerminalNodes { get; private set; }

public (int edges, int terminals) Traversal(ref int low, ref int high, ICollection<DawgNode> nodes)
{
    if (IndexOffsetByTerminalCount != int.MinValue)
    {
        return (0, ReachableTerminalNodes);
    }

    if (IsTerminal)
    {
        IndexOffsetByTerminalCount = --low;
    }
    else
    {
        IndexOffsetByTerminalCount = ++high;
    }

    var totalEdges = 0;

    nodes.Add(this);

    foreach (var child in Children)
    {
        var (childEdges, childTerminals) = child.Value.Traversal(ref low, ref high, nodes);

        //Add 1 to represent the edge to the child itself.
        totalEdges += childEdges + 1;
        ReachableTerminalNodes += childTerminals;
    }

    ReachableTerminalNodes += IsTerminal ? 1 : 0;
    return (totalEdges, ReachableTerminalNodes);
}

Next, we add a Dawg class which consumes the graph and this information, building the arrays and providing some traversal methods.

internal sealed class Dawg
{
    private readonly int _terminalNodeIndex;
    private readonly int _rootNodeIndex;
    private readonly int[] _firstChildEdgeIndex;

    private readonly int[] _edgesToNodeIndex;
    private readonly char[] _edgeCharacter;

    private readonly ushort[] _reachableTerminalNodes;

    private readonly long[] _wordCount;
        
    public Dawg(DawgBuilder builder)
    {
        var root = builder.Finish();
        var allNodes = new List<DawgNode>();
        int terminalNodeCount = 0, nonTerminalNodeCount = -1;
        var (edges, _) = root.Traversal(ref terminalNodeCount, ref nonTerminalNodeCount, allNodes);

        _edgesToNodeIndex = new int[edges];
        _edgeCharacter = new char[edges];

        _firstChildEdgeIndex = new int[allNodes.Count + 1];
        _firstChildEdgeIndex[_firstChildEdgeIndex.Length - 1] = _edgesToNodeIndex.Length;

        _rootNodeIndex = root.IndexOffsetByTerminalCount - terminalNodeCount;

        _terminalNodeIndex = -terminalNodeCount;

        _reachableTerminalNodes = new ushort[allNodes.Count];

        var orderedNodes = new DawgNode[allNodes.Count];
        foreach (var node in allNodes)
        {
            orderedNodes[node.IndexOffsetByTerminalCount - terminalNodeCount] = node;
        }

        var edgeIndex = 0;
        foreach (var node in orderedNodes)
        {
            var realId = node.IndexOffsetByTerminalCount - terminalNodeCount;
            _firstChildEdgeIndex[realId] = edgeIndex;
            _reachableTerminalNodes[realId] = (ushort)node.ReachableTerminalNodes;
            foreach (var child in node.Children)
            {
                _edgesToNodeIndex[edgeIndex] = child.Value.IndexOffsetByTerminalCount - terminalNodeCount;
                _edgeCharacter[edgeIndex] = child.Key;
                edgeIndex++;
            }
        }

        _wordCount = builder.Counting.ToArray();
    }
        
    public int GetIndex(string word)
    {
        var index = 0;
        var currentNode = _rootNodeIndex;
        foreach (var target in word)
        {
            var firstChildIndex = _firstChildEdgeIndex[currentNode];
            var lastChildIndex = _firstChildEdgeIndex[currentNode + 1];
            currentNode = -1;

            for (var i = firstChildIndex; i < lastChildIndex; i++)
            {
                var currentCharacter = _edgeCharacter[i];
                var nextNode = _edgesToNodeIndex[i];
                if (currentCharacter < target)
                {
                    index += _reachableTerminalNodes[nextNode];
                    continue;
                }

                // Because the entries are sorted, this is either equal or the word is not an entry in our DAWG.
                if (currentCharacter != target)
                {
                    return -1;
                }

                currentNode = nextNode;
                break;
            }

            // This is reached if all edges have characters < target.
            if (currentNode == -1)
            {
                return -1;
            }

            if (currentNode < _terminalNodeIndex)
            {
                index++;
            }
        }

        // Must end on a terminal currentNode.
        if (currentNode >= _terminalNodeIndex)
        {
            return -1;
        }

        // Because we want 0-indexed
        return index - 1;
    }

    public string GetWord(int index)
    {
        if (index > _wordCount.Length)
        {
            return null;
        }

        // Because we are 0-indexed
        index++;
        var currentNode = _rootNodeIndex;
        var output = new StringBuilder();
        while (index > 0)
        {
            var firstChildIndex = _firstChildEdgeIndex[currentNode];
            var lastChildIndex = _firstChildEdgeIndex[currentNode + 1];
            for (var i = firstChildIndex; i < lastChildIndex; i++)
            {
                var nextNode = _edgesToNodeIndex[i];
                var nextNumber = _reachableTerminalNodes[nextNode];
                if (nextNumber < index)
                {
                    index -= nextNumber;
                    continue;
                }

                currentNode = nextNode;
                if (currentNode < _terminalNodeIndex)
                {
                    index--;
                }

                var currentCharacter = _edgeCharacter[i];
                output.Append(currentCharacter);
                break;
            }
        }

        if (index != 0)
        {
            return null;
        }

        return output.ToString();
    }
}