JB's BLOG

Directed Acyclic Word Graphs

Part 1 - The Basics

I'd like to share some code I had a lot of fun writing. To get there, I'm going to describe some algorithms and data structures. Without this background, the code won't make much sense.

The English language

There are a lot of words in the English language. A lot. The Second edition of the Oxford English Dictionary contains 171,476 words in current use. Other corpora have wildly different counts (69K, 466K, 83K), but any way you look at it, you're going to be storing a lot of strings in memory.

The size of your dictionary

The list I mentioned with 83K words includes frequency information in a 1,283 KB file, with each line storing a word/count pair. 1 MB doesn't sound like much, but what about when you want to use and access that information in your application?

There are many different ways of storing lists of strings. They all share the same iteration time O(n) on the size of the dictionary. Where they do differ is in the size of their representation and the time taken to determine if a string belongs to the language.

  1. The most obvious way is a list. If sorted, you can search in O(n) with n being the number of words.
  2. We could use a hashset. O(m) lookup where m is the size of the word. Hashing isn't free after all.
  3. We might consider splitting up strings into base + suffix. Think of 'run'. It's a word on its own but also extends to 'runs', 'running', 'runner', 'runners', etc. If those endings are shared with any other words, they could share a suffix list. This is a suffix tree and it can save a substantial amount of storage space. A suffix tree provides O(m) lookup on the size of the word.
  4. If we start at the beginning of the word instead of the end, we end up with a trie. Lookup time is again O(m)
  5. By combining the suffix shrinking of a suffix tree with the prefix graph of a trie, you can create a Directed Acyclic Word Graph. It is an instance of a Finite State Automata and is sometimes referred to as such. This data structure is optimized to contain the smallest amount of edges and nodes, which will be less than a trie or suffix tree on their own could achieve. As with the other graph based methods, lookup is O(m)

Before we continue, I'll point out that with a suffix tree or a DAWG, additional state is required if you want to uniquely map a string to another value. Without this extra information, there are many possible ways to reach a single end-point in the graph and no way to distinguish them. One method to do this with a DAWG is described in Applications of Finite Automata Representing Large Vocabularies by Luccesi and Kowaltowski . For now we'll skip over it, but it will come up in the next article.

Constructing a DAWG

To build our DAWG, we're going to use an algorithm described in Incremental Construction of Minimal Acyclic Finite-State Automata by Daciuk, Mihov, Watson, and Watson . For this we need as input a sorted list of words. We will construct a graph by adding one word at a time. The letters of the word will be represented by edges, while nodes will represent possible branching points. A node can also be denoted as terminal, which represents that you could end there and have a valid word. What we're building is essentially a minimized trie. For another explanation see Compressing dictionaries with a DAWG by Steve Hanov .

The algorithm consists of two parts, insertion and minimization. As we loop over the collection of words, we keep track of our position on the graph, as well as the previous word processed. For each new word, we begin by back-tracking to the node representing the common prefix of it and the previous word. As we do so, we minimize the right-hand part of the tree, the suffixes if you will. To minimize, from each node as we back-track, we look to the previously constructed parts of the tree and see if we find an identical one. Nodes are considered identical if they have the same terminal state, same outgoing edges, and for each of those edges, the children of both nodes are identical. Following the minimization to the common prefix, we insert the next word by adding edges and nodes from the current one. The last node is marked as terminal since the word ends there. After we've finished adding all the words, we'll minimize back to the root node.

Stepping through construction

Here is an example DAWG for the following words:

  1. cat
  2. cats
  3. fact
  4. facts
  5. facet
  6. facets
  1. 0
    We start with a root node.
  2. 01c
    From our current node, we add an edge to a new one. The edge is labeled with the character.
  3. 012ca
    We previously added 'c', now we add 'a'.
  4. 0123cat
    When we reach the end of a word, we mark the node as terminal. In this diagram I use a nested circle.
  5. 01234cats
    We are done processing 'cat', we move on to the next word, 'cats'. The common prefix size is equal to our current depth, so we don't have to minimize/shrink any nodes. We continue from the last one by adding another edge for 's' and marking the new node as terminal.
  6. 012345678catsfact
    Now we add 'fact'. There are no nodes in common so we return to the root. Since there are no branches yet, no changes were made while minimizing the existing nodes.
  7. 0123456789catsfacts
    'facts' follows much like 'cats' did after 'cat'.
  8. 012345678catsfacts
    We now add 'facet'. The common prefix with the previous word is 'fac', so we will minimize nodes while traveling backwards from our current one to that point. On our first step back, we arrive at a node whose only edge is labeled 's' and points to a terminal node with no further edges. We shrink the graph by transferring our edge to an identical existing node.
  9. 01234567catsfact
    On our next step back we arrive at a node whose only edge is labeled 't' and points to a terminal node, which has an edge labeled s, pointing to a terminal node with no edges. An identical node already exists, so we move our edge to point there.
  10. 0123456789catsfactet
    Having made our way back to the node reached from root by 'f'-'a'-'c', we can add our new word 'facet'
  11. 012345678910catsfactets
    As we've done twice before, when the entire word is the prefix, we simply extend from the current node.
  12. 0123456789catsfactets
    Now that we are done adding words, we minimize all the way back to the root. On our first step we notice again, an edge which points to a node identical to an already existing one. We shrink the graph by re-pointing our edge.
  13. 012345678catsfactet
    Again, we re-point our edge when finding an existing identical node.
  14. 01234567catsfacte
    And one final time.

From the last slide, I'll point out what would have happened if we reduced past node 7 when going from 'facts' to 'facet'. The 'c' edge from 6 to 7 would instead be going from 6 to 2. If we then tried to add 'facet', we would be at an impasse. From root, 'fac' gets us to node 2, and from there, we would add our 'e' edge. 'ca' also gets to node 2, therefore any suffix to fac would also need to be a suffix to 'ca'. There is no such word as caet in our dictionary and so our DAWG would be broken.

Next steps

In the next article I'll go over the code for a basic implementation of a DAWG in C#. With what we've learned and observed here, we know it needs an insertion method, a minimization method, and a data structure for the nodes.