There are a lot of words in the English language. A lot. The Second edition of the Oxford English Dictionary contains 171,476 words in current use. Other corpora have wildly different counts (69K, 466K, 83K), but any way you look at it, you're going to be storing a lot of strings in memory.
The list I mentioned with 83K words includes frequency information in a 1,283 KB file, with each line storing a word/count pair. 1 MB doesn't sound like much, but what about when you want to use and access that information in your application?
There are many different ways of storing lists of strings. They all share the same iteration time O(n) on the size of the dictionary. Where they do differ is in the size of their representation and the time taken to determine if a string belongs to the language.
- The most obvious way is a list. If sorted, you can search in O(n) with n being the number of words.
- We could use a hashset. O(m) lookup where m is the size of the word. Hashing isn't free after all.
- We might consider splitting up strings into base + suffix. Think of 'run'. It's a word on its own but also extends to 'runs', 'running', 'runner', 'runners', etc. If those endings are shared with any other words, they could share a suffix list. This is a suffix tree and it can save a substantial amount of storage space. A suffix tree provides O(m) lookup on the size of the word.
- If we start at the beginning of the word instead of the end, we end up with a trie. Lookup time is again O(m)
- By combining the suffix shrinking of a suffix tree with the prefix graph of a trie, you can create a Directed Acyclic Word Graph. It is an instance of a Finite State Automata and is sometimes referred to as such. This data structure is optimized to contain the smallest amount of edges and nodes, which will be less than a trie or suffix tree on their own could achieve. As with the other graph based methods, lookup is O(m)
Before we continue, I'll point out that with a suffix tree or a DAWG, additional state is required if you want to uniquely map a string to another value. Without this extra information, there are many possible ways to reach a single end-point in the graph and no way to distinguish them. One method to do this with a DAWG is described in Applications of Finite Automata Representing Large Vocabularies by Luccesi and Kowaltowski . For now we'll skip over it, but it will come up in the next article.
To build our DAWG, we're going to use an algorithm described in Incremental Construction of Minimal Acyclic Finite-State Automata by Daciuk, Mihov, Watson, and Watson . For this we need as input a sorted list of words. We will construct a graph by adding one word at a time. The letters of the word will be represented by edges, while nodes will represent possible branching points. A node can also be denoted as terminal, which represents that you could end there and have a valid word. What we're building is essentially a minimized trie. For another explanation see Compressing dictionaries with a DAWG by Steve Hanov .
The algorithm consists of two parts, insertion and minimization. As we loop over the collection of words, we keep track of our position on the graph, as well as the previous word processed. For each new word, we begin by back-tracking to the node representing the common prefix of it and the previous word. As we do so, we minimize the right-hand part of the tree, the suffixes if you will. To minimize, from each node as we back-track, we look to the previously constructed parts of the tree and see if we find an identical one. Nodes are considered identical if they have the same terminal state, same outgoing edges, and for each of those edges, the children of both nodes are identical. Following the minimization to the common prefix, we insert the next word by adding edges and nodes from the current one. The last node is marked as terminal since the word ends there. After we've finished adding all the words, we'll minimize back to the root node.
Here is an example DAWG for the following words:
From the last slide, I'll point out what would have happened if we reduced past node 7 when going from 'facts' to 'facet'. The 'c' edge from 6 to 7 would instead be going from 6 to 2. If we then tried to add 'facet', we would be at an impasse. From root, 'fac' gets us to node 2, and from there, we would add our 'e' edge. 'ca' also gets to node 2, therefore any suffix to fac would also need to be a suffix to 'ca'. There is no such word as caet in our dictionary and so our DAWG would be broken.
In the next article I'll go over the code for a basic implementation of a DAWG in C#. With what we've learned and observed here, we know it needs an insertion method, a minimization method, and a data structure for the nodes.