Thursday, August 23, 2007

Predictive text input (T9) algorithm in Java

I have always been fascinated by the way mobile phones complete words for you when you are creating an SMS message. So I decided to try implement that algorithm in Java.

We know that each number key on the phone is mapped to a set of 3-4 letters. 1 is (a,b,c), 2 is (d,e,f) etc. I call them "buckets". Given a sequence of key strokes, a simple algorithm is to generate all combinations of letters from the corresponding buckets and then suggest only those combinations that exist in a dictionary of words.

For example:
Sequence 63 can be any of md, me, mf, nd, ne, nf, od, oe, of. The dictionary will have only "me" and "of". You can show them in alphabetical order, or in the order of most frequently used, last used etc.

Here are the 2 main functions in this approach. (You can find the full program here):
getMatchingWords returns a list of all alternatives for a given word. It also requires a dictionary of words and the bucket structure of the keypad.

getBucketWords is the recursive function where all the action happens. Assume the word has 4 letters (so 4 buckets) of which the first 2 you have somehow filled. Now you have to fill the 3rd letter (from all characters in the 3rd bucket). The way to proceed is to pick each letter in the current bucket, append it to the so-far generated string, and call yourself recursively. Here curStr and result are used as accumulator variables.

static Set getMatchingWords(String word, Set dict,
List buckets){
Set matches = new HashSet();
List wordBuckets = getWordBuckets(word, buckets);
getBucketWords(wordBuckets, 0, "", matches);
matches.retainAll(dict);
return matches;
}

static Set getBucketWords(List wordBuckets, int bucketIndex,
String curStr, Set result){
if(bucketIndex == wordBuckets.size()){
result.add(curStr);
}else{
String curBucket = (String)wordBuckets.get(bucketIndex);
for(int i=0;<curBucket.length();i++){
String newStr = new String(curStr);
newStr = newStr + curBucket.charAt(i);
getBucketWords(wordBuckets, bucketIndex+1, newStr,
result);
}
}
return result;
}

The biggest failing of this method is all the invalid words that are generated only to be discarded. For a word of length n, if you have on average m options for each letter, then the number of words generated is m to the power n. This is the time complexity of this program.

But consider if the dictionary were already sorted into different buckets according to the key sequences. This is eminently possible because mobile phones can come loaded with such a data structure to facilitate fast lookup.

In that case, for a given key sequence, we only need to look up the data structure for all matching words and return 1 or more. Of course the dictionary pre-processing would have been done offline.

Here's a function which implements this improved algorithm. That full program is here.

public List getMatches(String word){
List matches = new ArrayList();
String key = getKeyForWord(word);
if(key != null){
if(bucketedWords.containsKey(key)){
matches = (List)bucketedWords.get(key);
}
}
return matches;
}

The dictionary is pre-processed in 2 steps: First, each character in the keypad is mapped to its bucket number. Then, for each word in the dictionary, a bucket key is obtained by concatenating the bucket number of each character in the word. The word is added to the Map entry corresponding to that key.

To do a lookup, the bucket key for a given key sequence is derived as above, and all entries in the Map matching that bucket key are returned.

Each lookup is O(number of characters in the letter) to generate the key. To generate the bucketed dictionary itself, you need to examine every character in the dictionary once. If the dictionary has n words of average length m, it's O(m * n).

5 comments:

dawninghu said...

Great article.
Can you extend the dict of normal fiction text, such as several words
per line, including delimiters?
and add exact matches according to popularity, and prefix matches?

Greg S said...

Can you share your dictionary file for the program ("c:/pramod/myworkspace/T9/morewords.txt")

Greg
gsamva@gmail.com

David said...

Great Article Pramod. Can you please share the morewords.txt ?

Regards,
David : austinrocks007@gmail.com

vignesh said...

i can understand the program but how to run it and can u send the dictionary file for the program
morewords.txt
vikky.vignesh425@gmail.com

Ashish Pujari said...

Can you share your dictionary file for the program ("c:/pramod/myworkspace/T9/morewords.txt")


pujariaa@yahoo.com