Sunday, May 26, 2013

Spelling Corrector

Language: C/C++
Solution: Dynamic programming

Description:

The application may represent a solution for a search engine which suggests a possible correction for an input of English words.
The correction is done using an English dictionary which consists of 8000 words, each line containing a word and the frequency of that word in use.

Terminology used:

 1.  Editing distance
     - The number of elementary operations that has to be applied on the first string in order to obtain a new string which exists in the dictionary (Elementary operations: insert a character anywhere in the string, delete any character from the string and replace a character by any other character).

 2.  Fixing one word
      - Done using the dictionary and the editing distance
      - The correction is the word which has the minimum editing distance comparing to the word which is being fixed
      - For two words with the same editing distance the application will pick the one with the maximum frequency

 3.  Fixing a string
      - The program first removes all spaces from the string keeping the order of characters
      - Then it inserts spaces in every possible combination and computes the sum of editing distances (ted) and the sum of frequencies for all the corrected words (tf)
      - Picks the correction based on the "ted" and "tf" numbers considering the following rules
                                 - minimum "ted"
                                 - for equal "ted" pick the correction with the minimum number of word
                                 - for equal "ted" and equal number of words pick the correction with the maximum "tf"

Solution:

I obtained substrings of the initial string by taking away the first character one by one. Then, I went through the entire dictionary for each substring and found the correction based on the "Fixing one word".
To keep track of the entire information I have found, I use a matrix of structures each storing an integer, a string and a number for frequency. The information from the dictionary is stored in a map variable.
From one iteration to another I keep an auxiliary matrix "previous" because two consecutive words in dictionary can  have a couple of repeating characters and this will help me to skip the already calculated areas of the matrix.
To get the optimal correction, I used a dynamic programming algorithm by going through the matrix diagonal by diagonal and updating the values (using a Memoization  technique).



No comments:

Post a Comment