Language:
C/C++
Solution:
Dynamic programming
Description:
The
application may represent a solution for a search engine which suggests a
possible correction for an input of English words.
The
correction is done using an English dictionary which consists of 8000 words,
each line containing a word and the frequency of that word in use.
Terminology
used:
1. Editing distance
- The number of elementary operations that has to be applied on the first string in order to obtain a new string which exists in the dictionary (Elementary operations: insert a character anywhere in the string, delete any character from the string and replace a character by any other character).
2. Fixing one word
- Done using the dictionary and the editing distance
- The correction is the word which has the minimum editing distance comparing to the word which is being fixed
- For two words with the same editing distance the application will pick the one with the maximum frequency
- Done using the dictionary and the editing distance
- The correction is the word which has the minimum editing distance comparing to the word which is being fixed
- For two words with the same editing distance the application will pick the one with the maximum frequency
3. Fixing a string
- The program first removes all
spaces from the string keeping the order of characters
- Then it inserts spaces in every
possible combination and computes the sum of editing distances (ted) and the
sum of frequencies for all the corrected words (tf)
- Picks the correction based on
the "ted" and "tf" numbers considering the following rules
-
minimum "ted"
-
for equal "ted" pick the correction with the minimum number of word
- for equal "ted" and equal
number of words pick the correction with the maximum "tf"
Solution:
I obtained substrings of the initial string by taking away the first character one by one. Then, I went through the entire dictionary for each substring and found the correction based on the "Fixing one word".
To keep track of the entire information I have found, I use a matrix of structures each storing an integer, a string and a number for frequency. The information from the dictionary is stored in a map variable.
From one iteration to another I keep an auxiliary matrix "previous" because two consecutive words in dictionary can have a couple of repeating characters and this will help me to skip the already calculated areas of the matrix.
To get the optimal correction, I used a dynamic programming algorithm by going through the matrix diagonal by diagonal and updating the values (using a Memoization technique).
No comments:
Post a Comment