Spelling Corrector

Language: C/C++

Solution: Dynamic programming

Description:

The application may represent a solution for a search engine which suggests a possible correction for an input of English words.

The correction is done using an English dictionary which consists of 8000 words, each line containing a word and the frequency of that word in use.

Terminology used:

1. Editing distance
- The number of elementary operations that has to be applied on the first string in order to obtain a new string which exists in the dictionary (Elementary operations: insert a character anywhere in the string, delete any character from the string and replace a character by any other character).

2. Fixing one word
- Done using the dictionary and the editing distance
- The correction is the word which has the minimum editing distance comparing to the word which is being fixed
- For two words with the same editing distance the application will pick the one with the maximum frequency

3. Fixing a string

- The program first removes all spaces from the string keeping the order of characters

- Then it inserts spaces in every possible combination and computes the sum of editing distances (ted) and the sum of frequencies for all the corrected words (tf)

- Picks the correction based on the "ted" and "tf" numbers considering the following rules

- minimum "ted"

- for equal "ted" pick the correction with the minimum number of word

- for equal "ted" and equal number of words pick the correction with the maximum "tf"

Solution:

I obtained substrings of the initial string by taking away the first character one by one. Then, I went through the entire dictionary for each substring and found the correction based on the "Fixing one word".

To keep track of the entire information I have found, I use a matrix of structures each storing an integer, a string and a number for frequency. The information from the dictionary is stored in a map variable.

From one iteration to another I keep an auxiliary matrix "previous" because two consecutive words in dictionary can have a couple of repeating characters and this will help me to skip the already calculated areas of the matrix.

To get the optimal correction, I used a dynamic programming algorithm by going through the matrix diagonal by diagonal and updating the values (using a Memoization technique).

Demo: Spelling Corrector

Alex Lup - Portfolio

Sunday, May 26, 2013

Spelling Corrector

No comments:

Post a Comment