Soundex was a method used in the early 20th century for categorizing
surnames in the United States census. It grouped similar−sounding names together, so even if a name was misspelled,
researchers had a chance of finding it. Soundex is still used today for much the same reason, although of course we
use computerized database servers now. Most database servers include a Soundex function.
There are several subtle variations of the Soundex algorithm. This is the one used in this chapter:
1. Keep the first letter of the name as−is.
2. Convert the remaining letters to digits, according to a specific table:
♦ B, F, P, and V become 1.
♦ C, G, J, K, Q, S, X, and Z become 2.
♦ D and T become 3.
♦ L becomes 4.
♦ M and N become 5.
♦ R becomes 6.
♦ All other letters become 9.
3. Remove consecutive duplicates.
4. Remove all 9s altogether.
5. If the result is shorter than four characters (the first letter plus three digits), pad the result with trailing zeros.
6. if the result is longer than four characters, discard everything after the fourth character.
For example, my name, Pilgrim, becomes P942695. That has no consecutive duplicates, so nothing to do there.
Then you remove the 9s, leaving P4265. That's too long, so you discard the excess character, leaving P426.
Another example: Woo becomes W99, which becomes W9, which becomes W, which gets padded with zeros to
become W000.
surnames in the United States census. It grouped similar−sounding names together, so even if a name was misspelled,
researchers had a chance of finding it. Soundex is still used today for much the same reason, although of course we
use computerized database servers now. Most database servers include a Soundex function.
There are several subtle variations of the Soundex algorithm. This is the one used in this chapter:
1. Keep the first letter of the name as−is.
2. Convert the remaining letters to digits, according to a specific table:
♦ B, F, P, and V become 1.
♦ C, G, J, K, Q, S, X, and Z become 2.
♦ D and T become 3.
♦ L becomes 4.
♦ M and N become 5.
♦ R becomes 6.
♦ All other letters become 9.
3. Remove consecutive duplicates.
4. Remove all 9s altogether.
5. If the result is shorter than four characters (the first letter plus three digits), pad the result with trailing zeros.
6. if the result is longer than four characters, discard everything after the fourth character.
For example, my name, Pilgrim, becomes P942695. That has no consecutive duplicates, so nothing to do there.
Then you remove the 9s, leaving P4265. That's too long, so you discard the excess character, leaving P426.
Another example: Woo becomes W99, which becomes W9, which becomes W, which gets padded with zeros to
become W000.
Comments
Post a Comment
https://gengwg.blogspot.com/