Soundex Explained
NOTE: This is an update to an article I wrote many years ago. It has had
some new information added and several new links to new web pages have also
been added.
Many genealogy records
are indexed by a high-tech algorithm called the Soundex Code. Well, it was
“high tech” in 1918 when Robert Russell invented it. In a nutshell, Soundex
Codes provide a means of identifying words – especially names -- by the way
they sound. They were used extensively by the U.S. Work Projects Administration
(WPA) crews working in the 1930s to organize Federal Census data from 1880 to
1920. Soundex has also been used for many state and local census records and is
very popular in genealogy software and databases. If you are a genealogist, it
won't be long until you encounter Soundex.
Motor vehicle bureaus in the District of Columbia, Maryland, Michigan, Minnesota, and Missouri employ Soundex for generating the initial characters of the identification numbers on driver's licenses. The Canadian Centre for Justice Statistics uses Soundex to encode names in its crime surveys and maintain the anonymity of individuals about whom data is collected.
In the days when nearly all of the data for the Census of Population was collected by actual enumerators and individuals who walked from door to door, it was discovered that many of these people spelled surnames phonetically. Thus, one might spell Smith as "Smith" while another might spell it as "Smyth" and still another "Smythe." The census records were to be indexed by the sound of each name rather than by its spelling, and Soundex was the code system used to organize this index.
If you search many records of interest to genealogists, sooner or later you will need to use Soundex Codes. Why? Well, you can often find a person’s entry by his or her Soundex Code, even when the names have been misspelled. This becomes important when you realize that many census takers did not speak the language of the people being enumerated. In fact, in the first 150 years of U.S. census records, the majority of Americans were illiterate and did not know how to write their own last names. The spelling of many family names also has changed over the years, but often the Soundex Code remains the same.
Spelling of names varies widely in early records, especially when language difficulties have intervened. For instance, I could not find my French-speaking great-grandparents listed in the 1910 U.S. Census. I searched and searched, but never found any entries for Joseph and Sophie Theriault. Then I decided to do a Soundex search. The Soundex Code for Theriault is T643. When searching for Soundex Codes, I found several entries for T643 in Ashland, Maine, including one for the family of Joseph and Sophia Tahrihult -- improperly spelled, but with the same Soundex Code.
The census taker had a Scottish name, and he was listed on another census page in the same town as a being born in Scotland. I am guessing that he did not speak French. I bet he had some difficulty when speaking with my great-grandparents, neither of whom spoke English, and neither of whom could read or write. The census taker undoubtedly spoke with thick Scottish accent. No wonder Theriault became Tahrihult!
The Soundex Code is not difficult to learn although I still use a small reference card when I go to the archives to look at records. Every Soundex Code consists of a letter and three numbers, such as W-252. The letter is always the first letter of the surname, and the hyphen is optional. The numbers are assigned to the remaining letters of the surname according to the Soundex guide shown below. If necessary, zeroes are added at the end to produce a four-character code. Additional letters are disregarded.
Here is the Soundex Coding Guide:
Each number represents
letters:
1 = B, F, P and V
2 = C, G, J, K, Q, S, X and Z
3 = D and T
4 = L
5 = M and N
6 = R
Disregard the letters A, E, I, O, U, H, W, and Y.
1 = B, F, P and V
2 = C, G, J, K, Q, S, X and Z
3 = D and T
4 = L
5 = M and N
6 = R
Disregard the letters A, E, I, O, U, H, W, and Y.
Here are some of the
simpler examples:
Washington is coded W252
(W, 2 for the S, 5 for the N, 2 for the G, remaining letters disregarded).
Lee is coded L000 (L, there is no Soundex Code for E, so the numbers 000 are added).
Lee is coded L000 (L, there is no Soundex Code for E, so the numbers 000 are added).
Now let’s move on to
some of the more complex rules:
Any double letters in a
name are treated as one letter. For example:
Gutierrez is coded G-362
(G, 3 for the T, 6 for the first R, second R ignored, 2 for the Z).
If the surname has
different letters side-by-side that have the same number in the Soundex coding
guide, they are treated as one letter. Examples:
Pfister is coded as
P-236 (P, F ignored, 2 for the S, 3 for the T, 6 for the R).
Jackson is coded as J-250 (J, 2 for the C, K ignored, S ignored, 5 for the N, 0 added).
Tymczak is coded as T-522 (T, 5 for the M, 2 for the C, Z ignored, 2 for the K). Since the vowel "A" separates the Z and K, the K is coded.
Jackson is coded as J-250 (J, 2 for the C, K ignored, S ignored, 5 for the N, 0 added).
Tymczak is coded as T-522 (T, 5 for the M, 2 for the C, Z ignored, 2 for the K). Since the vowel "A" separates the Z and K, the K is coded.
Names with Prefixes
If a surname has a
prefix, such as Van, Con, De, Di, La, or Le, the code should ignore these
prefixes. However, coders sometimes miss this rule, so they might assign the
Soundex code either with or without the prefix. Because the surname might be
listed under either code, a thorough search of the Soundex index should include
both forms. Note, however, that Mc and Mac are not considered prefixes,
according to the National Archives and Records Administration. Once again,
however, not everyone knows this particular rule, so you might want to search
both with and without the Mc or Mac coded.
VanDeusen might be coded
two ways:
With the prefix included, V-532 (V, 5 for N, 3 for D, 2 for S)
With the prefix included, V-532 (V, 5 for N, 3 for D, 2 for S)
or
With the prefix
excluded, D-250 (D, 2 for the S, 5 for the N, 0 added).
Consonant Separators
If a vowel (A, E, I, O,
U) separates two consonants that have the same Soundex Code, the consonant to
the right of the vowel is coded. Example:
Tymczak is coded as
T-522 (T, 5 for the M, 2 for the C, Z ignored (see "Side-by-Side"
rule above), 2 for the K). Since the vowel "A" separates the Z and K,
the K is coded.
The “H & W Rule”
If "H" or
"W" separate two consonants that have the same Soundex Code, the
consonant to the right of the vowel is not coded. Example:
Ashcraft is coded A-261
(A, 2 for the S, C ignored, 6 for the R, 1 for the F). It is not coded A-226.
The Soundex Indexing System web page on the National Archives site has been updated
to include this previously "lost" rule. Not all documents use this
extra rule, however. Use the National Archives' Soundex page as your definitive
source. The genealogical community owes a special thanks to Tony Burroughs, who
researched and rediscovered the original Soundex instructions used by the
Census Bureau.
American Indian and Asian Names
A phonetically-spelled
American Indian or Asian name was sometimes coded as if it were one continuous
name. If a distinguishable surname was given, the name may have been coded in
the normal manner. For example, Dances with Wolves might have been coded as Dances
(D-522) or as Wolves (W-412), or the name Shinka-Wa-Sa may have been coded as
Shinka (S-520) or Sa (S-000).
While the rules sound a
bit complex, they do become easier with a bit of practice. For those of us who
are too lazy to go through the coding exercise, the computer age has brought
many new tools. Most modern genealogy programs will tell you the Soundex Code
of any name that you enter. In addition, a number of online Soundex Machines
are available, including those at: http://www.eogn.com/soundex and http://resources.rootsweb.ancestry.com/cgi-bin/soundexconverter On any of these sites, you can type in a last
name, and the site will display the correct Soundex Code. Yet another Soundex
Converter (YASC) at http://www.bradandkathy.com/genealogy/yasc.html will even convert a long list of names to their
Soundex equivalents; you do not have to enter them one at a time.
NOTE: You can find many more Soundex converters
online, but many of them do not follow the "H & W" Rule. To test
them, enter a name of Ashcraft. It should produce a Soundex code of A-261. If
the software produces some other code, don’t trust it.
The National Archives
and Records Administration publishes the definitive web page explaining the
coding of Soundex at http://www.archives.gov/genealogy/census/soundex.html
While Soundex is a great tool and in widespread use, it certainly is not perfect. For example, it fails for names that sound the same but have different first letters. For instance, Knowles is coded as K542 while both Noles and Nolles are N420. Likewise, Cantor is C536 while the similar sound of Kantor is K536.
Soundex also has a number of shortcomings when dealing with Eastern European Jewish names. Two Jewish genealogists, Randy Daitch and Gary Mokotoff, developed a more sophisticated system, more suitable for Jewish genealogy. The Daitch-Mokotoff Soundex is becoming the de facto standard for on-line lookups on Jewish-related web sites. You can read more about the Daitch-Mokotoff Soundex in an article written by Gary Mokotoff at http://www.avotaynu.com/soundex.html
Numerous other improved Soundex methods have been developed in recent years and are in widespread use on numerous computer databases. The accuracy of the newer methods is much improved. These new and improved Soundex systems typically use more than one letter and three numbers. However, they have never seen much use in genealogy databases.
Now, have fun with census records!
While Soundex is a great tool and in widespread use, it certainly is not perfect. For example, it fails for names that sound the same but have different first letters. For instance, Knowles is coded as K542 while both Noles and Nolles are N420. Likewise, Cantor is C536 while the similar sound of Kantor is K536.
Soundex also has a number of shortcomings when dealing with Eastern European Jewish names. Two Jewish genealogists, Randy Daitch and Gary Mokotoff, developed a more sophisticated system, more suitable for Jewish genealogy. The Daitch-Mokotoff Soundex is becoming the de facto standard for on-line lookups on Jewish-related web sites. You can read more about the Daitch-Mokotoff Soundex in an article written by Gary Mokotoff at http://www.avotaynu.com/soundex.html
Numerous other improved Soundex methods have been developed in recent years and are in widespread use on numerous computer databases. The accuracy of the newer methods is much improved. These new and improved Soundex systems typically use more than one letter and three numbers. However, they have never seen much use in genealogy databases.
Now, have fun with census records!
The entire link on the subject can be found
at:
http://blog.eogn.com/eastmans_online_genealogy/2010/08/soundex-explained.html
Posters comments:
There are many links on this subject,
to include online soundex calculators.
The point of the story is well taken.
I have a grandmother with a maiden name of Creswell, but the tombstones at the
Creswell Family Cemetery are also spelled Cresewell, and Chriswell.
No comments:
Post a Comment